Shell-Script
從非格式化文本中提取 URL
我只找到了從 HTML 文件等格式化文本中提取子字元串的範例,但在我的情況下,我需要輸出一個 URL 列表,例如:
... https://twitter.com/user1/status/xyza https://twitter.com/user1/status/xyzb https://twitter.com/user1/status/xyzc https://twitter.com/user2/status/xyza https://twitter.com/user2/status/xyzb ...
從一個非結構化且非常大的文件(+100 MB)中,這就是我的輸入:
n 3\\n \\n \\n \\n \\n \\n Retweeted\\n \\n \\n \\n 3\\n \\n \\n \\n\\n \\n \\n \\n \\n \\n \\n Like\\n \\n \\n \\n 5\\n \\n \\n \\n \\n \\n \\n \\n Liked\\n \\n \\n \\n 5\\n \\n \\n \\n\\n \\n\\n \\n \\n \\n \\n \\n More\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n Copy link to Tweet\\n \\n \\n Embed Tweet\\n \\n \\n \\n\\n\\n\\n\\n \\n\\n \\n\\n \\n\\n \\n \\n \\n \\n \\n \\n\\n \\n \\n\\n \\n\\n\\n \\n\\n\\n \\n \\n \\n\\n \\n \\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n \\n \\n \\n \\n \\n \\n\\n \\n \\n\\n \\n\\n Back to top ↑\\n\\n \\n\\n\\n \\n \\n \\n \\n\\n\\n \\n\\n\\n \\n \\n Loading seems to be taking a while.\\n \\n Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.\\n \\n \\n\\n\\n\\n \\n \\n \\n\\n \\n \\n\\n\\n\\n\\n\\n \\n\\n\\n \\n \\n Suggested by Twitter\\n \\n \\n \\n \\n \\n\\n \\n \\n \\n \\n \\n false\\n \\n \\n \\n \\n \\n\\n \\n\\n\\n\\n \\n \\n \\n \\n \\n © 2015 Twitter\\n About\\n Help\\n Terms\\n Privacy\\n Cookies\\n Ads info\\n \\n \\n \\n\\n\\n \\n\\n\\n\\n \\n \\n \\n\\n\\n \\n \\n \\n\\n\\n\\n \\n \\n \\n\\n \\n\\n \\n\\n \\n \\n\\n \\n \\n\\n\",\"meta_tags\":[{},{\"content\":\"0; URL=https://mobile.twitter.com/i/nojs_router?path=%2FTerriBauman%2Fstatus%2F680996161843380224\"},{\"name\":\"robots\",\"content\":\"NOODP\"},{\"name\":\"msapplication-TileImage\",\"content\":\"//abs.twimg.com/favicons/win8-tile-144.png\"},{\"name\":\"msapplication-TileColor\",\"content\":\"#00aced\"},{\"name\":\"swift-page-name\",\"content\":\"permalink\"},{\"content\":\"article\"},{\"content\":\"https://twitter.com/TerriBauman/status/680996161843380224\"},{\"content\":\"Terri Bauman on Twitter\"},{\"content\":\"https://pbs.twimg.com/media/BcaVtMKCEAAyz9f.jpg:large\"},{\"content\":\"true\"},{\"content\":\"“Social Media Jobs: https://t.co/NDDK4WaRA4 Please Retweet to spread words #OnlineJobs #Jobs”\"},{\"content\":\"Twitter\"},{\"content\":\"2231777543\"}],\"links\":[\"https://twitter.com/\",\"https://twitter.com/about\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/#supported_languages\",\"https://twitter.com/?lang=id\",\"https://twitter.com/?lang=msa\",\"https://twitter.com/?lang=cs\",\"https://twitter.com/?lang=da\",\"https://twitter.com/?lang=de\",\"https://twitter.com/?lang=en-gb\",\"https://twitter.com/?lang=es\",\"https://twitter.com/?lang=fil\",\"https://twitter.com/?lang=fr\",\"https://twitter.com/?lang=it\",\"https://twitter.com/?lang=hu\",\"https://twitter.com/?lang=nl\",\"https://twitter.com/?lang=no\",\"https://twitter.com/?lang=pl\",\"https://twitter.com/?lang=pt\",\"https://twitter.com/?lang=ro\",\"https://twitter.com/?lang=fi\",\"https://twitter.com/?lang=sv\",\"https://twitter.com/?lang=vi\",\"https://twitter.com/?lang=tr\",\"https://twitter.com/?lang=el\",\"https://twitter.com/?lang=ru\",\"https://twitter.com/?lang=uk\",\"https://twitter.com/?lang=he\",\"https://twitter.com/?lang=ar\",\"https://twitter.com/?lang=fa\",\"https://twitter.com/?lang=mr\",\"https://twitter.com/?lang=hi\",\"https://twitter.com/?lang=bn\",\"https://twitter.com/?lang=gu\",\"https://twitter.com/?lang=ta\",\"https://twitter.com/?lang=kn\",\"https://twitter.com/?lang=th\",\"https://twitter.com/?lang=ko\",\"https://twitter.com/?lang=ja\",\"https://twitter.com/?lang=zh-cn\",\"https://twitter.com/?lang=zh-tw\",\"https://twitter.com/login\",\"https://twitter.com/account/begin_password_reset\",\"https://twitter.com/signup\",\"https://twitter.com/TerriBauman\",\"https://pbs.twimg.com/profile_images/598412523734310913/t3ettYkj.jpg\",\"https://pbs.twimg.com/profile_images/598412523734310913/t3ettYkj.jpg\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/hashtag/Entrepreneur?src=hash\",\"https://twitter.com/hashtag/SocialMediaExpert?src=hash\",\"https://twitter.com/hashtag/SocialMediaMarketer?src=hash\",\"https://twitter.com/hashtag/BusinessOwner?src=hash\",\"https://twitter.com/hashtag/InternetMarketer?src=hash\",\"https://twitter.com/hashtag/SocialMediaJobs?src=hash\",\"https://t.co/ZciT91kZwP\",\"https://twitter.com/about\",\"http:////support.twitter.com\",\"https://twitter.com/tos\",\"https://twitter.com/privacy\",\"http:////support.twitter.com/articles/20170514\",\"http:////support.twitter.com/articles/20170451\",\"https://twitter.com/#\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"http://support.twitter.com/forums/26810/entries/78525\",\"http:////dev.twitter.com/docs/embedded-tweets\",\"http:////dev.twitter.com/docs/embedded-tweets\",\"https://twitter.com/account/begin_password_reset\",\"https://twitter.com/signup\",\"https://twitter.com/signup\",\"https://twitter.com/login\",\"http://support.twitter.com/articles/14226-how-to-find-your-twitter-short-code-or-long-code\",\"https://twitter.com/TerriBauman/status/680996164058001408\",\"https://twitter.com/TerriBauman/status/680977383365578752\",\"https://twitter.com/TerriBauman\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://t.co/NDDK4WaRA4\",\"https://twitter.com/hashtag/OnlineJobs?src=hash\",\"https://twitter.com/hashtag/Jobs?src=hash\",\"https://t.co/SJvkM1yWUI\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/TerriBauman/status/680996161843380224\",\"https://twitter.com/cakafete\",\"https://twitter.com/KassemAlYateem\",\"https://twitter.com/Worldspacetech1\",\"https://twitter.com/ElisaBW\",\"https://twitter.com/patrickarrelle\",\"https://twitter.com/AcousticsPro1\",\"https://twitter.com/#\",\"http://status.twitter.com\",\"https://twitter.com/about\",\"http:////support.twitter.com\",\"https://twitter.com/tos\",\"https://twitter.com/privacy\",\"http:////support.twitter.com/articles/20170514\",\"http:////support.twitter.com/articles/20170451\"]}"},{"url":"http://status.twitter.com/page/2","result":"{\"date_crawled\":\"2015-12-27T10:01:58Z\",\"title\":\"Twitter Status\",\"lossyHTML\":\"\\n\\n\\r\\n\\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n\\r\\n \\r\\n Twitter Status\\r\\n \\n\\r\\n \\r\\n \\r\\n\\r\\n \\r\\n\\r\\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\r\\n \\r\\n\\r\\n\\r\\n\\r\\n\\r\\n \\r\\n\\r\\n\\r\\n\\r\\n \\r\\n \\r\\n \\r\\n \\r\\n \\r\\n Updates on the status of the Twitter service.\\r\\n\\r\\n\\r\\n\\r\\n\\r\\nRelated Links\\r\\nOfficial Company Blog\\r\\n\\r\\nOfficial Help Documents\\r\\n\\r\\nDeveloper Community\\r\\n\\r\\n\\r\\n\\r\\n Archive\\r\\n\\r\\n\\r\\n\\r\\n \\r\\n Powered by Tumblr\\r\\n \\r\\n\\r\\n \\r\\n \\r\\n \\r\\n\\r\\n\\r\\n \\r\\n \\r\\n \\r\\n
我一直在嘗試做:
grep 'https://' input.txt | grep 'status' >> output.txt
我見過使用 sed 和 awk 的例子,但除了非常難以理解之外,它們幾乎總是基於列選擇,在我的情況下這是不可能的。
嘗試使用 GNU grep 來獲取帶有兩個斜杠的 URL:
grep -o 'http[s]*://[^/][^\\]*' file
帶有兩個或更多斜杠的 URL:
grep -o 'http[s]*://[^\\]*' file
推薦閱讀:堆棧溢出正則表達式常見問題解答
[s]*
: 星形量詞 (*
) 表示前面的表達式可以匹配零次或多次。這裡前面的表達式可以是字元類中的任何字元(用括號標記),它只包含一個s
. 它更容易使用s*
。
[^\\]*
: 匹配除反斜杠以外的任何字元零次或多次。我用反斜杠轉義了反斜杠以防止轉義]
。