從巨大的(強制)文本文件中提取兩個模式之間的數據
我有
filename.json
。如果我在終端中解析它file filename.json
輸出是:
filename.json: UTF-8 Unicode text, with very long lines wc -l filename.json 1 filename.json
如果我將其解析為
json
使用,jq
那麼我將不得不提及我希望它列印的數據的哪一部分,例如 id、摘要、作者等。我有數千個結構相似的 json,但我希望將數據儲存為“摘要”、“描述”、“評論”等的部分。因為,有成千上萬的 JSON 文件,我不想檢查每一個。但我知道我想要的數據位於兩種模式之間**“標題”:**和 “網址”:
$ cat filename.json
給出:
{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},
所以,我想列印模式之間的所有內容,但在終端中,文件為 1 行,並且模式出現多次。我能想到的唯一方法是在兩種模式之間列印直到文件結束。
我嘗試使用 sed:
sed -n '^/title/,/^url/p' filename.json
但它列印空白。
我希望數據進一步輸入以使用機器學習技術進行語言分析。
關於在圖案之間列印的其他方式的任何建議,圖案也會重複多次。所以,我希望在每個重複之間列印數據。
預期結果是列印為 CSV 或 tsv:
1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..." 2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..." etc,.
直到文件結束。
TL;博士
在 ksh、bash、zsh 中:
sed -e $'s,"title":,\1,g' -e $'s,"url":,\2,g' -e $'s,^[^\1]*,,' -e $' s,\1\\([^\2]*\\)\2[^\1]*,\\1\\\n,g' infile
但
一個字元分隔符。
一個字元定界符的規範解決方案讓我們假設
@
,#
例如,是:sed 's,^[^@]*,,;s,@\([^#]*\)#[^@]*,\1 ,g' infile
這將-從開頭刪除每個不是a的字元-提取第一個 到下一個第一個
@
之間的字元。@
#
對於輸入文件的每一行
infile
。通用分隔符。
只需將每個定界符字元串轉換為一個字元,任何其他定界符都可以轉換為上述答案。
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@\([^#]*\)#[^@]*/\1 /g' infile
在您的情況下,您可以使用換行符代替空格(
\1
),為 GNU sed 編寫的只是(\1\n
):sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@\([^#]*\)#[^@]*/\1\n/g' infile
對於其他(舊)seds 添加一個明確的換行符:
sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@\([^#]*\)#[^@]*/\1\ /g' infile
如果存在上述使用的分隔符可能在文件中的風險,請選擇其他被警告不存在於文件中的分隔符。如果這似乎是一個問題,則開始和結束分隔符可以是控製字元,如
Ctrl
-A
(或編碼:^A
,十六進制:Ox01
或八進制\001
)。您可以通過鍵入Ctrl
-V
Ctrl
-在 shell 控制台中鍵入它A
。您將在命令行中看到一個 ^A:sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A\([^^B]*\)^B[^^A]*,\1\n,g' infile
或者,如果輸入起來太麻煩,請使用 (ksh,bash,zsh):
sed -e $'s,"title":,\1,g' -e $'s,"url":,\2,g' -e $'s,^[^\1]*,,' -e $'s,\1\\([^\2]*\\)\2[^\1]*,\\1\\\n,g' infile
或者,如果您的 sed 支持它:
sed -e 's,"title":,\o001,g' -e 's,"url":,\o002,g' -e 's,^[^\o001]*,,' -e 's,\o001\([^\o002]*\)\o002[^\o001]*,\1\o012,g' infile
如果分隔符是“描述”:
如果起始標籤實際上是
"description":
(來自您的輸出範例),只需使用它而不是"title":
上面的輸出(來自您之前在問題中連結的文件):
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"", "LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"", "Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",
如果您需要對行進行編號,請使用以下命令再次 sed
sed -n '=;p;g;p'
:| sed -n '=;p;g;p' 1 "Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"", 2 "LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"", 3 "Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",
AWK
類似的邏輯在 awk 中實現:
awk -vone=$'\1' -vtwo=$'\2' '{ gsub(/"title":/,one); gsub(/"url":/,two); sub("^[^"one"]*"one,"") gsub(two"[^"one"]*"one,ORS) sub(two"[^"two"]*$","") } 1' infile