Awk

從巨大的(強制)文本文件中提取兩個模式之間的數據

  • November 25, 2018

我有filename.json。如果我在終端中解析它

file filename.json

輸出是:

filename.json: UTF-8 Unicode text, with very long lines  

wc -l filename.json    
1 filename.json

如果我將其解析為json使用,jq那麼我將不得不提及我希望它列印的數據的哪一部分,例如 id、摘要、作者等。我有數千個結構相似的 json,但我希望將數據儲存為“摘要”、“描述”、“評論”等的部分。因為,有成千上萬的 JSON 文件,我不想檢查每一個。但我知道我想要的數據位於兩種模式之間

**“標題”:**和 “網址”:

$ cat filename.json

給出:

{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},

所以,我想列印模式之間的所有內容,但在終端中,文件為 1 行,並且模式出現多次。我能想到的唯一方法是在兩種模式之間列印直到文件結束。

我嘗試使用 sed:

sed -n '^/title/,/^url/p' filename.json

但它列印空白。

我希望數據進一步輸入以使用機器學習技術進行語言分析。

關於在圖案之間列印的其他方式的任何建議,圖案也會重複多次。所以,我希望在每個重複之間列印數據。

預期結果是列印為 CSV 或 tsv:

1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."

2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."

etc,.

直到文件結束。

TL;博士

在 ksh、bash、zsh 中:

sed -e $'s,"title":,\1,g' -e $'s,"url":,\2,g' -e $'s,^[^\1]*,,' -e $'
        s,\1\\([^\2]*\\)\2[^\1]*,\\1\\\n,g' infile

一個字元分隔符。

一個字元定界符的規範解決方案讓我們假設@#例如,是:

sed 's,^[^@]*,,;s,@\([^#]*\)#[^@]*,\1 ,g' infile

這將-從開頭刪除每個不是a的字元-提取第一個 到下一個第一個@ 之間的字元。 @ #

對於輸入文件的每一infile

通用分隔符。

只需將每個定界符字元串轉換為一個字元,任何其他定界符都可以轉換為上述答案。

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@\([^#]*\)#[^@]*/\1 /g' infile

在您的情況下,您可以使用換行符代替空格(\1),為 GNU sed 編寫的只是(\1\n):

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@\([^#]*\)#[^@]*/\1\n/g' infile

對於其他(舊)seds 添加一個明確的換行符:

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@\([^#]*\)#[^@]*/\1\
/g' infile

如果存在上述使用的分隔符可能在文件中的風險,請選擇其他被警告不存在於文件中的分隔符。如果這似乎是一個問題,則開始和結束分隔符可以是控製字元,如Ctrl- A(或編碼:^A,十六進制:Ox01或八進制\001)。您可以通過鍵入Ctrl- V Ctrl-在 shell 控制台中鍵入它A。您將在命令行中看到一個 ^A:

sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A\([^^B]*\)^B[^^A]*,\1\n,g' infile

或者,如果輸入起來太麻煩,請使用 (ksh,bash,zsh):

sed -e $'s,"title":,\1,g' -e $'s,"url":,\2,g' -e $'s,^[^\1]*,,' -e $'s,\1\\([^\2]*\\)\2[^\1]*,\\1\\\n,g' infile

或者,如果您的 sed 支持它:

sed -e 's,"title":,\o001,g' -e 's,"url":,\o002,g' -e 's,^[^\o001]*,,' -e 's,\o001\([^\o002]*\)\o002[^\o001]*,\1\o012,g' infile

如果分隔符是“描述”:

如果起始標籤實際上是"description":(來自您的輸出範例),只需使用它而不是"title":

上面的輸出(來自您之前在問題中連結的文件):

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

如果您需要對行進行編號,請使用以下命令再次 sed sed -n '=;p;g;p'

| sed -n '=;p;g;p'
1
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

2
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

3
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

AWK

類似的邏輯在 awk 中實現:

awk -vone=$'\1' -vtwo=$'\2' '{
           gsub(/"title":/,one);
           gsub(/"url":/,two);
           sub("^[^"one"]*"one,"")
           gsub(two"[^"one"]*"one,ORS)
           sub(two"[^"two"]*$","")
          } 1' infile

引用自:https://unix.stackexchange.com/questions/483627