Text-Processing

sed 正則表達式無法擷取包含該模式的整個段落

  • August 17, 2021

我有這個 XML 文件(範例)

<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>


<This is a line of text with a year=33020 month=12 in it
This line of text does not have a year or month in it
This year=33020 is the current year the current month=1
This is the year=33020 the month=2/>

使用sed我的 Linux 發行版 ( sed (GNU sed) 4.2.2) 提供的安裝,我使用以下正則表達式在此文件中搜尋:

sed -En 'N;s/\<(This.*2020.*[\s\S\n]*?)\>/\1/gp' test2.txt

但是,它僅擷取此字元串:

<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it

但我試圖捕捉包含模式的整個第一段。<``>

我在這裡做錯了什麼?

這不起作用的原因是,<不需要>在正則表達式中轉義,它們沒有任何特殊含義。但是,\<對於GNU 擴展正則表達式(您使用 啟動)\> 確實具有特殊含義-E:它們在單詞邊界處匹配。\<匹配單詞的開頭和\>結尾。所以\<(This實際上並不匹配<,而是匹配單詞的開頭This。最後\>也是一樣。GNUsed手冊有一個範例,幾乎正是您所追求的:

$ sed -En '/./{H;1h;$!d} ; x; s/(<This.*2020.*?>)/\1/p;' file
<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>

我覺得sed特別不適合這種任務。我會perl改用:

$ perl -000 -ne 'chomp;/<.*2020.*?>/s && print "$_\n"; exit' file
<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>

在這裡,我們在“段落模式”(-000)中使用 Perl,這意味著“行”由兩個連續\n的字元定義,由一個空行定義。該腳本將:

  • chomp:刪除“行”(段落)末尾的尾隨換行符。
  • /<.*2020.*?>/s && print "$_\n":如果此“行”(段落)匹配 a<然後 0 個或多個字元,直到2020和零個或多個字元,然後是 a >,則列印此行附加換行符(print "$_\n")。match 運算符的s修飾符允許.匹配換行符。

另一種選擇是awk

$ awk 'BEGIN{RS="\n\n"} /<.*2020.+?>/' file
<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>

我們將記錄分隔符設置RS為兩個連續的換行符,然後使用與上面相同的正則表達式進行匹配。由於在awk找到匹配項(或任何其他操作返回 true)時的預設行為是列印目前記錄,這將列印出您需要的內容。

引用自:https://unix.stackexchange.com/questions/662958