awk 提取行並合併到一個新文件中

September 29, 2022

這是我可重現的例子
文件01.txt
line to skip
line to skip
line to skip
line to keep file 01
heading 1 in the form: 2017243 01 2017243 01
data 1 file 01
heading 2 in the form: 2017243 02 2017243 02
data 2 file 01
heading 3 in the form: 2017243 03 2017243 03
data 3 file 01
文件02.txt
line to skip
line to skip
line to skip
line to keep file 02
heading 1 in the form: 2017243 01 2017243 01
data 1 file 02
heading 2 in the form: 2017243 02 2017243 02
data 2 file 02
heading 3 in the form: 2017243 03 2017243 03
data 3 file 02
文件03.txt
line to skip
line to skip
line to skip
line to keep file 03
heading 1 in the form: 2017243 01 2017243 01
data 1 file 03
heading 2 in the form: 2017243 02 2017243 02
data 2 file 03
heading 3 in the form: 2017243 03 2017243 03
data 3 file 03
期望的輸出
line to keep file 01
line to keep file 02
line to keep file 03
heading 1 in the form: 2017243 01 2017243 01
data 1 file 01
data 1 file 02
data 1 file 03
heading 2 in the form: 2017243 02 2017243 02
data 2 file 01
data 2 file 02
data 2 file 03
heading 3 in the form: 2017243 03 2017243 03
data 3 file 01
data 3 file 02
data 3 file 03
到目前為止，我已經完成了從每個輸入文件中提取第四行的非常簡單的任務：
awk 'FNR == 4' *.txt &gt;&gt; out_row4
但後來我陷入了文件處理的其餘部分，無法設想出可行的最終解決方案……
我需要保持解決方案非常通用，因為要處理的文件數和行數非常大（每個文件超過 5900 行）
作為參考的一般模式：
總是跳過每個文件的前 3 行
保留每個文件的第 4 行
標題 1、2、3（… 等等）在不同文件中完全相同（因此它們只需要在所需的輸出文件中報告一次）
所有文件都包含相同的行數
文件沒有已知的結構化格式，它們是純文字文件
要提取和重新排列的常見模式是：
heading n in the form: 2017243 n 2017243 n
data n file ...
有什麼提示嗎？

應用DSU Idiom，使用任何版本的強制性 POSIX 工具 awk、sort 和 cut：

$ cat tst.sh
#!/usr/bin/env bash

awk -v OFS='\t' '
   FNR == 1 { fileNr++ }
   FNR &gt;= 4 { print FNR-3, fileNr, $0 }
' "$@" |
sort -n -k1,1 -k2,2 |
awk '($1 % 2) || ($2 == 1)' |
cut -f 3-

$  ./tst.sh file01.txt file02.txt file03.txt
line to keep file 01
line to keep file 02
line to keep file 03
heading 1 in the form: 2017243 01 2017243 01
data 1 file 01
data 1 file 02
data 1 file 03
heading 2 in the form: 2017243 02 2017243 02
data 2 file 01
data 2 file 02
data 2 file 03
heading 3 in the form: 2017243 03 2017243 03
data 3 file 01
data 3 file 02
data 3 file 03

上面唯一必須同時處理所有輸入的工具sort是旨在通過使用需求分頁等來處理大量輸入的工具，因此無論您有多少輸入文件（只要它們當然不要超過 ARG_MAX）或它們有多大。

或者，使用任何 awk 並假設輸入文件的數量不足以產生“打開的文件過多”錯誤：

$ cat tst.awk
BEGIN {
   while ( ! eof ) {
       for ( fileNr=1; fileNr&lt;ARGC; fileNr++ ) {
           if ( (getline vals[fileNr] &lt; ARGV[fileNr]) &lt;= 0 ) {
               eof = 1
           }
       }
       if ( !eof && (++lineNr &gt;= 4) ) {
           if ( lineNr % 2 ) {
               print vals[1]
           }
           else {
               for ( fileNr=1; fileNr&lt;ARGC; fileNr++ ) {
                   print vals[fileNr]
               }
           }
       }
   }
   exit
}

$ awk -f tst.awk file01.txt file02.txt file03.txt
line to keep file 01
line to keep file 02
line to keep file 03
heading 1 in the form: 2017243 01 2017243 01
data 1 file 01
data 1 file 02
data 1 file 03
heading 2 in the form: 2017243 02 2017243 02
data 2 file 01
data 2 file 02
data 2 file 03
heading 3 in the form: 2017243 03 2017243 03
data 3 file 01
data 3 file 02
data 3 file 03

我在getline上面謹慎使用以避免一次將大部分輸入文件讀入記憶體，有關何時/如何使用它的更多資訊，請參閱http://awk.freeshell.org/AllAboutGetline 。

引用自：https://unix.stackexchange.com/questions/719101

awk 提取行並合併到一個新文件中

相關問答

如何從每一行輸出不同的數據？

從 curl 輸出中提取 IP:PORT

如何從 git branch -r 輸出中獲取分支名稱

通過 awk 基於公共列合併兩個 txt 文件

如何按文件副檔名將文件目錄拆分為命名的子目錄？

在AWK中將變數插入字元串