Awk

Grep 未從 awk 管道返回相同的匹配項

  • September 14, 2017

我試圖根據一個文件的第一列來辨識所有共同的行。我正在使用以下命令:

awk '{print $1}' File1 | fgrep -wf - File2 >Out

文件1:

M01605:153:000000000-B55NK:1:1101:10003:14536   chr1    150129998   A   Rev 18
M01605:153:000000000-B55NK:1:1101:10007:14573   chr17   44166311    C   38  44166311
M01605:153:000000000-B55NK:1:1101:10007:14573   chr17   44166500    G   Rev 34
M01605:153:000000000-B55NK:1:1101:10009:9160    chr8    16716272    G   35  16716395
M01605:153:000000000-B55NK:1:1101:10009:9160    chr8    16716336    A   37  16716337
M01605:153:000000000-B55NK:1:1101:10009:9160    chr8    16716336    A   38  16716459
M01605:153:000000000-B55NK:1:1101:10010:14111   chr8    89574844    A   38  89574844
M01605:153:000000000-B55NK:1:1101:10010:19939   chr3    181151945   T   36  181151945
M01605:153:000000000-B55NK:1:1101:10011:22802   chr17   43984669    A   34  43984765
M01605:153:000000000-B55NK:1:1101:10011:22802   chr17   43984669    A   38  43984689

文件2:

M01605:153:000000000-B55NK:1:1101:10003:14536   2:N:0:1 GTTTGCGCCGATGTA 
M01605:153:000000000-B55NK:1:1101:10003:4882    2:N:0:1 GCACTGTAAAAAGTA 
M01605:153:000000000-B55NK:1:1101:10007:14573   2:N:0:1 GGGGATAAGCGTTGC 
M01605:153:000000000-B55NK:1:1101:10007:5336    2:N:0:1 GTGTTTGTGTAGCTA 
M01605:153:000000000-B55NK:1:1101:10008:14477   2:N:0:1 GGGCGGAGGTGAAGA 
M01605:153:000000000-B55NK:1:1101:10009:18543   2:N:0:1 AGTTCGAGCGCAGTG 
M01605:153:000000000-B55NK:1:1101:10009:9160    2:N:0:1 CAGAAGAGGTAATGT 
M01605:153:000000000-B55NK:1:1101:10010:14111   2:N:0:1 CTGCGTACTGATAGC 
M01605:153:000000000-B55NK:1:1101:10010:19939   2:N:0:1 TCCGTGGTGCCGGCA 
M01605:153:000000000-B55NK:1:1101:10011:22802   1:N:0:1 TGAGTTCGGATAAAG 

出去:

M01605:153:000000000-B55NK:1:1101:10003:14536 2:N:0:1   GTTTGCGCCGATGTA 
M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1   GGGGATAAGCGTTGC 
M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1    CAGAAGAGGTAATGT 
M01605:153:000000000-B55NK:1:1101:10010:14111 2:N:0:1   CTGCGTACTGATAGC 
M01605:153:000000000-B55NK:1:1101:10010:19939 2:N:0:1   TCCGTGGTGCCGGCA 
M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1   TGAGTTCGGATAAAG 

預期輸出:

M01605:153:000000000-B55NK:1:1101:10003:14536 2:N:0:1 GTTTGCGCCGATGTA M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC M016051: 000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:00000500-B5500 1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101: 10010:14111 2:N:0:1 CTGCGTACTGATAGC M01605:153:000000000-B55NK:1:1101:10010:19939 2:N:0:1 TCCGTGGTGCCGGCA M01605:153:000000000-B55NK:1:1101:12011:2 :N:0:1 TGAGTTCGATAAAG

請注意,生成的實際輸出中缺少粗體線,這就是我想要在輸出文件中的內容。

似乎 grep 執行正常,但隨後將所有相同的行壓縮為僅一個輸出行。有什麼建議麼?

這正是該join命令的用途:它基於一個公共欄位連接兩個文件:

$ awk '{print $1}' File1 | join - File2

M01605:153:000000000-B55NK:1:1101:10003:14536 2:N:0:1 GTTTGCGCCGATGTA 
M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC 
M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC 
M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT 
M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT 
M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT 
M01605:153:000000000-B55NK:1:1101:10010:14111 2:N:0:1 CTGCGTACTGATAGC 
M01605:153:000000000-B55NK:1:1101:10010:19939 2:N:0:1 TCCGTGGTGCCGGCA 
M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG 
M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG 

正如join. 如果join抱怨,稍微修改上面的命令以使用 GNU 對輸入進行排序sort

$ awk '{print $1}' File1 | sort | join - <(sort -k1,1 --stable File2)

由於您的第二個文件似乎有重複的行(請參閱評論),您可能希望將第二個sort命令更改為sort -k1,1 --stable --unique File2(仍然假設您使用的是 GNU sort,請使用uniq)。

當我解釋您想要的輸出時,您希望 File2 中的行重複出現在 File1 中的第一個欄位的次數。Grep 不會那樣做。相反,請嘗試:

$ awk 'FNR==NR{a[FNR]=$1;next} {for (k in a) if (a[k]==$1) print}' File1 File2
M01605:153:000000000-B55NK:1:1101:10003:14536 2:N:0:1 GTTTGCGCCGATGTA
M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC
M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC
M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT
M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT
M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT
M01605:153:000000000-B55NK:1:1101:10010:14111 2:N:0:1 CTGCGTACTGATAGC
M01605:153:000000000-B55NK:1:1101:10010:19939 2:N:0:1 TCCGTGGTGCCGGCA
M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG
M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG

這個怎麼運作

  • FNR==NR{a[FNR]=$1;next}

在讀取第一個文件 File1 時,將第一個欄位 儲存$1在數組a中的行號鍵下FNR

  • for (k in a) if (a[k]==$1) print

在讀取第二個文件時,遍歷 array 的每個元素a並在每次在 File2 的第一個欄位和 array 的值之間找到匹配項時列印該行a

更高效的替代方案

$ awk 'FNR==NR{a[$1]++;next} {for (i=1;i<=a[$1];i++) print}' File1 File2
M01605:153:000000000-B55NK:1:1101:10003:14536 2:N:0:1 GTTTGCGCCGATGTA
M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC
M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC
M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT
M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT
M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT
M01605:153:000000000-B55NK:1:1101:10010:14111 2:N:0:1 CTGCGTACTGATAGC
M01605:153:000000000-B55NK:1:1101:10010:19939 2:N:0:1 TCCGTGGTGCCGGCA
M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG
M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG

引用自:https://unix.stackexchange.com/questions/392310