Grep 未從 awk 管道返回相同的匹配項
我試圖根據一個文件的第一列來辨識所有共同的行。我正在使用以下命令:
awk '{print $1}' File1 | fgrep -wf - File2 >Out
文件1:
M01605:153:000000000-B55NK:1:1101:10003:14536 chr1 150129998 A Rev 18 M01605:153:000000000-B55NK:1:1101:10007:14573 chr17 44166311 C 38 44166311 M01605:153:000000000-B55NK:1:1101:10007:14573 chr17 44166500 G Rev 34 M01605:153:000000000-B55NK:1:1101:10009:9160 chr8 16716272 G 35 16716395 M01605:153:000000000-B55NK:1:1101:10009:9160 chr8 16716336 A 37 16716337 M01605:153:000000000-B55NK:1:1101:10009:9160 chr8 16716336 A 38 16716459 M01605:153:000000000-B55NK:1:1101:10010:14111 chr8 89574844 A 38 89574844 M01605:153:000000000-B55NK:1:1101:10010:19939 chr3 181151945 T 36 181151945 M01605:153:000000000-B55NK:1:1101:10011:22802 chr17 43984669 A 34 43984765 M01605:153:000000000-B55NK:1:1101:10011:22802 chr17 43984669 A 38 43984689
文件2:
M01605:153:000000000-B55NK:1:1101:10003:14536 2:N:0:1 GTTTGCGCCGATGTA M01605:153:000000000-B55NK:1:1101:10003:4882 2:N:0:1 GCACTGTAAAAAGTA M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC M01605:153:000000000-B55NK:1:1101:10007:5336 2:N:0:1 GTGTTTGTGTAGCTA M01605:153:000000000-B55NK:1:1101:10008:14477 2:N:0:1 GGGCGGAGGTGAAGA M01605:153:000000000-B55NK:1:1101:10009:18543 2:N:0:1 AGTTCGAGCGCAGTG M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10010:14111 2:N:0:1 CTGCGTACTGATAGC M01605:153:000000000-B55NK:1:1101:10010:19939 2:N:0:1 TCCGTGGTGCCGGCA M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG
出去:
M01605:153:000000000-B55NK:1:1101:10003:14536 2:N:0:1 GTTTGCGCCGATGTA M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10010:14111 2:N:0:1 CTGCGTACTGATAGC M01605:153:000000000-B55NK:1:1101:10010:19939 2:N:0:1 TCCGTGGTGCCGGCA M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG
預期輸出:
M01605:153:000000000-B55NK:1:1101:10003:14536 2:N:0:1 GTTTGCGCCGATGTA M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC M016051: 000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:00000500-B5500 1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101: 10010:14111 2:N:0:1 CTGCGTACTGATAGC M01605:153:000000000-B55NK:1:1101:10010:19939 2:N:0:1 TCCGTGGTGCCGGCA M01605:153:000000000-B55NK:1:1101:12011:2 :N:0:1 TGAGTTCGATAAAG
請注意,生成的實際輸出中缺少粗體線,這就是我想要在輸出文件中的內容。
似乎 grep 執行正常,但隨後將所有相同的行壓縮為僅一個輸出行。有什麼建議麼?
這正是該
join
命令的用途:它基於一個公共欄位連接兩個文件:$ awk '{print $1}' File1 | join - File2 M01605:153:000000000-B55NK:1:1101:10003:14536 2:N:0:1 GTTTGCGCCGATGTA M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10010:14111 2:N:0:1 CTGCGTACTGATAGC M01605:153:000000000-B55NK:1:1101:10010:19939 2:N:0:1 TCCGTGGTGCCGGCA M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG
正如
join
. 如果join
抱怨,稍微修改上面的命令以使用 GNU 對輸入進行排序sort
:$ awk '{print $1}' File1 | sort | join - <(sort -k1,1 --stable File2)
由於您的第二個文件似乎有重複的行(請參閱評論),您可能希望將第二個
sort
命令更改為sort -k1,1 --stable --unique File2
(仍然假設您使用的是 GNUsort
,請使用uniq
)。
當我解釋您想要的輸出時,您希望 File2 中的行重複出現在 File1 中的第一個欄位的次數。Grep 不會那樣做。相反,請嘗試:
$ awk 'FNR==NR{a[FNR]=$1;next} {for (k in a) if (a[k]==$1) print}' File1 File2 M01605:153:000000000-B55NK:1:1101:10003:14536 2:N:0:1 GTTTGCGCCGATGTA M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10010:14111 2:N:0:1 CTGCGTACTGATAGC M01605:153:000000000-B55NK:1:1101:10010:19939 2:N:0:1 TCCGTGGTGCCGGCA M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG
這個怎麼運作
FNR==NR{a[FNR]=$1;next}
在讀取第一個文件 File1 時,將第一個欄位 儲存
$1
在數組a
中的行號鍵下FNR
。
for (k in a) if (a[k]==$1) print
在讀取第二個文件時,遍歷 array 的每個元素
a
並在每次在 File2 的第一個欄位和 array 的值之間找到匹配項時列印該行a
。更高效的替代方案
$ awk 'FNR==NR{a[$1]++;next} {for (i=1;i<=a[$1];i++) print}' File1 File2 M01605:153:000000000-B55NK:1:1101:10003:14536 2:N:0:1 GTTTGCGCCGATGTA M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC M01605:153:000000000-B55NK:1:1101:10007:14573 2:N:0:1 GGGGATAAGCGTTGC M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10009:9160 2:N:0:1 CAGAAGAGGTAATGT M01605:153:000000000-B55NK:1:1101:10010:14111 2:N:0:1 CTGCGTACTGATAGC M01605:153:000000000-B55NK:1:1101:10010:19939 2:N:0:1 TCCGTGGTGCCGGCA M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG M01605:153:000000000-B55NK:1:1101:10011:22802 1:N:0:1 TGAGTTCGGATAAAG