Text-Processing
比較來自兩個不同文件的列並從第一個文件列印記錄那些與第二個文件不匹配的列
我想比較文件一到二的列。其中 file1 的第 2 列應與 file2 的第 1 列或第 2 列不匹配,並列印文件 1 的輸出。
文件 1。
貓 test.head20.R2.fastq.tab
@0_1_2367_1112_211 ENSG00000165837 GAAATTAAGTTATAATTTTCATGGGACATTTTCATCACTGTTGACACAGTTTCAAGCATTCCATCATGTTATTTTGACTCTTTTTCTTTTTTTTTTCTTT + @6@CDCFFEDEIJIIJJJFBFHIIJJJJJGC?CDDDDDDFEDGBFFFFHEDFFBBBDDDDDDDDBDDD@@@@CDDDDDEHHJJJGJIIIGIJJJIIIFCH @10000000_0_0_0_0 rupesh TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?DDECGGIEDDDDDDHHJJJJJJIGIIIJED?CB5@CFFHHHCFF @10000001_0_3150_2465_134 ENSG00000137860 GCCTCTCAAGTAGCTGGGATTACAGGCACCTGCCACCACGCCCAGCCAATTTTTGTATTTTTAGTAGAGACAATTTCACTATGTTGGCCAGGCTGGTCTT + DEDDB>HJIGHFJJJIGFFFHJJJJJJJJIIGHHFFDDCCCIIJJJJJJJJJJJJJIGIJJJHJIFHHGJJIIHEEEDDDDDC>?@DDDEEEDFFFFFFC @10000002_0_2947_952_158 ENSG00000028203 CCCCCAGGACCAGCTGCTGTTTTGTGATGACTGCGATCGGGGTTACCACATGTACTGCCTGAGTCCCCCCATGGCGGAGCCCCCGGAAGGGAGCTGGAGC + JFHHEDDB;;63JJJIJJJHHFIIJIHGHHHHGHHHJJIIIEEEIJJHHHJJHHFFJIJJJJJJJJJJJJJJHDDDDDDFGBB?8BDDDEDDDDDDDDCC @10000003_0_8902_3193_186 ENSG00000177051 CAAGGCCAGAGAGACAAATAATGCCTCATGTCCCACTGCTTTAAAATTACATTAATTTATAAAATGGCCACTATGGGCTCTTTTTGACTGTTTCTCGGAG + HGEDDDDDDDDDDDCDDB@<DDDD<>CDDDDB@>DDDEIJJJIGJJIIIIGIJHHHGEEIJJJJJJJJIHJJJJJJIIJIIIJJIJJJJIIIIHEDDDCC @5000345_0_3_0_0 ENSG00000178057 TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?
文件2
cat fusions.head16.R2.fastq.tab
ENSG00000137860 ENSG00000165837 1431 1598 0:0:0 0:0:0 0/2 CAGGTCATCTGCTCCTATCTCCTAAGGCCCATGGTTTTCATGATGGGTGTAGAGTGGACAGACTGTCCAATGGTGGCTGAGATGGTGGGAATCAAGTTCT + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ENSG00000177051 ENSG00000134905 277 433 0:0:0 0:0:0 451/2 CTTCACTGCACAGCCAGGGTGAGCCTCGCTGGGAAGGTGCAGGTGACTCGTGCCTGTCGGGGAGCCCGTCCTGTCCGTACAAAACATGTGCCAGGCAAGG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ENSG00000137860 ENSG00000165837 2761 2951 0:0:0 0:0:0 2/2 AAACAATCTTACGGATTAAGAGGAGACGTGAAGCTCAAAAGTTAACAGAGATGACCAGTTTCACATTTCATTTAATGAGCAAACCAACACCTGAGAAGCC + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ENSG00000028203 ENSG00000157766 183 411 0:0:0 0:0:0 101/2 TTCTTTGTCACCAAAAACAGAAAAATGCACAACAGAGGGACAACAAAAGCCTCCTACAAGAGTCCTACCAAAATACCTGGGATATAGTAATCACTCAATG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
所需的輸出:
@10000000_0_0_0_0 rupesh TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?DDECGGIEDDDDDDHHJJJJJJIGIIIJED?CB5@CFFHHHCFF @5000345_0_3_0_0 ENSG00000178057 TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?
到目前為止嘗試從第二個文件列印不匹配的條目,但不知道如何從 file1 列印?
awk '{k=$2} NR==FNR{a[k]; next} !(k in a)' test.head20.R2.fastq.tab fusions.head16.R2.fastq.tab
我不想要的上述程式碼的輸出:
ENSG00000177051 ENSG00000134905 277 433 0:0:0 0:0:0 451/2 CTTCACTGCACAGCCAGGGTGAGCCTCGCTGGGAAGGTGCAGGTGACTCGTGCCTGTCGGGGAGCCCGTCCTGTCCGTACAAAACATGTGCCAGGCAAGG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ENSG00000028203 ENSG00000157766 183 411 0:0:0 0:0:0 101/2 TTCTTTGTCACCAAAAACAGAAAAATGCACAACAGAGGGACAACAAAAGCCTCCTACAAGAGTCCTACCAAAATACCTGGGATATAGTAATCACTCAATG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
$ awk ' NR==FNR {a[$1]++; a[$2]++; next}; !($2 in a)' fusions.head16.R2.fastq.tab test.head20.R2.fastq.tab @10000000_0_0_0_0 rupesh TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?DDECGGIEDDDDDDHHJJJJJJIGIIIJED?CB5@CFFHHHCFF @5000345_0_3_0_0 ENSG00000178057 TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG + =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?
如果您在數據文件 (
fusions.head16.R2.fastq.tab
)之前閱讀排除文件 ( ),這比我最初想像的要簡單和容易test.head20.R2.fastq.tab
。這會讀入第一個文件並使用數組
a
來儲存在欄位中找到的標識符$1
和$2
.然後,對於第二個文件(以及後續文件,如果有)的每一行,如果欄位 $2 不在 array 中
a
,則列印該行。