Text-Processing

比較來自兩個不同文件的列並從第一個文件列印記錄那些與第二個文件不匹配的列

  • September 9, 2021

我想比較文件一到二的列。其中 file1 的第 2 列應與 file2 的第 1 列或第 2 列不匹配,並列印文件 1 的輸出。

文件 1。

貓 test.head20.R2.fastq.tab

@0_1_2367_1112_211  ENSG00000165837 GAAATTAAGTTATAATTTTCATGGGACATTTTCATCACTGTTGACACAGTTTCAAGCATTCCATCATGTTATTTTGACTCTTTTTCTTTTTTTTTTCTTT    +   @6@CDCFFEDEIJIIJJJFBFHIIJJJJJGC?CDDDDDDFEDGBFFFFHEDFFBBBDDDDDDDDBDDD@@@@CDDDDDEHHJJJGJIIIGIJJJIIIFCH
@10000000_0_0_0_0   rupesh  TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG    +   =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?DDECGGIEDDDDDDHHJJJJJJIGIIIJED?CB5@CFFHHHCFF
@10000001_0_3150_2465_134   ENSG00000137860 GCCTCTCAAGTAGCTGGGATTACAGGCACCTGCCACCACGCCCAGCCAATTTTTGTATTTTTAGTAGAGACAATTTCACTATGTTGGCCAGGCTGGTCTT    +   DEDDB>HJIGHFJJJIGFFFHJJJJJJJJIIGHHFFDDCCCIIJJJJJJJJJJJJJIGIJJJHJIFHHGJJIIHEEEDDDDDC>?@DDDEEEDFFFFFFC
@10000002_0_2947_952_158    ENSG00000028203 CCCCCAGGACCAGCTGCTGTTTTGTGATGACTGCGATCGGGGTTACCACATGTACTGCCTGAGTCCCCCCATGGCGGAGCCCCCGGAAGGGAGCTGGAGC    +   JFHHEDDB;;63JJJIJJJHHFIIJIHGHHHHGHHHJJIIIEEEIJJHHHJJHHFFJIJJJJJJJJJJJJJJHDDDDDDFGBB?8BDDDEDDDDDDDDCC
@10000003_0_8902_3193_186   ENSG00000177051 CAAGGCCAGAGAGACAAATAATGCCTCATGTCCCACTGCTTTAAAATTACATTAATTTATAAAATGGCCACTATGGGCTCTTTTTGACTGTTTCTCGGAG    +   HGEDDDDDDDDDDDCDDB@<DDDD<>CDDDDB@>DDDEIJJJIGJJIIIIGIJHHHGEEIJJJJJJJJIHJJJJJJIIJIIIJJIJJJJIIIIHEDDDCC
@5000345_0_3_0_0    ENSG00000178057 TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG    +   =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?

文件2

cat fusions.head16.R2.fastq.tab

ENSG00000137860 ENSG00000165837 1431    1598    0:0:0   0:0:0   0/2 CAGGTCATCTGCTCCTATCTCCTAAGGCCCATGGTTTTCATGATGGGTGTAGAGTGGACAGACTGTCCAATGGTGGCTGAGATGGTGGGAATCAAGTTCT    +   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ENSG00000177051 ENSG00000134905 277 433 0:0:0   0:0:0   451/2   CTTCACTGCACAGCCAGGGTGAGCCTCGCTGGGAAGGTGCAGGTGACTCGTGCCTGTCGGGGAGCCCGTCCTGTCCGTACAAAACATGTGCCAGGCAAGG    +   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ENSG00000137860 ENSG00000165837 2761    2951    0:0:0   0:0:0   2/2 AAACAATCTTACGGATTAAGAGGAGACGTGAAGCTCAAAAGTTAACAGAGATGACCAGTTTCACATTTCATTTAATGAGCAAACCAACACCTGAGAAGCC    +   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ENSG00000028203 ENSG00000157766 183 411 0:0:0   0:0:0   101/2   TTCTTTGTCACCAAAAACAGAAAAATGCACAACAGAGGGACAACAAAAGCCTCCTACAAGAGTCCTACCAAAATACCTGGGATATAGTAATCACTCAATG    +   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

所需的輸出:

@10000000_0_0_0_0   rupesh  TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG    +   =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?DDECGGIEDDDDDDHHJJJJJJIGIIIJED?CB5@CFFHHHCFF
@5000345_0_3_0_0    ENSG00000178057 TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG    +   =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?

到目前為止嘗試從第二個文件列印不匹配的條目,但不知道如何從 file1 列印?

awk '{k=$2} NR==FNR{a[k]; next} !(k in a)' test.head20.R2.fastq.tab fusions.head16.R2.fastq.tab

我不想要的上述程式碼的輸出:

ENSG00000177051 ENSG00000134905 277 433 0:0:0   0:0:0   451/2   CTTCACTGCACAGCCAGGGTGAGCCTCGCTGGGAAGGTGCAGGTGACTCGTGCCTGTCGGGGAGCCCGTCCTGTCCGTACAAAACATGTGCCAGGCAAGG    +   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ENSG00000028203 ENSG00000157766 183 411 0:0:0   0:0:0   101/2   TTCTTTGTCACCAAAAACAGAAAAATGCACAACAGAGGGACAACAAAAGCCTCCTACAAGAGTCCTACCAAAATACCTGGGATATAGTAATCACTCAATG    +   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
$ awk '
  NR==FNR {a[$1]++; a[$2]++; next};
  !($2 in a)' fusions.head16.R2.fastq.tab test.head20.R2.fastq.tab 
@10000000_0_0_0_0   rupesh  TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG    +   =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?DDECGGIEDDDDDDHHJJJJJJIGIIIJED?CB5@CFFHHHCFF
@5000345_0_3_0_0    ENSG00000178057 TCCCTACTCACGTGGTGGACGCACAACCTAAGGTCAAGCTTATAGGTAAACACGCAGTGAAATATCCAGAAACGAAGCTATCACCCGGGTAGTGTCTTGG    +   =FGIIIFDCCDDDCAA5BBBBGIJIIGJIJJJJJJIIGGHHIIIJIJIIJJIEE8?

如果您在數據文件 ( fusions.head16.R2.fastq.tab)之前閱讀排除文件 ( ),這比我最初想像的要簡單和容易test.head20.R2.fastq.tab

這會讀入第一個文件並使用數組a來儲存在欄位中找到的標識符$1$2.

然後,對於第二個文件(以及後續文件,如果有)的每一行,如果欄位 $2 不在 array 中a,則列印該行。

引用自:https://unix.stackexchange.com/questions/537905