Shell-Script

如何將多個文件中的行複製到一個文件中並將行命名為原始文件名

  • January 13, 2020

我的問題是:我有 200 個 fata 格式的文件。如:

/User/Bin/bin.0.fa
/User/Bin/bin.1.fa
...
/User/Bin/bin.200.fa

每個 .fa 文件都包含一個 contig 名稱 ID 和分隔為以下內容的核苷酸字元:

In /User/Bin/bin.0.fa

>c_000000000001
CGACATTTTCCAACTTATTTTTTCCTGTAGTAAAAATTATTTACATACAAAAAAGGAGCTGTTCACTAATTATTTAGTGC
>c_000000000002 
TACAACTCCTTTTTACTATTCTTCTGAATTTGATTTTTCATCCATTTGTTTTTGAGCTTCTTGAACTAATTTATCAAGACTATTATCTTCTACAACTTCATTTTCTTGTCTATCTAATTCATCTGTTAATGTTAATTGCTGATCTTTATCTTCTACATCT CTACCTGAAATTTTAGCTATAGCTACAATCTTTTCTTCATCAGAAGTTCTCATTAATCTAACTCCCATTGTAGCTCTAC
>c_000000000003  
AGTTACAGATACATCTGATACATTAATTCTTATAGCAACACCACTTGTATTTATAAGCATTAATTCATCTTCAGATTTACATACTGTTGCACCAACAACTTTACCAGTCTTTTCACTGATTTTGTATGTTATTAAACCAACTCCACCTCTATTTTGTCTC
...

In /User/Bin/bin.1.fa

>c_000000000004
GGATCATCGCTTGTACATCCCAAACCAAAAAAGAATACTGCACTTACAATCAGTTGGATTTGAAACGCGATTTTCATTTTTGGTATATGTTTAAGATTAGCACTTTGTTTCATTGCTTTTGGCTATGAACGATGTTTACGGGGGTGTA
>c_000000000005 
GAAAGAAGCGTATTGGTCGGTATAAATACCGCTCAACTAAACGAGCACAAAGCTACCGAAAATTTGGATGAATTGGCTTTTCTGGCCCAAACGGCTGGAGC
>c_000000000006
CGGCACTTATTTGCCCCAGCCCATTTTGGGGGTAGAAATACCCAAGAGCAAGGGAAAGGTTCGCCTTCTGGGTGTGCCTACCGTGGTTGACCGTATGTTGCAAC
... 

...
In /User/Bin/bin.200.fa

>c_000000020120   
CTCTGCAACTGGATCCCGAAAAGATCCGCAAAGAAAGCGAACCCAAAGAAAAAGTCGATCTGGAGAGCACCGTCGCCCGCAGTCTGGCCACCCT
>c_000000020121
CATCAATCATCTCAAATACTACCGCAACGCAGATTATTCCCAGTGCAATAACAAAACCGACTCCCGCCTCTTTTGTCTGGCCGTA
>c_000000050122 
GGTACGCCTCCGGCAGAACAAGGCGGCAACGAACCTCAGAACGAGGGAAAGCTAACCCAGGCCGGGTACGCCTCCGGCAGAACAAGGCGGCAACGAACCTCAGAACGAGGGAAAGCTAACCCAGGCCG
...  

我想將特定 .fa 文件中的每個 contig 名稱 ID(不帶“>”)複製到單個 TAB 分隔的 txt 文件中,其中 contig 名稱 ID 被命名為原始文件 n+1。像這樣:

In /User/Bin/Summary.txt

c_000000000001 Bin_1
c_000000000002 Bin_1
c_000000000003 Bin_1
...
c_000000000004 Bin_2
c_000000000005 Bin_2
c_000000000006 Bin_2
...
...
c_000000020120 Bin_201
c_000000020121 Bin_201
c_000000020122 Bin_201

鑑於您發布的範例輸入/輸出和您接受的答案,您真正需要的只是將 GNU awk 用於 ARGIND:

awk -F'>' -v OFS='\t' 'NF>1{print $2, "Bin_"ARGIND}' /Usr/Bin/bin*.fa > /User/Bin/Summary.txt

或使用任何 awk:

awk -F'>' -v OFS='\t' 'FNR==1{++c} NF>1{print $2, "Bin_"c}' /Usr/Bin/bin*.fa > /User/Bin/Summary.txt
#!/usr/bin/env python

import os

files = os.listdir('/User/Bin')
for file in files:
   fi = open(file, 'r')
   n = file.split('.')[1]
   for line in fi:
       line = line.strip()
       if line.startswith('>'):
           bins = 'Bin_' + n
           print("%s\t%s" % (line[1:], bins))
   fi.close()

只要你在 Linux 上,你可能已經安裝了 python。這可能會奏效。

引用自:https://unix.stackexchange.com/questions/561359