Bash

根據每個標題行的特定列上的關鍵字將文件拆分為單獨的文件

  • April 28, 2022

我想根據標題行中提到的第一個染色體創建單獨的文件。標題行中提到了 24 條染色體,它們的序列在接下來的 2 行中提到。文件架構是:

header

人類序列

其他基因組序列

但是,所有染色體序列都合併在一個文件中,我想將其拆分為單個染色體文件以及它們各自的序列對。我為此創建了一個 Python 腳本,但在集群上上傳大合併文件需要大量時間,並且經常導致連接錯誤。所以,我想用 Bash 腳本來做。

這個想法是在標題行的第 2 列搜尋“chrY”(或任何 chrome 名稱),然後將該標題行及其 2 個後續序列行粘貼到單獨的文件中。

2057524 chrY 68 170 chrX 23685 23787 - 4125
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA

細節:

2057524 chrY (human chromosome) 68 170 chrX (other genome chromosome) 23685 23787 - 4125 -> header line
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA (human sequence)
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA (other genome sequence)

用於檢測:

2057521 chr10 57211219 57211230 NW_007726181v1 1018288 1018299 + 575
CTGGGCACTATG
CTGAGCGCGGTG

2057522 chr2 57211231 57214400 NW_007726181v1 1018406 1021615 + 116172
GTTTtgagcttgt----acccagcgctgcttttgccttgctctgtgaccccaggcaagctgcctcacctctctgggccagtttccccat-cgtacagtggTGCTGCACACCCTGGCCCTGGCCC-CGAGGTGGCTGGGAGGTGGCTCCTCAAACAGCCGCTTTCTCATCAGTGCCCGGTGCTGGGT-CAGGGATCGACTGAGGCTCT--GAGCTAACTAGGAAACACAGTGGCCTTG--GAGGGCTGGGGAGTGTCATGGGGGTG---GGGACAGGGAGCCACCGGTCGCATGTGACTGAACTCTT-----------------CACCCCAGTCTGTGGCTTTCCCGTTGCAGTGAGAGCCACGAGCCAAGGTGGGCACTTGATGTCGGATCTCTTCAACAAGCTGGTCATGAGGCGCAAGGGTAGGAGGCAGGGCCGCTGCCCGCCCTGGGTCGGCACCT---------------TGTAATTCTGTCCTGCCTTTTTCTTCCTGTATTTAAGTCTCCGGGGGCTGGGGGAATCAGGGTTTCCCACCAACCACCCTCACTCAGCCTTTTCCC-TCCAGGCATCTCTGGGAAAGGACCT------GGGGCTGGTGAGGGGCCCGGAGGAGCCTTTGCCCGCGTGTCAGACTCCATCCCTCCTGTGCCCCCACCGCAACAGCCACAGGCAGAGGAGGACGAGGACGACTGGGAATCCTAGGGGGCTCCATGACACCTTCCCCCCCAGACCCAGACTTGGGCTGTTGCTCTGACATGGACACAGCCAGGACAAGCTGCTCAGACCTGCTTCCCTGG-GAGGGGGTGACGGAACCAGCACTG---------TGTG-GAGACCAGCTTCAAGGAGCGGAAGGCTGGCTTGAGGCCACACAGCTGGGGCG---GGGACTT-CTGTCTGCCTGTGCTCCATGGGGGGACGGCTCCACC--------CAGCCTGCGCCACTGTGTTCTTAAGAGGCTTCCAGAGAAAACGGCA-CACCAATCAATA-----------AAGAACTGAGCAGAAACCAACAGTGTGCTTTTAATAAAGGACCTCTAGCTGTGCAGGATGCAAACGTCTCGGGGTCAGTGACTGCCTCCTGCCCCTGTTGGTCCCTAGGCAGTGGGGGCAGAAGCTCCCAGCTGACCTG------TTTCTCTGGGAGAGAAGGGCAGTCAGCAGGGGCAGCTGTTGCAGATGGGAGGAATAG--------TCTCCCACA----AAAAAGGTTTCAGTGACAGACACGGGGTCTCTAAAAATAGTCATGCTGAGAGCCCAATGGCCCTTGGCACAATTGCTGGTGTTGGGGTAGAAGATGTCTTGGAGTTTGCTCAAGTGGTTGAGAGGGAGGGAGGTGCCATCAACTT---GGAGGAACTGGCACCAAGCCAGGGAGATAGAAATCCAGGCAAGGCTGTGGGGCAGGTTAGGGAGCAAGGCTGCAGGGGTGACTCAGGAAGAAGGTGGGGGAGGTGACAAGCCCCCAGGCAGGGGCCCTGTGGCC-------------ATGGGGATCTTTTTAAATTGAGACTAGGGGGTGAATAGTCCAGGGCAGCTAACTTTAGTTATTATAGAAAG-GGCAGTAGCAGATGGGTCTG-CTCCGTCTCGCTTCTAAGAAGGTGG---------------GCAGGACAAATGGCAGCCTCCTGCAGAGGCCCAGTGAGAAGCCTGGCCC-------TCGGCCAC-----ACAGGATGGAAGACAGATTGGATTCCACAGAGGGGAGCTGCCCTGGGAAGATCTCACGGATGGCCAGGACCCACCATTTCTTCGGGGTTCCCCT-GTTTTCTCCAACGGGCACTAATGCCTGTGCCTGGGTCCTGGCAACAC----------------------TCTGGACTCCACACTCT--TCTGGGTTTCACCTTTGTA-GCAGGATCCCTGCAGATCAGGCCCATGACAAACACCGTCTCCAGCGGGCAGAGCAAAGGAAGGGCGCAGCGCCAGGCAGTGGTGCAGCTGCCTGTCAGGAAGAGGCCTACTTCT---GGTGAAACTGGGCAGAC---AAAAGGCAGTGAGAAATGTGATCTCGGGGTGGTGGAGGCTC-TAGGGAAAGGAAAAGGCAGGAGTGAACTTCCACACAGCAGCAATGGCAGAACCAAAGGTGGCTTTGACCTCCACGAGGGCTCAGATCCAGGCCAACAGCTTGTCCAGGACAGGGTGCCGGGTGTATCACTAATCCAGGAGCACTATGCTGGCAGAATCCCTTTGGTGCCTGATGGCCCTGCCTTCGTGGGAACAGAGGCTAAGGCTTTGAGTTACAGCTGCCTCCCCAACAGTGCATCCCCTTCTCCTTCCTCAGCCTCAGGTAGGAGACAGGGCAGGCAACCCCCCTTTCCTCTTCTCCCCTTCTCCAGCCCCTGTCTGTCCACCCAGCTGGAGGCAG--CCAGGCTTGCCTATGGACTGGTTGACAGCCTTCATGCACAGGTTCTCCACCAGAGCCTTTCTTGGGGGCCCCTGGCT--GGGCTCTGAGCTGGGAGTGAAGGGGATGACCCATGCGGACTGTTTGCTGC-------------TTGTAGCTTTCCCTGGGA-AAGACTCTGCCAGGCCTTGGAGCCAGACCAGGAGGCTTTATAGGCCACTGCAAGCAGCAGGGCTCCAGATGACATCACAGGGAATATCAAGAGGGTGTGGAGGGGCATCGAAGCCTCTCCAGGAG---ACAG----GAGAC---GCCGGCCCAGTAGAGCCCTAGGGGCGACGCCACTCCCACTCACTGTCTACTCTCCTCTCACCTCTGCAACACTGGGGACACTCACAAGATTGTGATCCAAGTCGGCCGTCGTCTTCTGCAGCTCTGGAGACCTGATGCTGGGGAAGGGCATGCCTGGCATCACCACACACCTGGGAGGAGACAGGAGCCTG-GGGCCGGTGG---------------------GCCCACACATCACCAGCTGCTCCGTTCTACCATTTCTTCAGCCCTCTTGGCTGTGC-CTGCGGCTCTGCCCCTCCCCTCTCTGCACCTACCACCCAGAGAGGGCTTGTTGAGCTCAGAGATCCCACCTAGGCCAATCCACTGGGTTCTGTGGCAGCGATGGCCTGCCTGATCTTCCACCTGCTCTCCCAGGGCCAAAGCCAGACCTGCTGAGCCCCTCCC--TCCAGCCGGCTGGT-CTGAGCAGTCACAGCCCGGCTTTGGGCTCCGATGGCAGCAGATGGCAGGTAGGGGTCCAGCTGCTGG-AGCGAGGGCCGGCCACGTATCACAG-CCAAGGAGATGAGCACAAG--CACTACTTACTGGCCTAGGTTGTCAGAGAAGTTGATGCTCTCACTCATCTTTCCTCCAATC
gtcctgagtttgccaaggcccagctctgcttctgacttgtcctgtg-----agacaaagtgcctaacgtctttgggccagtttcctcatccccacagtggggctgcaca-cctgcccgtgtcttacaggatggccgtgatgt------tca-----CCATTTTCTTAT-AGTACCCACTGCCAGATACACGGACAGACCAAAGCTCCCAGAGCTCA-TGGTGAACAT-GTGGCTGTGGAGAGGGCTGGGGACTGTTGCAGGGGCAAGTGAGCCAAGAGGGCACTGG-CGACTGGGCCTGGAGCCTCCCGACTTGGCCCCCGCACCCTCCCACCTGCAGCTTTCCTCTTGCAGTGAGAGCCACAAGCCAAGGTGGAGACCTGATGTCAGATCTCTTCAACAAGCTGGTCATGAGGCGCAAAGGTAGGAGGCAGAGGGGCCGCC--TTCAGGGCAGGGGCCTCAGGGTGTCCTGCAGTGTACTTCTGTTCTGCCTCTTTCTTCCTGTTTTTAAATCTTCAGGGGCTGTGTGACCTGGGGCCTCCATACACCCTCCCTCA--CAGCCTTTTCCCTTCCAGGTATCTCCGGGAAAGGACCTGGAACAGGGGCCAGCGAGGGGCCAGGAGGAGCCTTCGCCCGAATGTCAGACTCCATCCCGCCTCTGCCTCCCCCACAGCAGCCAC---CGGGAGAGGACGAGGATGACTGGGAATCCTAGGGGTCT-CAGCACTCCTTCCTCCCCCAACCCAGACTTGGGCTGTGGCCCTGAGACAGACACAGCTGGGACA--------------GCCCCCTTGGTGAGACAGGGATGGTG-CAGGACTGCCCTACGTCTGTGCTGGGCCTTCTTCAGGGAGCGGGTAGGTTGCATGAAACCATAAGTGTGGGGTGGGAGGGGCTCGCTCTCCACCTGTGCCCCACCGTGTGCCTGCTCTACCCACCCCTTCAGCGTGTGCTCCTCTTCCCGAAAGAGACT--CGAAGAAAACAGCACCATGAATCAATAAAGGACGATGTAAGAACTGAGCATAAACCAACAGTGCACTTTTAATTAAGGAGTCAAGGCTGGGTGGCTTGCAAACATCTGAGAACCAGTGACTG--TCCTGCCCC-GTGGGTCTCCAGGCAAT-GGGGCAGAACATCTGAGTGGACCAGGGCCCCTTGCACTGGCTCGAAGGTGCAGTCAGCAGGGGCAGCTGCTGTGGATGGGAGGGAGGGAGGGAGATGTTCCCACGGGATAAAGATGTCTCAGTGACAGACATGGGGTCTCTAAAAATAGTTGTGCTGAGAGCCTAATGGCCCTTGGCATAATTGCTGATGTCAGGGTAGAAGGTGTCTTGGAGTTTGCTCAAGTGCCTGAGAGGGAAGGAGGTGCCATCAACTTGGAGGAGGAATGGGAGCCAAGCCAGAGAGA-AAAGCCCTGCGCGGAGCTGTGGAGCAGACCA--GAGCACAGCTG-----------------------------------AGGCTGGCAGG-AGGAGCCGTGTGGACAGCAGAACTAGAAATGGGGAACGTTTTGAGT------------GTGAAATGTCTAGAACAGCTCATTTTAGCTAGGATGAACAGAGGCAG-----GATGGGCCTGTTTCCATCGGACCTCTGAGAAGGTGGCTACTGAGAAAACATGCAGGACAGAAG-----CTGCAGCAGAACACCGGGCAGGAGCCTGGCGCGGCCAGTGTGGCCACACTAAACAGGGAGGAAGATGCAATGG------CAGGGAGCAGCTGCCCTGCAGTGGGCTCAAGGGCAGTCAGGACCCACTGTTTACTCAGGATCAACCTAGTTTTCTCCAACTGGCTTTTCTACCTGGGCCTGCATGCGGGCAGCCCACTGATGCTGGAAGGGGGCTGGTCTGGACCTCACACTCTACACCTGGTTTCACCTTCTTAGGCAGGATCCCTGTAGACCAGGCCCAAGACAAACACCATTCTAAGTGGGCAGGGTAAAGGAAGAGC------CCGGGC--TGGTGCAGCCATCCATCAGGAACGGCCAAACTTCTCCCGATGAAACTGGGGAGATGGGAAAAGGCAGTGAGAGACTAGATCTCAGGGTGA-GCAGGCTCGGGGGGGAAGGAAAAGGCAGGACTGACCTTACGCATAGCAGCAACAGCATGGCCAAAGGTGGCCTTGACCTCCACACGGTCTCGGATCCAGCCTGGCAGCTTTGCCAGGATGGGTGGGCGGGCATATCGCTGGTCTAGGAGCACTATGCTGGCAAAATCCCTCTGGTGCCTGATGGCTCTGCCTGGATGGGAACAGAATTTGGGGCTCCTAGGTAAA-------------------ATCCTCTCCTGTGACTTCATTCTC-------------------CAACCACCCAT--CTGTACTCC----------CAACTATCCATCCTGACAGCCAGGAGCAGTCCCAGGCTTACCTATAGATTGGTTGACAGCCTTCATACACAGATTCTCCACCAAGGCCTTCCCTGGTGGGGGCTGGCCTGGGGTTCTGGGCTGGGAAGGGTAGAAAGGACCTATCAGAACTGTTCCTTACCTCCTGTCTAGTGTTCTAGCTCTCCCTGGGAGAAGAGCCTGCCAGGCTTTGGAGCAAGACCAGGCAGCTTCACAAGCCAGTGCCAGCAGCTGG------CACGATGTCATGGAGAAGGTCAAGAGGGGGACAGGAAACACC--AGCATGGCAAGGAAGTCACAGCTACAAGACCCTGCTATCTCAG------CCTAGGGAATACACCACACTTCCCCCCGGCC--CTCTCCTCAT-CCTCTGGAATCCTGGAGGTACTCACAAGGGTCTGATCCAAGTAGGTCATCTTCTCTTGTAGTTCTGGAGAGTTGATGTTGGGGTAGGGCATGCCCACCATCACTACACACCTAGGTGGAGATGCACGCCGATGGGCATGTGGCCTCACACTCACTGAGTCCTCACCCACATGCCACCGACTGCT--GCTCTACCTCTGCTGCCG--CTCTTGGCTATGCTCGGCAGCTCTACCCTCCGCATC-CCGTACCTACCACCTGGAAAGGATTTTTTCAGCTAAGAGACCCAGTCTAAGCCAATGAACATAGTCCTGATAAGGTTATGGTTTGCCCCATTTTCCATCTGCTCT-CAAAGGCCCAATCCAGAGTTGCTGAAACACTTCCCGCCTGGCTGCCTGATCCTGAGCAGC--CAGCCTGGGTGCAGACTCAGATGGCATCAGATTGCAGGT-GGGGCCCAGCTGCTGGAAGTGAGGAGTAGCCAGGTGTCATAGCCCCAGGAGAGAAGGAGAGGACCACTACTTACCGGCCTAGGTTGTCAGAGAAGTTAATCCCTTCACTCATCTTTCCTCCAACC

2057523 chrY 57214466 57215088 NW_007726181v1 1023265 1023919 + 29358
GGCCCATCCCACTCTAGGCATGGCTCCTCTCCACAGGAAAACTCCACTCCAGTGCTCAGCTTGCACCCTGGCACAGGCCAGCAGTTGCT---GGAAGTCAGACACCTGCAGATCAAGACCACAGCATCAAGACCCTGTGACCTCTCAAAGGCCTGGTGGAAAGGA--------------CACGG-----------GAAGTCTGGGCTAAGAGACAGCAAATACACATGAACAGAAAGAAGAGGTCAAAGAAAA--GGCTGACGGCAAGTTAACAAAAAGAAA--AATGGTGAATGATACCCGGTGCTGGCAATCTCGTTTAAACTACATGCAGGAACAGCAAAGGAAATCCGGCAAATTT-GCGcagtcattctcaacaccggccatgcagcaaaatcatcagtggaaatttaaaaaaatacacgtggccaggccccagcccaaatcact-aataagaatctccaggg-CTtcacctgttagactggcaaaaaatccaaaag--taaacactttgtggagaaacaggcactcctagacattgctggtgggatacagaacagtacaattctga------------tggtaatcagttaacaaattaaacatatttattttatacttttaaacccaggaatcccatatttaggagtctactgagaccaaacagc
GGCTCGCTCCGCCCGGGTCACA-CTCCTCACCGCAGGAGAACTCCACCAC-TCGCTCAGCCTCAGCCCCAGCGCACGCCAGCAGCTGCTCCCGGAAGTCAGACACCTGCAGA------CCACAATGGCAGGGCCCTGTGACCTCCCAGAGGCACAGGGGAGAAGAACCTCAGGCCTCGGCATGGAGGGCAAGACAGAAGTCTGGGCTGGAAGGCAGCAAGTACGTACAAACAGAAAAAAGAGCTAAAAAAAAAAAGGCTAACAACAAATTAACAATAATAAATAAATTGTTAATGATATCCAGTGTTGGCAGTTTCATTTAAGCTACTGGTAGAAACAGCAAAGGAAATCTGGCAAACTTGGCAcagtgattctcaaccctggctatgcatcaaaaccagcagtgggaatttaaaaaaatacACATGGCCCAGCCACAGTCCAAACTACTGAATAACAATCTCCAGGGttttcacctaccaaattggc---aaatccgaaagtttaaccactctgtggagaaaaaggcatttttaaacattgctggtgcaatacagaatagtacaactcttacataggggaatttgacaat-acttaacaaattaaatgga-----tttttactttttaactcaggaatctcatatctgggactccacccagaatacacagc

2057524 chrX 68 170 NW_007727164v1 23685 23787 - 4125
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA

我建議grep-A開關一起使用,它告訴它還包括匹配後的行。像這樣的東西:

#!/bin/bash
file=$1

for i in `seq 1 20`; do
 grep -A2 "chr$i " $file > seq_$i
done

grep -A2 "chrX " $file > seq_X
grep -A2 "chrY " $file > seq_X

然後執行:

./extract.sh myfile

這是一個awk將在目前目錄中為每個染色體創建一個文件:

awk '
   FNR == 1 {
       # here we define a prefix and a suffix based on the name
       # of the file that we're currently reading
       prefix = FILENAME
       suffix = ""
       sub(".*/","",prefix)
       if (i = index(prefix,".")) {
           suffix = prefix
           sub(/\.[^.]*$/,"",prefix)
           sub(/.*\./,".",suffix)
       }
   }
   $2 ~ /^chr/ {
       close(outfile)
       outfile = prefix "-" $2 suffix
       print outfile # display the created filename
       l = 3
   }
   l-- > 0 { print > outfile }
' file

引用自:https://unix.stackexchange.com/questions/699605