Text-Processing
使用本地安裝文件中的床文件資訊檢索 fasta 序列
我有一個 .bed 文件,其中包含大約 30000 行,我使用
fetch-sequences
rsat 工具 ( http://rsat.ulb.ac.be/rsat/help.fetch-sequences.html#usage ) 的模組檢索了其中的序列。$$ Note: This tool connects to the server each time to retrieve the sequences $$ 現在我有大約 10000 個隨機排序的同一個床文件的子集,我想為其檢索序列。使用該
fetch-sequences
模組,需要 10 個小時才能實現這一目標。但我想通過首先檢索原始文件的序列來快速完成。使用這個本地文件,我想檢索子集的序列。有沒有辦法使用linux命令或perl來做到這一點?我不確定如何使用本地安裝的文件執行此操作。請幫忙。
這是我的樣本床文件:
chr10 91061477 91062132 peak4069 41 . 134.220 -1 -1 chr12 77456450 77457116 peak7216 97 . 313.820 -1 -1 chr7 150754939 150755706 peak30140 87 . 281.000 -1 -1 chr3 11643031 11643625 peak20536 135 . 435.740 -1 -1 chr19 6521662 6522869 peak14719 264 . 851.950 -1 -1 chr14 35008667 35009076 peak9034 40 . 131.480 -1 -1 chr6 88851148 88852148 peak27572 55 . 178.560 -1 -1 chr6 20212045 20212627 peak26579 44 . 144.630 -1 -1 chr2 136422022 136422722 peak17485 83 . 270.330 -1 -1 chr11 45951365 45952105 peak4995 284 . 915.840 -1 -1 chr19 50181053 50181876 peak15733 101 . 328.650 -1 -1 chr9 36140202 36140695 peak32014 42 . 137.080 -1 -1 chr4 40992483 40993431 peak23120 40 . 129.060 -1 -1 chr14 50528290 50529818 peak9133 256 . 826.310 -1 -1 chr18 57542948 57543911 peak14298 244 . 789.750 -1 -1 chr21 44528945 44529572 peak19741 45 . 148.260 -1 -1 chr6 16763102 16763743 peak26552 84 . 272.680 -1 -1 chr1 44678710 44679433 peak803 97 . 312.860 -1 -1 chr12 11323475 11324633 peak6358 123 . 396.790 -1 -1 chr9 98271450 98271859 peak32325 37 . 120.160 -1 -1 chr19 2167913 2169475 peak14575 455 . 1470.150 -1 -1 chr16 80613819 80615001 peak12054 261 . 843.100 -1 -1 chr12 118574314 118574830 peak7774 43 . 141.040 -1 -1 chr19 59083917 59085058 peak15917 129 . 418.440 -1 -1 chr2 191184311 191184984 peak17942 103 . 332.220 -1 -1 chr15 85956548 85957254 peak10816 179 . 578.110 -1 -1 chr4 84031272 84032016 peak23411 60 . 195.570 -1 -1 chr12 32012425 32013045 peak6654 45 . 148.210 -1 -1 chr6 70575973 70577060 peak27441 52 . 168.800 -1 -1 chr12 57852728 57853291 peak7023 59 . 192.900 -1 -1 chr10 120879718 120880402 peak4449 209 . 675.760 -1 -1 chr1 28833424 28834023 peak526 35 . 114.020 -1 -1 chr8 60963955 60965013 peak30803 329 . 1062.570 -1 -1 chr7 77326420 77326889 peak29382 32 . 103.320 -1 -1 chr5 133476115 133476527 peak25468 37 . 121.150 -1 -1 chr19 45909117 45910074 peak15572 69 . 222.490 -1 -1 chr5 16467271 16468036 peak24373 117 . 380.290 -1 -1 chr15 66682042 66683502 peak10489 182 . 589.480 -1 -1 chr9 35603806 35604394 peak31993 71 . 229.000 -1 -1 chr4 48249067 48249653 peak23178 50 . 162.070 -1 -1 chr3 112775853 112776466 peak21577 62 . 202.250 -1 -1 chr12 12867020 12867982 peak6428 33 . 106.930 -1 -1 chr6 35655359 35655985 peak27066 53 . 171.010 -1 -1 chr6 74171305 74172116 peak27466 161 . 521.390 -1 -1 chr11 12195905 12196539 peak4741 256 . 826.330 -1 -1 chr2 55386393 55386871 peak16583 40 . 131.810 -1 -1 chr9 37030245 37030957 peak32041 89 . 290.090 -1 -1 chr4 30431566 30431997 peak22948 66 . 214.250 -1 -1 chr10 16612633 16613149 peak3304 49 . 160.900 -1 -1
這是我獲取的 fasta 序列的範例(對於上面範例文件中的前 3 行):
>hg19_chr10_91061478_91062132_+ range=chr10:91061478-91062132 5'pad=0 3'pad=0 strand=+ repeatMasking=none AATTGTATTACAAGTCCCCAACTTAATCTTTTCGAATATGAAATAAGAGAGGGACAGTGCACAAGAGCAATGTCCCCAGACCCATCTTTAAGTGAAGCACCAGGCCGATGAAACATCCCTCTCTGCTGCCTTCTTTCTCTGATCACAACTCAGCTCCGGAGGAAAAAGAGTCCTCTAAAGTATAATAAAAAGAAAAAAAGAAAAAGAGTCCTGCCAATTTCACTTTCTAGTTTCACTTTCCCTTTTGTAACGTCAGCTGAAGGGAAACAAACAAAAAGGAACCAGAGGCCACTTGTATATATAGGTCTCTTCAGCATTTATTGGTGGCAGAAGAGGAAGATTTCTGAAGAGTGCAGCTGCCTGAACCGAGCCCTGCCGAACAGCTGAGAATTGCACTGCAACCATGAGGTAAATATTTTCCCTTCGTATTCGGTAGTGCTGTTGAGTCATCTTGTCCAATGCAAATCCTGAGAAGCTATGTTCCCAAAGAGGGCCAGCTCCATTTTAGTGTTTGTTTATAGCCTTACTATGCCTCTACCTCTGTTGGTTGTAAATCTGTCTTACCAATGGTGGTTTGTTCCCTCCTGAACAATTTTCTGCTTCACACTGGCAAGCTTCCTAAATTCATCTCCAGAACTGCACGCCTGGGGAGTTG >hg19_chr12_77456451_77457116_+ range=chr12:77456451-77457116 5'pad=0 3'pad=0 strand=+ repeatMasking=none GTCTTTGTTGAAGGTACTTGATAAAAGTCATTGACCCAGAGGAAGAGAAGTAAAGCACTGACTTAGACGTTATATAAATGTATGGATGTGTATTTTTTCAAGGCTGAACCATCCAAATTGGAAAGGAAAACAAAGTTTTGCTCTAAAACTCTCAAAGCCAAAACTCTGAATATATACTTTAAGTCTGGGCATTTCCACCCTCATGACTTAGATAATTAAAAAAAAAAAAAAAAGGCCACTTTAAATAATCTTCACTTTATCTGTGGTTTCACTTTCAGTGGCCAACTGCGGTCCAAAAATATCACATGGAAAATTCCAGAAATAAACAATTCATGAGTTTTAGATTGTGTGCAGTTCTGTGTAATGAAATCTCACGTCATCCTGCTCCGTCCTGCTTCGGATGTGACTCACCCCTTTGTCCAGCGTATTTGCACGGTAGATACTACCTGCTCGAGCAGCCACTGTGTTTTCAGGCTGGCTGTCACGGTATTGCAGTGCTCATGTTCGAGTAACTCTTATTTGACTTCATAATGGCTCCAAAGCACAAGAGTAGTGATGCTGGCAATTTGGATATGCCAAAGGGAAGCCATAAAGTGCTTCTTTTAAGTGAAAAGGTGAACGTTCTTGACTTAAGGAAAGAAAATCGTACGCCAAGGTTGCTAAGAT >hg19_chr7_150754940_150755706_+ range=chr7:150754940-150755706 5'pad=0 3'pad=0 strand=+ repeatMasking=none GCGGCCGCGGGGACCCCTGCGGGCCCTCGGTTTTAAGACTCTGGCCCCGGCGTTGCAGAGGAGGCGGCACCTGAGCCTCCAGCCCCGCCCCGTCTGCCCGCGCAGGGCCTTCTGGGGCTTGTAGTCCTAAAACGACTAGGTCTCCCCGCAATGCCCGAGACCGAGGACAAGAAGACTACAACCCCCAGCCGTCTGCGCTCACGCTCTCTGCAGCACTCGACTCCAGGGCTCCGTTTCTACCGCGGAGGCAAACCTTGGACTTCAAGTCCCAGGAAGCAACGCGTGTCCGTTTTCCGGGCGCCCCGCCGCGGCGCCGTGGTTCCCAGTGTGCCCCGCTGTTGTTATTCCTTTTATGTGCTCCCAGCCCTCTTGAAAAGGGCCGCTCCGGGACTACGCGTTCCAGAATGCAGCGGAAATGGGGGCGGAGCGCTCTCGGTTAGGGGTTTGGGGTTTGGCGGCCTAGATCCCGGGCACTGGCGGCCCAGCGCTGACCTGGTTGGTGGCATTGTGTTCCCAACGGCCTCTTGACGACCTCAGCACGGGTTTCCACCTCTCCCCAAGCCACCTAGTGACCCCAGAATTGACTGGGGAATGCCTGTGAGCGATGATGACCTCACAGGGAACAGCTGACCGCAGGGCTGGGAGAACAGCTGTGCCCCTTCGAGGCTGGATTTTAGTGGAGGGACACACGCCAAAGACCCCCTCTCTGCTGAGCCCCGTTTGTTGTCTCGGAGCCCACCCGACTCTAGCCGCTGAACTCTGACATG
這應該做你需要的:
for i in $(awk '{print $1"_"$2+1"_"$3}' foo.bed); do grep -A 1 $i seqs.fa ; done
解釋
awk '{print $1"_"$2+1"_"$3}' foo.bed
:這將列印 .bed 文件的每一行的染色體以及開始和結束座標。請注意$2+1
, 因為您文件中的座標與.bed
. 例如,對於文件的前三行:$ awk '{print $1"_"$2+1"_"$3}' foo.bed chr10_91061478_91062132 chr12_77456451_77457116 chr7_150754940_150755706
for i in $(command); do ...; done
:這將保存command
as返回的每個值$i
,然後對其進行處理。grep -A 1 $i seqs.fa
: 這就是“東西”。它將對awk
命令的每個結果進行 grep 並列印匹配的行以及下一行 (-A 1
)。如果您需要再次執行此操作,請不要使用 RSAT!1有更簡單的方法可以做到這一點。相反,下載每個染色體的 FASTA 序列,然後使用
exonerate
軟體包中的工具(可安裝在基於 Debian 的系統上sudo apt-get install exonerate
)。該過程(假設您對每個名為 的染色體都有一個 FASTA 文件chrN.fa
)將是:awk '{print $1,$2,($3-$2)}' foo.bed | while read chr start length; do fastasubseq /path/to/$chr.fa $start $length >> subseqs.fa done
上面的命令將提取感興趣的子序列(假設座標總是相對於正鏈,因為它們應該是)並且應該只需要幾秒鐘。
1 不要誤解我的意思,RSAT 套件是一個很棒的工具,我非常尊重與我一起工作了幾年的作者,但它不一定是此類大規模工作的最佳工具。