Text-Processing
awk 將一個文件拆分為多個文件,在另一個索引文件中具有特定名稱
我有一個集群 fasta 文件(稱為文件),如下所示:
>1AB2 >1AB2 AA NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI NCIDHFR8EHGBVPIWOBGIGRI >1AB3 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >1SC4 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2CD5 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2AC6 >2AC6 AA NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU NFIROUHBOERVERUGBERUOVREOIBROEBVUE NVHIRE >2ONM AA BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR NEBIBVVBRU >2POD AA BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP BUEIBVEO >7KZL >7KZL AA BUIREBVAUREVBREOIRGPNJBFDVERUBVROR >6GH3 >6GH3 AA NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI FHUIERBLUUIREB >6GH4 AA BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH NFHILRUGAURHG
該文件有 4 個組:
1AB2, 2AC6, 7KZL, and 6GH3
.>1AB2
第一個和第一個期間的內容>2AC6
屬於集群1AB2
。>2AC6
第一個和第一個期間的內容>7KZL
屬於集群2AC6
。
>XXXX
我想在此索引文件 (ind.txt)中將文件分成 4 個具有特定名稱的文件:HG001 1AB2 HG010 2AC6 HG023 7KZL HG004 6GH3
結果文件應如下所示:
HG001.fa
>1AB2 AA NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI NCIDHFR8EHGBVPIWOBGIGRI >1AB3 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >1SC4 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2CD5 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN
HG010.fa
>2AC6 AA NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU NFIROUHBOERVERUGBERUOVREOIBROEBVUE NVHIRE >2ONM AA BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR NEBIBVVBRU >2POD AA BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP BUEIBVEO
HG023.fa
>7KZL AA BUIREBVAUREVBREOIRGPNJBFDVERUBVROR
HG004.fa
>6GH3 AA NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI FHUIERBLUUIREB >6GH4 AA BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH NFHILRUGAURHG
我試著用
awk '/^>/ && NF==1; NR==FNR{a[$2]=$1} (substr($1,2) in a) {close(out); out="cluster/"a[substr($1,2)]".fa"} {print > out}' ind.txt file
但它沒有用,我找不到錯誤的解決方案。
mkdir -p cluster && awk 'NR==FNR {map[">"$2]="cluster/"$1".fa"; next} /^>/ && NF==1 {close(out); out=map[$0]; next} out != "" {print > out} ' ind.txt file
第一個條件動作 (
NR==FNR
) 是解析索引文件,創建文件名並將它們儲存到一個數組中,其中第二個文件的標題是雜湊值。找到標頭 (
/^>/ && NF==1
) 時,我們定義要使用的輸出文件名。對於任何其他行,我們列印到選定的文件名。
"cluster/.fa"
如果此標頭沒有映射,我還添加了不列印到文件的條件。使用範例輸入進行測試創建了這些文件:
$ head cluster/*.fa ==> cluster/HG001.fa <== >1AB2 AA NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI NCIDHFR8EHGBVPIWOBGIGRI >1AB3 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >1SC4 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2CD5 AA ==> cluster/HG004.fa <== >6GH3 AA NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI FHUIERBLUUIREB >6GH4 AA BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH NFHILRUGAURHG ==> cluster/HG010.fa <== >2AC6 AA NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU NFIROUHBOERVERUGBERUOVREOIBROEBVUE NVHIRE >2ONM AA BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR NEBIBVVBRU >2POD AA BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP ==> cluster/HG023.fa <== >7KZL AA BUIREBVAUREVBREOIRGPNJBFDVERUBVROR