Text-Processing
awk/sed 將集群文件拆分為多個文件
我有一個集群 fasta 文件(稱為文件),如下所示:
>1AB2 >1AB2 AA NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI NCIDHFR8EHGBVPIWOBGIGRI >1AB3 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >1SC4 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2CD5 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2AC6 >2AC6 AA NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU NFIROUHBOERVERUGBERUOVREOIBROEBVUE NVHIRE >2ONM AA BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR NEBIBVVBRU >2POD AA BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP BUEIBVEO >7KZL >7KZL AA BUIREBVAUREVBREOIRGPNJBFDVERUBVROR >6HG3 >6GH3 AA NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI FHUIERBLUUIREB >6GH4 AA BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH NFHILRUGAURHG
about 文件有 4 組:
1AB2, 2AC6, 7KZL, and 6GH3
.>1AB2
第一個和第一個期間的內容>2AC6
屬於集群1AB2
。>2AC6
第一個和第一個期間的內容>7KZL
屬於集群2AC6
。我想在第二個將文件分成 4 個文件
>XXXX
。每個文件應如下所示:文件_1
>1AB2 AA NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI NCIDHFR8EHGBVPIWOBGIGRI >1AB3 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >1SC4 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2CD5 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN
文件_2
>2AC6 AA NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU NFIROUHBOERVERUGBERUOVREOIBROEBVUE NVHIRE >2ONM AA BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR NEBIBVVBRU >2POD AA BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP BUEIBVEO
文件_3
>7KZL AA BUIREBVAUREVBREOIRGPNJBFDVERUBVROR
文件_4
>6GH3 AA NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI FHUIERBLUUIREB >6GH4 AA BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH NFHILRUGAURHG
awk '/^>/ && NF==1 {close(out); out="file_"++n; next} {print > out}' file
根據您的測試輸入,您要更改輸出文件的標題定義為:以一個欄位開頭
>
且只有一個欄位的行。使用next
我們對此行不列印任何內容,但設置輸出文件名。此外,close()
呼叫可確保我們不會以打開太多文件而結束,因為這awk
可能會引發錯誤。輸出:
$ head file_* ==> file_1 <== >1AB2 AA NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI NCIDHFR8EHGBVPIWOBGIGRI >1AB3 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >1SC4 AA WNIOREHUEBRGOUERGHBERGIORBGREUGEGO NWFWRUBGREOUEREOBRIOBNERIOBN >2CD5 AA ==> file_2 <== >2AC6 AA NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU NFIROUHBOERVERUGBERUOVREOIBROEBVUE NVHIRE >2ONM AA BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR NEBIBVVBRU >2POD AA BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP ==> file_3 <== >7KZL AA BUIREBVAUREVBREOIRGPNJBFDVERUBVROR ==> file_4 <== >6GH3 AA NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI FHUIERBLUUIREB >6GH4 AA BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH NFHILRUGAURHG thanasis@basis:~/Documents/development/temp>