Text-Processing

awk 將一個文件拆分為多個文件,在另一個索引文件中具有特定名稱

  • June 1, 2022

我有一個集群 fasta 文件(稱為文件),如下所示:

>1AB2
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2AC6
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
BUEIBVEO
>7KZL
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR
>6GH3
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG

該文件有 4 個組:1AB2, 2AC6, 7KZL, and 6GH3. >1AB2第一個和第一個期間的內容>2AC6屬於集群1AB2>2AC6第一個和第一個期間的內容>7KZL屬於集群2AC6

>XXXX我想在此索引文件 (ind.txt)中將文件分成 4 個具有特定名稱的文件:

HG001 1AB2
HG010 2AC6
HG023 7KZL
HG004 6GH3

結果文件應如下所示:

HG001.fa

>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN

HG010.fa

>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP
BUEIBVEO

HG023.fa

>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR

HG004.fa

>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG

我試著用

awk '/^>/ && NF==1; NR==FNR{a[$2]=$1} (substr($1,2) in a) {close(out); out="cluster/"a[substr($1,2)]".fa"}  {print > out}' ind.txt file

但它沒有用,我找不到錯誤的解決方案。

mkdir -p cluster &&
awk 'NR==FNR  {map[">"$2]="cluster/"$1".fa"; next}
    /^>/ && NF==1 {close(out); out=map[$0]; next} 
    out != "" {print > out}
' ind.txt file

第一個條件動作 ( NR==FNR) 是解析索引文件,創建文件名並將它們儲存到一個數組中,其中第二個文件的標題是雜湊值。

找到標頭 ( /^>/ && NF==1) 時,我們定義要使用的輸出文件名。

對於任何其他行,我們列印到選定的文件名。"cluster/.fa"如果此標頭沒有映射,我還添加了不列印到文件的條件。

使用範例輸入進行測試創建了這些文件:

$ head cluster/*.fa
==> cluster/HG001.fa <==
>1AB2 AA
NWWIEUNJRNIBGOWNGIOWGRBIGBRGRIOWGI
NCIDHFR8EHGBVPIWOBGIGRI
>1AB3 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>1SC4 AA
WNIOREHUEBRGOUERGHBERGIORBGREUGEGO
NWFWRUBGREOUEREOBRIOBNERIOBN
>2CD5 AA

==> cluster/HG004.fa <==
>6GH3 AA
NBVUIREVOIAWRHRUGRTYUVDNJKDFHUGSEI
FHUIERBLUUIREB
>6GH4 AA
BDFUIGEVUERERHOBERIHBSDLKFJBNIERIH
NFHILRUGAURHG

==> cluster/HG010.fa <==
>2AC6 AA
NFIGEURHGEIROHEGHTUTJGENLJBBEOWRIU
NFIROUHBOERVERUGBERUOVREOIBROEBVUE
NVHIRE
>2ONM AA
BUCIEHBUORBREOBWQVURVELLAJFLHIEBGR
NHEIBVEURIGBVNRIHEOEAJVSJDNHVUGBVR
NEBIBVVBRU
>2POD AA
BUFEWIBOEUWBWOREBRIUBGUERIGBVOSRIP

==> cluster/HG023.fa <==
>7KZL AA
BUIREBVAUREVBREOIRGPNJBFDVERUBVROR

引用自:https://unix.stackexchange.com/questions/704634