Shell-Script

從文件中提取對應於第 n:th 模式的子序列

  • December 8, 2019

我有以下數據塊(多個)

chr1.trna4 (17188416-17188486)  Length: 71 bp
Type: Gly   Anticodon: CCC at 33-35 (17188448-17188450) Score: 78.3
HMM Sc=56.60    Sec struct Sc=21.70
        *    |    *    |    *    |    *    |    *    |    *    |    *    |
Seq: GCATTGGTGGTTCAGTGGTAGAATTCTCGCCTCCCACGCGGGAGaCCCGGGTTCAATTCCCGGCCAATGCA
Str: >>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.

對於每個塊,我需要在塊的最後一行找到以 . 開頭的第 8 個模式Str。在上述情況下,第 8 個模式是.......(7 個週期)。這是因為第一組>符號構成一個模式,第二組句點構成第二個模式,依此類推。

現在我需要從Seq模式行正上方的行中提取這 7 個字元。在範例中,這對應於 subsequence CTCCCAC

輸出應該是Seq is CTCCCAC and Anticodon: CCC

這是可能的bash還是任何外殼?

數據塊的更多範例

chr19.trna11 (4724719-4724647)  Length: 73 bp
Type: Val   Anticodon: CAC at 34-36 (4724686-4724684)   Score: 79.2
HMM Sc=49.10    Sec struct Sc=30.10
        *    |    *    |    *    |    *    |    *    |    *    |    *    |  
Seq: GTTTCCGTAGTGTAGCGGTtATCACATTCGCCTCACACGCGAAAGGtCCCCGGTTCGATCCCGGGCGGAAACA
Str: >>>>>>>..>>>..........<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.


chr19.trna12 (1383433-1383361)  Length: 73 bp
Type: Phe   Anticodon: GAA at 34-36 (1383400-1383398)   Score: 88.9
HMM Sc=68.40    Sec struct Sc=20.50
        *    |    *    |    *    |    *    |    *    |    *    |    *    |  
Seq: GCCGAAATAGCTCAGTTGGGAGAGCGTTAGACTGAAGATCTAAAGGtCCCTGGTTCGATCCCGGGTTTCGGCA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.


chr21.trna1 (18827177-18827107) Length: 71 bp
Type: Gly   Anticodon: GCC at 33-35 (18827145-18827143) Score: 80.9
HMM Sc=60.10    Sec struct Sc=20.80
        *    |    *    |    *    |    *    |    *    |    *    |    *    |
Seq: GCATGGGTGGTTCAGTGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGGGTTCGATTCCCGGCCCATGCA
Str: >>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.



chrX.trna4 (18693101-18693029)  Length: 73 bp
Type: Val   Anticodon: TAC at 34-36 (18693068-18693066) Score: 82.9
HMM Sc=54.70    Sec struct Sc=28.20
        *    |    *    |    *    |    *    |    *    |    *    |    *    |  
Seq: GGTTCCATAGTGTAGTGGTtATCACGTCTGCTTTACACGCAGAAGGtCCTGGGTTCGAGCCCCAGTGGAACCA
Str: >>>>>>>..>>>..........<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.


chrX.trna6 (3833344-3833271)    Length: 74 bp
Type: Ile   Anticodon: GAT at 35-37 (3833310-3833308)   Score: 75.5
HMM Sc=50.20    Sec struct Sc=25.30
        *    |    *    |    *    |    *    |    *    |    *    |    *    |   
Seq: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.


chrX.trna8 (3794915-3794842)    Length: 74 bp
Type: Ile   Anticodon: GAT at 35-37 (3794881-3794879)   Score: 75.5
HMM Sc=50.20    Sec struct Sc=25.30
        *    |    *    |    *    |    *    |    *    |    *    |    *    |   
Seq: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.



chrX.trna10 (3756491-3756418)   Length: 74 bp
Type: Ile   Anticodon: GAT at 35-37 (3756457-3756455)   Score: 75.5
HMM Sc=50.20    Sec struct Sc=25.30
        *    |    *    |    *    |    *    |    *    |    *    |    *    |   
Seq: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.

chr19.trna8 (45981945-45981859) Length: 87 bp
Type: SeC   Anticodon: TCA at 36-38 (45981910-45981908) Score: 146.9
HMM Sc=0.00 Sec struct Sc=0.00
        *    |    *    |    *    |    *    |    *    |    *    |    *    |    *    |    * 
Seq: GCCCGGATGATCCTCAGTGGTCTGGGGTGCAGGCTTCAAACCTGTAGCTGTCTAGCGACAGAGTGGTTCAATTCCACCTTTCGGGCG
Str: >>>>>>>.>..>>>>>>....<<<<<<<<<<<<.......<<<<<<.>>>>>....<<<<<.>>>>.......<<<<<.<<<<<<<.

鑑於我們可以將起始索引與反密碼子一起提取:

len=7
prior=2

while IFS= read  -r line; do
   if [[ $line =~ Anticodon:" "([[:alpha:]]+)" at "([0-9]+) ]]; then
       anticodon=${BASH_REMATCH[1]}
       start=$(( BASH_REMATCH[2] - 1))  # string indexing is zero-based
   elif [[ $line == "Seq: "* ]]; then
       seq=${line#Seq: }
       printf "Seq: %s, Anticodon: %s\n" "${seq:start-prior:len}" "$anticodon"
   fi
done < file

一個更複雜的解決方案,每次都解析“Str:”行,但不會將長度硬編碼為 7(它會硬編碼“nth”模式):

8thSeq() {
   local seq=$1 str=$2
   local last=${str:0:1}
   local nth=8 n=1 start

   for (( i=1; i < ${#str}; i++)); do
       if [[ "${str:i:1}" != "$last" ]]; then
           ((n++))
           if ((n == nth)); then
               start=$i
           elif ((n == nth+1)); then
               echo "${seq:start:i-start}"
               break
           fi
       fi
       last=${str:i:1}
   done
}

while IFS= read  -r line; do
   if [[ $line =~ Anticodon:" "([[:alpha:]]+) ]]; then
       anticodon=${BASH_REMATCH[1]}
   elif [[ $line == "Seq: "* ]]; then
       seq=${line#Seq: }
   elif [[ $line == "Str: "* ]]; then
       str=${line#Str: }
       printf "Seq: %s, Anticodon: %s\n" "$(8thSeq "$seq" "$str")" "$anticodon"
   fi
done < file

使用“更多”數據,兩種解決方案都輸出

Seq: CTCACAC, Anticodon: CAC
Seq: CTGAAGA, Anticodon: GAA
Seq: CTGCCAC, Anticodon: GCC
Seq: TTTACAC, Anticodon: TAC
Seq: CTGATAA, Anticodon: GAT
Seq: CTGATAA, Anticodon: GAT
Seq: CTGATAA, Anticodon: GAT
Seq: CTTCAAA, Anticodon: TCA

使用awk

$ awk -f script.awk file
Sequence: CTCACAC, Anticodon: CAC, Type: Val
Sequence: CTGAAGA, Anticodon: GAA, Type: Phe
Sequence: CTGCCAC, Anticodon: GCC, Type: Gly
Sequence: TTTACAC, Anticodon: TAC, Type: Val
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTTCAAA, Anticodon: TCA, Type: SeC

script.awk以下awk程序在哪裡:

/^Type:/ {
       type = $2
       anticodon = $4
       split($6, pos, "-")
}

/^Seq:/ {
       seq = substr($2, pos[1]-2, length(anticodon) + 4)
       # or: seq = substr($2, pos[1]-2, pos[2]-pos[1]+5)
       printf "Sequence: %s, Anticodon: %s, Type: %s\n", seq, anticodon, type
}

第一個塊由任何以字元串開頭的行觸發,Type:它從第 2 個和第 4 個空格分隔的欄位中挑選出類型和反密碼子序列,並將第 6 個這樣的欄位拆分-以生成序列中的開始和結束座標。

第二個塊由以字元串開頭的行觸發,Seq:它使用反密碼子的起始位置和從最新行讀取的反密碼子長度從第二個空格分隔的欄位中挑選出序列Type:,確保獲得幾個鹼基- 兩邊的對。

然後產生輸出。


以下sed腳本使用該行中的第 8 個“模式”Str:來提取所需序列,而不是該Type:行中給出的反密碼子的數字位置。

/^Type:[[:blank:]]*/ {
       s/.*Type: \([^[:blank:]]*\)[[:blank:]]*Anticodon: \([^[:blank:]]*\).*/ Anticodon: \2, Type: \1/
       h
}

/^Seq:[[:blank:]]*/ {
       s//Sequence: /
       G
       y/\n/,/
       w data.tmp
}

/^Str:[[:blank:]]*/ {
       s///
       s,\(\(\([<>.]\)\3*\)\{7\}\)\(\([<>.]\)\5*\).*,s/: \1\\(\4\\)[^\,]*/: \\1/;n,
       y/<>/../
       w pass2.sed
}

d

(尾隨d不是錯字)。

它分兩次執行。

在第一遍中,創建了兩個新文件,data.tmp並且pass2.sed.

$ sed -f script.sed file

(這裡沒有終端輸出)

對於給定的數據,data.tmp看起來像

Sequence: GTTTCCGTAGTGTAGCGGTtATCACATTCGCCTCACACGCGAAAGGtCCCCGGTTCGATCCCGGGCGGAAACA, Anticodon: CAC, Type: Val
Sequence: GCCGAAATAGCTCAGTTGGGAGAGCGTTAGACTGAAGATCTAAAGGtCCCTGGTTCGATCCCGGGTTTCGGCA, Anticodon: GAA, Type: Phe
Sequence: GCATGGGTGGTTCAGTGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGGGTTCGATTCCCGGCCCATGCA, Anticodon: GCC, Type: Gly
Sequence: GGTTCCATAGTGTAGTGGTtATCACGTCTGCTTTACACGCAGAAGGtCCTGGGTTCGAGCCCCAGTGGAACCA, Anticodon: TAC, Type: Val
Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile
Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile
Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile
Sequence: GCCCGGATGATCCTCAGTGGTCTGGGGTGCAGGCTTCAAACCTGTAGCTGTCTAGCGACAGAGTGGTTCAATTCCACCTTTCGGGCG, Anticodon: TCA, Type: SeC

whilepass2.sed是一個sed後處理的腳本:

s/: ...............................\(.......\)[^,]*/: \1/;n
s/: ...............................\(.......\)[^,]*/: \1/;n
s/: ..............................\(.......\)[^,]*/: \1/;n
s/: ...............................\(.......\)[^,]*/: \1/;n
s/: ................................\(.......\)[^,]*/: \1/;n
s/: ................................\(.......\)[^,]*/: \1/;n
s/: ................................\(.......\)[^,]*/: \1/;n
s/: .................................\(.......\)[^,]*/: \1/;n

應用pass2.seddata.sed給你最終的結果:

$ sed -f pass2.sed data.tmp
Sequence: CTCACAC, Anticodon: CAC, Type: Val
Sequence: CTGAAGA, Anticodon: GAA, Type: Phe
Sequence: CTGCCAC, Anticodon: GCC, Type: Gly
Sequence: TTTACAC, Anticodon: TAC, Type: Val
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTTCAAA, Anticodon: TCA, Type: SeC

注意:我不確定第二步如何在非常大的數據集上執行。

引用自:https://unix.stackexchange.com/questions/554476