Shell-Script
從文件中提取對應於第 n:th 模式的子序列
我有以下數據塊(多個)
chr1.trna4 (17188416-17188486) Length: 71 bp Type: Gly Anticodon: CCC at 33-35 (17188448-17188450) Score: 78.3 HMM Sc=56.60 Sec struct Sc=21.70 * | * | * | * | * | * | * | Seq: GCATTGGTGGTTCAGTGGTAGAATTCTCGCCTCCCACGCGGGAGaCCCGGGTTCAATTCCCGGCCAATGCA Str: >>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.
對於每個塊,我需要在塊的最後一行找到以 . 開頭的第 8 個模式
Str
。在上述情況下,第 8 個模式是.......
(7 個週期)。這是因為第一組>
符號構成一個模式,第二組句點構成第二個模式,依此類推。現在我需要從
Seq
模式行正上方的行中提取這 7 個字元。在範例中,這對應於 subsequenceCTCCCAC
。輸出應該是
Seq is CTCCCAC and Anticodon: CCC
這是可能的
bash
還是任何外殼?數據塊的更多範例
chr19.trna11 (4724719-4724647) Length: 73 bp Type: Val Anticodon: CAC at 34-36 (4724686-4724684) Score: 79.2 HMM Sc=49.10 Sec struct Sc=30.10 * | * | * | * | * | * | * | Seq: GTTTCCGTAGTGTAGCGGTtATCACATTCGCCTCACACGCGAAAGGtCCCCGGTTCGATCCCGGGCGGAAACA Str: >>>>>>>..>>>..........<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. chr19.trna12 (1383433-1383361) Length: 73 bp Type: Phe Anticodon: GAA at 34-36 (1383400-1383398) Score: 88.9 HMM Sc=68.40 Sec struct Sc=20.50 * | * | * | * | * | * | * | Seq: GCCGAAATAGCTCAGTTGGGAGAGCGTTAGACTGAAGATCTAAAGGtCCCTGGTTCGATCCCGGGTTTCGGCA Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. chr21.trna1 (18827177-18827107) Length: 71 bp Type: Gly Anticodon: GCC at 33-35 (18827145-18827143) Score: 80.9 HMM Sc=60.10 Sec struct Sc=20.80 * | * | * | * | * | * | * | Seq: GCATGGGTGGTTCAGTGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGGGTTCGATTCCCGGCCCATGCA Str: >>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<. chrX.trna4 (18693101-18693029) Length: 73 bp Type: Val Anticodon: TAC at 34-36 (18693068-18693066) Score: 82.9 HMM Sc=54.70 Sec struct Sc=28.20 * | * | * | * | * | * | * | Seq: GGTTCCATAGTGTAGTGGTtATCACGTCTGCTTTACACGCAGAAGGtCCTGGGTTCGAGCCCCAGTGGAACCA Str: >>>>>>>..>>>..........<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. chrX.trna6 (3833344-3833271) Length: 74 bp Type: Ile Anticodon: GAT at 35-37 (3833310-3833308) Score: 75.5 HMM Sc=50.20 Sec struct Sc=25.30 * | * | * | * | * | * | * | Seq: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. chrX.trna8 (3794915-3794842) Length: 74 bp Type: Ile Anticodon: GAT at 35-37 (3794881-3794879) Score: 75.5 HMM Sc=50.20 Sec struct Sc=25.30 * | * | * | * | * | * | * | Seq: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. chrX.trna10 (3756491-3756418) Length: 74 bp Type: Ile Anticodon: GAT at 35-37 (3756457-3756455) Score: 75.5 HMM Sc=50.20 Sec struct Sc=25.30 * | * | * | * | * | * | * | Seq: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. chr19.trna8 (45981945-45981859) Length: 87 bp Type: SeC Anticodon: TCA at 36-38 (45981910-45981908) Score: 146.9 HMM Sc=0.00 Sec struct Sc=0.00 * | * | * | * | * | * | * | * | * Seq: GCCCGGATGATCCTCAGTGGTCTGGGGTGCAGGCTTCAAACCTGTAGCTGTCTAGCGACAGAGTGGTTCAATTCCACCTTTCGGGCG Str: >>>>>>>.>..>>>>>>....<<<<<<<<<<<<.......<<<<<<.>>>>>....<<<<<.>>>>.......<<<<<.<<<<<<<.
鑑於我們可以將起始索引與反密碼子一起提取:
len=7 prior=2 while IFS= read -r line; do if [[ $line =~ Anticodon:" "([[:alpha:]]+)" at "([0-9]+) ]]; then anticodon=${BASH_REMATCH[1]} start=$(( BASH_REMATCH[2] - 1)) # string indexing is zero-based elif [[ $line == "Seq: "* ]]; then seq=${line#Seq: } printf "Seq: %s, Anticodon: %s\n" "${seq:start-prior:len}" "$anticodon" fi done < file
一個更複雜的解決方案,每次都解析“Str:”行,但不會將長度硬編碼為 7(它會硬編碼“nth”模式):
8thSeq() { local seq=$1 str=$2 local last=${str:0:1} local nth=8 n=1 start for (( i=1; i < ${#str}; i++)); do if [[ "${str:i:1}" != "$last" ]]; then ((n++)) if ((n == nth)); then start=$i elif ((n == nth+1)); then echo "${seq:start:i-start}" break fi fi last=${str:i:1} done } while IFS= read -r line; do if [[ $line =~ Anticodon:" "([[:alpha:]]+) ]]; then anticodon=${BASH_REMATCH[1]} elif [[ $line == "Seq: "* ]]; then seq=${line#Seq: } elif [[ $line == "Str: "* ]]; then str=${line#Str: } printf "Seq: %s, Anticodon: %s\n" "$(8thSeq "$seq" "$str")" "$anticodon" fi done < file
使用“更多”數據,兩種解決方案都輸出
Seq: CTCACAC, Anticodon: CAC Seq: CTGAAGA, Anticodon: GAA Seq: CTGCCAC, Anticodon: GCC Seq: TTTACAC, Anticodon: TAC Seq: CTGATAA, Anticodon: GAT Seq: CTGATAA, Anticodon: GAT Seq: CTGATAA, Anticodon: GAT Seq: CTTCAAA, Anticodon: TCA
使用
awk
:$ awk -f script.awk file Sequence: CTCACAC, Anticodon: CAC, Type: Val Sequence: CTGAAGA, Anticodon: GAA, Type: Phe Sequence: CTGCCAC, Anticodon: GCC, Type: Gly Sequence: TTTACAC, Anticodon: TAC, Type: Val Sequence: CTGATAA, Anticodon: GAT, Type: Ile Sequence: CTGATAA, Anticodon: GAT, Type: Ile Sequence: CTGATAA, Anticodon: GAT, Type: Ile Sequence: CTTCAAA, Anticodon: TCA, Type: SeC
script.awk
以下awk
程序在哪裡:/^Type:/ { type = $2 anticodon = $4 split($6, pos, "-") } /^Seq:/ { seq = substr($2, pos[1]-2, length(anticodon) + 4) # or: seq = substr($2, pos[1]-2, pos[2]-pos[1]+5) printf "Sequence: %s, Anticodon: %s, Type: %s\n", seq, anticodon, type }
第一個塊由任何以字元串開頭的行觸發,
Type:
它從第 2 個和第 4 個空格分隔的欄位中挑選出類型和反密碼子序列,並將第 6 個這樣的欄位拆分-
以生成序列中的開始和結束座標。第二個塊由以字元串開頭的行觸發,
Seq:
它使用反密碼子的起始位置和從最新行讀取的反密碼子長度從第二個空格分隔的欄位中挑選出序列Type:
,確保獲得幾個鹼基- 兩邊的對。然後產生輸出。
以下
sed
腳本使用該行中的第 8 個“模式”Str:
來提取所需序列,而不是該Type:
行中給出的反密碼子的數字位置。/^Type:[[:blank:]]*/ { s/.*Type: \([^[:blank:]]*\)[[:blank:]]*Anticodon: \([^[:blank:]]*\).*/ Anticodon: \2, Type: \1/ h } /^Seq:[[:blank:]]*/ { s//Sequence: / G y/\n/,/ w data.tmp } /^Str:[[:blank:]]*/ { s/// s,\(\(\([<>.]\)\3*\)\{7\}\)\(\([<>.]\)\5*\).*,s/: \1\\(\4\\)[^\,]*/: \\1/;n, y/<>/../ w pass2.sed } d
(尾隨
d
不是錯字)。它分兩次執行。
在第一遍中,創建了兩個新文件,
data.tmp
並且pass2.sed
.$ sed -f script.sed file
(這裡沒有終端輸出)
對於給定的數據,
data.tmp
看起來像Sequence: GTTTCCGTAGTGTAGCGGTtATCACATTCGCCTCACACGCGAAAGGtCCCCGGTTCGATCCCGGGCGGAAACA, Anticodon: CAC, Type: Val Sequence: GCCGAAATAGCTCAGTTGGGAGAGCGTTAGACTGAAGATCTAAAGGtCCCTGGTTCGATCCCGGGTTTCGGCA, Anticodon: GAA, Type: Phe Sequence: GCATGGGTGGTTCAGTGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGGGTTCGATTCCCGGCCCATGCA, Anticodon: GCC, Type: Gly Sequence: GGTTCCATAGTGTAGTGGTtATCACGTCTGCTTTACACGCAGAAGGtCCTGGGTTCGAGCCCCAGTGGAACCA, Anticodon: TAC, Type: Val Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile Sequence: GCCCGGATGATCCTCAGTGGTCTGGGGTGCAGGCTTCAAACCTGTAGCTGTCTAGCGACAGAGTGGTTCAATTCCACCTTTCGGGCG, Anticodon: TCA, Type: SeC
while
pass2.sed
是一個sed
後處理的腳本:s/: ...............................\(.......\)[^,]*/: \1/;n s/: ...............................\(.......\)[^,]*/: \1/;n s/: ..............................\(.......\)[^,]*/: \1/;n s/: ...............................\(.......\)[^,]*/: \1/;n s/: ................................\(.......\)[^,]*/: \1/;n s/: ................................\(.......\)[^,]*/: \1/;n s/: ................................\(.......\)[^,]*/: \1/;n s/: .................................\(.......\)[^,]*/: \1/;n
應用
pass2.sed
到data.sed
給你最終的結果:$ sed -f pass2.sed data.tmp Sequence: CTCACAC, Anticodon: CAC, Type: Val Sequence: CTGAAGA, Anticodon: GAA, Type: Phe Sequence: CTGCCAC, Anticodon: GCC, Type: Gly Sequence: TTTACAC, Anticodon: TAC, Type: Val Sequence: CTGATAA, Anticodon: GAT, Type: Ile Sequence: CTGATAA, Anticodon: GAT, Type: Ile Sequence: CTGATAA, Anticodon: GAT, Type: Ile Sequence: CTTCAAA, Anticodon: TCA, Type: SeC
注意:我不確定第二步如何在非常大的數據集上執行。