Awk
改變字元串中的位置以生成輸出列表
我正在處理約 30 長度的短字元串(它們是 DNA 序列)。出於我的目的,每第 5 個位置需要交換 4 個 DNA 鹼基(A、C、T、G)中的任何一個。例如,如果我有一個輸出的輸入
AAAAAAAAAAAAAA
將是一個列表:AAAAAAAAAAAAAA AAAACAAAAAAAAA AAAATAAAAAAAAA AAAAGAAAAAAAAA AAAACAAAACAAAA AAAACAAAATAAAA ....
也就是說,每個第 5 個位置單獨交換 A、C、T 或 G,以生成所有可能序列的陣列,其中每個第 5 個位置是所有可能的 DNA 鹼基。
我一直在嘗試使用 for 循環,並且可以編輯每個第 5 個位置,但不能以組合方法
例如
echo "AAAAAAAAAAAAAAA" > one.spacer for i in $(seq 1 3) do for base in {a,c,t,g} do awk -v b=$base -v x=$i '{print substr ($0,1,5*x-1) b substr ($0,5*x+1,100)}' one.spacer done done
給出輸出:
AAAAaAAAAAAAAAA AAAAcAAAAAAAAAA AAAAtAAAAAAAAAA AAAAgAAAAAAAAAA AAAAAAAAAaAAAAA AAAAAAAAAcAAAAA AAAAAAAAAtAAAAA AAAAAAAAAgAAAAA AAAAAAAAAAAAAAa AAAAAAAAAAAAAAc AAAAAAAAAAAAAAt AAAAAAAAAAAAAAg
但希望您能看到它僅在每個第 5 個位置單獨編輯。我需要將包括的序列列表,例如
AAAAgAAAAgAAAAg AAAAcAAAAtAAAAa
以及所有其他組合。希望這更清楚一點
即使在每個 Unix 機器上使用任何 shell 中的任何 awk 進行真正的 30 字元寬度輸入,這也將在幾分之一秒內執行:
$ cat tst.awk function mutate(old,lgth, new,i,j) { for (i=5; i<=lgth; i+=5) { for (j=1; j<=4; j++) { new = substr(old,1,i-1) substr("ACTG",j,1) substr(old,i+1) if ( !seen[new]++ ) { print new mutate(new,lgth) } } } } { mutate($0,length($0)) }
$ echo 'AAAAAAAAAAAAAAA' | awk -f tst.awk AAAAAAAAAAAAAAA AAAACAAAAAAAAAA AAAATAAAAAAAAAA AAAAGAAAAAAAAAA AAAAGAAAACAAAAA AAAAAAAAACAAAAA AAAACAAAACAAAAA AAAATAAAACAAAAA AAAATAAAATAAAAA AAAAAAAAATAAAAA AAAACAAAATAAAAA AAAAGAAAATAAAAA AAAAGAAAAGAAAAA AAAAAAAAAGAAAAA AAAACAAAAGAAAAA AAAATAAAAGAAAAA AAAATAAAAGAAAAC AAAAAAAAAGAAAAC AAAACAAAAGAAAAC AAAAGAAAAGAAAAC AAAAGAAAAAAAAAC AAAAAAAAAAAAAAC AAAACAAAAAAAAAC AAAATAAAAAAAAAC AAAATAAAACAAAAC AAAAAAAAACAAAAC AAAACAAAACAAAAC AAAAGAAAACAAAAC AAAAGAAAATAAAAC AAAAAAAAATAAAAC AAAACAAAATAAAAC AAAATAAAATAAAAC AAAATAAAATAAAAT AAAAAAAAATAAAAT AAAACAAAATAAAAT AAAAGAAAATAAAAT AAAAGAAAAAAAAAT AAAAAAAAAAAAAAT AAAACAAAAAAAAAT AAAATAAAAAAAAAT AAAATAAAACAAAAT AAAAAAAAACAAAAT AAAACAAAACAAAAT AAAAGAAAACAAAAT AAAAGAAAAGAAAAT AAAAAAAAAGAAAAT AAAACAAAAGAAAAT AAAATAAAAGAAAAT AAAATAAAAGAAAAG AAAAAAAAAGAAAAG AAAACAAAAGAAAAG AAAAGAAAAGAAAAG AAAAGAAAAAAAAAG AAAAAAAAAAAAAAG AAAACAAAAAAAAAG AAAATAAAAAAAAAG AAAATAAAACAAAAG AAAAAAAAACAAAAG AAAACAAAACAAAAG AAAAGAAAACAAAAG AAAAGAAAATAAAAG AAAAAAAAATAAAAG AAAACAAAATAAAAG AAAATAAAATAAAAG