Awk

改變字元串中的位置以生成輸出列表

  • August 17, 2021

我正在處理約 30 長度的短字元串(它們是 DNA 序列)。出於我的目的,每第 5 個位置需要交換 4 個 DNA 鹼基(A、C、T、G)中的任何一個。例如,如果我有一個輸出的輸入AAAAAAAAAAAAAA 將是一個列表:

AAAAAAAAAAAAAA
AAAACAAAAAAAAA
AAAATAAAAAAAAA
AAAAGAAAAAAAAA
AAAACAAAACAAAA
AAAACAAAATAAAA
....

也就是說,每個第 5 個位置單獨交換 A、C、T 或 G,以生成所有可能序列的陣列,其中每個第 5 個位置是所有可能的 DNA 鹼基。

我一直在嘗試使用 for 循環,並且可以編輯每個第 5 個位置,但不能以組合方法

例如

echo "AAAAAAAAAAAAAAA" > one.spacer 
for i in $(seq 1 3)
 do
   for base in {a,c,t,g}
     do
      awk -v b=$base -v x=$i '{print substr ($0,1,5*x-1) b substr ($0,5*x+1,100)}' one.spacer
   done
done

給出輸出:

AAAAaAAAAAAAAAA
AAAAcAAAAAAAAAA
AAAAtAAAAAAAAAA
AAAAgAAAAAAAAAA
AAAAAAAAAaAAAAA
AAAAAAAAAcAAAAA
AAAAAAAAAtAAAAA
AAAAAAAAAgAAAAA
AAAAAAAAAAAAAAa
AAAAAAAAAAAAAAc
AAAAAAAAAAAAAAt
AAAAAAAAAAAAAAg

但希望您能看到它僅在每個第 5 個位置單獨編輯。我需要將包括的序列列表,例如

AAAAgAAAAgAAAAg
AAAAcAAAAtAAAAa

以及所有其他組合。希望這更清楚一點

即使在每個 Unix 機器上使用任何 shell 中的任何 awk 進行真正的 30 字元寬度輸入,這也將在幾分之一秒內執行:

$ cat tst.awk
function mutate(old,lgth,       new,i,j) {
   for (i=5; i<=lgth; i+=5) {
       for (j=1; j<=4; j++) {
           new = substr(old,1,i-1) substr("ACTG",j,1) substr(old,i+1)
           if ( !seen[new]++ ) {
               print new
               mutate(new,lgth)
           }
       }
   }
}

{ mutate($0,length($0)) }
$ echo 'AAAAAAAAAAAAAAA' | awk -f tst.awk
AAAAAAAAAAAAAAA
AAAACAAAAAAAAAA
AAAATAAAAAAAAAA
AAAAGAAAAAAAAAA
AAAAGAAAACAAAAA
AAAAAAAAACAAAAA
AAAACAAAACAAAAA
AAAATAAAACAAAAA
AAAATAAAATAAAAA
AAAAAAAAATAAAAA
AAAACAAAATAAAAA
AAAAGAAAATAAAAA
AAAAGAAAAGAAAAA
AAAAAAAAAGAAAAA
AAAACAAAAGAAAAA
AAAATAAAAGAAAAA
AAAATAAAAGAAAAC
AAAAAAAAAGAAAAC
AAAACAAAAGAAAAC
AAAAGAAAAGAAAAC
AAAAGAAAAAAAAAC
AAAAAAAAAAAAAAC
AAAACAAAAAAAAAC
AAAATAAAAAAAAAC
AAAATAAAACAAAAC
AAAAAAAAACAAAAC
AAAACAAAACAAAAC
AAAAGAAAACAAAAC
AAAAGAAAATAAAAC
AAAAAAAAATAAAAC
AAAACAAAATAAAAC
AAAATAAAATAAAAC
AAAATAAAATAAAAT
AAAAAAAAATAAAAT
AAAACAAAATAAAAT
AAAAGAAAATAAAAT
AAAAGAAAAAAAAAT
AAAAAAAAAAAAAAT
AAAACAAAAAAAAAT
AAAATAAAAAAAAAT
AAAATAAAACAAAAT
AAAAAAAAACAAAAT
AAAACAAAACAAAAT
AAAAGAAAACAAAAT
AAAAGAAAAGAAAAT
AAAAAAAAAGAAAAT
AAAACAAAAGAAAAT
AAAATAAAAGAAAAT
AAAATAAAAGAAAAG
AAAAAAAAAGAAAAG
AAAACAAAAGAAAAG
AAAAGAAAAGAAAAG
AAAAGAAAAAAAAAG
AAAAAAAAAAAAAAG
AAAACAAAAAAAAAG
AAAATAAAAAAAAAG
AAAATAAAACAAAAG
AAAAAAAAACAAAAG
AAAACAAAACAAAAG
AAAAGAAAACAAAAG
AAAAGAAAATAAAAG
AAAAAAAAATAAAAG
AAAACAAAATAAAAG
AAAATAAAATAAAAG

引用自:https://unix.stackexchange.com/questions/664797