Bash

/usr/bin/awk:參數列表太長

  • March 10, 2015

我正在嘗試執行以下 bash 腳本:

#!/bin/bash

file=$1
filename=${file%%.*}
line1=$(sed -n 1~2p ${file})
seqs=$(grep -v '^>' ${file})
pos=$(echo "${line1}" | awk -F"[__]" 'NF>2{print $2}')

( 
   awk -v str="${seqs}" -v str2="${pos}" -v str3="${line1}" -v name=${filename} -v sep="[$IFS]" '
       BEGIN {
           n = split(str, a, sep)
           m = split(str2, b, sep)
           k = split(str3, c, sep)
           for (i=1;i<=n;i++) {o=10;d[$i]=b[i]-o;s[$i]=d[i]>0?d[i]:1; print c[i] "\n" substr(a[i],d[$i],2*o+(d[$i]<0?d[$i]:1)) > name"_flanks.fasta"}
       }
   '
)

但是我得到:

$ ./test.sh myfile.fasta
./test.sh: line 10: /usr/bin/xargs: Argument list too long

因不使用版本控製而受到打擊,但這在我的程式碼的先前版本中有效。似乎是什麼問題?

**編輯:**已經註意到,如果我管“頭 $ {file} |" into the sed and grep commands then this runs fine, but doing “cat $ {file} |” 重新產生原始錯誤。這真的是文件大小限制嗎?我是否必須將計算分成更小的文件塊?

的輸出"$seqs$是這些元素中的大約 6,000 個

MEDEAVLDRGASFLKHVCDEEEVEGHHTIYIGVHVPKSYRRRRRHKRKTGHKEKKEKERISENYSDKSDIENADESSSSILKPLISPAAERIRFILGEEDDSPAPPQLFTELDELLAVDGQEMEWKETARWIKFEEKVEQGGERWSKPHVATLSLHSLFELRTCMEKGSIMLDREASSLPQLVEMIVDHQIETGLLKPELKDKVTYTLLRKHRHQTKKSNLRSLADIGKTVSSASRMFTNPDNGSPAMTHRNLTSSSLNDISDKPEKDQLKNKFMKKLPRDAEASNVLVGEVDFLDTPFIAFVRLQQAVMLGALTEVPVPTRFLFILLGPKGKAKSYHEIGRAIATLMSDEVFHDIAYKAKDRHDLIAGIDEFLDEVIVLPPGEWDPAIRIEPPKSLPSSDKRKNMYSGGENVQMNGDTPHDGGHGGGGHGDCEELQRTGRFCGGLIKDIKRKAPFFASDFYDALNIQALSAILFIYLATVTNAITFGGLLGDATDNMQGVLESFLGTAVSGAIFCLFAGQPLTILSSTGPVLVFERLLFNFSKDNNFDYLEFRLWIGLWSAFLCLILVATDASFLVQYFTRFTEEGFSSLISFIFIYDAFKKMIKLADYYPINSNFKVGYNTLFSCTCVPPDPANISISNDTTLAPEYLPTMSSTDMYHNTTFDWAFLSKKECSKYGGNLVGNNCNFVPDITLMSFILFLGTYTSSMALKKFKTSPYFPTTARKLISDFAIILSILIFCVIDALVGVDTPKLIVPSEFKPTSPNRGWFVPPFGENPWWVCLAAAIPALLVTILIFMDQQITAVIVNRKEHKLKKGAGYHLDLFWVAILMVICSLMALPWYVAATVISIAHIDSLKMETETSAPGEQPKFLGVREQRVTGTLVFILTGLSVFMAPILKFIPMPVLYGVFLYMGVASLNGVQFMDRLKLLLMPLKHQPDFIYLRHVPLRRVHLFTFLQVLCLALLWILKSTVAAIIFPVMILALVAVRKGMDYLFSQHDLSFLDDVIPEKDKKKKEDEKKKKKKKGSLDSDNDDSDCPYSEKVPSIKIPMDIMEQQPFLSDSKPSDRERSPTFLERHTSC

該文件包含許多重複的數據,例如:

>Q9UM01_334_L_R
MVDSTEYEVASQPEVETSPLGDGASPGPEQVKLKKEISLLNGVCLIVGNMIGSGIFVSPKGVLIYSASFGLSLVIWAVGGLFSVFGALCYAELGTTIKKSGASYAYILEAFGGFLAFIRLWTSLLIIEPTSQAIIAITFANYMVQPLFPSCFAPYAASRLLAAACICLLTFINCAYVKWGTLVQDIFTYAKVLALIAVIVAGIVRLGQGASTHFENSFEGSSFAVGDIALALYSALFSYSGWDTLNYVTEEIKNPERNLPLSIGISMPIVTIIYILTNVAYYTVLDMRDILASDAVAVTFADQIFGIFNWIIPLSVALSCFGGLNASIVAASRLFFVGSREGHLPDAICMIHVERFTPVPSLLFNGIMALIYLCVEDIFQLINYYSFSYWFFVGLSIVGQLYLRWKEPDRPRPLKLSVFFPIVFCLCTIFLVAVPLYSDTINSLIGIAIALSGLPFYFLIIRVPEHKRPLYLRRIVGSATRYLQVLCMSVAAEMDLEDGGEMPKQRDPKSN

我想閱讀標題(以“>”開頭),去掉位置編號(334),然後第 2 行是我想要的“序列”:

轉到位置pos[i]seqs[i]選擇一個子字元串,seqs[i]它的兩邊最多 10 個位置pos[i]。例如,如果pos[i] = 15我會返回:

EYEVASQPEVETSPLGDGAS

我可以在不使用整個文件時執行此操作,但似乎將所有內容直接讀入 awk 會使程序比通過 shell 變數載入所有內容更有效。

為什麼您不只使用@Olivier Dulacawk提供的方法:

awk '/^>/{split($0,N,"_");n=N[2];print;next}{print substr($0,n-10,20)}' file > file_flanks.fasta

相同的:

awk -F'_' '/^>/{n=$2;print;next}{print substr($0,n-10,20)}' file > file_flanks.fasta

或者沒有數組:

awk '/^>/{print;sub("[^_]*_","");n=$0+0;next}{print substr($0,n-10,20)}' file > file_flanks.fasta

引用自:https://unix.stackexchange.com/questions/189245