Grep
使用 GNU 並行增強 grep 搜尋
我正在使用以下 grep 腳本來輸出所有不匹配的模式:
grep -oFf patterns.txt large_strings.txt | grep -vFf - patterns.txt > unmatched_patterns.txt
patterns 文件包含以下 12 個字元長的子字元串(一些實例如下所示):
6b6c665d4f44 8b715a5d5f5f 26364d605243 717c8a919aa2
large_strings 文件包含大約 20-1 億個字元的極長字元串(一小段字元串如下所示):
121b1f212222212123242223252b36434f5655545351504f4e4e5056616d777d80817d7c7b7a7a7b7c7d7f8997a0a2a2a3a5a5a6a6a6a6a6a7a7babbbcbebebdbcbcbdbdbdbdbcbcbcbcc2c2c2c2c2c2c2c2c4c4c4c3c3c3c2c2c3c3c3c3c3c3c3c3c2c2c1c0bfbfbebdbebebebfbfc0c0c0bfbfbfbebebdbdbdbcbbbbbababbbbbcbdbdbdbebebfbfbfbebdbcbbbbbbbbbcbcbcbcbcbcbcbcbcb8b8b8b7b7b6b6b6b8b8b9babbbbbcbcbbbabab9b9bababbbcbcbcbbbbbababab9b8b7b6b6b6b6b7b7b7b7b7b7b7b7b7b7b6b6b5b5b6b6b7b7b7b7b8b8b9b9b9b9b9b8b7b7b6b5b5b5b5b5b4b4b3b3b3b6b5b4b4b5b7b8babdbebfc1c1c0bfbec1c2c2c2c2c1c0bfbfbebebebebfc0c1c0c0c0bfbfbebebebebebebebebebebebebebdbcbbbbbab9babbbbbcbcbdbdbdbcbcbbbbbbbbbbbabab9b7b6b5b4b4b4b4b3b1aeaca9a7a6a9a9a9aaabacaeafafafafafafafafafb1b2b2b2b2b1b0afacaaa8a7a5a19d9995939191929292919292939291908f8e8e8d8c8b8a8a8a8a878787868482807f7d7c7975716d6b6967676665646261615f5f5e5d5b5a595957575554525
我們如何加快上述腳本(gnu parallel、xargs、fgrep 等)?我嘗試使用 –pipepart 和 –block 但它不允許您通過管道傳輸兩個 grep 命令。
順便說一句,這些都是十六進製字元串和模式。
一個不使用 grep 的更有效的答案:
build_k_mers() { k="$1" slot="$2" perl -ne 'for $n (0..(length $_)-'"$k"') { $prefix = substr($_,$n,2); $fh{$prefix} or open $fh{$prefix}, ">>", "tmp/kmer.$prefix.'"$slot"'"; $fh = $fh{$prefix}; print $fh substr($_,$n,'"$k"'),"\n" }' } export -f build_k_mers rm -rf tmp mkdir tmp export LC_ALL=C # search strings must be sorted for comm parsort patterns.txt | awk '{print >>"tmp/patterns."substr($1,1,2)}' & # make shorter lines: Insert \n(last 12 char before \n) for every 32k # This makes it easier for --pipepart to find a newline # It will not change the kmers generated perl -pe 's/(.{32000})(.{12})/$1$2\n$2/g' large_strings.txt > large_lines.txt # Build 12-mers parallel --pipepart --block -1 -a large_lines.txt 'build_k_mers 12 {%}' # -j10 and 20s may be adjusted depending on hardware parallel -j10 --delay 20s 'parsort -u tmp/kmer.{}.* > tmp/kmer.{}; rm tmp/kmer.{}.*' ::: `perl -e 'map { printf "%02x ",$_ } 0..255'` wait parallel comm -23 {} {=s/patterns./kmer./=} ::: tmp/patterns.??
我已經在完整的工作(
patterns.txt
:9GBytes/725937231 行,large_strings.txt
:19GBytes/184 行)上對此進行了測試,並且在我的 64 核機器上它在 3 小時內完成。
這應該有效:
parallel --pipepart --block -1 -a large_strings.txt grep -oFf patterns.txt | grep -vFf - patterns.txt > unmatched_patterns.txt
如果你有
ripgrep
使用它:parallel --pipepart --block -1 -a large_strings.txt rg -oFf patterns.txt | rg -vFf - patterns.txt > unmatched_patterns.txt
如果
patterns.txt
也很大,請查看:https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
您的情況也非常接近 BLAT 解決的問題,除了 BLAT 是為 DNA 建構的。但我不認為你不能在你的情況下使用 BLAT - 可能有一些變化(好吧,你可以將每個十六進制值轉換為 2 個 DNA 字母並直接使用它)。BLAT 與在數據庫中查找一樣快,因此無法與
grep
. http://genome.ucsc.edu/FAQ/FAQblat.html#blat3