如何提高 bash 腳本中的 cat 和 xargs 性能

May 14, 2021

我必須處理一個包含 100 萬個域行呼叫的文件：
1m.txt
現在我的腳本正在驗證 1m.txt 中包含的行：
cat out.txt &gt; advance.txt 2&gt; /dev/null
cat 1m.txt | xargs -I {} -P 100 sh -c "if ! grep --quiet {} advance.txt; then if host {} &gt;/dev/null; then echo OK {}; else echo DIE {}; fi; fi" &gt;&gt; out.txt
這個腳本的作用是，如果它被中斷（ctrl + c）並重新啟動，它會從處理的最後一行開始。如果有 1000 行，在 200 行中斷程序時重啟很快。但是 100 萬行和 500k 行中斷程序需要幾個小時
有沒有辦法讓它更有效率？

所以你目前的邏輯是“對於1m.txt中的每一行，看看它是否已經是advanced.txt。如果沒有，則處理它並將其添加到out.txt。當工作開始時，用所有行更新advance.txt在 out.txt’ 中。
這樣做的問題是，隨著更多行添加到Advance.txt，每行必須比較的行越多。作為最壞的情況，如果每行都已處理，則需要檢查 1m.txt 中的每一百萬行中的每一行，以查看它是否是提前的.txt。平均而言，您需要在 Advance.txt 中比較一半的行，因此這將需要 1,000,000*500,000 或 500,000,000,000（5000 億）次比較。
如果您沒有並行處理事情，那麼直接的處理方法是找到 out.txt 中的最後一行，然後跳過 1m.txt 中的所有行直到該點。例如
# Pipe the output of the if/then/else/fi construct to xargs.
# use the if/then/else/fi to select the input.
# Use '-s' to see if the file exists and has non zero size.
if [ -s out.txt ] ; then
   # we have some existing data
   # Get the host from the last line
   # delete anything that is not the last line
   # remove the DIE/OK. quote anything not alphabetic with a backslash.
  lasthost="$(sed '$!d;s/^$DIE\|OK$ //;s/[^0-9a-zA-Z]/\\&/g' out.txt)"
  # get the lines from 1m.txt from after the matched host
  # uses GNU sed extension to start at line "0"
  sed "0,/^$lasthost\$/d" 1m.txt
else
  # no existing data, so just copy the 1m.txt using cat
  cat 1m.txt
fi | xargs -I {} sh -c "if host {} &gt;/dev/null; then echo OK {}; else echo DIE {}; fi" &gt;&gt; out.txt
但是，您正在並行處理事情。由於host返回值可能需要非常多變的時間，因此可以顯著重新排列輸入。需要一種更快的方法來查看主機是否已經被看到。標準方法是使用某種雜湊表。一種方法是使用awk.
if [ -s out.txt ] ; then
   # we have some existing data. Process the two files given
   # for the first file set the entries of the seen array to 1
   # for the second file print out the hosts which have not been seen. 
   awk 'FNR==NR {seen[$2]=1;next} seen[$1]!=1' out.txt 1m.txt
else
  cat 1m.txt
fi | xargs -I {} -P 100 sh -c "if host {} &gt;/dev/null; then echo OK {}; else echo DIE {}; fi" &gt;&gt; out.txt

引用自：https://unix.stackexchange.com/questions/649621

如何提高 bash 腳本中的 cat 和 xargs 性能

相關問答

管道查找到 xargs 在終端中有效，但在 shell 腳本中無效

xargs - 為每個參數附加一個參數

使用 xargs 和 cat 並行執行腳本

Bash腳本：包含“$”字元的文件中的多行

如何在命令行上接受文件列表並使用 xargs 創建所有文件的日期副本（basename_date.extension）？

為所有 bash 文件添加 .sh 副檔名