Text-Processing

如何讓我的 sed 腳本執行得更快?

  • August 4, 2019

我得到了這個腳本我的一個相關問題 -如何將文件名和標題插入到 csv 的開頭

find . -name '*.csv' -printf "%f\n" |
sed 's/.csv$//' |
xargs -I{} sed -i '1s/^/customer|/ '$'\n'' 1!s/^/{}|/' {}.csv;

目前對於大文件,這需要相當長的時間。我將它縮放到 50,000 個文件並得到了這個結果。

real    1m41.251s
user    0m59.326s
sys     0m38.681s

對於 100,000 個文件,我得到了這個。

real    3m18.466s
user    1m58.451s
sys     1m16.550s

du -sh100,000 個文件提供 485M。我想將此數據擴展到 10-20 GB。

我想知道是否有任何方法可以加快上述腳本的速度。我願意使用任何工具來加快速度。

如果有幫助,我使用的是 Ubuntu 18.04.02 LTS,16 GB RAM。


使用Ed Morton 對我的問題的回答

time awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' *.csv

real    0m20.253s
user    0m3.336s
sys     0m14.854s

sed它比初始:o快得多。我不明白為什麼。如果有人可以解釋它,那將非常有幫助。


當我將其擴展到一百萬個文件時,上面的腳本說Argument list too long

我嘗試了以下方法,但速度很慢,

find . -name \*.csv -exec awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' {} \;

即使我批量處理,100,000 個文件似乎也很慢。

time find . -name "10*.csv" -exec awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' {} \;

real    9m29.474s
user    2m3.336s
sys     6m37.822s

我使用 Ed 的答案嘗試了通常的 for 循環,但它似乎以與生成的原始文件相同的速度工作,大約 40 分鐘用於 100 萬條記錄。

for file in *.csv; do
   echo "$file"
   awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' "$file"
done

我嘗試使用lsxargs每 100,000 個文件對其進行批處理,這似乎是 Ed 給出的初始解決方案是合理的。

time ls 11*.csv | xargs awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}'

real    0m23.619s
user    0m3.537s
sys     0m15.272s

time ls 12*.csv | xargs awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}'

real    0m25.044s
user    0m3.892s
sys     0m16.261s

time ls 13*.csv | xargs awk -i inplace -v OFS='|' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}'

real    0m24.997s
user    0m4.035s
sys     0m16.757s

我現在計劃的是使用上述解決方案,使用 for 循環進行批處理。給定每批 25 秒的平均時間,它會在 25*10 -> 4 分鐘左右結束。對於百萬條記錄,我覺得這很快。

如果有人有更好的解決方案,請告訴我。如果上面編寫的任何程式碼是錯誤/錯誤的,請告訴我。我仍然是初學者,可能複製或理解不正確。

$ awk -v OFS=',' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' 10000000.csv
customer,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000000,Chae,Jesusa,Cummings,Female,deifier2040@example.com,555-555-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13

所以你可以用任何 awk 做:

for file in *.csv; do
   awk 'script' "$file" > tmp && mv tmp "$file"
done

或使用 GNU awk 進行“就地”編輯:

$ tail -n +1 10000000.csv 10000001.csv
==> 10000000.csv <==
first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
Chae,Jesusa,Cummings,Female,deifier2040@example.com,555-555-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13

==> 10000001.csv <==
first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
Fleta,Rosette,Hurley,Other,tobacconist1857@example.com,1-555-555-1210,35 Freelon Arcade,Beaverton,Rhode Island,Cayman Islands,2009-06-08,2009-06-29,39684,2009-07-01,NVIDIA GeForce GTX 980,474.31,16,395.79,Broken,157.53,1088.04
Bennett,Dennis,George,Male,dona1910@example.com,(555) 555-4131,505 Robert C Levy Arcade,Wellington,Louisiana,Mexico,2019-05-09,2019-05-19,37938,2019-05-21,8GB,187.67,16,205.77,Service,170.21,1007.85
Tommye,Pamula,Diaz,Other,dovelet1967@example.com,555.555.4445,1001 Canby Boulevard,Edinburg,Massachusetts,Gambia,2004-05-02,2004-05-24,31364,2004-05-26,Lenovo,137.21,13,193.63,Replacement,246.43,934.31
Albert,Jerrold,Cohen,Other,bolio2036@example.com,+1-(555)-555-8491,1181 Baden Avenue,Menomonee Falls,Texas,Tajikistan,2019-08-03,2019-08-12,37768,2019-08-15,Intel® Iris™ Graphics 6100,396.46,17,223.02,Service,118.53,960.27
Louetta,Collene,Best,Fluid,dinner1922@example.com,1-555-555-7050,923 Barry Viaduct,Laurel,Illinois,St. Barthélemy,2009-03-02,2009-03-06,39557,2009-03-07,AMD Radeon R9 M395X,133.9,11,198.49,Fix,178.54,1055.32
Kandace,Wesley,Diaz,Female,closterium1820@example.com,+1-(555)-555-5414,341 Garlington Run,Santa Maria,New Jersey,Mexico,2005-10-09,2005-10-10,30543,2005-10-14,Samsung,590.29,5,354.85,Service,292.56,1032.22

.

$ awk -i inplace -v OFS=',' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' 10000000.csv 10000001.csv

.

$ tail -n +1 10000000.csv 10000001.csv
==> 10000000.csv <==
customer,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000000,Chae,Jesusa,Cummings,Female,deifier2040@example.com,555-555-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13

==> 10000001.csv <==
customer,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000001,Fleta,Rosette,Hurley,Other,tobacconist1857@example.com,1-555-555-1210,35 Freelon Arcade,Beaverton,Rhode Island,Cayman Islands,2009-06-08,2009-06-29,39684,2009-07-01,NVIDIA GeForce GTX 980,474.31,16,395.79,Broken,157.53,1088.04
10000001,Bennett,Dennis,George,Male,dona1910@example.com,(555) 555-4131,505 Robert C Levy Arcade,Wellington,Louisiana,Mexico,2019-05-09,2019-05-19,37938,2019-05-21,8GB,187.67,16,205.77,Service,170.21,1007.85
10000001,Tommye,Pamula,Diaz,Other,dovelet1967@example.com,555.555.4445,1001 Canby Boulevard,Edinburg,Massachusetts,Gambia,2004-05-02,2004-05-24,31364,2004-05-26,Lenovo,137.21,13,193.63,Replacement,246.43,934.31
10000001,Albert,Jerrold,Cohen,Other,bolio2036@example.com,+1-(555)-555-8491,1181 Baden Avenue,Menomonee Falls,Texas,Tajikistan,2019-08-03,2019-08-12,37768,2019-08-15,Intel® Iris™ Graphics 6100,396.46,17,223.02,Service,118.53,960.27
10000001,Louetta,Collene,Best,Fluid,dinner1922@example.com,1-555-555-7050,923 Barry Viaduct,Laurel,Illinois,St. Barthélemy,2009-03-02,2009-03-06,39557,2009-03-07,AMD Radeon R9 M395X,133.9,11,198.49,Fix,178.54,1055.32
10000001,Kandace,Wesley,Diaz,Female,closterium1820@example.com,+1-(555)-555-5414,341 Garlington Run,Santa Maria,New Jersey,Mexico,2005-10-09,2005-10-10,30543,2005-10-14,Samsung,590.29,5,354.85,Service,292.56,1032.22

如果您有太多文件要在命令行上傳遞並且通過 xargs 執行它太慢,那麼這裡有另一個選項:

awk -i inplace ... '
   BEGIN {
       while ( (getline line < ARGV[1]) > 0 ) {
           if ( line ~ /\.csv$/ ) {
               ARGV[ARGC] = line
               ARGC++
           }
       }
       ARGV[1] = ""
   }
   { the "real" script }
' <(ls)

上面將 的輸出作為輸入文件而不是作為參數讀取,ls使用以 結尾的文件名填充參數數組,.csv然後對文件進行操作,就好像它們在命令行上作為參數傳遞一樣。

引用自:https://unix.stackexchange.com/questions/533626