Text-Processing

僅使用 sed 或 perl 使用不正確的換行符修復格式錯誤的 CSV

  • April 2, 2018

我有一個逗號分隔的 CSV 文件,但由於某種原因,我們的系統在文件中的隨機位置插入了一個換行符,這導致整個文件中斷。我可以得到文件中的列數。

如何使用sed和/或perl在單行命令中解決它?我知道這是可以解決的,awk但這是出於學習目的。如果使用perl,我不想使用內置的 CSV 函式。可以解決嗎??我解決這個問題好幾天了,我似乎找不到解決方案:(

樣本格式錯誤的輸入(大量隨機插入 \n)

policyID,statecode,county,Point longitude,Some Thing Here,point_granularity
119736,FL,CLAY COUNTY,-81.711777,“Residential Lot”,1
448094,FL,CLAY COUNTY,-81.707664,“Residen
tial Lot”,3
206893,FL,CLAY COUNTY,-81.7
00455,“Residen
tial Lot”,1
333743,FL,CLAY COUNTY,-81.707703,“Residential Lot”,
3
172534,FL,CLAY COUNTY,-81.702675,“Residential Lot”,1
785275,FL,CLAY COUNTY,-81.707703,“Residential Lot”,3
995932,FL,CLAY COUNTY,-81.713882,
“Residential Lot”,1
223488,FL,CLAY COUNTY,-81.707146,“Residential Lot”,1
4335
12,FL,CLAY COUNTY,-81.704613,
“Residential Lot”,1

所需輸出

policyID,statecode,county,Point longitude,Some Thing Here,point_granularity
119736,FL,CLAY COUNTY,-81.711777,“Residential Lot”,1
448094,FL,CLAY COUNTY,-81.707664,“Residential Lot”,3
206893,FL,CLAY COUNTY,-81.700455,“Residential Lot”,1
333743,FL,CLAY COUNTY,-81.707703,“Residential Lot”,3
172534,FL,CLAY COUNTY,-81.702675,“Residential Lot”,1
785275,FL,CLAY COUNTY,-81.707703,“Residential Lot”,3
995932,FL,CLAY COUNTY,-81.713882,“Residential Lot”,1
223488,FL,CLAY COUNTY,-81.707146,“Residential Lot”,1
433512,FL,CLAY COUNTY,-81.704613,“Residential Lot”,1
$ awk -F, '{ while (NF < 6 || $NF == "") { brokenline=$0; getline; $0 = brokenline $0}; print }' file.csv
policyID,statecode,county,Point longitude,Some Thing Here,point_granularity
119736,FL,CLAY COUNTY,-81.711777,“Residential Lot”,1
448094,FL,CLAY COUNTY,-81.707664,“Residential Lot”,3
206893,FL,CLAY COUNTY,-81.700455,“Residential Lot”,1
333743,FL,CLAY COUNTY,-81.707703,“Residential Lot”,3
172534,FL,CLAY COUNTY,-81.702675,“Residential Lot”,1
785275,FL,CLAY COUNTY,-81.707703,“Residential Lot”,3
995932,FL,CLAY COUNTY,-81.713882,“Residential Lot”,1
223488,FL,CLAY COUNTY,-81.707146,“Residential Lot”,1
433512,FL,CLAY COUNTY,-81.704613,“Residential Lot”,1

awk只要目前行中的欄位少於六個,或者最後一個欄位為空(在最後一個欄位分隔符之後有一行被斷開),程式碼就會將下一行輸入附加到目前行。


Perl 類似工作:

perl -ne 'chomp;while (tr/,/,/ < 5 || /,$/) { $_ .= readline; chomp } print "$_\n"' file.csv

引用自:https://unix.stackexchange.com/questions/434979