Text-Processing
如何將具有多個子行項(從列)的 CSV 轉換為 Awk 中的新列?
我有一個 CSV 文件,其一般格式如下圖所示。
在該 CSV 中有多個行屬於某個列 (
desc
),我想提取這些項目並將它們分別添加到名為 的新列name, size, weight, glass
中。我已經突出顯示(紅色)條目的那些子行項目。原始結構:
預期結構:
原始 CSV:
page,item,id,desc,price 1,2,F1,Alpha,111.11 1,,,380 x 2280 mm (size), 1,,,Weight: 33.0 kg, 1,,,Glass: AAA, 1,,,XXX, 1,3,F2,Beta,222.22 1,,,880 x 2280 mm (size), 1,,,Weight: 68.4 kg, 1,,,Glass: BBB, 1,,,YYY, 1,4,F3,Gamma,333.33 1,,,1980 x 1580 mm (size), 1,,,Weight: 78.2 kg, 1,,,Glass: CCC, 1,,,XXX, 1,,,YYY, 1,,,ZZZ,
預期生成的 CSV:
page,item,id,name,size,weight,glass,price 1,2,F1,Alpha,380 x 2280,33.0,AAA,111.11 1,3,F2,Beta,880 x 2280,68.4,BBB,222.22 1,4,F3,Gamma,1980 x 1580,78.2,CCC,333.33
其中name將取代desc中的第一行。
更新:
在某些情況下,某些 awk 解決方案可能適用於上述情況,但在添加第 4 項時會失敗。要進行全面測試,請考慮將其添加到上述內容中:
1,7,F4,Delta,111.11 1,,,11 x 22 mm (size), 1,,,Weight: 33.0 kg, 1,,,Glass: DDD, 1,,,Random-1,
所以重要的3點:
- 列中的子行數
desc
可能會有所不同。- 之後的任何子行都
Glass:...
應該被忽略。- 列中可能有沒有任何子行的項目,它們也應該被忽略。
desc
問:如何使用Awk將這些子行重新映射到新列?
(或者是否有更合適的工具在 bash 中執行此操作?)
可能相關(但不是很有幫助)問題:
awk 'BEGIN{ FS=OFS=","; print "page,item,id,name,size,weight,glass,price" } $2!=""{ price=$5; data=$1 FS $2 FS $3 FS $4; desc=""; c=0; next } { gsub(/ ?(mm \(size\)|Weight:|kg|Glass:) ?/, "") } ++c<=3{ desc=(desc==""?"":desc OFS) $4; next } data { print data, desc, price; data="" } ' infile
包括解釋:
awk 'BEGIN{ FS=OFS=","; print "page,item,id,name,size,weight,glass,price" } #this block will be executed only once before reading any line, and does: #set FS (Field Separator), OFS (Output Field Separator) to a comma character #print the "header" line .... $2!=""{ price=$5; data=$1 FS $2 FS $3 FS $4; desc=""; c=0; next } #this blocks will be executed only when column#2 value was not empty, and does: #backup column#5 into "price" variable #also backup columns#1~4 into "data" variable #reset the "desc" variable and also counter variable "c" #then read next line and skip processing the rest of the code { gsub(/ ?(mm \(size\)|Weight:|kg|Glass:) ?/, "") } #this block runs for every line and replace strings above with empty string ++c<=3{ desc=(desc==""?"":desc OFS) $4; next } #this block runs at most 3reps and #joining the descriptions in column#4 of every line #and read the next line until counter var "c" has value <=3 data { print data, desc, price; data="" } #if "data" variable has containing any data, then #print the data, desc, price and empty "data" variable ' infile