Text-Processing
根據指定列從 CSV 中刪除重複項
我正在使用如下所示的 CSV 數據集:
year,manufacturer,brand,series,variation,card_number,card_title,sport,team 2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer, 2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer, 2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,, 2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football, 2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK- DF1,Darren Smith,Football, 2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football, 2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,
它包含許多我需要刪除的重複項(保留一個記錄實例)。基於從我使用的 CSV 文件中刪除重複條目,
sort -u file.csv --o deduped-file.csv
它適用於以下範例2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer, 2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
但沒有捕捉到像這樣的例子
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer, 2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,
數據不完整,但代表同一事物。
是否可以根據指定欄位(例如年份、製造商、品牌、系列、變體)刪除重複項?
我將創建前 5 個欄位的“鍵”,然後僅在第一次看到該鍵時列印一行:
awk -F, ' {key = $1 FS $2 FS $3 FS $4 FS $5} !seen[key]++ ' file
year,manufacturer,brand,series,variation,card_number,card_title,sport,team 2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer, 2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football, 2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football, 2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,