根據指定列從 CSV 中刪除重複項

December 12, 2021

我正在使用如下所示的 CSV 數據集：
year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK- DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,
它包含許多我需要刪除的重複項（保留一個記錄實例）。基於從我使用的 CSV 文件中刪除重複條目，sort -u file.csv --o deduped-file.csv它適用於以下範例
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
但沒有捕捉到像這樣的例子
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,
數據不完整，但代表同一事物。
是否可以根據指定欄位（例如年份、製造商、品牌、系列、變體）刪除重複項？

我將創建前 5 個欄位的“鍵”，然後僅在第一次看到該鍵時列印一行：

awk -F, '
 {key = $1 FS $2 FS $3 FS $4 FS $5}
 !seen[key]++ 
' file

year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,

引用自：https://unix.stackexchange.com/questions/681059

根據指定列從 CSV 中刪除重複項

相關問答

AWK：在 csv 文件中添加缺失的行並將特定列設置為“0”

如果 col1 與 col4/col5 匹配，則分別在 col1 的空行中列印 col4/col5 中的所有值

僅在逗號分隔文件中刪除引號之間的逗號

僅用 sed 或 awk 替換雙引號之間的字元

從另一個文件替換 csv 文件中的列值

使用 Bash / awk 填充 csv 文件第一列中的空格