Sed
合併僅由特定列比較的兩個 CSV
我想比較兩個具有以下格式的 CSV 文件。他們沒有標題。我想按特定列(在本例中為第二列)比較它們。
源 CSV 文件大約 4-5GB,因此無法將它們載入到記憶體中。
如果 old.csv 中沒有匹配的列,則將每個新行寫入 out.csv。
第 2 列將是一個 html 連結,為了簡單起見,這裡只有一個詞。
我的問題是可以使用 sed、awk、join 或 grep 達到相同的結果嗎?
舊的.csv
"person"|"john"|"smith" "person"|"anne"|"frank" "person"|"bob"|"macdonald" "fruit"|"orange"|"banana" "fruit"|"strawberry"|"fields" "fruit"|"ringring"|"banana"
新的.csv
"person"|"john"|"smith" "person"|"anne"|"frank" "person"|"bob"|"macdonald" "fruit"|"orange"|"banana" "fruit"|"strawberry"|"fields" "glider"|"person"|"airport" "fruit"|"ringring"|"banana" "glider"|"person2"|"airport"
diff.py
#!/usr/bin/env python3 """ Source: https://gist.github.com/davidrleonard/4dbeebf749248a956e44 Usage: $ ./csv-difference.py -d new.csv -s old.csv -o out.csv -c 1 """ import sys import argparse import csv def main(): parser = argparse.ArgumentParser(description='Output difference in CSVs.') parser.add_argument('-d', '--dataset', help='A CSV file of the full dataset', required=True) parser.add_argument('-s', '--subset', help='A CSV file that is a subset of the full dataset', required=True) parser.add_argument('-o', '--output', help='The CSV file we should write to (will be overwritten if it exists', required=True) parser.add_argument('-c', '--column', help='A number of the column to be compared (0 is column 1, 1 is column 2, etc.)', required=True, type=int) args = parser.parse_args() dataset_file = args.dataset subset_file = args.subset output_file = args.output column_num = args.column with open(dataset_file, 'r') as datafile, open(subset_file, 'r') as subsetfile, open(output_file, 'w') as outputfile: data = {row[column_num]: row for row in csv.reader(datafile, delimiter='|', quotechar='"')} subset = {row[column_num]: row for row in csv.reader(subsetfile, delimiter='|', quotechar='"')} data_keys = set(data.keys()) subset_keys = set(subset.keys()) output_keys = data_keys - subset_keys output = [data[key] for key in output_keys] output_csv = csv.writer(outputfile, delimiter='|', quotechar='"', quoting=csv.QUOTE_ALL) for row in output: output_csv.writerow(row) if __name__ == '__main__': main() sys.stdout.flush()
哪個正在生成
out.csv
"glider"|"person"|"airport" "glider"|"person2"|"airport"
awk 超級簡單:
$ awk -F'|' 'NR == FNR {old[$2]; next} !($2 in old)' old.csv new.csv "glider"|"person"|"airport" "glider"|"person2"|"airport"
它將 old.csv 文件的第二個欄位儲存在名為“old”的數組中,然後對於 new.csv 文件,它將列印第二個欄位不在“old”數組中的記錄。
確實,這不會尊重引號內的任何管道字元。為此,我喜歡 ruby 的 csv 模組:
ruby -rcsv -e ' old_col2 = [] old_data = CSV.foreach("./old.csv", :col_sep => "|") do |row| old_col2 << row[1] end CSV.foreach("./new.csv", :col_sep => "|") do |row| if not old_col2.include?(row[1]) puts CSV.generate_line(row, :col_sep => "|", :force_quotes => true) end end '