Shell-Script
使用從多個 .csv 文件中檢索到的唯一名稱的頻率創建表
我有 32 個 CSV 文件,其中包含從數據庫中獲取的資訊。我需要製作一個 TSV/CSV 格式的頻率表,其中行的名稱是每個文件的名稱,列的名稱是在整個文件中找到的唯一名稱。然後需要用每個文件的每個名稱的頻率計數填充該表。最大的問題是並非所有文件都包含相同的提取名稱。
.csv
輸入:$cat file_1 name_of_sequence,C cc,'other_information' name_of_sequence,C cc,'other_information' name_of_sequence,C cc,'other_information' name_of_sequence,D dd,'other_information' ... $cat file_2 name_of_sequence,B bb,'other_information' name_of_sequence,C cc,'other_information' name_of_sequence,C cc,'other_information' name_of_sequence,C cc,'other_information' ... $cat file_3 name_of_sequence,A aa,'other_information' name_of_sequence,A aa,'other_information' name_of_sequence,A aa,'other_information' name_of_sequence,A aa,'other_information' ... $cat `.csv/.tsv` output: taxa,A aa,B bb,C cc,D dd File_1,0,0,3,1 File_2,0,1,3,0 File_3,4,0,0,0
使用 bash 我知道如何獲取
cut
第二列sort
和uniq
名稱,然後獲取每個文件中每個名稱的計數。我不知道如何製作一個表格來顯示所有名稱、計數並放置“當文件中不存在該名稱時為 0”。我通常使用 Bash 對數據進行排序,但也可以使用 python 腳本。
以下內容應適用於 python 2 和 3,另存為
xyz.py
並執行
python xyz.py file_1 file_2 file_3
:import sys import csv names = set() # to keep track of all sequence names files = {} # map of file_name to dict of sequence_names mapped to counts # counting for file_name in sys.argv[1:]: # lookup the file_name create a new dict if not in the files dict b = files.setdefault(file_name, {}) with open(file_name) as fp: for line in fp: x = line.strip().split() # split the line names.add(x[1]) # might be a new sequence name # retrieve the sequence name or set it if not there yet # what would not work is "i += 1" as you would need to assign # that to b[x[1]] again. The list "[0]" however is a reference b.setdefault(x[1], [0])[0] += 1 # output names = sorted(list(names)) # sort the unique sequence names for the columns grid = [] # create top line top_line = ['taxa'] grid.append(top_line) for name in names: top_line.append(name) # append each files values to the grid for file_name in sys.argv[1:]: data = files[file_name] line = [file_name] grid.append(line) for name in names: line.append(data.get(name, [0])[0]) # 0 if sequence name not in file # dump the grid to CSV with open('out.csv', 'w') as fp: writer = csv.writer(fp) writer.writerows(grid)
使用
[0]
計數器比直接使用整數更容易更新值。如果輸入文件更複雜,最好使用 Python 的 CSV 庫來讀取它們