Bash
awk:計算CSV給定列中元素的出現次數
我正在處理多列數據文件的分析(這裡是10行的例子,但真正的日誌會包含150行!)格式如下:
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before) 9000, lig662, 1, 0.421573, -7.8400, 153 10V2, lig662, 1, 0.42692, -8.0300, 149 3000, lig158, 1, 0.427342, -8.1900, 147 3000, lig158, 1, 0.427342, -8.1900, 147 10V2, lig342, 1, 0.432943, -9.4200, 137 10V1, lig807, 1, 0.434338, -8.0300, 147 10V2, lig369, 1, 0.440377, -7.3200, 156 10V1, lig342, 1, 0.441205, -9.4200, 135 10V1, lig369, 1, 0.465029, -7.3600, 148 10V1, lig158, 1, 0.504513, -7.4800, 135
從這些數據中,我需要關注第一列(如
9000
或)10V1
以及3000
第二列(如 ligXXX)中的索引。特別是我需要列印兩列中的前三個索引以及它們在 CSV 的所有行中出現的次數(從而指示兩列中最常見的索引):TOP PROT; TOP LIG 10V1 (number of cases:4), lig 158 (number of cases: 3) 10V2 (number of cases:3), lig 662 (number of cases: 2) 3000 (number of cases: 2), lig 369 (number of cases: 2)
可以直接應用 AWK 來計算所選列中的出現次數,這些列可以排序等。
awk '{print $1}' file.csv | sort | uniq -c
我需要為列和按出現的排名發展這個想法。
對數組數組和 sorted_in 使用 GNU awk:
$ cat tst.awk BEGIN { FS=", *"; OFS=", " } NR > 1 { cnts[1][$1]++ cnts[2][$2]++ } END { numRows = 3 numCols = 2 PROCINFO["sorted_in"] = "@val_num_desc" for (colNr=1; colNr<=numCols; colNr++) { rowNr = 0 for (key in cnts[colNr]) { vals[++rowNr][colNr] = sprintf("%s (number of cases: %d)", key, cnts[colNr][key]) } } print "TOP PROT", "TOP LIG" for (rowNr=1; rowNr<=numRows; rowNr++) { for (colNr=1; colNr<=numCols; colNr++) { printf "%s%s", vals[rowNr][colNr], (colNr<numCols ? OFS : ORS) } } }
$ awk -f tst.awk file TOP PROT, TOP LIG 10V1 (number of cases: 4), lig158 (number of cases: 3) 10V2 (number of cases: 3), lig662 (number of cases: 2) 3000 (number of cases: 2), lig369 (number of cases: 2)
使用 GNU awk
gawk -F',[[:blank:]]+' -v N=3 ' { count["prot"][$1]++ count["lig"][$2]++ } function show(thing, n, id) { print "TOP " toupper(thing) n = N for (id in count[thing]) { printf "%s (number of cases: %d)\n", id, count[thing][id] if (--n == 0) break } } END { PROCINFO["sorted_in"] = "@val_num_desc" show("prot") show("lig") } ' file.csv | pr -2Ts$'\t' | sed 's/\t/, /'
TOP PROT, TOP LIG 10V1 (number of cases: 4), lig158 (number of cases: 3) 10V2 (number of cases: 3), lig662 (number of cases: 2) 3000 (number of cases: 2), lig369 (number of cases: 2)