Bash

awk:計算CSV給定列中元素的出現次數

  • June 3, 2021

我正在處理多列數據文件的分析(這裡是10行的例子,但真正的日誌會包含150行!)格式如下:

ID(Prot),   ID(lig),    ID(cluster),    dG(rescored),   dG(before), POP(before)
9000,   lig662, 1,  0.421573,   -7.8400,    153
10V2,   lig662, 1,  0.42692,    -8.0300,    149
3000,   lig158, 1,  0.427342,   -8.1900,    147
3000,   lig158, 1,  0.427342,   -8.1900,    147
10V2,   lig342, 1,  0.432943,   -9.4200,    137
10V1,   lig807, 1,  0.434338,   -8.0300,    147
10V2,   lig369, 1,  0.440377,   -7.3200,    156
10V1,   lig342, 1,  0.441205,   -9.4200,    135
10V1,   lig369, 1,  0.465029,   -7.3600,    148
10V1,   lig158, 1,  0.504513,   -7.4800,    135

從這些數據中,我需要關注第一列(如9000或)10V1以及3000第二列(如 ligXXX)中的索引。特別是我需要列印兩列中的前三個索引以及它們在 CSV 的所有行中出現的次數(從而指示兩列中最常見的索引):

TOP PROT; TOP LIG
10V1 (number of cases:4), lig 158 (number of cases: 3)
10V2 (number of cases:3), lig 662 (number of cases: 2)
3000 (number of cases: 2), lig 369 (number of cases: 2)

可以直接應用 AWK 來計算所選列中的出現次數,這些列可以排序等。

awk '{print $1}' file.csv | sort | uniq -c

我需要為列和按出現的排名發展這個想法。

對數組數組和 sorted_in 使用 GNU awk:

$ cat tst.awk
BEGIN { FS=", *"; OFS=", " }
NR > 1 {
   cnts[1][$1]++
   cnts[2][$2]++
}
END {
   numRows = 3
   numCols = 2

   PROCINFO["sorted_in"] = "@val_num_desc"
   for (colNr=1; colNr<=numCols; colNr++) {
       rowNr = 0
       for (key in cnts[colNr]) {
           vals[++rowNr][colNr] = sprintf("%s (number of cases: %d)", key, cnts[colNr][key])
       }
   }

   print "TOP PROT", "TOP LIG"
   for (rowNr=1; rowNr<=numRows; rowNr++) {
       for (colNr=1; colNr<=numCols; colNr++) {
           printf "%s%s", vals[rowNr][colNr], (colNr<numCols ? OFS : ORS)
       }
   }
}
$ awk -f tst.awk file
TOP PROT, TOP LIG
10V1 (number of cases: 4), lig158 (number of cases: 3)
10V2 (number of cases: 3), lig662 (number of cases: 2)
3000 (number of cases: 2), lig369 (number of cases: 2)

使用 GNU awk

gawk -F',[[:blank:]]+' -v N=3 '
   {
       count["prot"][$1]++
       count["lig"][$2]++
   }

   function show(thing,      n, id) {
       print "TOP " toupper(thing)
       n = N
       for (id in count[thing]) {
           printf "%s (number of cases: %d)\n", id,  count[thing][id]
           if (--n == 0) break
       }
   }

   END {
       PROCINFO["sorted_in"] = "@val_num_desc"
       show("prot")
       show("lig")
   }
' file.csv | pr -2Ts$'\t' | sed 's/\t/, /'
TOP PROT, TOP LIG
10V1 (number of cases: 4), lig158 (number of cases: 3)
10V2 (number of cases: 3), lig662 (number of cases: 2)
3000 (number of cases: 2), lig369 (number of cases: 2)

引用自:https://unix.stackexchange.com/questions/652627