Command-Line
基於兩列的字元串成對組合
我正在嘗試獲取可用於每個數據堆棧的字元串的成對組合,
輸入文件包含兩列:col1 是基因名稱,col2 是各種壓力源的名稱。
gene1 FishKairomones gene1 Microcystin gene1 Calcium gene2 Cadmium gene2 Microcystis gene2 FishKairomones gene2 Phosphorous gene3 FishKairomones gene3 Microcystin gene3 Phosphorous gene3 Cadmium
因此,從表中可以看出,gene1 對 3 種壓力源有反應,魚開羅素、微囊藻毒素和鈣。
我想獲得這樣的成對錶:
gene1 FishKairomones gene1 Microcystin gene1 FishKairomones gene1 Calcium gene1 Microcystin gene1 Calcium gene2 Cadmium gene2 Microcystis gene2 Cadmium gene2 FishKairomones gene2 Cadmium gene2 Phosphorous gene2 Microcystis gene2 FishKairomones gene2 Microcystis gene2 Phosphorous gene2 FishKairomones gene2 Phosphorous
如您所見,gene1 FishKairomones 與gene1 微囊藻毒素有關,gene1 fishkairomones 也與鈣有關,而gene1 微囊藻素與gene1 鈣有關。同樣,我想為所有基因做這件事。
有時基因可能有 3 個壓力源,有時是 4 個,以此類推。
我在這裡嘗試了程式碼:Command line tool to “cat” pairwise expansion of all rows in a file
這會創建整個文件的所有成對組合,這不是我想要的。
**
AWK
**解決方案(即使對於無序的輸入行也有效):awk '{ a[$1]=($1 in a? a[$1]",":"")$2 } # grouping `stressors` by `gene` names END { for (k in a) { # for each `gene` len=split(a[k], b, ","); # split `stressors` string into array b for (i=1;i<len;i++) # construct pairwise combinations for (j=i+1;j<=len;j++) # between `stressors` print k,b[i],k,b[j] } }' file
輸出:
gene1 FishKairomones gene1 Microcystin gene1 FishKairomones gene1 Calcium gene1 Microcystin gene1 Calcium gene2 Cadmium gene2 Microcystis gene2 Cadmium gene2 FishKairomones gene2 Cadmium gene2 Phosphorous gene2 Microcystis gene2 FishKairomones gene2 Microcystis gene2 Phosphorous gene2 FishKairomones gene2 Phosphorous gene3 FishKairomones gene3 Microcystin gene3 FishKairomones gene3 Phosphorous gene3 FishKairomones gene3 Cadmium gene3 Microcystin gene3 Phosphorous gene3 Microcystin gene3 Cadmium gene3 Phosphorous gene3 Cadmium