Command-Line

基於兩列的字元串成對組合

  • November 7, 2017

我正在嘗試獲取可用於每個數據堆棧的字元串的成對組合,

輸入文件包含兩列:col1 是基因名稱,col2 是各種壓力源的名稱。

       gene1   FishKairomones
       gene1   Microcystin
       gene1   Calcium
       gene2   Cadmium
       gene2   Microcystis
       gene2   FishKairomones
       gene2   Phosphorous
       gene3   FishKairomones
       gene3   Microcystin
       gene3   Phosphorous
       gene3   Cadmium

因此,從表中可以看出,gene1 對 3 種壓力源有反應,魚開羅素、微囊藻毒素和鈣。

我想獲得這樣的成對錶:

   gene1   FishKairomones  gene1   Microcystin
   gene1   FishKairomones  gene1   Calcium
   gene1   Microcystin gene1   Calcium
   gene2   Cadmium gene2   Microcystis
   gene2   Cadmium gene2   FishKairomones
   gene2   Cadmium gene2   Phosphorous
   gene2   Microcystis gene2   FishKairomones
   gene2   Microcystis gene2   Phosphorous
   gene2   FishKairomones  gene2   Phosphorous

如您所見,gene1 FishKairomones 與gene1 微囊藻毒素有關,gene1 fishkairomones 也與鈣有關,而gene1 微囊藻素與gene1 鈣有關。同樣,我想為所有基因做這件事。

有時基因可能有 3 個壓力源,有時是 4 個,以此類推。

我在這裡嘗試了程式碼:Command line tool to “cat” pairwise expansion of all rows in a file

這會創建整個文件的所有成對組合,這不是我想要的。

**AWK**解決方案(即使對於無序的輸入行也有效):

awk '{ a[$1]=($1 in a? a[$1]",":"")$2 }   # grouping `stressors` by `gene` names
    END { 
        for (k in a) {                   # for each `gene`
            len=split(a[k], b, ",");     # split `stressors` string into array b
            for (i=1;i<len;i++)          # construct pairwise combinations
                for (j=i+1;j<=len;j++)   # between `stressors` 
                    print k,b[i],k,b[j] 
        } 
    }' file

輸出:

gene1 FishKairomones gene1 Microcystin
gene1 FishKairomones gene1 Calcium
gene1 Microcystin gene1 Calcium
gene2 Cadmium gene2 Microcystis
gene2 Cadmium gene2 FishKairomones
gene2 Cadmium gene2 Phosphorous
gene2 Microcystis gene2 FishKairomones
gene2 Microcystis gene2 Phosphorous
gene2 FishKairomones gene2 Phosphorous
gene3 FishKairomones gene3 Microcystin
gene3 FishKairomones gene3 Phosphorous
gene3 FishKairomones gene3 Cadmium
gene3 Microcystin gene3 Phosphorous
gene3 Microcystin gene3 Cadmium
gene3 Phosphorous gene3 Cadmium

引用自:https://unix.stackexchange.com/questions/403027