Text-Processing

根據字元串 z 用字元串 y 替換字元串 x

  • April 20, 2022

(已編輯)我有一個包含相當多問題的文件。例如如下:

Chr1_RagTag_p   AUGUSTUS    transcript  393571  396143  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1";
Chr1_RagTag_p   AUGUSTUS    exon    393571  393638  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "1";
Chr1_RagTag_p   AUGUSTUS    exon    393732  393945  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "2";
Chr1_RagTag_p   AUGUSTUS    exon    394047  394094  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "3";
Chr1_RagTag_p   AUGUSTUS    exon    394178  394259  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "4";
Chr1_RagTag_p   AUGUSTUS    exon    394457  394559  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "5";
Chr1_RagTag_p   AUGUSTUS    exon    394698  394818  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "6";
Chr1_RagTag_p   AUGUSTUS    exon    394911  394958  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "7";
Chr1_RagTag_p   AUGUSTUS    exon    395153  395236  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "8";
Chr1_RagTag_p   AUGUSTUS    exon    395347  395411  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "9";
Chr1_RagTag_p   AUGUSTUS    exon    395716  395767  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "10";
Chr1_RagTag_p   AUGUSTUS    exon    395957  395995  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "11";
Chr1_RagTag_p   AUGUSTUS    exon    396069  396143  .   +   .   transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "12";
Chr1_RagTag_p   manual  transcript  396451  399224  .   +   .   transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2";
Chr1_RagTag_p   manual  exon    396451  397570  .   +   .   transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "1";
Chr1_RagTag_p   manual  exon    397661  397848  .   +   .   transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "2";
Chr1_RagTag_p   manual  exon    397923  398146  .   +   .   transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "3";
Chr1_RagTag_p   manual  exon    398367  399224  .   +   .   transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "4";
Chr1_RagTag_p   AUGUSTUS    transcript  77905   78201   .   +   .   transcript_id "TCONS_00000004"; gene_id "XLOC_000004"; oId "g15.t1"; cmp_ref "AT1G01150.1"; class_code "x"; cmp_ref_gene "AT1G01150"; tss_id "TSS4"; num_samples "1";
Chr1_RagTag_p   AUGUSTUS    exon    77905   78201   .   +   .   transcript_id "TCONS_00000004"; gene_id "XLOC_000004"; exon_number "1";

您可以清楚地看到根據 的範例中有兩個基因gene_name,但是軟體不知何故將這兩個基因合併為一個基因(gene_id)。我想通過使用相同的.gene_name替換gene_idfor 行來糾正這些問題transcript_name。另外,對於第三個基因(gene_id "XLOC_000004"),我想用oId沒有部分的東西來代替它的名字.t[0-9]

輸出是這樣的

Chr1_RagTag_p   AUGUSTUS    transcript  393571  396143  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1";
Chr1_RagTag_p   AUGUSTUS    exon    393571  393638  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "1";
Chr1_RagTag_p   AUGUSTUS    exon    393732  393945  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "2";
Chr1_RagTag_p   AUGUSTUS    exon    394047  394094  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "3";
Chr1_RagTag_p   AUGUSTUS    exon    394178  394259  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "4";
Chr1_RagTag_p   AUGUSTUS    exon    394457  394559  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "5";
Chr1_RagTag_p   AUGUSTUS    exon    394698  394818  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "6";
Chr1_RagTag_p   AUGUSTUS    exon    394911  394958  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "7";
Chr1_RagTag_p   AUGUSTUS    exon    395153  395236  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "8";
Chr1_RagTag_p   AUGUSTUS    exon    395347  395411  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "9";
Chr1_RagTag_p   AUGUSTUS    exon    395716  395767  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "10";
Chr1_RagTag_p   AUGUSTUS    exon    395957  395995  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "11";
Chr1_RagTag_p   AUGUSTUS    exon    396069  396143  .   +   .   transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "12";
Chr1_RagTag_p   manual  transcript  396451  399224  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2";
Chr1_RagTag_p   manual  exon    396451  397570  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "1";
Chr1_RagTag_p   manual  exon    397661  397848  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "2";
Chr1_RagTag_p   manual  exon    397923  398146  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "3";
Chr1_RagTag_p   manual  exon    398367  399224  .   +   .   transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "4";
Chr1_RagTag_p   AUGUSTUS    transcript  77905   78201   .   +   .   transcript_id "TCONS_00000004"; gene_id "g15"; oId "g15.t1"; cmp_ref "AT1G01150.1"; class_code "x"; cmp_ref_gene "AT1G01150"; tss_id "TSS4"; num_samples "1";
Chr1_RagTag_p   AUGUSTUS    exon    77905   78201   .   +   .   transcript_id "TCONS_00000004"; gene_id "g15"; exon_number "1";

我猜邏輯是 grep gene_nameby transcript_id,然後根據數字替換gene_id每一行中的。gene_name``transcript_id

到目前為止,我已經創建了一個transcript_id列表gene_name

TCONS_00000070 AT1G02100
TCONS_00000071 AT1G02110
TCONS_00000004 g15

然後我需要根據相關替換每一gene_id行中的。但我不知道該怎麼做。我可以使用嗎?gene_name``transcript_id``sed

最後但同樣重要的是,我應該提到我的真實文件很大,其中包含 35000 個不同的trnascript_id.

提前非常感謝!

我無法判斷這是否會涵蓋所有邊緣情況,但是對於您的範例,它可以滿足您的要求:

sed '/.*gene_name/{h;s///;s/;.*//;x;};G;s/gene_id[^;]*\(.*\)\n\(.*\)/gene_id\2\1/' file

它所做的是提取gene_name如果存在,將其儲存在保持空間中並將其用作所有後續gene_ids 的替換,直到出現新gene_name的。

更詳細:

  • /.*gene_name/是一個地址,所以里面的所有內容{}都只會應用於具有該模式的行
  • 在我們搞砸一切之前,我們將原始行儲存到h舊空間
  • s///刪除前一個模式(直到 的所有內容gene_name);s/;.*//刪除從分號開始的所有內容。所以剩下的是空格和雙引號中的字元串。
  • x交換兩個空間,到現在我們在保持空間中有替換,在模式空間中有原始行
  • 從現在開始的所有內容都應用於所有行:G將保留空間附加到每一行,所以我們有行、換行符和替換
  • s/gene_id[^;]*\(.*\)\n\(.*\)/gene_id\2\1/' is easier to write than to read: $$ ^; $$matches everything between基因ID and the, thus the part to be replaced. The (. ) parts cover the text before and after the embedded newline, so we can refer to them as\1 and\2` 替換。

gene_name為了說明最後一步,請查看在將保留空間與with<CR>作為嵌入式換行符附加後緩衝區的外觀:

Chr1_RagTag_p ………; gene_id "XLOC_000060"; exon_number "1";<CR> "AT1G02100"
                         \______v_____/\________v_______/    \____v____/
                  gene_id     [^;]*          \(.*\)       \n    \(.*\)

使用擴展正則表達式(選項)可能更容易閱讀-E

sed -E '/.*gene_name/{h;s///;s/;.*//;x;};G;s/(gene_id)[^;]*(.*)\n(.*)/\1\3\2/' file

考慮更新問題中的 no gene_namecase 進行更新

我只是簡單地添加了與提取oId類似的gene_name提取,但在它之前。因此,如果之後出現 a gene_name,它將覆蓋oId. 這次在單獨的行中以提高可讀性:

sed '
 /.*oId/{
   h
   s///
   s/\..*/"/
   x
 }
 /.*gene_name/{
   h
   s///
   s/;.*//
   x
 }
 G
 s/gene_id[^;]*\(.*\)\n\(.*\)/gene_id\2\1/' file

引用自:https://unix.stackexchange.com/questions/699608