根據字元串 z 用字元串 y 替換字元串 x
(已編輯)我有一個包含相當多問題的文件。例如如下:
Chr1_RagTag_p AUGUSTUS transcript 393571 396143 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1"; Chr1_RagTag_p AUGUSTUS exon 393571 393638 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "1"; Chr1_RagTag_p AUGUSTUS exon 393732 393945 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "2"; Chr1_RagTag_p AUGUSTUS exon 394047 394094 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "3"; Chr1_RagTag_p AUGUSTUS exon 394178 394259 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "4"; Chr1_RagTag_p AUGUSTUS exon 394457 394559 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "5"; Chr1_RagTag_p AUGUSTUS exon 394698 394818 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "6"; Chr1_RagTag_p AUGUSTUS exon 394911 394958 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "7"; Chr1_RagTag_p AUGUSTUS exon 395153 395236 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "8"; Chr1_RagTag_p AUGUSTUS exon 395347 395411 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "9"; Chr1_RagTag_p AUGUSTUS exon 395716 395767 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "10"; Chr1_RagTag_p AUGUSTUS exon 395957 395995 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "11"; Chr1_RagTag_p AUGUSTUS exon 396069 396143 . + . transcript_id "TCONS_00000070"; gene_id "XLOC_000060"; exon_number "12"; Chr1_RagTag_p manual transcript 396451 399224 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2"; Chr1_RagTag_p manual exon 396451 397570 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "1"; Chr1_RagTag_p manual exon 397661 397848 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "2"; Chr1_RagTag_p manual exon 397923 398146 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "3"; Chr1_RagTag_p manual exon 398367 399224 . + . transcript_id "TCONS_00000071"; gene_id "XLOC_000060"; exon_number "4"; Chr1_RagTag_p AUGUSTUS transcript 77905 78201 . + . transcript_id "TCONS_00000004"; gene_id "XLOC_000004"; oId "g15.t1"; cmp_ref "AT1G01150.1"; class_code "x"; cmp_ref_gene "AT1G01150"; tss_id "TSS4"; num_samples "1"; Chr1_RagTag_p AUGUSTUS exon 77905 78201 . + . transcript_id "TCONS_00000004"; gene_id "XLOC_000004"; exon_number "1";
您可以清楚地看到根據 的範例中有兩個基因
gene_name
,但是軟體不知何故將這兩個基因合併為一個基因(gene_id
)。我想通過使用相同的.gene_name
替換gene_id
for 行來糾正這些問題transcript_name
。另外,對於第三個基因(gene_id "XLOC_000004"
),我想用oId
沒有部分的東西來代替它的名字.t[0-9]
輸出是這樣的
Chr1_RagTag_p AUGUSTUS transcript 393571 396143 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; gene_name "AT1G02100"; oId "g109.t1"; cmp_ref "AT1G02100.1"; class_code "="; tss_id "TSS64"; num_samples "1"; Chr1_RagTag_p AUGUSTUS exon 393571 393638 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "1"; Chr1_RagTag_p AUGUSTUS exon 393732 393945 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "2"; Chr1_RagTag_p AUGUSTUS exon 394047 394094 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "3"; Chr1_RagTag_p AUGUSTUS exon 394178 394259 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "4"; Chr1_RagTag_p AUGUSTUS exon 394457 394559 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "5"; Chr1_RagTag_p AUGUSTUS exon 394698 394818 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "6"; Chr1_RagTag_p AUGUSTUS exon 394911 394958 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "7"; Chr1_RagTag_p AUGUSTUS exon 395153 395236 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "8"; Chr1_RagTag_p AUGUSTUS exon 395347 395411 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "9"; Chr1_RagTag_p AUGUSTUS exon 395716 395767 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "10"; Chr1_RagTag_p AUGUSTUS exon 395957 395995 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "11"; Chr1_RagTag_p AUGUSTUS exon 396069 396143 . + . transcript_id "TCONS_00000070"; gene_id "AT1G02100"; exon_number "12"; Chr1_RagTag_p manual transcript 396451 399224 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; gene_name "AT1G02110"; oId "g110.t1"; cmp_ref "AT1G02110.1"; class_code "="; tss_id "TSS65"; num_samples "2"; Chr1_RagTag_p manual exon 396451 397570 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "1"; Chr1_RagTag_p manual exon 397661 397848 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "2"; Chr1_RagTag_p manual exon 397923 398146 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "3"; Chr1_RagTag_p manual exon 398367 399224 . + . transcript_id "TCONS_00000071"; gene_id "AT1G02110"; exon_number "4"; Chr1_RagTag_p AUGUSTUS transcript 77905 78201 . + . transcript_id "TCONS_00000004"; gene_id "g15"; oId "g15.t1"; cmp_ref "AT1G01150.1"; class_code "x"; cmp_ref_gene "AT1G01150"; tss_id "TSS4"; num_samples "1"; Chr1_RagTag_p AUGUSTUS exon 77905 78201 . + . transcript_id "TCONS_00000004"; gene_id "g15"; exon_number "1";
我猜邏輯是 grep
gene_name
bytranscript_id
,然後根據數字替換gene_id
每一行中的。gene_name``transcript_id
到目前為止,我已經創建了一個
transcript_id
列表gene_name
TCONS_00000070 AT1G02100 TCONS_00000071 AT1G02110 TCONS_00000004 g15
然後我需要根據相關替換每一
gene_id
行中的。但我不知道該怎麼做。我可以使用嗎?gene_name``transcript_id``sed
最後但同樣重要的是,我應該提到我的真實文件很大,其中包含 35000 個不同的
trnascript_id
.提前非常感謝!
我無法判斷這是否會涵蓋所有邊緣情況,但是對於您的範例,它可以滿足您的要求:
sed '/.*gene_name/{h;s///;s/;.*//;x;};G;s/gene_id[^;]*\(.*\)\n\(.*\)/gene_id\2\1/' file
它所做的是提取
gene_name
如果存在,將其儲存在保持空間中並將其用作所有後續gene_id
s 的替換,直到出現新gene_name
的。更詳細:
/.*gene_name/
是一個地址,所以里面的所有內容{}
都只會應用於具有該模式的行- 在我們搞砸一切之前,我們將原始行儲存到
h
舊空間s///
刪除前一個模式(直到 的所有內容gene_name
);s/;.*//
刪除從分號開始的所有內容。所以剩下的是空格和雙引號中的字元串。x
交換兩個空間,到現在我們在保持空間中有替換,在模式空間中有原始行- 從現在開始的所有內容都應用於所有行:
G
將保留空間附加到每一行,所以我們有行、換行符和替換s/gene_id[^;]*\(.*\)\n\(.*\)/gene_id\2\1/' is easier to write than to read:
$$ ^; $$matches everything between
基因IDand the
;, thus the part to be replaced. The
(. )parts cover the text before and after the embedded newline, so we can refer to them as
\1and
\2` 替換。
gene_name
為了說明最後一步,請查看在將保留空間與with<CR>
作為嵌入式換行符附加後緩衝區的外觀:Chr1_RagTag_p ………; gene_id "XLOC_000060"; exon_number "1";<CR> "AT1G02100" \______v_____/\________v_______/ \____v____/ gene_id [^;]* \(.*\) \n \(.*\)
使用擴展正則表達式(選項)可能更容易閱讀
-E
:sed -E '/.*gene_name/{h;s///;s/;.*//;x;};G;s/(gene_id)[^;]*(.*)\n(.*)/\1\3\2/' file
考慮更新問題中的 no
gene_name
case 進行更新我只是簡單地添加了與提取
oId
類似的gene_name
提取,但在它之前。因此,如果之後出現 agene_name
,它將覆蓋oId
. 這次在單獨的行中以提高可讀性:sed ' /.*oId/{ h s/// s/\..*/"/ x } /.*gene_name/{ h s/// s/;.*// x } G s/gene_id[^;]*\(.*\)\n\(.*\)/gene_id\2\1/' file