僅在第一列中的第一個空格後刪除字元串

February 18, 2021

以下文件是製表符分隔的。我試圖從第一列中刪除NbLab330C00 64506568空格後的數字以獲得NbLab330C00.

$ head LAB330_TE_annotation.gff3 
##gff-version 3      
##date Sun Feb 14 08:41:36 UTC 2021      
##Identity: Sequence identity (0-1) between the library sequence and the target region.      
##ltr_identity: Sequence identity (0-1) between the left and right LTR regions.      
##tsd: target site duplication.      
##seqid source sequence_ontology start end score strand phase attributes      
NbLab330C00 64506568    EDTA    Gypsy_LTR_retrotransposon   2   3364    20798   -   .   ID=TE_homo_0;Name=TE_00007365_INT;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.868;Method=homology      
NbLab330C00 64506568    EDTA    Gypsy_LTR_retrotransposon   3367    4198    3385    -   .   ID=TE_homo_1;Name=TE_00008087_LTR;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.865;Method=homology      
NbLab330C00 64506568    EDTA    hAT_TIR_transposon  4424    4715    1278    +   .   ID=TE_homo_2;Name=TE_00003964;Classification=DNA/DTA;Sequence_ontology=SO:0002279;Identity=0.834;Method=homology      
NbLab330C00 64506568    EDTA    hAT_TIR_transposon  5236    5453    835 +   .   ID=TE_homo_3;Name=TE_00001425;Classification=DNA/DTA;Sequence_ontology=SO:0002279;Identity=0.828;Method=homology

我嘗試了以下awk命令，但它也縮短了最後一列。

$ awk -v OFS='\t' '{print $1,$3,$4,$5,$7,$8,$9}' LAB330_TE_annotation.gff3 &gt; LAB330_TE_annotation.fix.gff3
(base) ubuntu@ip-10-23-2-113:/efs/apollo/LAB330$ head LAB330_TE_annotation.fix.gff3 
##gff-version                       
##date  Feb 14  08:41:36    2021        
##Identity: identity    (0-1)   between library sequence    and
##ltr_identity: identity    (0-1)   between left    and right
##tsd:  site    duplication.                
##seqid sequence_ontology   start   end strand  phase   attributes
NbLab330C00 EDTA    Gypsy_LTR_retrotransposon   2   20798   -   .
NbLab330C00 EDTA    Gypsy_LTR_retrotransposon   3367    3385    -   .
NbLab330C00 EDTA    hAT_TIR_transposon  4424    1278    +   .
NbLab330C00 EDTA    hAT_TIR_transposon  5236    835 +   .
(base) ubuntu@ip-10-23-2-113:/efs/apollo/LAB330$

如何修復上述命令，

先感謝您，

awk 'BEGIN{ OFS=FS="\t" } 
 !/^#/{ sub(/ [0-9]+$/, "", $1) }
 1
' LAB330_TE_annotation.gff3 &gt; LAB330_TE_annotation.fix.gff3
這將保留以未修改開頭的標題行，#並用空字元串替換第一個欄位末尾的空格字元，後跟至少一個數字。

您可以使用cut刪除第二列。預設分隔符是製表符，因此您無需指定-d開關。

$ cut -f 1,3- LAB330_TE_annotation.gff3
##gff-version 3
##date Sun Feb 14 08:41:36 UTC 2021
##Identity: Sequence identity (0-1) between the library sequence and the target region.
##ltr_identity: Sequence identity (0-1) between the left and right LTR regions.
##tsd: target site duplication.
##seqid source sequence_ontology start end score strand phase attributes
NbLab330C00 EDTA    Gypsy_LTR_retrotransposon   2   3364    20798   -   .   ID=TE_homo_0;Name=TE_00007365_INT;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.868;Method=homology
NbLab330C00 EDTA    Gypsy_LTR_retrotransposon   3367    4198    3385    -   .   ID=TE_homo_1;Name=TE_00008087_LTR;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.865;Method=homology
NbLab330C00 EDTA    hAT_TIR_transposon  4424    4715    1278    +   .   ID=TE_homo_2;Name=TE_00003964;Classification=DNA/DTA;Sequence_ontology=SO:0002279;Identity=0.834;Method=homology
NbLab330C00 EDTA    hAT_TIR_transposon  5236    5453    835 +   .   ID=TE_homo_3;Name=TE_00001425;Classification=DNA/DTA;Sequence_ontology=SO:0002279;Identity=0.828;Method=homology

選擇：$ cut -f 2 --complement LAB330_TE_annotation.gff3

引用自：https://unix.stackexchange.com/questions/635097

僅在第一列中的第一個空格後刪除字元串

相關問答

bash 將行轉換為列

提取欄位和每行出現的次數

使用 awk 檢查一個變數中特定列上每一行中的數字與另一個變數中兩個特定列中的所有行

如何在最後一個下劃線字元上剪切字元串？

剪切字元串的最後一個值

複製即將到來的行的最後一列