Awk

提取字元串的一部分的最簡單方法?

  • December 26, 2020

我有一個文件(bigfile.txt),其中一列看起來像這樣

NW_017095471.1  Gnomon  mRNA    108321  109565  .   +   .   ID=rna34;Parent=gene27;Dbxref=GeneID:108565285,Genbank:XM_017925071.1;Name=XM_017925071.1;gbkey=mRNA;gene=LOC108565285;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 30 samples with support for all annotated introns;product=transmembrane protein 126A;transcript_id=XM_017925071.1
ID=gene27;Dbxref=GeneID:108565285;Name=LOC108565285;gbkey=Gene;gene=LOC108565285;gene_biotype=protein_coding
ID=gene28;Dbxref=GeneID:108569527;Name=LOC108569527;gbkey=Gene;gene=LOC108569527;gene_biotype=protein_coding
ID=gene78;Dbxref=GeneID:108562956;Name=LOC108562956;gbkey=Gene;gene=LOC108562956;gene_biotype=protein_coding

我有一個單獨的列表:

gene27
gene28

我想獲取每一行並 grep ID 欄位,然後返回“Name =”後面的“LOC#”。

gene=$line
`grep $gene";" bigfile.txt | sed -e 's/Name=

返回

LOC108565285
LOC108569527

我該如何去提取這部分?

假設這是GFF 文件的第 9 個製表符分隔欄位(“屬性”欄位),您可以提取與gene特定ID屬性相對應的屬性值(從單獨的文件中讀取),如下所示awk

BEGIN { FS = "\t" }

FNR == NR {
   # Read IDs into a hash as keys.
   ids[$1] = 1
   next
}

$3 == "gene" {
   # Split the attribute field into separate key-value pairs.
   n = split($9, keyvalues, ";")

   id = ""    # Not found a gene ID yet
   gene = ""  # No gene name to print

   # Loop over the key-value pairs, split them on the "="
   # and extract the gene name and gene ID.
   for (i = 1; i <= n; ++i) {
       split(keyvalues[i], attr, "=")
       if (attr[1] == "ID") {
           if (attr[2] in ids)
               id = attr[2]
           else
               next  # This line is not of interest
       }
       else if (attr[1] == "gene")
           gene = attr[2]
   }

   if (id != "" && gene != "")
       print gene
}

在一個名為 GFF 的文件上執行它,該文件file.gff包含第 9 列中的給定數據,基因 ID 列表位於id.list

$ awk -f script.awk id.list file.gff
LOC108565285
LOC108569527

FNR == NR基因 ID 列表是從程式碼塊中的第一個文件中讀取的awk,而最後一個塊正在處理命令行中給出的第二個(以及所有以後的)文件中的基因特徵行的屬性欄位(僅)。

awk程式碼假定 GFF 文件的IDgene屬性僅包含一個值(不是逗號分隔的值列表)並且這些值沒有被引用。

要將輸出作為基因名稱和基因 ID 列表(兩列),請將print gene語句更改為print id, gene.

這需要重構,但應該做你想做的事:

while IFS=; read -r line; do grep -Fw "$line" biffile.txt; done < other_file | awk -F';' '{split($3,a,"=");print a[2]}'

引用自:https://unix.stackexchange.com/questions/529493