Awk
提取字元串的一部分的最簡單方法?
我有一個文件(bigfile.txt),其中一列看起來像這樣
NW_017095471.1 Gnomon mRNA 108321 109565 . + . ID=rna34;Parent=gene27;Dbxref=GeneID:108565285,Genbank:XM_017925071.1;Name=XM_017925071.1;gbkey=mRNA;gene=LOC108565285;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 30 samples with support for all annotated introns;product=transmembrane protein 126A;transcript_id=XM_017925071.1 ID=gene27;Dbxref=GeneID:108565285;Name=LOC108565285;gbkey=Gene;gene=LOC108565285;gene_biotype=protein_coding ID=gene28;Dbxref=GeneID:108569527;Name=LOC108569527;gbkey=Gene;gene=LOC108569527;gene_biotype=protein_coding ID=gene78;Dbxref=GeneID:108562956;Name=LOC108562956;gbkey=Gene;gene=LOC108562956;gene_biotype=protein_coding
我有一個單獨的列表:
gene27 gene28
我想獲取每一行並 grep ID 欄位,然後返回“Name =”後面的“LOC#”。
gene=$line `grep $gene";" bigfile.txt | sed -e 's/Name=
返回
LOC108565285 LOC108569527
我該如何去提取這部分?
假設這是GFF 文件的第 9 個製表符分隔欄位(“屬性”欄位),您可以提取與
gene
特定ID
屬性相對應的屬性值(從單獨的文件中讀取),如下所示awk
:BEGIN { FS = "\t" } FNR == NR { # Read IDs into a hash as keys. ids[$1] = 1 next } $3 == "gene" { # Split the attribute field into separate key-value pairs. n = split($9, keyvalues, ";") id = "" # Not found a gene ID yet gene = "" # No gene name to print # Loop over the key-value pairs, split them on the "=" # and extract the gene name and gene ID. for (i = 1; i <= n; ++i) { split(keyvalues[i], attr, "=") if (attr[1] == "ID") { if (attr[2] in ids) id = attr[2] else next # This line is not of interest } else if (attr[1] == "gene") gene = attr[2] } if (id != "" && gene != "") print gene }
在一個名為 GFF 的文件上執行它,該文件
file.gff
包含第 9 列中的給定數據,基因 ID 列表位於id.list
:$ awk -f script.awk id.list file.gff LOC108565285 LOC108569527
FNR == NR
基因 ID 列表是從程式碼塊中的第一個文件中讀取的awk
,而最後一個塊正在處理命令行中給出的第二個(以及所有以後的)文件中的基因特徵行的屬性欄位(僅)。該
awk
程式碼假定 GFF 文件的ID
和gene
屬性僅包含一個值(不是逗號分隔的值列表)並且這些值沒有被引用。要將輸出作為基因名稱和基因 ID 列表(兩列),請將
print gene
語句更改為print id, gene
.
這需要重構,但應該做你想做的事:
while IFS=; read -r line; do grep -Fw "$line" biffile.txt; done < other_file | awk -F';' '{split($3,a,"=");print a[2]}'