Shell-Script
grep 查找與句點完全匹配的單詞
我有一個
.csv
這種格式的大文件:"acc","lineage" "MT993865","B.1.509" "MW483477","B.1.402" "MW517757","B.1.2" "MW517758","B.1.2" "MW592770","B.1.564" ...
即,第一列是表示
accession_id
數據樣本的字元串,第二列是 covid 變體lineage
。我想為一些感興趣的特定變體提取 accession_ids 及其譜系,Omicron
例如B.1.1.529
. 我嘗試使用 grep 文件,-w
但由於.
它是一個非單詞字元,它會獲取我擴展 omicron 的變體的結果,例如,B.1.1.529.1
如需詳細討論,請查看我寫的這個 bash 腳本:
# filter data based on the selected lineages (refer to variants_lineage.txt for more info) as given below. # File with metadata metadata_file="$HOME/thesis/SARS-CoV2-data/metadata.csv" cat "$metadata_file" | tr -d '"' | tr ',' $'\t' > adj_metadata.tsv # list of lineages of interest selected_lineages=("B.1.1.7" "B.1.351" "P.1" "B.1.617.2" "B.1.1.5290" "C.37" "B.1.621" "B.1.429" "B.1.427" "CAL.20C" "P.2" "B.1.525" "P.3" "B.1.526" "B.1.617.1" ) pattern=$(echo ${selected_lineages[*]}|tr ' ' '|') if [ -f "adj_metadata.tsv" ] then echo "File exists" for lineage in ${selected_lineages[@]} do echo "Filtering for lineage $lineage" grep -w "$lineage" adj_metadata.tsv >> filtered_metadata.tsv done else echo "Adjusted metadata file does not exist." fi # Check for the uniqueness of the filtered_metadata.csv file, this should fetch the list of selected_lineages cut -d$'\t' -f2 filtered_metadata.tsv | sort | uniq
非常感謝任何建議/建議。
也請隨時評論與問題無關的改進。
先感謝您。
方法一
由於 .csv 中的字元串始終位於雙引號之間
"
,因此您可以在匹配項中包含引號。然後,您只需對錶達式使用單引號'
。例子:
asdf.csv:
"foo","B.1.1.529" "bar","B.1.1.529.1"
╰─$ grep '"B.1.1.529"' ./asdf "foo","B.1.1.529"
如您所見
B.1.1.529.1
,在這種情況下將不匹配。方法二
雖然方法 1 可以處理您的輸入數據,但它不適用於 ,
adj_metadata.tsv
因為它被剝離了所有引號。您當然可以修改您的腳本以首先匹配,然後通過管道輸出tr
,但這將包括不必要的工作。你可以做的是將正則表達式錨定到行尾
$
例子:
adj-metadata.tsv:
foo B.1.1.529 bar B.1.1.529.1
╰─$ grep "B.1.1.529$" adj_metadata.tsv foo B.1.1.529
您需要使用此方法對腳本進行的唯一修改是
\$
在 grep 命令的正確位置添加:#!/bin/bash # filter data based on the selected lineages (refer to variants_lineage.txt for more info) as given below. # File with metadata metadata_file="$HOME/thesis/SARS-CoV2-data/metadata.csv" cat "$metadata_file" | tr -d '"' | tr ',' $'\t' > adj_metadata.tsv # list of lineages of interest selected_lineages=("B.1.1.7" "B.1.351" "P.1" "B.1.617.2" "B.1.1.5290" "C.37" "B.1.621" "B.1.429" "B.1.427" "CAL.20C" "P.2" "B.1.525" "P.3" "B.1.526" "B.1.617.1" ) #replace all occurrences of "." with "\." selected_lineages=$(echo $selected_lineages | sed 's/\./\\./g') if [ -f "adj_metadata.tsv" ] then echo "File exists" for lineage in ${selected_lineages[@]} do echo "Filtering for lineage $lineage" grep -w "$lineage\$" adj_metadata.tsv >> filtered_metadata.tsv done else echo "Adjusted metadata file does not exist." fi # Check for the uniqueness of the filtered_metadata.csv file, this should fetch the list of selected_lineages cut -d$'\t' -f2 filtered_metadata.tsv | sort | uniq
注意:雖然
.
通常用作任何字元的表達式,但您需要使用 a 轉義\
以搜尋.
如下文字:B\.1\.1\.529$
.
\
為簡單起見,您仍然可以在沒有 的情況下保留它。