Shell-Script

grep 查找與句點完全匹配的單詞

  • March 16, 2022

我有一個.csv這種格式的大文件:

"acc","lineage"
"MT993865","B.1.509"
"MW483477","B.1.402"
"MW517757","B.1.2"
"MW517758","B.1.2"
"MW592770","B.1.564"
...

即,第一列是表示accession_id數據樣本的字元串,第二列是 covid 變體lineage。我想為一些感興趣的特定變體提取 accession_ids 及其譜系,Omicron例如B.1.1.529. 我嘗試使用 grep 文件,-w但由於.它是一個非單詞字元,它會獲取我擴展 omicron 的變體的結果,例如,B.1.1.529.1

如需詳細討論,請查看我寫的這個 bash 腳本:

# filter data based on the selected lineages (refer to variants_lineage.txt for more info) as given below.

# File with metadata
metadata_file="$HOME/thesis/SARS-CoV2-data/metadata.csv"
cat "$metadata_file" | tr -d '"' | tr ',' $'\t' > adj_metadata.tsv

# list of lineages of interest
selected_lineages=("B.1.1.7" "B.1.351" "P.1" "B.1.617.2" "B.1.1.5290" "C.37" "B.1.621" "B.1.429" "B.1.427" "CAL.20C" "P.2" "B.1.525" "P.3" "B.1.526" "B.1.617.1" )
pattern=$(echo ${selected_lineages[*]}|tr ' ' '|')

if [ -f "adj_metadata.tsv" ]
then
 echo "File exists"
 for lineage in ${selected_lineages[@]}
   do
     echo "Filtering for lineage $lineage"
     grep -w "$lineage" adj_metadata.tsv >> filtered_metadata.tsv
   done
else
 echo "Adjusted metadata file does not exist."
fi

# Check for the uniqueness of the filtered_metadata.csv file, this should fetch the list of selected_lineages
cut -d$'\t' -f2 filtered_metadata.tsv | sort | uniq

非常感謝任何建議/建議。

也請隨時評論與問題無關的改進。

先感謝您。

方法一

由於 .csv 中的字元串始終位於雙引號之間",因此您可以在匹配項中包含引號。然後,您只需對錶達式使用單引號'

例子:

asdf.csv:

"foo","B.1.1.529"
"bar","B.1.1.529.1"
╰─$ grep  '"B.1.1.529"' ./asdf
"foo","B.1.1.529"

如您所見B.1.1.529.1,在這種情況下將不匹配。


方法二

雖然方法 1 可以處理您的輸入數據,但它不適用於 ,adj_metadata.tsv因為它被剝離了所有引號。您當然可以修改您的腳本以首先匹配,然後通過管道輸出tr,但這將包括不必要的工作。

你可以做的是將正則表達式錨定到行尾$

例子:

adj-metadata.tsv:

foo     B.1.1.529
bar     B.1.1.529.1
╰─$ grep "B.1.1.529$" adj_metadata.tsv
foo     B.1.1.529

您需要使用此方法對腳本進行的唯一修改是\$在 grep 命令的正確位置添加:

#!/bin/bash
# filter data based on the selected lineages (refer to variants_lineage.txt for more info) as given below.

# File with metadata
metadata_file="$HOME/thesis/SARS-CoV2-data/metadata.csv"
cat "$metadata_file" | tr -d '"' | tr ',' $'\t' > adj_metadata.tsv

# list of lineages of interest
selected_lineages=("B.1.1.7" "B.1.351" "P.1" "B.1.617.2" "B.1.1.5290" "C.37" "B.1.621" "B.1.429" "B.1.427" "CAL.20C" "P.2" "B.1.525" "P.3" "B.1.526" "B.1.617.1" )

#replace all occurrences of "." with "\."
selected_lineages=$(echo $selected_lineages | sed 's/\./\\./g')

if [ -f "adj_metadata.tsv" ]
then
 echo "File exists"
 for lineage in ${selected_lineages[@]}
   do
     echo "Filtering for lineage $lineage"
     grep -w "$lineage\$" adj_metadata.tsv >> filtered_metadata.tsv
   done
else
 echo "Adjusted metadata file does not exist."
fi

# Check for the uniqueness of the filtered_metadata.csv file, this should fetch the list of selected_lineages
cut -d$'\t' -f2 filtered_metadata.tsv | sort | uniq

注意:雖然.通常用作任何字元的表達式,但您需要使用 a 轉義\以搜尋.如下文字:B\.1\.1\.529$.

\為簡單起見,您仍然可以在沒有 的情況下保留它。

引用自:https://unix.stackexchange.com/questions/694553