Awk

如何使用 awk 辨識最後一列中的空白?

  • September 30, 2022

我有一個看起來像這樣的文件。第一行是標題。

"variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos"
"chr10_100000235_C_T_b38" "chr10" "100000235" "C" "T" "chr10" 101759992
"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" 
"chr10_100004827_A_C_b38" "chr10" "100004827" "A" "C" "chr10" 101764584
"chr10_100005358_G_C_b38" "chr10" "100005358" "G" "C" "chr10" 101765115
"chr10_100005711_G_A_b38" "chr10" "100005711" "G" "A" "chr10" 101765468
"chr10_100006780_C_T_b38" "chr10" "100006780" "C" "T" "chr10" 101766537
"chr10_100007241_C_T_b38" "chr10" "100007241" "C" "T" "chr10" 101766998
"chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10" 
"chr10_100009013_G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770

如何辨識最後一列中的空欄位?我嘗試了以下命令:

awk '$7==" "' file.txt > blanks.txt
awk '{if($7==" ") print}' file.txt > blanks.txt

兩者都給出了空文件。

blanks.txt 的結果應該是

"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" 
"chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"

這個答案的最後一個選擇對接受的內容更加嚴格,並且獨立於由製表符和/或空格分隔的欄位。

但是,開始:

如果最後一個欄位為空,則只有 6 個欄位(如果以空格或製表符分隔)。如果要列印這些行,可以這樣做:

$ awk ' NF<7 {print}' infile

"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" 
"chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"

{print}命令實際上不是必需的,因為預設情況下 awk 會在為真的表達式上列印,在下一個解決方案中刪除(感謝FelixJN)。

如果您還需要標題,請添加:

$ awk '(NF<7) || (NR==1)' infile

"variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos"
"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" 
"chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"

而且,如果您想保留具有足夠欄位的行,請執行以下操作:

$ awk '(NF>=7) || (NR==1)' infile

"variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos"
"chr10_100000235_C_T_b38" "chr10" "100000235" "C" "T" "chr10" 101759992
"chr10_100004827_A_C_b38" "chr10" "100004827" "A" "C" "chr10" 101764584
"chr10_100005358_G_C_b38" "chr10" "100005358" "G" "C" "chr10" 101765115
"chr10_100005711_G_A_b38" "chr10" "100005711" "G" "A" "chr10" 101765468
"chr10_100006780_C_T_b38" "chr10" "100006780" "C" "T" "chr10" 101766537
"chr10_100007241_C_T_b38" "chr10" "100007241" "C" "T" "chr10" 101766998
"chr10_100009013_G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770

如果您需要一個不依賴於缺少最後一個文件這一事實的解決方案,而是確保在行尾有一個尾隨數字,請使用:

$ awk '/[0-9]+[ \t]*$/ || (NR==1)' infile

"variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos"
"chr10_100000235_C_T_b38" "chr10" "100000235" "C" "T" "chr10" 101759992
"chr10_100004827_A_C_b38" "chr10" "100004827" "A" "C" "chr10" 101764584
"chr10_100005358_G_C_b38" "chr10" "100005358" "G" "C" "chr10" 101765115
"chr10_100005711_G_A_b38" "chr10" "100005711" "G" "A" "chr10" 101765468
"chr10_100006780_C_T_b38" "chr10" "100006780" "C" "T" "chr10" 101766537
"chr10_100007241_C_T_b38" "chr10" "100007241" "C" "T" "chr10" 101766998
"chr10 100009013_G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770
"chr10 100009013 G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770
"chr10_100009013_G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770

這不會受到任何其他欄位缺失的影響,並且與使用哪個欄位分隔符(空格和/或製表符)無關。

這是假設最後一個欄位是一個沒有用雙引號括起來的數字,但如果需要,這很容易更改。

並且,要嚴格遵守您的問題所要求的輸出:

$ awk '!/[0-9]+[ \t]*$/ && NR>1' infile
"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" 
"chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"

引用自:https://unix.stackexchange.com/questions/719266