Awk
如何使用 awk 辨識最後一列中的空白?
我有一個看起來像這樣的文件。第一行是標題。
"variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos" "chr10_100000235_C_T_b38" "chr10" "100000235" "C" "T" "chr10" 101759992 "chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" "chr10_100004827_A_C_b38" "chr10" "100004827" "A" "C" "chr10" 101764584 "chr10_100005358_G_C_b38" "chr10" "100005358" "G" "C" "chr10" 101765115 "chr10_100005711_G_A_b38" "chr10" "100005711" "G" "A" "chr10" 101765468 "chr10_100006780_C_T_b38" "chr10" "100006780" "C" "T" "chr10" 101766537 "chr10_100007241_C_T_b38" "chr10" "100007241" "C" "T" "chr10" 101766998 "chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10" "chr10_100009013_G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770
如何辨識最後一列中的空欄位?我嘗試了以下命令:
awk '$7==" "' file.txt > blanks.txt awk '{if($7==" ") print}' file.txt > blanks.txt
兩者都給出了空文件。
blanks.txt 的結果應該是
"chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" "chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"
這個答案的最後一個選擇對接受的內容更加嚴格,並且獨立於由製表符和/或空格分隔的欄位。
但是,開始:
如果最後一個欄位為空,則只有 6 個欄位(如果以空格或製表符分隔)。如果要列印這些行,可以這樣做:
$ awk ' NF<7 {print}' infile "chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" "chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"
該
{print}
命令實際上不是必需的,因為預設情況下 awk 會在為真的表達式上列印,在下一個解決方案中刪除(感謝FelixJN)。如果您還需要標題,請添加:
$ awk '(NF<7) || (NR==1)' infile "variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos" "chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" "chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"
而且,如果您想保留具有足夠欄位的行,請執行以下操作:
$ awk '(NF>=7) || (NR==1)' infile "variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos" "chr10_100000235_C_T_b38" "chr10" "100000235" "C" "T" "chr10" 101759992 "chr10_100004827_A_C_b38" "chr10" "100004827" "A" "C" "chr10" 101764584 "chr10_100005358_G_C_b38" "chr10" "100005358" "G" "C" "chr10" 101765115 "chr10_100005711_G_A_b38" "chr10" "100005711" "G" "A" "chr10" 101765468 "chr10_100006780_C_T_b38" "chr10" "100006780" "C" "T" "chr10" 101766537 "chr10_100007241_C_T_b38" "chr10" "100007241" "C" "T" "chr10" 101766998 "chr10_100009013_G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770
如果您需要一個不依賴於缺少最後一個文件這一事實的解決方案,而是確保在行尾有一個尾隨數字,請使用:
$ awk '/[0-9]+[ \t]*$/ || (NR==1)' infile "variant_id" "hg38_chr" "hg38_pos" "ref_allele" "alt_allele" "hg19_chr" "hg19_pos" "chr10_100000235_C_T_b38" "chr10" "100000235" "C" "T" "chr10" 101759992 "chr10_100004827_A_C_b38" "chr10" "100004827" "A" "C" "chr10" 101764584 "chr10_100005358_G_C_b38" "chr10" "100005358" "G" "C" "chr10" 101765115 "chr10_100005711_G_A_b38" "chr10" "100005711" "G" "A" "chr10" 101765468 "chr10_100006780_C_T_b38" "chr10" "100006780" "C" "T" "chr10" 101766537 "chr10_100007241_C_T_b38" "chr10" "100007241" "C" "T" "chr10" 101766998 "chr10 100009013_G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770 "chr10 100009013 G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770 "chr10_100009013_G_A_b38" "chr10" "100009013" "G" "A" "chr10" 101768770
這不會受到任何其他欄位缺失的影響,並且與使用哪個欄位分隔符(空格和/或製表符)無關。
這是假設最後一個欄位是一個沒有用雙引號括起來的數字,但如果需要,這很容易更改。
並且,要嚴格遵守您的問題所要求的輸出:
$ awk '!/[0-9]+[ \t]*$/ && NR>1' infile "chr10_100002628_A_C_b38" "chr10" "100002628" "A" "C" "chr10" "chr10_100008640_A_G_b38" "chr10" "100008640" "A" "G" "chr10"