Text-Formatting
刪除不在括號內的所有內容
我想刪除不在括號內的所有內容,包括括號,僅在以“>”開頭的行中。有 sed 替代品嗎?另外,想按字母順序對行進行排序,也就是以“>”開頭的行及其下一行。
輸入範例:
>ID:000:FLKLNFIA_00192 |[Ignicoccus_hospitalis_KIN4-I.gbfspecies]|strain|Ignicoccus_hospitalis_KIN4-I.gbf|LSU ribosomal protei..|447|FLKLNFIA_1(1297538):162644-163090:1 ^^ Archaeagenomesparanahui Ignicoccus_hospitalis_KIN4-I.gbfspecies strain strain.|neighbours:ID:000:FLKLNFIA_00191(1),ID:000:FLKLNFIA_00193(1)|neighbour_genes:LSU ribosomal protei..,SSU ribosomal protei..| ATGAGTGTGACTA---TTT---GCAATCAGCTAGCTACTACGTACTGATCGTAGCTGACG >ID:000:MGCDKLCO_01184 |[Archaeoglobus_fulgidus_DSM_4304.gbfspecies]|strain|Archaeoglobus_fulgidus_DSM_4304.gbf|50S ribosomal protei..|471|MGCDKLCO_1(2178400):1005279-1005749:1 ^^ Archaeagenomesparanahui Archaeoglobus_fulgidus_DSM_4304.gbfspecies strain strain.|neighbours:ID:000:MGCDKLCO_01183(1),ID:000:MGCDKLCO_01185(1)|neighbour_genes:LSU ribosomal protei..,SSU ribosomal protei..| ATGCGCGCGATAGCTAGCTAGCTAGCTTTAGGGGGATTAGCTA----ACTCTGATTCGGA
預期輸出:
>Archaeoglobus_fulgidus_DSM_4304.gbfspecies ATGCGCGCGATAGCTAGCTAGCTAGCTTTAGGGGGATTAGCTA----ACTCTGATTCGGA >Ignicoccus_hospitalis_KIN4-I.gbfspecies ATGAGTGTGACTA---TTT---GCAATCAGCTAGCTACTACGTACTGATCGTAGCTGACG
謝謝
與
perl
:perl -ne 'push @l, ">" . join("", /\[(.*?)\]/g) . "\n" . <>; END{print for sort @l}' your-file
與
sed
:<your-file sed 's/^[^[]*\[/>/ s/\][^[]*\[\{0,1\}//g N;s/\n/\[/' | sort | tr '[' '\n'
我的(複雜的)建議:
cat file | grep -Po "^[CGTA-]*$|^>.*$" | grep -Po "(?<=\[).*(?=])|^[ACGT-]*$" | awk '{printf (NR%2==0) ? $0 "\n" : ">"$0"::"}' | sort | sed 's/#/\n/'
grep 僅包含字元
CGTA-
的行和以開頭的行>
grep -Po "^[CGTA-]*$|^>.*$"
Grep 僅括號內的內容,不包括它們,以及與模式匹配的行
ACGT-
| grep -Po "(?<=\[).*(?=])|^[ACGT-]*$"
加入每兩行,添加分隔符
#
和初始>
,然後排序| awk '{printf (NR%2==0) ? $0 "\n" : ">"$0"#"}' | sort
最後用
#
新行替換分隔符| sed 's/#/\n/'
輸出:
>Archaeoglobus_fulgidus_DSM_4304.gbfspecies ATGCGCGCGATAGCTAGCTAGCTAGCTTTAGGGGGATTAGCTA----ACTCTGATTCGGA >Ignicoccus_hospitalis_KIN4-I.gbfspecies ATGAGTGTGACTA---TTT---GCAATCAGCTAGCTACTACGTACTGATCGTAGCTGACG