Text-Processing
每次給定列的內容更改時拆分文件
我正在處理一個文本文件,並且遇到了一個我無法解決的問題,儘管進行了一些Google搜尋和四處詢問。
我想根據第 2 列的內容(用 分隔)將此文件(20,880 行)拆分為單獨的文件
|
。每次第 2 列的內容更改時,我都想要一個新文件。不幸的是,第 2 列的每個實例的行數不規則,所以我不能每n
行都拆分文件。以下是原始文件的前幾行:
>00000000|gene_cluster:GC_00001105|genome_name:r7534_20160316|gene_callers_id:24 >00000001|gene_cluster:GC_00001105|genome_name:r7537_20160321|gene_callers_id:78 >00000002|gene_cluster:GC_00001105|genome_name:r7541_20160426|gene_callers_id:774 >00000003|gene_cluster:GC_00001105|genome_name:r7544_20160502|gene_callers_id:1034 >00000004|gene_cluster:GC_00001105|genome_name:r7547_20160512|gene_callers_id:330 >00000005|gene_cluster:GC_00001105|genome_name:r7550_20160517|gene_callers_id:2094 >00000006|gene_cluster:GC_00001290|genome_name:r7534_20160316|gene_callers_id:76 >00000007|gene_cluster:GC_00001290|genome_name:r7537_20160321|gene_callers_id:358 >00000008|gene_cluster:GC_00001290|genome_name:r7541_20160426|gene_callers_id:1601 >00000009|gene_cluster:GC_00001290|genome_name:r7544_20160502|gene_callers_id:2134
然後我按第二列對其進行排序,給我這個:
>00006406|gene_cluster:GC_00000001|genome_name:r7534_20160316|gene_callers_id:1988 >00006409|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:1059 >00006410|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:1811 >00006407|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:1947 >00006411|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:643 >00006408|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:759 >00006412|gene_cluster:GC_00000001|genome_name:r7541_20160426|gene_callers_id:1252 >00006415|gene_cluster:GC_00000001|genome_name:r7541_20160426|gene_callers_id:1920 >00006414|gene_cluster:GC_00000001|genome_name:r7541_20160426|gene_callers_id:2021 >00006413|gene_cluster:GC_00000001|genome_name:r7541_20160426|gene_callers_id:2094
但是我還沒有弄清楚每次第二列更改時如何拆分文件。我該如何拆分這個文件?
謝謝!
使用
awk
$ awk -F"|" '{print > $2}' input_file $ head gene_cluster* ==> gene_cluster:GC_00001105 <== >00000000|gene_cluster:GC_00001105|genome_name:r7534_20160316|gene_callers_id:24 >00000001|gene_cluster:GC_00001105|genome_name:r7537_20160321|gene_callers_id:78 >00000002|gene_cluster:GC_00001105|genome_name:r7541_20160426|gene_callers_id:774 >00000003|gene_cluster:GC_00001105|genome_name:r7544_20160502|gene_callers_id:1034 >00000004|gene_cluster:GC_00001105|genome_name:r7547_20160512|gene_callers_id:330 >00000005|gene_cluster:GC_00001105|genome_name:r7550_20160517|gene_callers_id:2094 ==> gene_cluster:GC_00001290 <== >00000006|gene_cluster:GC_00001290|genome_name:r7534_20160316|gene_callers_id:76 >00000007|gene_cluster:GC_00001290|genome_name:r7537_20160321|gene_callers_id:358 >00000008|gene_cluster:GC_00001290|genome_name:r7541_20160426|gene_callers_id:1601 >00000009|gene_cluster:GC_00001290|genome_name:r7544_20160502|gene_callers_id:2134