Text-Processing

每次給定列的內容更改時拆分文件

  • September 19, 2022

我正在處理一個文本文件,並且遇到了一個我無法解決的問題,儘管進行了一些Google搜尋和四處詢問。

我想根據第 2 列的內容(用 分隔)將此文件(20,880 行)拆分為單獨的文件|。每次第 2 列的內容更改時,我都想要一個新文件。不幸的是,第 2 列的每個實例的行數不規則,所以我不能每n行都拆分文件。

以下是原始文件的前幾行:

>00000000|gene_cluster:GC_00001105|genome_name:r7534_20160316|gene_callers_id:24
>00000001|gene_cluster:GC_00001105|genome_name:r7537_20160321|gene_callers_id:78
>00000002|gene_cluster:GC_00001105|genome_name:r7541_20160426|gene_callers_id:774
>00000003|gene_cluster:GC_00001105|genome_name:r7544_20160502|gene_callers_id:1034
>00000004|gene_cluster:GC_00001105|genome_name:r7547_20160512|gene_callers_id:330
>00000005|gene_cluster:GC_00001105|genome_name:r7550_20160517|gene_callers_id:2094
>00000006|gene_cluster:GC_00001290|genome_name:r7534_20160316|gene_callers_id:76
>00000007|gene_cluster:GC_00001290|genome_name:r7537_20160321|gene_callers_id:358
>00000008|gene_cluster:GC_00001290|genome_name:r7541_20160426|gene_callers_id:1601
>00000009|gene_cluster:GC_00001290|genome_name:r7544_20160502|gene_callers_id:2134

然後我按第二列對其進行排序,給我這個:

>00006406|gene_cluster:GC_00000001|genome_name:r7534_20160316|gene_callers_id:1988
>00006409|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:1059
>00006410|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:1811
>00006407|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:1947
>00006411|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:643
>00006408|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:759
>00006412|gene_cluster:GC_00000001|genome_name:r7541_20160426|gene_callers_id:1252
>00006415|gene_cluster:GC_00000001|genome_name:r7541_20160426|gene_callers_id:1920
>00006414|gene_cluster:GC_00000001|genome_name:r7541_20160426|gene_callers_id:2021
>00006413|gene_cluster:GC_00000001|genome_name:r7541_20160426|gene_callers_id:2094

但是我還沒有弄清楚每次第二列更改時如何拆分文件。我該如何拆分這個文件?

謝謝!

使用awk

$ awk -F"|" '{print > $2}' input_file
$ head gene_cluster*
==> gene_cluster:GC_00001105 <==
>00000000|gene_cluster:GC_00001105|genome_name:r7534_20160316|gene_callers_id:24
>00000001|gene_cluster:GC_00001105|genome_name:r7537_20160321|gene_callers_id:78
>00000002|gene_cluster:GC_00001105|genome_name:r7541_20160426|gene_callers_id:774
>00000003|gene_cluster:GC_00001105|genome_name:r7544_20160502|gene_callers_id:1034
>00000004|gene_cluster:GC_00001105|genome_name:r7547_20160512|gene_callers_id:330
>00000005|gene_cluster:GC_00001105|genome_name:r7550_20160517|gene_callers_id:2094

==> gene_cluster:GC_00001290 <==
>00000006|gene_cluster:GC_00001290|genome_name:r7534_20160316|gene_callers_id:76
>00000007|gene_cluster:GC_00001290|genome_name:r7537_20160321|gene_callers_id:358
>00000008|gene_cluster:GC_00001290|genome_name:r7541_20160426|gene_callers_id:1601
>00000009|gene_cluster:GC_00001290|genome_name:r7544_20160502|gene_callers_id:2134

引用自:https://unix.stackexchange.com/questions/717815