將一些分隔不佳的數據處理成有用的 CSV

March 2, 2017

我有一些形式的輸出：
count  id     type
588    10 |    3
10    12 |    3
883    14 |    3
98    17 |    3
17    18 |    1
77598    18 |    3
10000    21 |    3
17892     2 |    3
20000    23 |    3
63    27 |    3
 6     3 |    3
2446    35 |    3
14    4 |    3
15     4 |    1
253     4 |    2
19857     4 |    3
1000     5 |    3
...
這非常混亂，需要清理為 CSV，這樣我就可以把它送給項目經理，讓他們把電子表格搞得一團糟。
問題的核心是：我需要它的輸出是：
id, sum_of_type_1, sum_of_type_2, sum_of_type_3
這方面的一個例子是 id “4”：
14    4 |    3
15     4 |    1
253     4 |    2
19857     4 |    3
這應該是：
4,15,253,19871
不幸的是，我在這種事情上很垃圾，我已經設法將所有行清理並放入 CSV，但我無法對行進行重複數據刪除和分組。現在我有這個：
awk 'BEGIN{OFS=",";} {split($line, part, " "); print part[1],part[2],part[4]}' | awk '{ gsub (" ", "", $0); print}'
但所做的只是清理垃圾字元並再次列印行。
將行按摩到上述輸出中的最佳方法是什麼？

一種方法是將所有內容放在雜湊中。

# put values into a hash based on the id and tag
awk 'NR&gt;1{n[$2","$4]+=$1}
END{
   # merge the same ids on the one line
   for(i in n){
       id=i;
       sub(/,.*/,"",id);
       a[id]=a[id]","n[i];
   }
   # print everyhing
   for(i in a){
       print i""a[i];
   }
}'

編輯：我的第一個答案沒有正確回答問題

Perl 的救援：

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

&lt;&gt;;  # Skip the header.

my %sum;
my %types;
while (&lt;&gt;) {
   my ($count, $id, $type) = grep length, split '[\s|]+';
   $sum{$id}{$type} += $count;
   $types{$type} = 1;
}

say join ',', 'id', sort keys %types;
for my $id (sort { $a &lt;=&gt; $b } keys %sum) {
   say join ',', $id, map $_ // q(), @{ $sum{$id} }{ sort keys %types };
}

它保留了兩個表，類型表和 id 表。對於每個 id，它儲存每種類型的總和。

引用自：https://unix.stackexchange.com/questions/348303

將一些分隔不佳的數據處理成有用的 CSV

相關問答

將程序輸出的逐行塊轉換為 CSV，同時刪除行標題

在 CSV 文件中迭代 fieldA，其中 fieldB 具有指定值

如何將製表符分隔的數據轉換為逗號分隔的數據？

幫助 awk / sed shell 腳本

如何從文件中刪除所有評論？

僅在逗號分隔文件中刪除引號之間的逗號