合併具有一個公共欄位的多行數據

December 1, 2016

我有一個包含 9+ 行數據的大文件，用分號 (;) 分隔，我想合併第 3 列中的數據（用 , 分隔）與第 5 列中的數據匹配的行。數據保存在 Linux 機器上並具有常用的 awk/perl 工具，但不知道如何使用它們

文件：

Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
orion.uk.localhost.com;XY01123;Machine-apache-ua01;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
orion.uk.localhost.com;XY01123;Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-bcp1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat2;uat;uat.matrix.localhost.com;22 March 2013 06:16:10 GMT;22 March 2018 06:46:10 GMT;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal

預期輸出：

Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-bcp1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal

任何關於如何合併的想法將不勝感激。

也許有更優雅的方式awk，但這裡有一個可能的腳本。
BEGIN { FS=";" ; OFS=";" }
NR==1 { print $0 }
NR&gt;1 {
   if ( b[$5]=="" ) {
       a[$5]=$0
       b[$5]=$3
   }
   else {
       b[$5]=b[$5]","$3
       $3=b[$5]
       a[$5]=$0
   }
}
END {
   for (c in a) {
       print a[c]
   }
}
解釋：
BEGIN將分號設置為輸入和輸出欄位分隔符
NR==1只需列印第一行（標題），無需任何操作
NR>1對於其他行：
b[$5]是一個由欄位 5 值索引的數組，包含（增長的）逗號分隔的欄位 3 條目列表
a[$5]是一個由欄位 5 值索引的數組，包含修改後的行（即包含以逗號分隔的欄位 3 值）
如果b[$5]未設置（此值的第一次出現），設置a[$5]為行和b[$5]欄位 3
否則（b[$5]設置），將帶有逗號分隔符的欄位 3 添加到b[$5]，將此行中的欄位 3 替換為 this 然後替換a[$5]為此更改的行
END對於數組的所有索引值c，a列印數組元素（即所需的行）
我真的不知道如何awk對輸出進行排序，但這是我的結果：
Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used
Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External
matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;22 March 2013 06:16:10 GMT;22 March 2018 06:46:10 GMT;1024;External
orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
matrix.localhost.com;XY6124;Machine-apache-bcp1,Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External

你sqlite有嗎？我對如何加入行的理解正確嗎？

sqlite&gt; .separator ;
sqlite&gt; .import file.txt alldata
sqlite&gt; select "ENV URL", group_concat("Machine") from alldata group by "ENV URL";
dev.matrix.localhost.com;Machine-apache-dev1,Machine-apache-dev2
per.Upgrade.uk.localhost.com;Machine-apache-pf01,Machine-apache-pf02
test.matrix.localhost.com;Machine-apache-bcp1,Machine-apache-prd1
uat.matrix.localhost.com;Machine-apache-uat1,Machine-apache-uat2
uat.orion.uk.localhost.com;Machine-apache-ua01,Machine-apache-ua02

或非互動式：

echo 'select "ENV URL", group_concat("Machine") from alldata group by "ENV URL";' \
 | sqlite3 -separator ";" -cmd ".import file.txt alldata" -batch

引用自：https://unix.stackexchange.com/questions/327140

合併具有一個公共欄位的多行數據

相關問答

刪除 CSV 文件中每個欄位中的重複模式/條目

使用 AWK 將子字元串按最後 n 個字元拆分為新列

通過 awk 合併具有 N 個公共列的多個文件，如果任何文件沒有公共鍵，則希望將列值替換為 0

計算 awk（或 perl）中的唯一關聯值

僅對子字元串進行更改操作

找到兩個連續的重複行