Text-Processing
合併具有一個公共欄位的多行數據
我有一個包含 9+ 行數據的大文件,用分號 (;) 分隔,我想合併第 3 列中的數據(用 , 分隔)與第 5 列中的數據匹配的行。數據保存在 Linux 機器上並具有常用的 awk/perl 工具,但不知道如何使用它們
文件:
Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used orion.uk.localhost.com;XY01123;Machine-apache-ua01;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal orion.uk.localhost.com;XY01123;Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal matrix.localhost.com;XY6124;Machine-apache-dev1;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External matrix.localhost.com;XY6124;Machine-apache-bcp1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External matrix.localhost.com;XY6124;Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External matrix.localhost.com;XY6124;Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External matrix.localhost.com;XY6124;Machine-apache-uat1;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External matrix.localhost.com;XY6124;Machine-apache-uat2;uat;uat.matrix.localhost.com;22 March 2013 06:16:10 GMT;22 March 2018 06:46:10 GMT;1024;External Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal Upgrade.uk.localhost.com;IN022345;Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
預期輸出:
Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External matrix.localhost.com;XY6124;Machine-apache-bcp1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External matrix.localhost.com;XY6124;Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;16 April 2013 07:06:33 GMT+01:00;16 April 2018 07:36:33 GMT+01:00;1024;External Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal
任何關於如何合併的想法將不勝感激。
也許有更優雅的方式
awk
,但這裡有一個可能的腳本。BEGIN { FS=";" ; OFS=";" } NR==1 { print $0 } NR>1 { if ( b[$5]=="" ) { a[$5]=$0 b[$5]=$3 } else { b[$5]=b[$5]","$3 $3=b[$5] a[$5]=$0 } } END { for (c in a) { print a[c] } }
解釋:
BEGIN
將分號設置為輸入和輸出欄位分隔符
NR==1
只需列印第一行(標題),無需任何操作
NR>1
對於其他行:
b[$5]
是一個由欄位 5 值索引的數組,包含(增長的)逗號分隔的欄位 3 條目列表a[$5]
是一個由欄位 5 值索引的數組,包含修改後的行(即包含以逗號分隔的欄位 3 值)- 如果
b[$5]
未設置(此值的第一次出現),設置a[$5]
為行和b[$5]
欄位 3- 否則(
b[$5]
設置),將帶有逗號分隔符的欄位 3 添加到b[$5]
,將此行中的欄位 3 替換為 this 然後替換a[$5]
為此更改的行END
對於數組的所有索引值c
,a
列印數組元素(即所需的行)我真的不知道如何
awk
對輸出進行排序,但這是我的結果:Domain Name;ID;Machine;Environment;ENV URL;Start Date;End Date;Disk Size;Used Upgrade.uk.localhost.com;IN022345;Machine-apache-pf01,Machine-apache-pf02;per;per.Upgrade.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal matrix.localhost.com;XY6124;Machine-apache-dev1,Machine-apache-dev2;dev;dev.matrix.localhost.com;16 April 2013 06:32:28 GMT+01:00;16 April 2018 07:02:28 GMT+01:00;1024;External matrix.localhost.com;XY6124;Machine-apache-uat1,Machine-apache-uat2;uat;uat.matrix.localhost.com;22 March 2013 06:16:10 GMT;22 March 2018 06:46:10 GMT;1024;External orion.uk.localhost.com;XY01123;Machine-apache-ua01,Machine-apache-ua02;uat;uat.orion.uk.localhost.com;5 August 2015 16:54:08 GMT+01:00;2 August 2025 16:54:08 GMT+01:00;2048;Internal matrix.localhost.com;XY6124;Machine-apache-bcp1,Machine-apache-prd1;test;test.matrix.localhost.com;2 April 2013 08:12:10 GMT+01:00;2 April 2018 08:42:10 GMT+01:00;1024;External
你
sqlite
有嗎?我對如何加入行的理解正確嗎?sqlite> .separator ; sqlite> .import file.txt alldata sqlite> select "ENV URL", group_concat("Machine") from alldata group by "ENV URL"; dev.matrix.localhost.com;Machine-apache-dev1,Machine-apache-dev2 per.Upgrade.uk.localhost.com;Machine-apache-pf01,Machine-apache-pf02 test.matrix.localhost.com;Machine-apache-bcp1,Machine-apache-prd1 uat.matrix.localhost.com;Machine-apache-uat1,Machine-apache-uat2 uat.orion.uk.localhost.com;Machine-apache-ua01,Machine-apache-ua02
或非互動式:
echo 'select "ENV URL", group_concat("Machine") from alldata group by "ENV URL";' \ | sqlite3 -separator ";" -cmd ".import file.txt alldata" -batch