Text-Processing
如何使用 awk 使用文本模式分割文件
我有一個 2Gb 的文件。這有一個標題和許多“事件”結構。這是開始的樣子:
<run example> <header> 5 This is header </header> <event = 22> <evhead> 8 3 1 2 0 0 0 0 0 0 0 1 0 1 30 0 1 4 1 4 3 1 0 1 0 0 0 0 0 0 1 1 8 0 1 0 2 1 5 2 0 2 1 3 7 3 1 1 0 1 0 10100 2 3 1 5 1 1 5 1 7 2 3 2 2 </evhead> 0 97 3 11 0 0 3 4 1.94791176123E-14 0.00000000000E+00 -2.75000000000E+01 2.75000000047E+01 5.10000000000E-04 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 2212 0 0 5 0 -1.94791176123E-14 0.00000000000E+00 9.20000000000E+02 9.20000478451E+02 9.38270000000E-01 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 11 1 0 0 0 4.63012694434E+00 2.62561831936E+00 -2.31855757639E+01 2.37887130977E+01 5.10000000000E-04 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 22 1 0 0 0 -4.63012694434E+00 -2.62561831936E+00 -4.31442423592E+00 3.71128690719E+00 -5.75956188088E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 2212 2 0 0 0 -2.16995636615E-14 -1.11022302463E-15 9.20000000000E+02 9.20000478451E+02 9.38270000000E-01 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 22 4 0 0 0 -4.60626572550E+00 -2.61208727495E+00 -2.23619853289E+00 5.74815342040E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 </event>
整個文件包含 97000 個這樣的“事件”塊。所以我想把這個文件分成10個文件,每個文件都包含標題和10000個“事件”塊。所有塊的索引不同(它們是隨機的)。最後一個文件當然只包含 7 000 個塊。
我已經嘗試了來自堆棧的多個指令,如下所示:https ://stackoverflow.com/questions/8544197/splitting-a-file-in-linux-based-on-content https://stackoverflow.com/questions/8544197/splitting -a-file-in-linux-based-on-content 但沒有對我有用。
下面是用於任何測試的文件的更大範例(file_to_download):
<run example> <header> 5 header </header> <event = 22> <evhead> 8 3 1 2 0 0 0 0 0 0 0 1 0 1 30 0 1 4 1 4 3 1 0 1 0 </evhead> 0 97 3 11 0 0 3 4 1.94791176123E-14 0.00000000000E+00 -2.75000000000E+01 2.75000000047E+01 5.10000000000E-04 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 2212 0 0 5 0 -1.94791176123E-14 0.00000000000E+00 9.20000000000E+02 9.20000478451E+02 9.38270000000E-01 </event> <event = 26> <evhead> 8 3 1 2 0 0 0 0 0 0 0 1 0 1 30 0 1 4 1 4 3 1 0 1 0 </evhead> 0 52 3 11 0 0 3 4 1.94791176123E-14 0.00000000000E+00 -2.75000000000E+01 2.75000000047E+01 5.10000000000E-04 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 2212 0 0 5 0 -1.94791176123E-14 0.00000000000E+00 9.20000000000E+02 9.20000478451E+02 9.38270000000E-01 </event> <event = 31> <evhead> 8 3 1 2 0 0 0 0 0 0 0 1 0 1 30 0 1 4 1 4 3 1 0 1 0 0 0 0 0 0 1 1 8 </evhead> 0 92 3 11 0 0 3 4 1.94791176123E-14 0.00000000000E+00 -2.75000000000E+01 2.75000000047E+01 5.10000000000E-04 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 2212 0 0 5 0 -1.94791176123E-14 0.00000000000E+00 9.20000000000E+02 9.20000478451E+02 9.38270000000E-01 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 11 1 0 0 0 4.39003604933E+00 4.97037860337E+00 -2.04926313413E+01 2.15389187176E+01 5.10000000000E-04 </event> <event = 37> <evhead> 8 3 1 2 0 0 0 0 0 0 0 1 0 1 30 0 1 </evhead> 0 77 3 11 0 0 3 4 1.94791176123E-14 0.00000000000E+00 -2.75000000000E+01 2.75000000047E+01 5.10000000000E-04 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 2212 0 0 5 0 -1.94791176123E-14 0.00000000000E+00 9.20000000000E+02 9.20000478451E+02 9.38270000000E-01 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 11 1 0 0 0 7.91768942174E+00 3.75815788575E+00 -2.09569980000E+01 2.27158385693E+01 5.10000000000E-04 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 </event> <event = 41> <evhead> 8 3 1 2 0 0 0 0 0 0 0 1 0 1 30 0 1 4 1 4 3 1 0 1 0 </evhead> 0 122 3 11 0 0 3 4 1.94791176123E-14 0.00000000000E+00 -2.75000000000E+01 2.75000000047E+01 5.10000000000E-04 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 2212 0 0 5 0 -1.94791176123E-14 0.00000000000E+00 9.20000000000E+02 9.20000478451E+02 9.38270000000E-01 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 3 11 1 0 0 0 -3.63469912393E+00 3.95372353695E+00 -1.62133507727E+01 1.70796870892E+01 5.10000000000E-04 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 0.00000000000E+00 </event>
使用 GNU
awk
:BEGIN { fname = "/dev/null" } /<header>/,/<\/header>/ { hdr = hdr $0 "\n"; next } /^<event / { events++ if(events % 10000 == 1) { if(files++) close(fname) fname = sprintf("file%02d.txt", files) print hdr >fname } } { print >>fname }
要執行它:將其寫入文件
script.awk
,然後執行:gawk -f script.awk file.txt