Text-Processing
如何基於單個鍵列合併兩個文件並自動修復固定列並填充缺失的數據
我有兩組文件,它們試圖合併到一個文件中並填寫相關的缺失數據。
文件都是
,
分隔的第一個文件包含 13 列,第 8 列包含
YYYY-MM-DD
格式的日期(注意:此文件缺少 44 天。第二個文件是 2 列,第一個是完整的日曆年(閏年 366 天)YYYY-MM-DD
格式,而第二個是相關的儒略日期值。缺少天數的範例文件 #1
06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-26,Parts-per-million,24,100.0,0.379167,10 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-27,Parts-per-million,24,100.0,0.2875,10 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-28,Parts-per-million,11,46.0,0.163636,10 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-30,Parts-per-million,20,83.0,0.23,10 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-31,Parts-per-million,24,100.0,0.195833,10
我嘗試使用以下命令將文件合併在一起並創建一個包含 14 列的新文件,因為填充了缺失的日期並添加了儒略日期。我還在尋找程式碼以從初始文件中自動填充第 1-7 列和第 9 列的固定值,並在
-999
缺少此數據的情況下填充第 10-13 列。awk -F ',' 'NR==FNR {h[$1] = $14; next} {print $1,$2,$3,$4,$5,$6,$7,$8,h[$2],$9,$10,$11,$12,$13}' temp2.tmp temp1.tmp > temp3.tmp 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-26,26,Parts-per-million,24,100.0,0.379167,10 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-27,27,Parts-per-million,24,100.0,0.2875,10 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-28,28,Parts-per-million,11,46.0,0.163636,10 **06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-29,29,Parts-per-million,-999,-999,-999,-999** 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-30,30,Parts-per-million,20,83.0,0.23,10 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-31,31,Parts-per-million,24,100.0,0.195833,10
現在,日期時間計算總是一件……困難的事情。特別是。如果日期時間序列跨越午夜、月末或年末,或夏令時切換。在這裡,為了安全起見,我們使用紀元秒。使用命令轉換回日期
date
時間可能不適用於所有 *nix 風格。而且,我們將TZ
變數設置為“UTC”以避免 DST 問題。試試沒有,你會看到。來吧,試試:export TZ=UTC # get rid of side effects, e.g. DST switching cut -d, -f8 samplefile | date -f- +%s | paste - samplefile > TMP1 # prepend epoch seconds to the input file { read MIN DUMMY # get file´s MIN and MAX dates while read TMP DUMMY do MAX=$TMP done # and calculate a sequence of days between them eval echo @{$MIN..$MAX..86400} | tr ' ' $'\n' | date -f- +$'%s\t%Y-%m-%d\t%y%j' } < TMP1 > TMP2 # in epoch, yyyy-mm-dd, and julian format join -a1 -a2 -- TMP1 TMP2 | awk -F"[, ]" ' # join first and second intermediate files NF == 3 {split($0, TMPINS) # orig. files line missing; fill temp array with epoch etc. data $0 = SAVED # get last saved complete line $9 = TMPINS[2] # overwrite "yesterday´s" date $NF = TMPINS[3] # append julian date $11 = $12 = $13 = $14 = -999 # set invalid indicator } NF >= 13 {SAVED = $0 # correct line? save it $1 = $1 # recreate line with OFS char } {sub($1",",_) # for all lines: remove leading epoch field $14 = $15 # put julian date into right place NF-- # get rid of last field; may not work in ALL awks } 1 # default action: print ' OFS="," 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-26,Parts-per-million,24,100.0,0.379167,10,20026 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-27,Parts-per-million,24,100.0,0.2875,10,20027 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-28,Parts-per-million,11,46.0,0.163636,10,20028 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-29,Parts-per-million,-999,-999,-999,-999,20029 06,037,0016,42101,34.14435,-117.85036,1-HOUR,2020-01-30,Parts-per-million,20,83.0,0.23,10,20030
在 FIFO 的幫助下,整個事情可以寫在一個冗長的命令管道上:
mkfifo TMPFIFO cut -d, -f8 samplefile | date -f- +%s | tee -a >(read MIN; while read TMP; do MAX=$TMP; done; eval echo @{$MIN..$MAX..86400} | tr ' ' $'\n' > TMPFIFO) | paste - samplefile | join -a1 -a2 -- - <(date -fTMPFIFO +$'%s\t%Y-%m-%d\t%y%j') | awk -F"[, ]" 'NF == 3 {split($0, TMPINS); $0 = SAVED; $9 = TMPINS[2]; $NF = TMPINS[3]; $11 = $12 = $13 = $14 = -999} NF >= 13 {SAVED = $0; $1 = $1} {sub($1",",_); $14 = $15; NF--} 1' OFS=","