Text-Processing
用於檢查文件中每一行的 awk 腳本
我在嘗試創建一個 awk 腳本來檢查並可能更正文本文件中的每一行時遇到了一些麻煩。
考慮這個例子:
$ cat employee.txt "100","Thomas","Sales","5000" "200","Jason","Technology","5500" "300","Mayla", "Technology","7000" "400","Nisha","Marketing","9500" "500","Randy","Techno logy","6000" "501","Ritu","Accounting","5400"
如您所見,有些線條似乎在錯誤的點斷開。該模式應如下所示:
$ cat employee.txt "100","Thomas","Sales","5000" "200","Jason","Technology","5500" "300","Mayla","Technology","7000" "400","Nisha","Marketing","9500" "500","Randy","Technology","6000" "501","Ritu","Accounting","5400"
所以我想知道 Awk 中是否有一種方法可以確定是否沒有遵循該模式,例如通過驗證每行中逗號的數量,然後退格虛線。
我收到包含數百或數千行的此類文件,因此修復所有斷線的手動工作很乏味。
我正在創建一個控製文件以使用 SQLLDR 將數據載入到表中,但由於文本文件包含虛線而出現錯誤。所以我的解決方案是通過腳本修復每一行。
有什麼想法嗎?腳本不必在 awk 中。
$ awk -F, 'FNR == 1 { nf = NF } { while (NF < nf || !/[^,]"$/) { line = $0; getline; $0 = line $0 }; print }' file "100","Thomas","Sales","5000" "200","Jason","Technology","5500" "300","Mayla","Technology","7000" "400","Nisha","Marketing","9500" "500","Randy","Technology","6000" "501","Ritu","Accounting","5400"
這使用
awk
並假設第一行具有正確數量的欄位,並且沒有欄位可能包含嵌入的逗號。它進一步假設沒有一行會有太多的欄位,即一行可能有額外的換行符,但沒有一行與下一行/上一行連接。當找到欄位數錯誤的行(或不以
"
字元結尾的行,這意味著最後一個欄位被拆分)時,將目前行保存在變數中line
並讀取下一行。然後將目前行更新為line
和剛剛讀取的行的連接。這將繼續(在多個連續分割線的情況下),直到我們最終得到具有正確數量的欄位的東西。然後列印重建的線。
NF
是一個特殊awk
變數,保存目前記錄中的欄位數(一條記錄預設為一行)。$0
當(目前記錄)分配給或讀取新記錄時,此編號會自動更新。該nf
變數是我們自己的變數,從第一行開始設置為“正確的欄位數”。