Text-Processing

用於檢查文件中每一行的 awk 腳本

  • December 5, 2019

我在嘗試創建一個 awk 腳本來檢查並可能更正文本文件中的每一行時遇到了一些麻煩。

考慮這個例子:

$ cat employee.txt
"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla",
"Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Techno
logy","6000"
"501","Ritu","Accounting","5400"

如您所見,有些線條似乎在錯誤的點斷開。該模式應如下所示:

$ cat employee.txt
"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla","Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Technology","6000"
"501","Ritu","Accounting","5400"

所以我想知道 Awk 中是否有一種方法可以確定是否沒有遵循該模式,例如通過驗證每行中逗號的數量,然後退格虛線。

我收到包含數百或數千行的此類文件,因此修復所有斷線的手動工作很乏味。

我正在創建一個控製文件以使用 SQLLDR 將數據載入到表中,但由於文本文件包含虛線而出現錯誤。所以我的解決方案是通過腳本修復每一行。

有什麼想法嗎?腳本不必在 awk 中。

$ awk -F, 'FNR == 1 { nf = NF } { while (NF < nf || !/[^,]"$/) { line = $0; getline; $0 = line $0 }; print }' file
"100","Thomas","Sales","5000"
"200","Jason","Technology","5500"
"300","Mayla","Technology","7000"
"400","Nisha","Marketing","9500"
"500","Randy","Technology","6000"
"501","Ritu","Accounting","5400"

這使用awk假設第一行具有正確數量的欄位,並且沒有欄位可能包含嵌入的逗號。它進一步假設沒有一行會有太多的欄位,即一行可能有額外的換行符,但沒有一行與下一行/上一行連接。

當找到欄位數錯誤的行(或不以"字元結尾的行,這意味著最後一個欄位被拆分)時,將目前行保存在變數中line並讀取下一行。然後將目前行更新為line和剛剛讀取的行的連接。這將繼續(在多個連續分割線的情況下),直到我們最終得到具有正確數量的欄位的東西。然後列印重建的線。

NF是一個特殊awk變數,保存目前記錄中的欄位數(一條記錄預設為一行)。$0當(目前記錄)分配給或讀取新記錄時,此編號會自動更新。該nf變數是我們自己的變數,從第一行開始設置為“正確的欄位數”。

引用自:https://unix.stackexchange.com/questions/555761