Text-Processing

從大文件中刪除具有重複第一個欄位/列的行

  • November 29, 2019

我有一個非常大的文件(下面的片段)。我需要刪除第一列中的數字沒有從上面的行連續增加的任何行。

例如,我想保留程式碼段中的第一行,其中第一列中的標識符是“ 40812.”。然後我想保留40813第一列中“”所在的行(在我的範例中為第 3 行),然後保留以“”開頭的行,40814依此類推。我想刪除任何違反此繼承的行,例如第二行。

我在這裡查看了以前的問題/答案以尋找可能的解決方案,但到目前為止還沒有成功。出現在幾個問題中的解決方案是:

awk -F',' ' '!seen[$1]++ myFile

我改編了另一個我看到的解決方案:

sort -t':' -k 1,1 -u myFile

如果有人能告訴我哪裡出錯了,我將不勝感激。我對文件操作不是很有經驗。

40812        20406.000000         0.843859468      1083.209050130      -994.562279080      -993.349611938        22.120868921
40829        20414.500000         0.891283743      1144.084593627      -994.539001565      -993.349739827        21.177788019
40813        20406.500000         0.829362077      1064.599666089      -994.546948121      -993.348764740        22.087239027
40830        20415.000000         0.889606427      1141.931529727      -994.537943593      -993.350242614        21.282490969
40814        20407.000000         0.822524589      1055.822814442      -994.540118434      -993.348757318        22.083606005
40831        20415.500000         0.875230513      1123.478077086      -994.523844766      -993.350421831        20.606467962
40815        20407.500000         0.823511602      1057.089780943      -994.541681744      -993.349315083        22.432111979
40832        20416.000000         0.846150258      1086.149592126      -994.494220141      -993.349798791        22.309054136
40816        20408.000000         0.824550451      1058.423286012      -994.543159511      -993.349731194        22.481428146
40833        20416.500000         0.811604775      1041.805740021      -994.458563132      -993.348626225        21.118428946
40834        20417.000000         0.787796672      1011.244783236      -994.434062658      -993.347887110        20.963790894
40817        20408.500000         0.819160081      1051.504008955      -994.537767061      -993.349702160        22.268819809
40835        20417.500000         0.784857495      1007.471947645      -994.431441227      -993.348167742        20.731789112
40818        20409.000000         0.807571275      1036.628191427      -994.525675417      -993.349169067        22.332761049
40836        20418.000000         0.799208319      1025.893192994      -994.446595759      -993.348938468        21.268665075
40819        20409.500000         0.797104599      1023.192780242      -994.514563564      -993.348491176        22.622548103
40837        20418.500000         0.819797939      1052.322786256      -994.467698852      -993.349417295        21.013041973
40820        20410.000000         0.796605925      1022.552664951      -994.513928312      -993.348319789        22.193170071

這正是awk擅長的事情:

$ awk '{ if(NR==1 || $1 == last+1){print; last=$1}}' file
40812        20406.000000         0.843859468      1083.209050130      -994.562279080      -993.349611938        22.120868921
40813        20406.500000         0.829362077      1064.599666089      -994.546948121      -993.348764740        22.087239027
40814        20407.000000         0.822524589      1055.822814442      -994.540118434      -993.348757318        22.083606005
40815        20407.500000         0.823511602      1057.089780943      -994.541681744      -993.349315083        22.432111979
40816        20408.000000         0.824550451      1058.423286012      -994.543159511      -993.349731194        22.481428146
40817        20408.500000         0.819160081      1051.504008955      -994.537767061      -993.349702160        22.268819809
40818        20409.000000         0.807571275      1036.628191427      -994.525675417      -993.349169067        22.332761049
40819        20409.500000         0.797104599      1023.192780242      -994.514563564      -993.348491176        22.622548103
40820        20410.000000         0.796605925      1022.552664951      -994.513928312      -993.348319789        22.193170071

或者,有點打高爾夫球:

$ awk '(NR==1 || $1 == last+1) && last=$1' file
40812        20406.000000         0.843859468      1083.209050130      -994.562279080      -993.349611938        22.120868921
40813        20406.500000         0.829362077      1064.599666089      -994.546948121      -993.348764740        22.087239027
40814        20407.000000         0.822524589      1055.822814442      -994.540118434      -993.348757318        22.083606005
40815        20407.500000         0.823511602      1057.089780943      -994.541681744      -993.349315083        22.432111979
40816        20408.000000         0.824550451      1058.423286012      -994.543159511      -993.349731194        22.481428146
40817        20408.500000         0.819160081      1051.504008955      -994.537767061      -993.349702160        22.268819809
40818        20409.000000         0.807571275      1036.628191427      -994.525675417      -993.349169067        22.332761049
40819        20409.500000         0.797104599      1023.192780242      -994.514563564      -993.348491176        22.622548103
40820        20410.000000         0.796605925      1022.552664951      -994.513928312      -993.348319789        22.193170071

解釋

  • if(NR==1 || $1 == last+1):NR是目前行號。所以NR == 1只有在讀取文件的第一行時才會如此。我們需要這個,所以我們總是會列印第一行。然後,$1 == last +1如果行 ( $1) 的第一個欄位等於儲存在變數中的值last加 1,則為真。綜合起來,這意味著“如果這是最後一行或第一個欄位等於 last + 1”,即定義您的目標行。
  • print; last=$1:如果上述兩個條件中的任何一個為真,則列印該行並將 的值設置為該行last第一個欄位,以便我們處理下一個欄位。

引用自:https://unix.stackexchange.com/questions/554625