Text-Processing
從大文件中刪除具有重複第一個欄位/列的行
我有一個非常大的文件(下面的片段)。我需要刪除第一列中的數字沒有從上面的行連續增加的任何行。
例如,我想保留程式碼段中的第一行,其中第一列中的標識符是“
40812
.”。然後我想保留40813
第一列中“”所在的行(在我的範例中為第 3 行),然後保留以“”開頭的行,40814
依此類推。我想刪除任何違反此繼承的行,例如第二行。我在這裡查看了以前的問題/答案以尋找可能的解決方案,但到目前為止還沒有成功。出現在幾個問題中的解決方案是:
awk -F',' ' '!seen[$1]++ myFile
我改編了另一個我看到的解決方案:
sort -t':' -k 1,1 -u myFile
如果有人能告訴我哪裡出錯了,我將不勝感激。我對文件操作不是很有經驗。
40812 20406.000000 0.843859468 1083.209050130 -994.562279080 -993.349611938 22.120868921 40829 20414.500000 0.891283743 1144.084593627 -994.539001565 -993.349739827 21.177788019 40813 20406.500000 0.829362077 1064.599666089 -994.546948121 -993.348764740 22.087239027 40830 20415.000000 0.889606427 1141.931529727 -994.537943593 -993.350242614 21.282490969 40814 20407.000000 0.822524589 1055.822814442 -994.540118434 -993.348757318 22.083606005 40831 20415.500000 0.875230513 1123.478077086 -994.523844766 -993.350421831 20.606467962 40815 20407.500000 0.823511602 1057.089780943 -994.541681744 -993.349315083 22.432111979 40832 20416.000000 0.846150258 1086.149592126 -994.494220141 -993.349798791 22.309054136 40816 20408.000000 0.824550451 1058.423286012 -994.543159511 -993.349731194 22.481428146 40833 20416.500000 0.811604775 1041.805740021 -994.458563132 -993.348626225 21.118428946 40834 20417.000000 0.787796672 1011.244783236 -994.434062658 -993.347887110 20.963790894 40817 20408.500000 0.819160081 1051.504008955 -994.537767061 -993.349702160 22.268819809 40835 20417.500000 0.784857495 1007.471947645 -994.431441227 -993.348167742 20.731789112 40818 20409.000000 0.807571275 1036.628191427 -994.525675417 -993.349169067 22.332761049 40836 20418.000000 0.799208319 1025.893192994 -994.446595759 -993.348938468 21.268665075 40819 20409.500000 0.797104599 1023.192780242 -994.514563564 -993.348491176 22.622548103 40837 20418.500000 0.819797939 1052.322786256 -994.467698852 -993.349417295 21.013041973 40820 20410.000000 0.796605925 1022.552664951 -994.513928312 -993.348319789 22.193170071
這正是
awk
擅長的事情:$ awk '{ if(NR==1 || $1 == last+1){print; last=$1}}' file 40812 20406.000000 0.843859468 1083.209050130 -994.562279080 -993.349611938 22.120868921 40813 20406.500000 0.829362077 1064.599666089 -994.546948121 -993.348764740 22.087239027 40814 20407.000000 0.822524589 1055.822814442 -994.540118434 -993.348757318 22.083606005 40815 20407.500000 0.823511602 1057.089780943 -994.541681744 -993.349315083 22.432111979 40816 20408.000000 0.824550451 1058.423286012 -994.543159511 -993.349731194 22.481428146 40817 20408.500000 0.819160081 1051.504008955 -994.537767061 -993.349702160 22.268819809 40818 20409.000000 0.807571275 1036.628191427 -994.525675417 -993.349169067 22.332761049 40819 20409.500000 0.797104599 1023.192780242 -994.514563564 -993.348491176 22.622548103 40820 20410.000000 0.796605925 1022.552664951 -994.513928312 -993.348319789 22.193170071
或者,有點打高爾夫球:
$ awk '(NR==1 || $1 == last+1) && last=$1' file 40812 20406.000000 0.843859468 1083.209050130 -994.562279080 -993.349611938 22.120868921 40813 20406.500000 0.829362077 1064.599666089 -994.546948121 -993.348764740 22.087239027 40814 20407.000000 0.822524589 1055.822814442 -994.540118434 -993.348757318 22.083606005 40815 20407.500000 0.823511602 1057.089780943 -994.541681744 -993.349315083 22.432111979 40816 20408.000000 0.824550451 1058.423286012 -994.543159511 -993.349731194 22.481428146 40817 20408.500000 0.819160081 1051.504008955 -994.537767061 -993.349702160 22.268819809 40818 20409.000000 0.807571275 1036.628191427 -994.525675417 -993.349169067 22.332761049 40819 20409.500000 0.797104599 1023.192780242 -994.514563564 -993.348491176 22.622548103 40820 20410.000000 0.796605925 1022.552664951 -994.513928312 -993.348319789 22.193170071
解釋
if(NR==1 || $1 == last+1)
:NR
是目前行號。所以NR == 1
只有在讀取文件的第一行時才會如此。我們需要這個,所以我們總是會列印第一行。然後,$1 == last +1
如果行 ($1
) 的第一個欄位等於儲存在變數中的值last
加 1,則為真。綜合起來,這意味著“如果這是最後一行或第一個欄位等於 last + 1”,即定義您的目標行。print; last=$1
:如果上述兩個條件中的任何一個為真,則列印該行並將 的值設置為該行的last
第一個欄位,以便我們處理下一個欄位。