在許多 URL 系列的長列表中刪除具有相同域的系列中的所有 URL，但最後一次出現除外

March 11, 2021

我有一個我認為sed可能非常適合的問題，但我對它的了解還不夠，無法弄清楚如何正確使用它。
這就是我所擁有的 - 像這樣的文件，但要長得多：
https://www.npmjs.com
https://www.npmjs.com/package/rabin
https://www.politico.com/news/magazine/blah/blah
https://www.raspberrypi.org
https://www.raspberrypi.org/documentation/blah
https://www.raspberrypi.org/products/raspberry-pi-zero-w/
https://www.reddit.com
https://www.reddit.com/
https://www.reddit.com/r/geology/blah/blah/blah
https://www.reddit.com/r/commandline/blah/blah/blah
...thousands more...
我需要的只是粗體的項目，即有許多系列的 URL 共享一個域名，我需要每個系列中的最後一個 URL 用於整個文本文件。
所以只有前面有箭頭的
https://www.npmjs.com
-&gt;https://www.npmjs.com/package/rabin
-&gt;https://www.politico.com/news/magazine/blah/blah
https://www.raspberrypi.org
https://www.raspberrypi.org/documentation/blah
-&gt;https://www.raspberrypi.org/products/raspberry-pi-zero-w/
https://www.reddit.com
https://www.reddit.com/
https://www.reddit.com/r/geology/blah/blah/blah
-&gt;https://www.reddit.com/r/commandline/blah/blah/blah
...thousands more...
有任何想法嗎？
謝謝！

這成功了：
cat input.txt | \
gawk -e '{match($0, /(https?:\/\/(?:www.)?[a-zA-Z0-9-]+?[a-z0-9.]+)/, url)} \
!a[url[1]]++{ \
   b[++count]=url[1] \
} \
{ \
   c[url[1]]=$0 \
} \
END{ \
   for(i=1;i&lt;=count;i++){ \
       print c[b[i]] \
   } \
}' &gt; output.txt
正則表達式可能會被簡化很多，並且可能會擷取更多的域名差異，但就我而言，它工作得很好。該awk命令是從此答案修改的。（有趣的是，有人如何從我的問題中刪除了“bash”標籤，而真正幫助我的答案卻被標記為“bash”……
再思考一下這個問題，我想你也可以使用 ask 將匹配的域作為單獨的“欄位”添加到末尾，使用 sort unique 選擇最後一個，然後在末尾刪除域“欄位”，或者更確切地說使用 ask 僅列印排序唯一後的第一個“欄位”，即原始 URL。

引用自：https://unix.stackexchange.com/questions/638469

在許多 URL 系列的長列表中刪除具有相同域的系列中的所有 URL，但最後一次出現除外

相關問答

替換引號內的特定字元

存在於一個文件中但不存在於另一個文件中的 grep 行

如果關鍵字 2 不存在，則刪除關鍵字 1 之後的行

從 Linux 命令行更改文本

將 xdpdump 的輸出保存到變數

查找和刪除重複記錄