使用 wget 遞歸下載

March 27, 2017

我對以下 wget 命令有疑問：
wget -nd -r -l 10 http://web.archive.org/web/20110726051510/http://feedparser.org/docs/
它應該遞歸下載原始網路上的所有連結文件，但它只下載兩個文件（index.html和robots.txt）。
我怎樣才能實現這個網頁的遞歸下載？

wget預設情況下，它遵循robots.txt 標準來抓取頁面，就像搜尋引擎一樣，而對於 archive.org，它不允許整個 /web/ 子目錄。要覆蓋，請使用-e robots=off,
wget -nd -r -l 10 -e robots=off http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

$ wget --random-wait -r -p -e robots=off -U Mozilla \
   http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

遞歸下載 url 的內容。

--random-wait - wait between 0.5 to 1.5 seconds between requests.
-r - turn on recursive retrieving.
-e robots=off - ignore robots.txt.
-U Mozilla - set the "User-Agent" header to "Mozilla". Though a better choice is a real User-Agent like "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729)".

其他一些有用的選項是：

--limit-rate=20k - limits download speed to 20kbps.
-o logfile.txt - log the downloads.
-l 0 - remove recursion depth (which is 5 by default).
--wait=1h - be sneaky, download one file every hour.

引用自：https://unix.stackexchange.com/questions/25340

使用 wget 遞歸下載

相關問答

Linux wget -O /dev/null <http….> 語法

瀏覽器從本地“index.html”-file 重定向到“file:///”-root

是否可以搜尋 .tar.gz 文件廣度優先？

如何從 BLAST 數據庫目錄中獲取所有 Betacoronavirus .tar.gz 文件？

我將 wget 安裝到我的本地目錄中，因為系統範圍的 wget 已過時。如何使用更新的 wget 而不是系統範圍的？

如何正確下載此網頁？