Wget
使用 wget 遞歸下載
我對以下 wget 命令有疑問:
wget -nd -r -l 10 http://web.archive.org/web/20110726051510/http://feedparser.org/docs/
它應該遞歸下載原始網路上的所有連結文件,但它只下載兩個文件(
index.html
和robots.txt
)。我怎樣才能實現這個網頁的遞歸下載?
wget
預設情況下,它遵循robots.txt 標準來抓取頁面,就像搜尋引擎一樣,而對於 archive.org,它不允許整個 /web/ 子目錄。要覆蓋,請使用-e robots=off
,wget -nd -r -l 10 -e robots=off http://web.archive.org/web/20110726051510/http://feedparser.org/docs/
$ wget --random-wait -r -p -e robots=off -U Mozilla \ http://web.archive.org/web/20110726051510/http://feedparser.org/docs/
遞歸下載 url 的內容。
--random-wait - wait between 0.5 to 1.5 seconds between requests. -r - turn on recursive retrieving. -e robots=off - ignore robots.txt. -U Mozilla - set the "User-Agent" header to "Mozilla". Though a better choice is a real User-Agent like "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729)".
其他一些有用的選項是:
--limit-rate=20k - limits download speed to 20kbps. -o logfile.txt - log the downloads. -l 0 - remove recursion depth (which is 5 by default). --wait=1h - be sneaky, download one file every hour.