如何正確下載此網頁？

May 17, 2022

我正在嘗試下載此網頁的內容：https ://bcs.wiley.com/he-bcs/Books?action=index&itemId=1119299160&bcsId=10685 。特別是，我對可以從“按章節瀏覽”、“按資源瀏覽”等菜單訪問的 pdf 文件感興趣。您可以在上面的網頁中看到。我試圖通過下載頁面wget，但沒有成功。
我已經使用了該-r l 0選項wget並引用了 URL（如下面的評論中所討論的）。
你可以幫幫我嗎？先感謝您！

由於 JavaScript 處理 URL 的方式，單獨使用wget是行不通的。您必須使用解析頁面xmllint，然後將 URL 處理為wget可以處理的格式。
首先提取和處理由 JavaScript 處理的 URL，並將其輸出到urls.txt：
wget -O - 'https://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=10685&itemId=1119299160&resourceId=42647' | \
xmllint --html --xpath "//li[@class='resourceColumn']//a/@href" - 2&gt;/dev/null | \
sed -e 's# href.*Books#https://bcs.wiley.com/he-bcs/Books#' -e 's/amp;//g' -e 's/&newwindow.*$//' &gt; urls.txt
現在下載通過打開每個 URL 找到的 PDF 文件urls.txt：
wget -O - -i urls.txt | grep -o 'https.*pdf' | wget -i -
curl選擇：
curl 'https://bcs.wiley.com/he-bcs/Books?action=resource&bcsId=10685&itemId=1119299160&resourceId=42647' | \
xmllint --html --xpath "//li[@class='resourceColumn']//a/@href" - 2&gt;/dev/null | \
sed -e 's# href.*Books#https://bcs.wiley.com/he-bcs/Books#' -e 's/amp;//g' -e 's/&newwindow.*$//' &gt; urls.txt
curl -s $(cat urls.txt) | grep -o 'https.*pdf' | xargs -l curl -O

引用自：https://unix.stackexchange.com/questions/702602

如何正確下載此網頁？

相關問答

從網站遞歸下載

wget 僅使用 {..} 的父頁面

下載.asp或.php文件後，可以轉換成.html文件嗎？

需要下載工具——帶有自定義標題、恢復、重試、自定義文件名輸出和動態重定向

如何使用 wget 從 Onedrive 下載文件和文件夾？

curl和wget有什麼區別？