Bash
嘗試使用 sed 從 .txt 文件中的 html 源中 grep url
我之前已經能夠使用下面的程式碼從 html 源中獲取 url 列表,但由於某種原因,它不適用於這個特定的範例。
緊握:
grep -1 box-download shareit1.txt|sed 's/<a/\/n/'|sed 's/href/\/n/'|grep http|cut -d\" -f2>> shareit2.txt
網址:
<div class="box-download"> <a data-no-file="0" title="SHAREit free download" href="http://gsf-cf.softonic.com/c98/1a8/173dd01ec9001985d81eb5f2023b03280c/LenovoShareIt-win.exe?SD_used=0&channel=WEB&fdh=no&id_file=69703978&instance=softonic_en&type=PROGRAM&Expires=1444364906&Signature=SdKSfTDHY4dG6HVu2--lqt8lRbGK9S1opIDZiSNwvggAAAXB3hESz1G1Y00rU5iLGY5lai0YOJBXhE4y6gvL4uQvCV4U5jzLDU9TmFTxe4xNDrEmkSC95LyGdGSudQKfrWdD06gBlVrqE49AeeotENtdA3SpkmfQGGd1tnjS138_&Key-Pair-Id=APKAJUA62FNWTI37JTGQ&filename=LenovoShareIt-win.exe" id="download-button" class="button-main-download-xl" data-ua="#c,#l,a=Download,downloadType=HostedDownload" > <strong>Free Download <span>Safe download</span> </strong> <i class="icon-download-alt"></i> </a>
感謝幫助。
sed 's/^[^"]* *// s/" */"\n/2 /\n/P;D ' <in >out
這將輪流列印和吃輸入行,一次一個雙引號上下文。它可能會使您的數據更加
grep
友好。如所寫,如果引用的上下文可以跨越換行符,則它不起作用,但是,據我了解,它們不應該在 HTML 中。無論如何,它確實使您的樣品更容易處理:
class="box-download" data-no-file="0" title="SHAREit free download" href="http://gsf-cf.softonic.com/c98/1a8/173dd01ec9001985d81eb5f2023b03280c/LenovoShareIt-win.exe?SD_used=0&channel=WEB&fdh=no&id_file=69703978&instance=softonic_en&type=PROGRAM&Expires=1444364906&Signature=SdKSfTDHY4dG6HVu2--lqt8lRbGK9S1opIDZiSNwvggAAAXB3hESz1G1Y00rU5iLGY5lai0YOJBXhE4y6gvL4uQvCV4U5jzLDU9TmFTxe4xNDrEmkSC95LyGdGSudQKfrWdD06gBlVrqE49AeeotENtdA3SpkmfQGGd1tnjS138_&Key-Pair-Id=APKAJUA62FNWTI37JTGQ&filename=LenovoShareIt-win.exe" id="download-button" class="button-main-download-xl" data-ua="#c,#l,a=Download,downloadType=HostedDownload" class="icon-download-alt"
有了這個(固定的)file.html:
<html> <div class="box-download"> <a data-no-file="0" title="SHAREit free download" href="http://gsf-cf.softonic.com/c98/1a8/173dd01ec9001985d81eb5f2023b03280c/LenovoShareIt-win.exe?SD_used=0&channel=WEB&fdh=no&id_file=69703978&instance=softonic_en&type=PROGRAM&Expires=1444364906&Signature=SdKSfTDHY4dG6HVu2--lqt8lRbGK9S1opIDZiSNwvggAAAXB3hESz1G1Y00rU5iLGY5lai0YOJBXhE4y6gvL4uQvCV4U5jzLDU9TmFTxe4xNDrEmkSC95LyGdGSudQKfrWdD06gBlVrqE49AeeotENtdA3SpkmfQGGd1tnjS138_&Key-Pair-Id=APKAJUA62FNWTI37JTGQ&filename=LenovoShareIt-win.exe" id="download-button" class="button-main-download-xl" data-ua="#c,#l,a=Download,downloadType=HostedDownload"> <strong>Free Download<span>Safe download</span></strong> <i class="icon-download-alt"></i> </a> </div> </html>
命令:
xmlstarlet sel -t -v "//html/div/a/@href" file.html
輸出:
http://gsf-cf.softonic.com/c98/1a8/173dd01ec9001985d81eb5f2023b03280c/LenovoShareIt-win.exe?SD_used=0&channel=WEB&fdh=no&id_file=69703978&instance=softonic_en&type=PROGRAM&Expires=1444364906&Signature=SdKSfTDHY4dG6HVu2--lqt8lRbGK9S1opIDZiSNwvggAAAXB3hESz1G1Y00rU5iLGY5lai0YOJBXhE4y6gvL4uQvCV4U5jzLDU9TmFTxe4xNDrEmkSC95LyGdGSudQKfrWdD06gBlVrqE49AeeotENtdA3SpkmfQGGd1tnjS138_&Key-Pair-Id=APKAJUA62FNWTI37JTGQ&filename=LenovoShareIt-win.exe