Bash

嘗試使用 sed 從 .txt 文件中的 html 源中 grep url

  • August 5, 2016

我之前已經能夠使用下面的程式碼從 html 源中獲取 url 列表,但由於某種原因,它不適用於這個特定的範例。

緊握:

grep -1 box-download shareit1.txt|sed 's/<a/\/n/'|sed 's/href/\/n/'|grep http|cut -d\" -f2>> shareit2.txt

網址:

<div class="box-download">
<a data-no-file="0" title="SHAREit free download" href="http://gsf-cf.softonic.com/c98/1a8/173dd01ec9001985d81eb5f2023b03280c/LenovoShareIt-win.exe?SD_used=0&channel=WEB&fdh=no&id_file=69703978&instance=softonic_en&type=PROGRAM&Expires=1444364906&Signature=SdKSfTDHY4dG6HVu2--lqt8lRbGK9S1opIDZiSNwvggAAAXB3hESz1G1Y00rU5iLGY5lai0YOJBXhE4y6gvL4uQvCV4U5jzLDU9TmFTxe4xNDrEmkSC95LyGdGSudQKfrWdD06gBlVrqE49AeeotENtdA3SpkmfQGGd1tnjS138_&Key-Pair-Id=APKAJUA62FNWTI37JTGQ&filename=LenovoShareIt-win.exe" id="download-button" class="button-main-download-xl"
       data-ua="#c,#l,a=Download,downloadType=HostedDownload"
   >
   <strong>Free Download
       <span>Safe download</span>
   </strong>
   <i class="icon-download-alt"></i>
</a>

感謝幫助。

sed 's/^[^"]*  *//
    s/" */"\n/2
     /\n/P;D
'    <in >out

這將輪流列印和吃輸入行,一次一個雙引號上下文。它可能會使您的數據更加grep友好。如所寫,如果引用的上下文可以跨越換行符,則它不起作用,但是,據我了解,它們不應該在 HTML 中。

無論如何,它確實使您的樣品更容易處理:

class="box-download"
data-no-file="0"
title="SHAREit free download"
href="http://gsf-cf.softonic.com/c98/1a8/173dd01ec9001985d81eb5f2023b03280c/LenovoShareIt-win.exe?SD_used=0&channel=WEB&fdh=no&id_file=69703978&instance=softonic_en&type=PROGRAM&Expires=1444364906&Signature=SdKSfTDHY4dG6HVu2--lqt8lRbGK9S1opIDZiSNwvggAAAXB3hESz1G1Y00rU5iLGY5lai0YOJBXhE4y6gvL4uQvCV4U5jzLDU9TmFTxe4xNDrEmkSC95LyGdGSudQKfrWdD06gBlVrqE49AeeotENtdA3SpkmfQGGd1tnjS138_&Key-Pair-Id=APKAJUA62FNWTI37JTGQ&filename=LenovoShareIt-win.exe"
id="download-button"
class="button-main-download-xl"
data-ua="#c,#l,a=Download,downloadType=HostedDownload"
class="icon-download-alt"

有了這個(固定的)file.html:

<html>
 <div class="box-download">
   <a data-no-file="0" title="SHAREit free download" href="http://gsf-cf.softonic.com/c98/1a8/173dd01ec9001985d81eb5f2023b03280c/LenovoShareIt-win.exe?SD_used=0&channel=WEB&fdh=no&id_file=69703978&instance=softonic_en&type=PROGRAM&Expires=1444364906&Signature=SdKSfTDHY4dG6HVu2--lqt8lRbGK9S1opIDZiSNwvggAAAXB3hESz1G1Y00rU5iLGY5lai0YOJBXhE4y6gvL4uQvCV4U5jzLDU9TmFTxe4xNDrEmkSC95LyGdGSudQKfrWdD06gBlVrqE49AeeotENtdA3SpkmfQGGd1tnjS138_&Key-Pair-Id=APKAJUA62FNWTI37JTGQ&filename=LenovoShareIt-win.exe" id="download-button" class="button-main-download-xl" data-ua="#c,#l,a=Download,downloadType=HostedDownload">
     <strong>Free Download<span>Safe download</span></strong>
     <i class="icon-download-alt"></i>
   </a>
 </div>
</html>

命令:

xmlstarlet sel -t -v "//html/div/a/@href" file.html

輸出:

http://gsf-cf.softonic.com/c98/1a8/173dd01ec9001985d81eb5f2023b03280c/LenovoShareIt-win.exe?SD_used=0&channel=WEB&fdh=no&id_file=69703978&instance=softonic_en&type=PROGRAM&Expires=1444364906&Signature=SdKSfTDHY4dG6HVu2--lqt8lRbGK9S1opIDZiSNwvggAAAXB3hESz1G1Y00rU5iLGY5lai0YOJBXhE4y6gvL4uQvCV4U5jzLDU9TmFTxe4xNDrEmkSC95LyGdGSudQKfrWdD06gBlVrqE49AeeotENtdA3SpkmfQGGd1tnjS138_&Key-Pair-Id=APKAJUA62FNWTI37JTGQ&filename=LenovoShareIt-win.exe

引用自:https://unix.stackexchange.com/questions/234817