Bash

html中標籤之間的搜尋模式

  • April 17, 2021

我需要從具有特定標題的標籤中獲取價值。

我有這個命令。

sed -n 's/title="view quote">\(.*\)<\/a>/\1/p' index.html

這是 index.html 的一部分,我需要“生活中的一切都是運氣”

<a title="view quote" href="https://www.brainyquote.com/quotes/donald_trump_106578" class="oncl_q">
<img id="qimage_106578" src="./Donald Trump Quotes - BrainyQuote_files/donaldtrump1.jpg" class="bqphtgrid" alt="Everything in life is luck. - Donald Trump">
</a>
</div>
<a href="https://www.brainyquote.com/quotes/donald_trump_106578" class="b-qt qt_106578 oncl_q" title="view quote">Everything in life is luck.</a>
<a href="https://www.brainyquote.com/quotes/donald_trump_106578" class="bq-aut qa_106578 oncl_a" title="view author">Donald Trump</a>
</div>
<div class="qbn-box">
<div class="sh-cont">
<a href="https://www.brainyquote.com/share/fb/106578" aria-label="Share this quote on Facebook" class="sh-fb sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/facebook-f.svg" alt="Share on Facebook" class="bq-fa"></a><a href="https://www.brainyquote.com/share/tw/106578?ti=Donald+Trump+Quotes" aria-label="Share this quote on Twitter" class="sh-tw sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/twitter.svg" alt="Share on Twitter" class="bq-fa"></a><a href="https://www.brainyquote.com/share/li/106578?ti=Donald+Trump+Quotes+-+BrainyQuote" aria-label="Share this quote on LinkedIn" class="sh-tw sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/linkedin-in.svg" alt="Share on LinkedIn" class="bq-fa"></a>
</div>
</div>
<div class="qll-dsk-kw-box">
<div class="kw-box">
<a href="https://www.brainyquote.com/topics/life-quotes" class="qkw-btn btn btn-xs oncl_klc" data-idx="0">Life</a>
<a href="https://www.brainyquote.com/topics/luck-quotes" class="qkw-btn btn btn-xs oncl_klc" data-idx="1">Luck</a>
<a href="https://www.brainyquote.com/topics/everything-quotes" class="qkw-btn btn btn-xs oncl_klc" data-idx="2">Everything</a>
</div>
</div>
</div>
<div id="qpos_1_2" class="m-brick grid-item boxy bqQt r-width" style="position: absolute; left: 623px; top: 2px;">
<div class="clearfix">
<div class="qti-listm">
<a title="view quote" href="https://www.brainyquote.com/quotes/donald_trump_119339" class="oncl_q">
<img id="qimage_119339" src="./Donald Trump Quotes - BrainyQuote_files/donaldtrump1(1).jpg" class="bqphtgrid" alt="The first thing the secretary types is the boss. - Donald Trump">
</a>
</div>
<a href="https://www.brainyquote.com/quotes/donald_trump_119339" class="b-qt qt_119339 oncl_q" title="view quote">The first thing the secretary types is the boss.</a>
<a href="https://www.brainyquote.com/quotes/donald_trump_119339" class="bq-aut qa_119339 oncl_a" title="view author">Donald Trump</a>
</div>
<div class="qbn-box">
<div class="sh-cont">
<a href="https://www.brainyquote.com/share/fb/119339" aria-label="Share this quote on Facebook" class="sh-fb sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/facebook-f.svg" alt="Share on Facebook" class="bq-fa"></a><a href="https://www.brainyquote.com/share/tw/119339?ti=Donald+Trump+Quotes" aria-label="Share this quote on Twitter" class="sh-tw sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/twitter.svg" alt="Share on Twitter" class="bq-fa"></a><a href="https://www.brainyquote.com/share/li/119339?ti=Donald+Trump+Quotes+-+BrainyQuote" aria-label="Share this quote on LinkedIn" class="sh-tw sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/linkedin-in.svg" alt="Share on LinkedIn" class="bq-fa"></a>
</div>
</div>

我需要所有這些值來填充 bash 中的數組。這裡的預期輸出是

$$ ‘Everything in life is luck’,‘The first thing the secretary types is the boss.’ $$. 但我需要 index.html 中的所有引號,所以我需要選擇器來獲取所有引號到數組中。

即使它是 HTML 而不是正確的 XML,您實際上也可以使用xmlstarlet.

讓我們呼叫您的文件index.html。命令呼叫:

xmlstarlet fo -H index.html 2>/dev/null |
   xmlstarlet sel -t -v '//a[@title="view quote" and string-length(text()) > 1]' -n 2>/dev/null

輸出:

Everything in life is luck.
The first thing the secretary types is the boss.

你以前可能沒有遇到xmlstarlet過。這是一個了不起的工具,可讓您格式化、編輯和解析 XML。今天我發現它還可以重新格式化格式不正確的 HTML。如果沒有,請安裝它。(如果您沒有安裝它的權限,請詢問。)它以一種無法開始處理sed的方式理解 XML。awk重新格式化 XML?sed並且awk可能會破裂,但xmlstarlet沒有顯著差異。

引用自:https://unix.stackexchange.com/questions/645559