Bash
html中標籤之間的搜尋模式
我需要從具有特定標題的標籤中獲取價值。
我有這個命令。
sed -n 's/title="view quote">\(.*\)<\/a>/\1/p' index.html
這是 index.html 的一部分,我需要“生活中的一切都是運氣”
<a title="view quote" href="https://www.brainyquote.com/quotes/donald_trump_106578" class="oncl_q"> <img id="qimage_106578" src="./Donald Trump Quotes - BrainyQuote_files/donaldtrump1.jpg" class="bqphtgrid" alt="Everything in life is luck. - Donald Trump"> </a> </div> <a href="https://www.brainyquote.com/quotes/donald_trump_106578" class="b-qt qt_106578 oncl_q" title="view quote">Everything in life is luck.</a> <a href="https://www.brainyquote.com/quotes/donald_trump_106578" class="bq-aut qa_106578 oncl_a" title="view author">Donald Trump</a> </div> <div class="qbn-box"> <div class="sh-cont"> <a href="https://www.brainyquote.com/share/fb/106578" aria-label="Share this quote on Facebook" class="sh-fb sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/facebook-f.svg" alt="Share on Facebook" class="bq-fa"></a><a href="https://www.brainyquote.com/share/tw/106578?ti=Donald+Trump+Quotes" aria-label="Share this quote on Twitter" class="sh-tw sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/twitter.svg" alt="Share on Twitter" class="bq-fa"></a><a href="https://www.brainyquote.com/share/li/106578?ti=Donald+Trump+Quotes+-+BrainyQuote" aria-label="Share this quote on LinkedIn" class="sh-tw sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/linkedin-in.svg" alt="Share on LinkedIn" class="bq-fa"></a> </div> </div> <div class="qll-dsk-kw-box"> <div class="kw-box"> <a href="https://www.brainyquote.com/topics/life-quotes" class="qkw-btn btn btn-xs oncl_klc" data-idx="0">Life</a> <a href="https://www.brainyquote.com/topics/luck-quotes" class="qkw-btn btn btn-xs oncl_klc" data-idx="1">Luck</a> <a href="https://www.brainyquote.com/topics/everything-quotes" class="qkw-btn btn btn-xs oncl_klc" data-idx="2">Everything</a> </div> </div> </div> <div id="qpos_1_2" class="m-brick grid-item boxy bqQt r-width" style="position: absolute; left: 623px; top: 2px;"> <div class="clearfix"> <div class="qti-listm"> <a title="view quote" href="https://www.brainyquote.com/quotes/donald_trump_119339" class="oncl_q"> <img id="qimage_119339" src="./Donald Trump Quotes - BrainyQuote_files/donaldtrump1(1).jpg" class="bqphtgrid" alt="The first thing the secretary types is the boss. - Donald Trump"> </a> </div> <a href="https://www.brainyquote.com/quotes/donald_trump_119339" class="b-qt qt_119339 oncl_q" title="view quote">The first thing the secretary types is the boss.</a> <a href="https://www.brainyquote.com/quotes/donald_trump_119339" class="bq-aut qa_119339 oncl_a" title="view author">Donald Trump</a> </div> <div class="qbn-box"> <div class="sh-cont"> <a href="https://www.brainyquote.com/share/fb/119339" aria-label="Share this quote on Facebook" class="sh-fb sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/facebook-f.svg" alt="Share on Facebook" class="bq-fa"></a><a href="https://www.brainyquote.com/share/tw/119339?ti=Donald+Trump+Quotes" aria-label="Share this quote on Twitter" class="sh-tw sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/twitter.svg" alt="Share on Twitter" class="bq-fa"></a><a href="https://www.brainyquote.com/share/li/119339?ti=Donald+Trump+Quotes+-+BrainyQuote" aria-label="Share this quote on LinkedIn" class="sh-tw sh-grey" target="_blank" rel="nofollow"><img src="./Donald Trump Quotes - BrainyQuote_files/linkedin-in.svg" alt="Share on LinkedIn" class="bq-fa"></a> </div> </div>
我需要所有這些值來填充 bash 中的數組。這裡的預期輸出是
$$ ‘Everything in life is luck’,‘The first thing the secretary types is the boss.’ $$. 但我需要 index.html 中的所有引號,所以我需要選擇器來獲取所有引號到數組中。
即使它是 HTML 而不是正確的 XML,您實際上也可以使用
xmlstarlet
.讓我們呼叫您的文件
index.html
。命令呼叫:xmlstarlet fo -H index.html 2>/dev/null | xmlstarlet sel -t -v '//a[@title="view quote" and string-length(text()) > 1]' -n 2>/dev/null
輸出:
Everything in life is luck. The first thing the secretary types is the boss.
你以前可能沒有遇到
xmlstarlet
過。這是一個了不起的工具,可讓您格式化、編輯和解析 XML。今天我發現它還可以重新格式化格式不正確的 HTML。如果沒有,請安裝它。(如果您沒有安裝它的權限,請詢問。)它以一種無法開始處理sed
的方式理解 XML。awk
重新格式化 XML?sed
並且awk
可能會破裂,但xmlstarlet
沒有顯著差異。