從已經下載的 index.html 中提取 pdf 文件以獲取它們，即使使用 grep 有多個 pdf

September 4, 2020

我有一個index.html包含指向 PDF 文件的 href 連結的文件。

當我這樣做時：grep -i 'href=' index.html，我得到例如：

&lt;p&gt;Télécharger : &lt;a href="https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé-C1_L1.pdf"&gt;&lt;span style="color: #0000ff;"&gt;Cours n°1&lt;/span&gt;&lt;/a&gt; (S. Henrot-Versillé), &lt;span style="color: #0000ff;"&gt;&lt;a href="https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé_C1_L2.pdf"&gt;Cours n°2&lt;/a&gt;&lt;/span&gt; (S. Henrot-Versillé), &lt;a href="https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Henrot-Versillé_C3.pdf"&gt;&lt;span style="color: #0000ff;"&gt;Cours n°3&lt;/span&gt;&lt;/a&gt; (S. Henrot-Versillé)&lt;/p&gt;
&lt;p&gt;Télécharger le cours sur &lt;a href="https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_L1_Bayesian.pdf"&gt;la méthode bayésienne&lt;/a&gt; (M. Martinelli) et &lt;a href="https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_TD_Bayesian.pdf"&gt;son TD&lt;/a&gt; (M. Martinelli).&lt;/p&gt;&lt;/div&gt;
&lt;p&gt;&lt;a href="https://github.com/mhuertascompany/EDE19" title="GitHub Deep Learning 2019 EDE"&gt;https://github.com/mhuertascompany/EDE19&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://colab.research.google.com/drive" title="TDs Deep Learning 2019"&gt;https://colab.research.google.com/drive&lt;/a&gt;&lt;/p&gt;&lt;/div&gt;
       &lt;a href="https://www.facebook.com/euclid.france" class="icon"&gt;
       &lt;a href="https://twitter.com/Euclid_FR" class="icon"&gt;
       &lt;a href="#" class="icon"&gt;
       &lt;a href="https://ecole-euclid.cnrs.fr/feed/" class="icon"&gt;

現在，我想用 gsed （在 MacOS Catalina 上）管道 grep 的這個輸出，以便提取 PDF 文件的所有完整 href，即使在同一行上有多個 PDF 連結。

我首先嘗試：

grep -i 'href=' index.html | gsed 's/href="\(.*pdf\)"/\1/g'

但這不起作用，如您所見，我只會列印第一個 PDF 連結，而不是所有 PDF 連結（在同一個連結上），那麼此外，如何列印所有模式匹配？

目標是在此之後下載文件中存在的所有 PDFindex.html文件

任何幫助都會很棒。

因為你有 GNU sed，你可以安裝 GNU awk。使用 GNU awk 進行多字元 RS 和 RT：

$ awk -v RS='href="http[^"]+.pdf"' -F'"' 'RT{$0=RT; print $2}' file
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé-C1_L1.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé_C1_L2.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Henrot-Versillé_C3.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_L1_Bayesian.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_TD_Bayesian.pdf

否則在每個 UNIX 機器上的任何 shell 中使用任何 awk：

$ awk '{
   while ( match($0,/href="http[^"]+.pdf"/) ) {
       split(substr($0,RSTART,RLENGTH),f,/"/)
       print f[2]
       $0 = substr($0,RSTART+RLENGTH)
   }
}' file
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé-C1_L1.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2019_Henrot-Versillé_C1_L2.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Henrot-Versillé_C3.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_L1_Bayesian.pdf
https://ecole-euclid.cnrs.fr/wp-content/uploads/EDE2018_Martinelli_C2_TD_Bayesian.pdf

只需將該輸出通過管道傳輸到xargs -n 1 curl -O, 即可下載 PDF（假設 URL 中沒有空格）。

引用自：https://unix.stackexchange.com/questions/607889

從已經下載的 index.html 中提取 pdf 文件以獲取它們，即使使用 grep 有多個 pdf

相關問答

如何使用 sed 或 grep 在模式前搜尋一行

Grep 等號的 RHS

grep 模式並將匹配項保持在同一行

如何使用萬用字元在 2 個模式之間列印文本？

從巨大的（強制）文本文件中提取兩個模式之間的數據

如何從單個維基百科文章中下載具有全解析度（即 svg）的所有圖像