Text-Processing
如何按摩或格式化 html 以便用 xmstarlet 解析?
我首先需要
html
通過類似的東西在野外奔跑jsoup
嗎?不讓它在人類意義上有效,可能會變成亂碼,但至少xmlstarlet
可以處理文件?最好尋找可以像這樣安裝和使用的 CLI:
massage foo.html > bar.xhtml
或者至少是類似的東西。
案例:
thufir@doge:~/.html$ thufir@doge:~/.html$ curl http://int.soccerway.com/ > soccer.html % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 188k 0 188k 0 0 313k 0 --:--:-- --:--:-- --:--:-- 313k thufir@doge:~/.html$ thufir@doge:~/.html$ xmlstarlet sel -t -v "/html/body/table/tr/td[1]" -n soccer.html soccer.html:70.13: xmlParseEntityRef: no name if (this.$ && this.$.fn && this.$.fn.jquery) { ^ soccer.html:70.14: xmlParseEntityRef: no name if (this.$ && this.$.fn && this.$.fn.jquery) { ^ soccer.html:70.26: xmlParseEntityRef: no name if (this.$ && this.$.fn && this.$.fn.jquery) { ^ soccer.html:70.27: xmlParseEntityRef: no name if (this.$ && this.$.fn && this.$.fn.jquery) { ^ soccer.html:198.8: Opening and ending tag mismatch: link line 27 and head </head> ^ soccer.html:209.45: EntityRef: expecting ';' j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= ^ soccer.html:223.40: xmlParseEntityRef: no name if (typeof(e.data) === 'string' && (e.data.indexOf('onEplayerVideoStarted' ^
理想情況下會
htmlstarlet
直接針對 URL 執行,但似乎沒有這樣的規定。有一個format
fo
選項,但我無法得到與上面不同的結果。
如果您只想要表格的數據單元格,可以使用
xmlstarlet fo
後跟xmlstarlet sel
. 您遇到的主要問題是 XPath。如果您添加幾個“萬用字元”元素 (//
),您將獲得所需的結果:# fetch URL silently, following redirects, and send to standard out curl -sL http://int.soccerway.com/ | # interpret input as HTML (-H) and try to recover as much as possible (-R) xmlstarlet fo -H -R 2> /dev/null | # use the following XPath expression and return the value (-t -v), # also add a newline after the result (-n) xmlstarlet sel -t -v '//table//tr//h3/span' -n 2> /dev/null | # only show the first 10 values head -n10
輸出:
World - Friendlies Argentina - Prim B Nacional Australia - National Premier Leagues Australia - NPL Youth League Bangladesh - Premier League Belarus - Premier League Benin - Championnat National Brazil - Serie A Brazil - Serie D Brazil - Copa Paulista