Text-Processing

如何按摩或格式化 html 以便用 xmstarlet 解析?

  • February 6, 2020

我首先需要html通過類似的東西在野外奔跑jsoup嗎?不讓它在人類意義上有效,可能會變成亂碼,但至少xmlstarlet可以處理文件?

最好尋找可以像這樣安裝和使用的 CLI:

massage foo.html > bar.xhtml

或者至少是類似的東西。

案例:

thufir@doge:~/.html$ 
thufir@doge:~/.html$ curl http://int.soccerway.com/  > soccer.html
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  188k    0  188k    0     0   313k      0 --:--:-- --:--:-- --:--:--  313k
thufir@doge:~/.html$ 
thufir@doge:~/.html$ xmlstarlet sel -t -v "/html/body/table/tr/td[1]" -n soccer.html 
soccer.html:70.13: xmlParseEntityRef: no name
if (this.$ && this.$.fn && this.$.fn.jquery) {
           ^
soccer.html:70.14: xmlParseEntityRef: no name
if (this.$ && this.$.fn && this.$.fn.jquery) {
            ^
soccer.html:70.26: xmlParseEntityRef: no name
if (this.$ && this.$.fn && this.$.fn.jquery) {
                        ^
soccer.html:70.27: xmlParseEntityRef: no name
if (this.$ && this.$.fn && this.$.fn.jquery) {
                         ^
soccer.html:198.8: Opening and ending tag mismatch: link line 27 and head
</head>
      ^
soccer.html:209.45: EntityRef: expecting ';'
 j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
                                           ^
soccer.html:223.40: xmlParseEntityRef: no name
     if (typeof(e.data) === 'string' && (e.data.indexOf('onEplayerVideoStarted'
                                      ^

理想情況下會htmlstarlet直接針對 URL 執行,但似乎沒有這樣的規定。

一個formatfo選項,但我無法得到與上面不同的結果。

如果您只想要表格的數據單元格,可以使用xmlstarlet fo後跟xmlstarlet sel. 您遇到的主要問題是 XPath。如果您添加幾個“萬用字元”元素 ( //),您將獲得所需的結果:

# fetch URL silently, following redirects, and send to standard out
curl -sL http://int.soccerway.com/                          |

# interpret input as HTML (-H) and try to recover as much as possible (-R)
xmlstarlet fo  -H -R                           2> /dev/null |

# use the following XPath expression and return the value (-t -v), 
# also add a newline after the result (-n)
xmlstarlet sel -t -v '//table//tr//h3/span' -n 2> /dev/null |

# only show the first 10 values
head -n10

輸出:

World - Friendlies
Argentina - Prim B Nacional
Australia - National Premier Leagues
Australia - NPL Youth League
Bangladesh - Premier League
Belarus - Premier League
Benin - Championnat National
Brazil - Serie A
Brazil - Serie D
Brazil - Copa Paulista

引用自:https://unix.stackexchange.com/questions/382794