Text-Processing

從 html/xml 文件中提取特定的單詞及其數據

  • September 19, 2019

樣本輸入是

<bre rt="1600" et="1550794901464" st="1550794899864" tid="8390500116294391399" mh="N" cn="" lc="" ts="N/A" cidc="" IDC="" eidc="BRE-S-TRA-0085418501"/>
   <r1>
       <gr1>
           <a="1" b="smaple data with spaces" c="Created TrasctionInfo" d="1550794901228"/>
           <e="INITIAL" f="2" g="INITIAL_LEGACY" h="1550794901228" i="LegacyToggle is off. Follow Legacy flow"/>
           <lx ets="2019-02-22T00:21:41.228Z" trxn="smaple data with spaces 2 record" rn="Derive data" abc="COT def" def="Season occur" trxn="smaple data with spaces 3rd record" den="andys and others" trxn="smaple data with spaces 4th record" kit="Theater - Span day"
            rns="Span day" trxn="smaple data with spaces 5th record" off="|"/>
           <cwl wc="2.0766" tot="16" act="116.28960000000001" CSE="CHE-CSFL" wg1.0" high="1" </cwl>
               </gr1>
           </r1>
</bre>
<bre rt="1234" et="1234794901464" st="1234794899864" tid="2345500116294391399" mh="Y" cn="At123" lc="" ts="NA" cidc="" IDC="some text value" eidc="abc-def-gh-2385418501"/>
   <r1>
       <gr1>
           <a="1" trxn="other data with spaces" c="Created Info" d="3434794545228"/>
           <e="begin" f="2" g="INITIAL_LEGACY" h="1234709901228" i="Toggle hig. Follow toggle flow"/>
           <lx ets="2017-02-22T00:21:41.228Z" trxn="another record data" rn="Derive data" abc="COT def" trxn="smaple data with spaces record" def="Season occur" den="andys and others" trxn="smaple data with spaces 4th record" kit="Theater - Span day"
            rns="Span day" trxn="data with spaces" off="|"/>
           <cwl wc="2.0766" tot="16" act="116.28960000000001" CSE="CHE-CSFL" wg1.0" high="1" </cwl>
               </gr1>
           </r1>
</bre>
<bre rt="1234" et="1234794901464" st="1234794899864" tid="2345500116294391399" mh="Y" cn="At123" lc="" ts="NA" cidc="" IDC="some text value" eidc="abc-def-gh-2385418501"/>
   <r1>
       <gr1>
           <a="1" c="Created transaction" b="3434794545228"/>
           <e="begin" f="2" g="INITIAL_LEGACY" h="1234709901228" i="Toggle hig. Follow toggle flow"/>
           <lx ets="2017-02-22T00:21:41.228Z" rn="Derive data" abc="COT def" def="Season occur" den="andys and others" kit="Theater - Span day"
            rns="Span day" off="|"/>
           <cwl wc="2.0766" tot="16" act="116.28960000000001" CSE="CHE-CSFL" wg1.0" high="1" </cwl>
               </gr1>
           </r1>
</bre>

輸出應該是

tid="8390500116294391399"
ts="N/A"
ets="2019-02-22T00:21:41.228Z" 
trxn="smaple data with spaces 2 record"
trxn="smaple data with spaces 3rd record"
trxn="smaple data with spaces 5th record"
tid="2345500116294391399"
ts="NA"
ets="2017-02-22T00:21:41.228Z" 
trxn="other data with spaces"
trxn="another record data"
trxn="smaple data with spaces record"
trxn="data with spaces"
tid="2345500116294391399"
ts="NA"
ets="2017-02-22T00:21:41.228Z"

我試過如下

sed -e 's/trxn=/\ntrxn=/g' -e 's/tid=/\ntid=/g' -e 's/ts=/\nts=/g'

while IFS= read -r var
do
   if grep -Fxq "$trxn" temp2.txt
   then
     awk -F"=" '/tid/{print VAL=$i} /ts/{print VAL=$i} /ets/{print VAL=$i} /trxn/{print VAL=$i} /tid/{print VAL=$i;next}' temp2.txt >> out.txt
   else
     awk -F"=" '/tid/{print VAL=$i} /ts/{print VAL=$i} /ets/{print VAL=$i} /tid/{print VAL=$i;next}' temp2.txt >> out.txt
   fi
done < "$input"

或使用 grep:

$ grep -Eo '(ets|tid|trxn|ts)="[^"]+"' file
tid="8390500116294391399"
ts="N/A"
ets="2019-02-22T00:21:41.228Z"
trxn="smaple data with spaces 2 record"
trxn="smaple data with spaces 3rd record"
trxn="smaple data with spaces 4th record"
trxn="smaple data with spaces 5th record"
tid="2345500116294391399"
ts="NA"
trxn="other data with spaces"
ets="2017-02-22T00:21:41.228Z"
trxn="another record data"
trxn="smaple data with spaces record"
trxn="smaple data with spaces 4th record"
trxn="data with spaces"
tid="2345500116294391399"
ts="NA"
ets="2017-02-22T00:21:41.228Z"

引用自:https://unix.stackexchange.com/questions/512484