Text-Processing
從 html/xml 文件中提取特定的單詞及其數據
樣本輸入是
<bre rt="1600" et="1550794901464" st="1550794899864" tid="8390500116294391399" mh="N" cn="" lc="" ts="N/A" cidc="" IDC="" eidc="BRE-S-TRA-0085418501"/> <r1> <gr1> <a="1" b="smaple data with spaces" c="Created TrasctionInfo" d="1550794901228"/> <e="INITIAL" f="2" g="INITIAL_LEGACY" h="1550794901228" i="LegacyToggle is off. Follow Legacy flow"/> <lx ets="2019-02-22T00:21:41.228Z" trxn="smaple data with spaces 2 record" rn="Derive data" abc="COT def" def="Season occur" trxn="smaple data with spaces 3rd record" den="andys and others" trxn="smaple data with spaces 4th record" kit="Theater - Span day" rns="Span day" trxn="smaple data with spaces 5th record" off="|"/> <cwl wc="2.0766" tot="16" act="116.28960000000001" CSE="CHE-CSFL" wg1.0" high="1" </cwl> </gr1> </r1> </bre> <bre rt="1234" et="1234794901464" st="1234794899864" tid="2345500116294391399" mh="Y" cn="At123" lc="" ts="NA" cidc="" IDC="some text value" eidc="abc-def-gh-2385418501"/> <r1> <gr1> <a="1" trxn="other data with spaces" c="Created Info" d="3434794545228"/> <e="begin" f="2" g="INITIAL_LEGACY" h="1234709901228" i="Toggle hig. Follow toggle flow"/> <lx ets="2017-02-22T00:21:41.228Z" trxn="another record data" rn="Derive data" abc="COT def" trxn="smaple data with spaces record" def="Season occur" den="andys and others" trxn="smaple data with spaces 4th record" kit="Theater - Span day" rns="Span day" trxn="data with spaces" off="|"/> <cwl wc="2.0766" tot="16" act="116.28960000000001" CSE="CHE-CSFL" wg1.0" high="1" </cwl> </gr1> </r1> </bre> <bre rt="1234" et="1234794901464" st="1234794899864" tid="2345500116294391399" mh="Y" cn="At123" lc="" ts="NA" cidc="" IDC="some text value" eidc="abc-def-gh-2385418501"/> <r1> <gr1> <a="1" c="Created transaction" b="3434794545228"/> <e="begin" f="2" g="INITIAL_LEGACY" h="1234709901228" i="Toggle hig. Follow toggle flow"/> <lx ets="2017-02-22T00:21:41.228Z" rn="Derive data" abc="COT def" def="Season occur" den="andys and others" kit="Theater - Span day" rns="Span day" off="|"/> <cwl wc="2.0766" tot="16" act="116.28960000000001" CSE="CHE-CSFL" wg1.0" high="1" </cwl> </gr1> </r1> </bre>
輸出應該是
tid="8390500116294391399" ts="N/A" ets="2019-02-22T00:21:41.228Z" trxn="smaple data with spaces 2 record" trxn="smaple data with spaces 3rd record" trxn="smaple data with spaces 5th record" tid="2345500116294391399" ts="NA" ets="2017-02-22T00:21:41.228Z" trxn="other data with spaces" trxn="another record data" trxn="smaple data with spaces record" trxn="data with spaces" tid="2345500116294391399" ts="NA" ets="2017-02-22T00:21:41.228Z"
我試過如下
sed -e 's/trxn=/\ntrxn=/g' -e 's/tid=/\ntid=/g' -e 's/ts=/\nts=/g' while IFS= read -r var do if grep -Fxq "$trxn" temp2.txt then awk -F"=" '/tid/{print VAL=$i} /ts/{print VAL=$i} /ets/{print VAL=$i} /trxn/{print VAL=$i} /tid/{print VAL=$i;next}' temp2.txt >> out.txt else awk -F"=" '/tid/{print VAL=$i} /ts/{print VAL=$i} /ets/{print VAL=$i} /tid/{print VAL=$i;next}' temp2.txt >> out.txt fi done < "$input"
或使用 grep:
$ grep -Eo '(ets|tid|trxn|ts)="[^"]+"' file tid="8390500116294391399" ts="N/A" ets="2019-02-22T00:21:41.228Z" trxn="smaple data with spaces 2 record" trxn="smaple data with spaces 3rd record" trxn="smaple data with spaces 4th record" trxn="smaple data with spaces 5th record" tid="2345500116294391399" ts="NA" trxn="other data with spaces" ets="2017-02-22T00:21:41.228Z" trxn="another record data" trxn="smaple data with spaces record" trxn="smaple data with spaces 4th record" trxn="data with spaces" tid="2345500116294391399" ts="NA" ets="2017-02-22T00:21:41.228Z"