過濾匹配特定 id 的 xml 文件
假設您有一個包含許多 xml 文件的文件,例如
<a> <b> ... </a> in between xml documents there may be plain text log messages <x> ... </x> ...
我將如何過濾此文件以僅顯示給定正則表達式與該 xml 文件的任何一行匹配的那些 xml 文件?我在這裡談論的是一個簡單的文本匹配,因此正則表達式匹配部分也可能完全不了解底層格式 - xml。
你可以假設根元素的開始和結束標籤總是在它們自己的行上(儘管可能會被空白填充),並且它們只用作根元素,即同名的標籤不會出現在下面根元素。這應該可以完成工作,而不必求助於 xml 感知工具。
我寫了一個 Python 解決方案、一個 Bash 解決方案和一個 Awk 解決方案。所有腳本的想法都是一樣的:逐行檢查並使用標誌變數來跟踪狀態(即我們目前是否在 XML 子文件中以及我們是否找到了匹配的行)。
在 Python 腳本中,我將所有行讀入一個列表,並跟踪目前 XML 子文件開始的列表索引,以便在到達結束標記時列印出目前子文件。我檢查每一行的正則表達式模式,並使用一個標誌來跟踪我們處理完目前子文件時是否輸出它。
在 Bash 腳本中,我使用一個臨時文件作為緩衝區來儲存目前的 XML 子文件,並等待它完成寫入,然後再使用
它來檢查它是否包含與給定正則表達式匹配的行。Awk 腳本類似於 Base 腳本,但我使用 Awk 數組作為緩衝區而不是文件。
我根據您問題中給出的範例數據,對照以下數據文件 ( ) 檢查了這兩個腳本:<a> <b> string to search for: stuff </b> </a> in between xml documents there may be plain text log messages <x> unicode string: øæå </x>
Python 解決方案
這是一個簡單的 Python 腳本,可以滿足您的要求:
#!/usr/bin/env python2 # -*- encoding: ascii -*- """xmlgrep.py""" import sys import re invert_match = False if sys.argv[1] == '-v' or sys.argv[1] == '--invert-match': invert_match = True sys.argv.pop(0) regex = sys.argv[1] # Open the XML-ish file with open(sys.argv[2], 'r') if len(sys.argv) > 2 else sys.stdin as xmlfile: # Read all of the data into a list lines = xmlfile.readlines() # Use flags to keep track of which XML subdocument we're in # and whether or not we've found a match in that document start_index = closing_tag = regex_match = False # Iterate through all the lines for index, line in enumerate(lines): # Remove trailing and leading white-space line = line.strip() # If we have a start_index then we're inside an XML document if start_index is not False: # If this line is a closing tag then reset the flags # and print the document if we found a match if line == closing_tag: if regex_match != invert_match: print(''.join(lines[start_index:index+1])) start_index = closing_tag = regex_match = False # If this line is NOT a closing tag then we # search the current line for a match elif re.search(regex, line): regex_match = True # If we do NOT have a start_index then we're either at the # beginning of a new XML subdocument or we're inbetween # XML subdocuments else: # Check for an opening tag for a new XML subdocument match = re.match(r'^<(\w+)>$', line) if match: # Store the current line number start_index = index # Construct the matching closing tag closing_tag = '</' + match.groups()[0] + '>'
python xmlgrep.py stuff data.xml
<a> <b> string to search for: stuff </b> </a>
python xmlgrep.py øæå data.xml
<x> unicode string: øæå </x>
搜尋不匹配的文件,並使用標準輸入:cat data.xml | python xmlgrep.py -v stuff
這是相同基本算法的 bash 實現。它使用標誌來跟踪目前行是否屬於 XML 文件,並使用臨時文件作為緩衝區來儲存正在處理的每個 XML 文件。
#!/usr/bin/env bash # xmlgrep.sh # Get the filename and search pattern from the command-line FILENAME="$1" REGEX="$2" # Use flags to keep track of which XML subdocument we're in XML_DOC=false CLOSING_TAG="" # Use a temporary file to store the current XML subdocument TEMPFILE="$(mktemp)" # Reset the internal field separator to preserver white-space export IFS='' # Iterate through all the lines of the file while read LINE; do # If we're already in an XML subdocument then update # the temporary file and check to see if we've reached # the end of the document if "${XML_DOC}"; then # Append the line to the temp-file echo "${LINE}" >> "${TEMPFILE}" # If this line is a closing tag then reset the flags if echo "${LINE}" | grep -Pq '^\s*'"${CLOSING_TAG}"'\s*$'; then XML_DOC=false CLOSING_TAG="" # Print the document if it contains the match pattern if grep -Pq "${REGEX}" "${TEMPFILE}"; then cat "${TEMPFILE}" fi fi # Otherwise we check to see if we've reached # the beginning of a new XML subdocument elif echo "${LINE}" | grep -Pq '^\s*<\w+>\s*$'; then # Extract the tag-name TAG_NAME="$(echo "${LINE}" | sed 's/^\s*<\(\w\+\)>\s*$/\1/;tx;d;:x')" # Construct the corresponding closing tag CLOSING_TAG="</${TAG_NAME}>" # Set the XML_DOC flag so we know we're inside an XML subdocument XML_DOC=true # Start storing the subdocument in the temporary file echo "${LINE}" > "${TEMPFILE}" fi done < "${FILENAME}"
bash xmlgrep.sh data.xml 'stuff'
<a> <b> string to search for: stuff </b> </a>
bash xmlgrep.sh data.xml 'øæå'
<x> unicode string: øæå </x>
awk 解決方案
解決方案 - 雖然我awk
的不是很好,所以它很粗糙。它使用與 Bash 和 Python 腳本相同的基本思想。它將每個 XML 文件儲存在一個緩衝區(一個awk
數組)中,並使用標誌來跟踪狀態。當它完成處理一個文件時,如果它包含與給定正則表達式匹配的任何行,它就會列印它。這是腳本:#!/usr/bin/env gawk # xmlgrep.awk # Variables: # # XML_DOC # XML_DOC=1 if the current line is inside an XML document. # # CLOSING_TAG # Stores the closing tag for the current XML document. # # BUFFER_LENGTH # Stores the number of lines in the current XML document. # # MATCH # MATCH=1 if we found a matching line in the current XML document. # # PATTERN # The regular expression pattern to match against (given as a command-line argument). # # Initialize Variables BEGIN{ XML_DOC=0; CLOSING_TAG=""; BUFFER_LENGTH=0; MATCH=0; } { if (XML_DOC==1) { # If we're inside an XML block, add the current line to the buffer BUFFER[BUFFER_LENGTH]=$0; BUFFER_LENGTH++; # If we've reached a closing tag, reset the XML_DOC and CLOSING_TAG flags if ($0 ~ CLOSING_TAG) { XML_DOC=0; CLOSING_TAG=""; # If there was a match then output the XML document if (MATCH==1) { for (i in BUFFER) { print BUFFER[i]; } } } # If we found a matching line then update the MATCH flag else { if ($0 ~ PATTERN) { MATCH=1; } } } else { # If we reach a new opening tag then start storing the data in the buffer if ($0 ~ /<[a-z]+>/) { # Set the XML_DOC flag XML_DOC=1; # Reset the buffer delete BUFFER; BUFFER[0]=$0; BUFFER_LENGTH=1; # Reset the match flag MATCH=0; # Compute the corresponding closing tag match($0, /<([a-z]+)>/, match_groups); CLOSING_TAG="</" match_groups[1] ">"; } } }
gawk -v PATTERN="øæå" -f xmlgrep.awk data.xml
<x> unicode string: øæå </x>