Awk

過濾匹配特定 id 的 xml 文件

  • February 2, 2018

假設您有一個包含許多 xml 文件的文件,例如

<a>
 <b>
 ...
</a>
in between xml documents there may be plain text log messages
<x>
 ...
</x>

...

我將如何過濾此文件以僅顯示給定正則表達式與該 xml 文件的任何一行匹配的那些 xml 文件?我在這裡談論的是一個簡單的文本匹配,因此正則表達式匹配部分也可能完全不了解底層格式 - xml。

你可以假設根元素的開始和結束標籤總是在它們自己的行上(儘管可能會被空白填充),並且它們只用作根元素,即同名的標籤不會出現在下面根元素。這應該可以完成工作,而不必求助於 xml 感知工具。

概括

我寫了一個 Python 解決方案、一個 Bash 解決方案和一個 Awk 解決方案。所有腳本的想法都是一樣的:逐行檢查並使用標誌變數來跟踪狀態(即我們目前是否在 XML 子文件中以及我們是否找到了匹配的行)。

在 Python 腳本中,我將所有行讀入一個列表,並跟踪目前 XML 子文件開始的列表索引,以便在到達結束標記時列印出目前子文件。我檢查每一行的正則表達式模式,並使用一個標誌來跟踪我們處理完目前子文件時是否輸出它。

在 Bash 腳本中,我使用一個臨時文件作為緩衝區來儲存目前的 XML 子文件,並等待它完成寫入,然後再使用grep它來檢查它是否包含與給定正則表達式匹配的行。

Awk 腳本類似於 Base 腳本,但我使用 Awk 數組作為緩衝區而不是文件。

測試數據文件

data.xml我根據您問題中給出的範例數據,對照以下數據文件 ( ) 檢查了這兩個腳本:

<a>
 <b>
   string to search for: stuff
 </b>
</a>
in between xml documents there may be plain text log messages
<x>
   unicode string: øæå
</x>

Python 解決方案

這是一個簡單的 Python 腳本,可以滿足您的要求:

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""xmlgrep.py"""

import sys
import re

invert_match = False

if sys.argv[1] == '-v' or sys.argv[1] == '--invert-match':
   invert_match = True
   sys.argv.pop(0)

regex = sys.argv[1]

# Open the XML-ish file
with open(sys.argv[2], 'r') if len(sys.argv) > 2 else sys.stdin as xmlfile:

   # Read all of the data into a list
   lines = xmlfile.readlines()

   # Use flags to keep track of which XML subdocument we're in
   # and whether or not we've found a match in that document
   start_index = closing_tag = regex_match = False

   # Iterate through all the lines
   for index, line in enumerate(lines):

       # Remove trailing and leading white-space
       line = line.strip()

       # If we have a start_index then we're inside an XML document
       if start_index is not False:

           # If this line is a closing tag then reset the flags
           # and print the document if we found a match
           if line == closing_tag:
               if regex_match != invert_match:
                   print(''.join(lines[start_index:index+1]))
               start_index = closing_tag = regex_match = False

           # If this line is NOT a closing tag then we
           # search the current line for a match
           elif re.search(regex, line):
               regex_match = True

       # If we do NOT have a start_index then we're either at the
       # beginning of a new XML subdocument or we're inbetween
       # XML subdocuments
       else:

           # Check for an opening tag for a new XML subdocument
           match = re.match(r'^<(\w+)>$', line)
           if match:

               # Store the current line number
               start_index = index

               # Construct the matching closing tag
               closing_tag = '</' + match.groups()[0] + '>'

以下是執行腳本以搜尋字元串“stuff”的方式:

python xmlgrep.py stuff data.xml

這是輸出:

<a>
 <b>
   string to search for: stuff
 </b>
</a>

以下是您如何執行腳本來搜尋字元串“øæå”:

python xmlgrep.py øæå data.xml

這是輸出:

<x>
   unicode string: øæå
</x>

您還可以指定-v--invert-match搜尋不匹配的文件,並使用標準輸入:

cat data.xml | python xmlgrep.py -v stuff

重擊解決方案

這是相同基本算法的 bash 實現。它使用標誌來跟踪目前行是否屬於 XML 文件,並使用臨時文件作為緩衝區來儲存正在處理的每個 XML 文件。

#!/usr/bin/env bash
# xmlgrep.sh

# Get the filename and search pattern from the command-line
FILENAME="$1"
REGEX="$2"

# Use flags to keep track of which XML subdocument we're in
XML_DOC=false
CLOSING_TAG=""

# Use a temporary file to store the current XML subdocument
TEMPFILE="$(mktemp)"

# Reset the internal field separator to preserver white-space
export IFS=''

# Iterate through all the lines of the file
while read LINE; do

   # If we're already in an XML subdocument then update
   # the temporary file and check to see if we've reached
   # the end of the document
   if "${XML_DOC}"; then

       # Append the line to the temp-file
       echo "${LINE}" >> "${TEMPFILE}"

       # If this line is a closing tag then reset the flags
       if echo "${LINE}" | grep -Pq '^\s*'"${CLOSING_TAG}"'\s*$'; then
           XML_DOC=false
           CLOSING_TAG=""

           # Print the document if it contains the match pattern 
           if grep -Pq "${REGEX}" "${TEMPFILE}"; then
               cat "${TEMPFILE}"
           fi
       fi

   # Otherwise we check to see if we've reached
   # the beginning of a new XML subdocument
   elif echo "${LINE}" | grep -Pq '^\s*<\w+>\s*$'; then

       # Extract the tag-name
       TAG_NAME="$(echo "${LINE}" | sed 's/^\s*<\(\w\+\)>\s*$/\1/;tx;d;:x')"

       # Construct the corresponding closing tag
       CLOSING_TAG="</${TAG_NAME}>"

       # Set the XML_DOC flag so we know we're inside an XML subdocument
       XML_DOC=true

       # Start storing the subdocument in the temporary file 
       echo "${LINE}" > "${TEMPFILE}"
   fi
done < "${FILENAME}"

以下是執行腳本來搜尋字元串“stuff”的方法:

bash xmlgrep.sh data.xml 'stuff'

這是相應的輸出:

<a>
 <b>
   string to search for: stuff
 </b>
</a>

以下是您可以如何執行腳本來搜尋字元串“øæå”:

bash xmlgrep.sh data.xml 'øæå'

這是相應的輸出:

<x>
   unicode string: øæå
</x>

awk 解決方案

這是一個awk解決方案 - 雖然我awk的不是很好,所以它很粗糙。它使用與 Bash 和 Python 腳本相同的基本思想。它將每個 XML 文件儲存在一個緩衝區(一個awk數組)中,並使用標誌來跟踪狀態。當它完成處理一個文件時,如果它包含與給定正則表達式匹配的任何行,它就會列印它。這是腳本:

#!/usr/bin/env gawk
# xmlgrep.awk

# Variables:
#
#   XML_DOC
#       XML_DOC=1 if the current line is inside an XML document.
#
#   CLOSING_TAG
#       Stores the closing tag for the current XML document.
#
#   BUFFER_LENGTH
#       Stores the number of lines in the current XML document.
#
#   MATCH
#       MATCH=1 if we found a matching line in the current XML document.
#
#   PATTERN
#       The regular expression pattern to match against (given as a command-line argument).
#

# Initialize Variables
BEGIN{
   XML_DOC=0;
   CLOSING_TAG="";
   BUFFER_LENGTH=0;
   MATCH=0;
}
{
   if (XML_DOC==1) {

       # If we're inside an XML block, add the current line to the buffer
       BUFFER[BUFFER_LENGTH]=$0;
       BUFFER_LENGTH++;

       # If we've reached a closing tag, reset the XML_DOC and CLOSING_TAG flags
       if ($0 ~ CLOSING_TAG) {
           XML_DOC=0;
           CLOSING_TAG="";

           # If there was a match then output the XML document
           if (MATCH==1) {
               for (i in BUFFER) {
                   print BUFFER[i];
               }
           }
       }
       # If we found a matching line then update the MATCH flag
       else {
           if ($0 ~ PATTERN) {
               MATCH=1;
           }
       }
   }
   else {

       # If we reach a new opening tag then start storing the data in the buffer
       if ($0 ~ /<[a-z]+>/) {

           # Set the XML_DOC flag
           XML_DOC=1;

           # Reset the buffer
           delete BUFFER;
           BUFFER[0]=$0;
           BUFFER_LENGTH=1;

           # Reset the match flag
           MATCH=0;

           # Compute the corresponding closing tag
           match($0, /<([a-z]+)>/, match_groups);
           CLOSING_TAG="</" match_groups[1] ">";
       }
   }
}

這是您的稱呼:

gawk -v PATTERN="øæå" -f xmlgrep.awk data.xml

這是相應的輸出:

<x>
   unicode string: øæå
</x>

引用自:https://unix.stackexchange.com/questions/411496