過濾匹配特定 id 的 xml 文件

February 2, 2018

假設您有一個包含許多 xml 文件的文件，例如
&lt;a&gt;
 &lt;b&gt;
 ...
&lt;/a&gt;
in between xml documents there may be plain text log messages
&lt;x&gt;
 ...
&lt;/x&gt;

...
我將如何過濾此文件以僅顯示給定正則表達式與該 xml 文件的任何一行匹配的那些 xml 文件？我在這裡談論的是一個簡單的文本匹配，因此正則表達式匹配部分也可能完全不了解底層格式 - xml。
你可以假設根元素的開始和結束標籤總是在它們自己的行上（儘管可能會被空白填充），並且它們只用作根元素，即同名的標籤不會出現在下面根元素。這應該可以完成工作，而不必求助於 xml 感知工具。

概括

我寫了一個 Python 解決方案、一個 Bash 解決方案和一個 Awk 解決方案。所有腳本的想法都是一樣的：逐行檢查並使用標誌變數來跟踪狀態（即我們目前是否在 XML 子文件中以及我們是否找到了匹配的行）。

在 Python 腳本中，我將所有行讀入一個列表，並跟踪目前 XML 子文件開始的列表索引，以便在到達結束標記時列印出目前子文件。我檢查每一行的正則表達式模式，並使用一個標誌來跟踪我們處理完目前子文件時是否輸出它。

在 Bash 腳本中，我使用一個臨時文件作為緩衝區來儲存目前的 XML 子文件，並等待它完成寫入，然後再使用grep它來檢查它是否包含與給定正則表達式匹配的行。

Awk 腳本類似於 Base 腳本，但我使用 Awk 數組作為緩衝區而不是文件。

測試數據文件

data.xml我根據您問題中給出的範例數據，對照以下數據文件 ( ) 檢查了這兩個腳本：

&lt;a&gt;
 &lt;b&gt;
   string to search for: stuff
 &lt;/b&gt;
&lt;/a&gt;
in between xml documents there may be plain text log messages
&lt;x&gt;
   unicode string: øæå
&lt;/x&gt;

Python 解決方案

這是一個簡單的 Python 腳本，可以滿足您的要求：

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""xmlgrep.py"""

import sys
import re

invert_match = False

if sys.argv[1] == '-v' or sys.argv[1] == '--invert-match':
   invert_match = True
   sys.argv.pop(0)

regex = sys.argv[1]

# Open the XML-ish file
with open(sys.argv[2], 'r') if len(sys.argv) &gt; 2 else sys.stdin as xmlfile:

   # Read all of the data into a list
   lines = xmlfile.readlines()

   # Use flags to keep track of which XML subdocument we're in
   # and whether or not we've found a match in that document
   start_index = closing_tag = regex_match = False

   # Iterate through all the lines
   for index, line in enumerate(lines):

       # Remove trailing and leading white-space
       line = line.strip()

       # If we have a start_index then we're inside an XML document
       if start_index is not False:

           # If this line is a closing tag then reset the flags
           # and print the document if we found a match
           if line == closing_tag:
               if regex_match != invert_match:
                   print(''.join(lines[start_index:index+1]))
               start_index = closing_tag = regex_match = False

           # If this line is NOT a closing tag then we
           # search the current line for a match
           elif re.search(regex, line):
               regex_match = True

       # If we do NOT have a start_index then we're either at the
       # beginning of a new XML subdocument or we're inbetween
       # XML subdocuments
       else:

           # Check for an opening tag for a new XML subdocument
           match = re.match(r'^&lt;(\w+)&gt;$', line)
           if match:

               # Store the current line number
               start_index = index

               # Construct the matching closing tag
               closing_tag = '&lt;/' + match.groups()[0] + '&gt;'

以下是執行腳本以搜尋字元串“stuff”的方式：

python xmlgrep.py stuff data.xml

這是輸出：

&lt;a&gt;
 &lt;b&gt;
   string to search for: stuff
 &lt;/b&gt;
&lt;/a&gt;

以下是您如何執行腳本來搜尋字元串“øæå”：

python xmlgrep.py øæå data.xml

這是輸出：

&lt;x&gt;
   unicode string: øæå
&lt;/x&gt;

您還可以指定-v或--invert-match搜尋不匹配的文件，並使用標準輸入：

cat data.xml | python xmlgrep.py -v stuff

重擊解決方案

這是相同基本算法的 bash 實現。它使用標誌來跟踪目前行是否屬於 XML 文件，並使用臨時文件作為緩衝區來儲存正在處理的每個 XML 文件。

#!/usr/bin/env bash
# xmlgrep.sh

# Get the filename and search pattern from the command-line
FILENAME="$1"
REGEX="$2"

# Use flags to keep track of which XML subdocument we're in
XML_DOC=false
CLOSING_TAG=""

# Use a temporary file to store the current XML subdocument
TEMPFILE="$(mktemp)"

# Reset the internal field separator to preserver white-space
export IFS=''

# Iterate through all the lines of the file
while read LINE; do

   # If we're already in an XML subdocument then update
   # the temporary file and check to see if we've reached
   # the end of the document
   if "${XML_DOC}"; then

       # Append the line to the temp-file
       echo "${LINE}" &gt;&gt; "${TEMPFILE}"

       # If this line is a closing tag then reset the flags
       if echo "${LINE}" | grep -Pq '^\s*'"${CLOSING_TAG}"'\s*$'; then
           XML_DOC=false
           CLOSING_TAG=""

           # Print the document if it contains the match pattern 
           if grep -Pq "${REGEX}" "${TEMPFILE}"; then
               cat "${TEMPFILE}"
           fi
       fi

   # Otherwise we check to see if we've reached
   # the beginning of a new XML subdocument
   elif echo "${LINE}" | grep -Pq '^\s*&lt;\w+&gt;\s*$'; then

       # Extract the tag-name
       TAG_NAME="$(echo "${LINE}" | sed 's/^\s*&lt;\(\w\+\)&gt;\s*$/\1/;tx;d;:x')"

       # Construct the corresponding closing tag
       CLOSING_TAG="&lt;/${TAG_NAME}&gt;"

       # Set the XML_DOC flag so we know we're inside an XML subdocument
       XML_DOC=true

       # Start storing the subdocument in the temporary file 
       echo "${LINE}" &gt; "${TEMPFILE}"
   fi
done &lt; "${FILENAME}"

以下是執行腳本來搜尋字元串“stuff”的方法：

bash xmlgrep.sh data.xml 'stuff'

這是相應的輸出：

&lt;a&gt;
 &lt;b&gt;
   string to search for: stuff
 &lt;/b&gt;
&lt;/a&gt;

以下是您可以如何執行腳本來搜尋字元串“øæå”：

bash xmlgrep.sh data.xml 'øæå'

這是相應的輸出：

&lt;x&gt;
   unicode string: øæå
&lt;/x&gt;

awk 解決方案

這是一個awk解決方案 - 雖然我awk的不是很好，所以它很粗糙。它使用與 Bash 和 Python 腳本相同的基本思想。它將每個 XML 文件儲存在一個緩衝區（一個awk數組）中，並使用標誌來跟踪狀態。當它完成處理一個文件時，如果它包含與給定正則表達式匹配的任何行，它就會列印它。這是腳本：

#!/usr/bin/env gawk
# xmlgrep.awk

# Variables:
#
#   XML_DOC
#       XML_DOC=1 if the current line is inside an XML document.
#
#   CLOSING_TAG
#       Stores the closing tag for the current XML document.
#
#   BUFFER_LENGTH
#       Stores the number of lines in the current XML document.
#
#   MATCH
#       MATCH=1 if we found a matching line in the current XML document.
#
#   PATTERN
#       The regular expression pattern to match against (given as a command-line argument).
#

# Initialize Variables
BEGIN{
   XML_DOC=0;
   CLOSING_TAG="";
   BUFFER_LENGTH=0;
   MATCH=0;
}
{
   if (XML_DOC==1) {

       # If we're inside an XML block, add the current line to the buffer
       BUFFER[BUFFER_LENGTH]=$0;
       BUFFER_LENGTH++;

       # If we've reached a closing tag, reset the XML_DOC and CLOSING_TAG flags
       if ($0 ~ CLOSING_TAG) {
           XML_DOC=0;
           CLOSING_TAG="";

           # If there was a match then output the XML document
           if (MATCH==1) {
               for (i in BUFFER) {
                   print BUFFER[i];
               }
           }
       }
       # If we found a matching line then update the MATCH flag
       else {
           if ($0 ~ PATTERN) {
               MATCH=1;
           }
       }
   }
   else {

       # If we reach a new opening tag then start storing the data in the buffer
       if ($0 ~ /&lt;[a-z]+&gt;/) {

           # Set the XML_DOC flag
           XML_DOC=1;

           # Reset the buffer
           delete BUFFER;
           BUFFER[0]=$0;
           BUFFER_LENGTH=1;

           # Reset the match flag
           MATCH=0;

           # Compute the corresponding closing tag
           match($0, /&lt;([a-z]+)&gt;/, match_groups);
           CLOSING_TAG="&lt;/" match_groups[1] "&gt;";
       }
   }
}

這是您的稱呼：

gawk -v PATTERN="øæå" -f xmlgrep.awk data.xml

這是相應的輸出：

&lt;x&gt;
   unicode string: øæå
&lt;/x&gt;

引用自：https://unix.stackexchange.com/questions/411496

過濾匹配特定 id 的 xml 文件

概括

測試數據文件

Python 解決方案

重擊解決方案

awk 解決方案

相關問答

文件中的行範圍

使用 shell 工具 awk 編輯 fslint 的輸出 |grep |sed

df -k + 如何匹配根卷行

匹配一個模式並替換它後面的第一個字元串實例

只需將 xml 文件的某些行連接在一起

如何在復雜的線條上非常精確地匹配