grep 與文件中的許多模式並顯示哪個模式與哪個文件匹配，而無需重新讀取文件

February 22, 2022

我對 m 個文件有 n 個單獨的非固定 grep。我只需要知道每個文件中是否至少有 1 個匹配項，但每個模式都需要這個。我目前製作了 n 個單獨的 grep，以便以後可以將它們全部合併，但它非常慢並且有些文件很大。
有沒有辦法替換這些不需要我讀取所有文件 n 次（只要我可以將模式（不匹配）匹配到包含匹配項的文件，就不需要是單獨的文件）。grep -f 看起來很有希望，但它顯示匹配任何模式的文件而不是匹配每個模式的文件。
稍後合併到 1 個大文件中的內容：
grep -liE  pattern1  file_glob* &gt; temp_pattern1.txt && sed s/^/escapedpattern1 / temp_pattern1.txt
grep -liE  pattern2   file_glob* &gt; temp_pattern2.txt && sed s/^/escapedpattern2 / temp_pattern2.txt
...
grep -liE  patternN   file_glob* &gt; temp_patternN.txt && sed s/^/escapedpatternN / temp_patternN.txt

temp_pattern1.txt
pattern1 /path/to/file1
pattern1 /path/to/file2
pattern1 /path/to/file3

temp_pattern2.txt
pattern2 /path/to/file1
pattern2 /path/to/file3
...
temp_patternN.txt
pattern N /path/to/fileM

如果您想使用grep，您可以做的最好的事情是使用該-m 1選項grep在第一次匹配時停止讀取其目前輸入文件。您仍然會多次讀取每個輸入文件（每個模式一次），但它應該更快（除非匹配在文件的最後一行或附近）。
例如
#!/bin/bash

# Speed up each grep by exiting on 1st match with -m 1
#
# This still reads each file multiple times, but should run faster because it
# won't read the entire file each time unless the match is on the last line.
#
# Also reduce repetitive code by using an array and a for loop iterating over
# the indices of the array, rather than the values

patterns=(pattern1 pattern2 pattern3 patternN)

# iterate over the indices of the array (with `${!`), not the values.
for p in "${!patterns[@]}"; do
 # escape forward- and back- slashes in pattern
 esc=$(echo "${patterns[$p]}" | sed -e 's:/:\\/:g; s:\\:\\\\:g')
 grep -liE -m 1 "${patterns[$p]}" file_glob* |
   sed -e "s/^/$esc\t/" &gt; "temp_pattern$(($p+1)).txt"
done
注意：$p+1存在是因為 bash 數組從零開始。+1 使 temp_patterns 文件從 1 開始。
如果您使用像或之類的腳本語言，您可以做您想做的事。例如，下面的 perl 腳本只讀取每個輸入文件一次，並針對尚未在該文件中看到的每個模式檢查每一行。它跟踪已經在特定文件中看到的模式（使用數組），並且還注意到何時在文件中看到了所有可用模式（也使用）並在這種情況下關閉目前文件。awk``perl``@seen``@seen
#!/usr/bin/perl
use strict;

# array to hold the patterns
my @patterns = qw(pattern1 pattern2 pattern3 patternN);

# Array-of-Arrays (AoA, see man pages for perllol and perldsc)
# to hold matches
my @matches;

# Array for keeping track of whether current pattern has
# been seen already in current file
my @seen;

# read each line of each file
while(&lt;&gt;) {
 # check each line against all patterns that haven't been seen yet
 for my $i (keys @patterns) {
   next if $seen[$i];
   if (m/$patterns[$i]/i) {
     # add the pattern and the filename to the @matches AoA
     push @{ $matches[$i] }, "$patterns[$i]\t$ARGV";
     $seen[$i] = 1;
   }
 };

 # handle end-of-file AND having seen all patterns in a file
 if (eof || $#seen == $#patterns) {
   #print "closing $ARGV on line $.\n" unless eof;
   # close the current input file.  This will have
   # the effect of skipping to the next file.
   close(ARGV);
   # reset seen array at the end of every input file
   @seen = ();
 };
}

# now create output files
for my $i (keys @patterns) {
 #next unless @{ $matches[$i] }; # skip patterns with no matches
 my $outfile = "temp_pattern" . ($i+1) . ".txt";
 open(my $out,"&gt;",$outfile) || die "Couldn't open output file '$outfile' for write: $!\n";
 print $out join("\n", @{ $matches[$i] }), "\n";
 close($out);
}
該if (eof || $#seen == $#patterns)行測試目前文件上的 eof（文件結尾）或者我們是否已經看到目前文件中的所有可用模式（即，@seen 中的元素數是否等於 @patterns 中的元素數）。
在這兩種情況下，我們都希望將 @seen 數組重置為空，以便為下一個輸入文件做好準備。
在後一種情況下，我們還想提前關閉目前輸入文件——我們已經看到了我們想要在其中看到的所有內容，無需繼續讀取和處理文件的其餘部分。
順便說一句，如果您不希望創建空文件（即當模式不匹配時），請取消註釋next unless @{ $matches[$i] }輸出 for 循環中的行。
如果您不需要或不需要臨時文件，並且只想將所有匹配項輸出到一個文件，請將最終輸出 for 循環替換為：
for my $i (keys @patterns) {
 #next unless @{ $matches[$i] }; # skip patterns with no matches
 print join("\n", @{ $matches[$i] }), "\n";
}
並將輸出重定向到文件。
順便說一句，如果要添加模式在文件中首次出現的行號，請更改：
push @{ $matches[$i] }, "$patterns[$i]\t$ARGV";
至
push @{ $matches[$i] }, "$patterns[$i]\t$.\t$ARGV";
$.是一個內置的 perl 變數，它保存輸入的目前行號<>。ARGV只要目前文件 ( ) 關閉，它就會重置為零。

引用自：https://unix.stackexchange.com/questions/691264

grep 與文件中的許多模式並顯示哪個模式與哪個文件匹配，而無需重新讀取文件

相關問答

如何多次忽略所有包含特殊字元的文本？

混合大小寫單詞的正則表達式

從 html 中按模式抓取兩個字元串

將 xdpdump 的輸出保存到變數

如何使用 sed、grep 或 awk 根據另一個文件中的行號將某些行保留在文件中

查找和刪除重複記錄