ack ：獲取第 10 個（或更大的第 n 個）匹配/擷取組

June 18, 2020

我想我可能只是搜尋錯誤，但我沒有找到任何答案。如果有重複，請告訴我，我可以刪除它。
問題背景
我正在使用ack( link )，它在引擎蓋下有 Perl 5，來獲取 n-gram - 尤其是高階 n-gram。使用我知道的語法（基本上最多$9），我最多可以得到 9 克，但我無法得到 10 克。使用$10只是給了我$1一個0之後。喜歡的事情$(10)並${10}沒有解決問題。我對使用語言建模工具包的解決方案不感興趣，我想使用ack.
我正在使用的一個數據集是馬克吐溫的全集
( wget http://www.gutenberg.org/cache/epub/3200/pg3200.txt && mv pg3200.txt TWAIN_Mark_complete_orig.txt ).
我已經把事情解析得很乾淨（請參閱文章末尾的解析說明TWAIN_Mark_complete_parsed.txt）並將解析結果保存為.
我從 2-gram 得到很好，程式碼和部分結果是
time cat TWAIN_Mark_complete_parsed.txt | \
   ack '(\S+) +(?=(\S+) +)' \
   --output '$1 $2' | \
   sort | uniq -c | \
   sort -rn &gt; Twain_2grams.txt
## `time` info not shown
$ head -n 2 Twain_2grams.txt
 18176 of the
 13288 in the
一直到 9 克，與
time cat TWAIN_Mark_complete_parsed.txt | \
   ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
   --output '$1 $2 $3 $4 $5 $6 $7 $8 $9' | \
   sort | uniq -c | sort -rn &gt; Twain_9grams.txt
## time info not shown
$ head -n 2 Twain_9grams.txt
    17 to mrs jane clemens and mrs moffett in st
    17 mrs jane clemens and mrs moffett in st louis
（注意我對命令進行元程式ack，而不是只輸入每一個。）
問題/我嘗試過的
我第一次嘗試 10 克，結果是
time cat TWAIN_Mark_complete_parsed.txt | \
   ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
   --output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $10' | \
   sort | uniq -c | sort -rn &gt; Twain_10grams.txt

$ head -n 2 Twain_10grams.txt
    17 to mrs jane clemens and mrs moffett in st to0
    17 mrs jane clemens and mrs moffett in st louis mrs0
為了更好地了解正在發生的事情，
參看。這個 SO answer（和這個評論）有關如何通過逐字突出顯示來獲得彩色差異的詳細資訊。基本上apt還是yumfor colordiff，然後pipfor diff-highlight。
使用$(10)而不是$10給出前兩行輸出為
    17 to mrs jane clemens and mrs moffett in st $(10)
    17 mrs jane clemens and mrs moffett in st louis $(10)
（兩分鐘後）。
使用${10}而不是$10給出前兩行輸出為
    17 to mrs jane clemens and mrs moffett in st ${10}
    17 mrs jane clemens and mrs moffett in st louis ${10}
這就是我的想法。
預期/期望的輸出
請注意，實際輸出存在與此處顯示的不同的統計（非常非零和有限）可能性。9-gram 的前兩個結果不是不同的單詞序列。更常見的 10-gram 的其他可能部分可以通過查看前 10 個最常見的 9-gram 來找到 - 使用head而不是head -n 2. 即便如此，我相當肯定即使這樣也不能保證我們有兩個最常見的 10 克。但是，我希望我能清楚地說明我想要完成的工作。
~~17 to mrs jane clemens and mrs moffett in st louis~~
~~3 mrs jane clemens and mrs moffett in st louis honolulu~~
編輯我已經找到了另一組將預期輸出更改為（可能不是實際輸出，而是從我之前使用的簡單模型更改它的一組。）
    17 to mrs jane clemens and mrs moffett in st louis
     7 happiness in his home had been wounded and bruised almost
那將是head -n 2我一直用來展示我得到什麼樣的結果的。
我不想通過我將在這裡使用的相同過程來獲得它。
$ grep -o "to mrs jane clemens and mrs moffett in st [^ ]\+" \
  TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
    17 to mrs jane clemens and mrs moffett in st louis

$ grep -o "mrs jane clemens and mrs moffett in st louis [^ ]\+" \
  TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     3 mrs jane clemens and mrs moffett in st louis honolulu
     2 mrs jane clemens and mrs moffett in st louis san
     2 mrs jane clemens and mrs moffett in st louis no
     2 mrs jane clemens and mrs moffett in st louis 224
     1 mrs jane clemens and mrs moffett in st louis wash
     1 mrs jane clemens and mrs moffett in st louis wailuku
     1 mrs jane clemens and mrs moffett in st louis virginia
     1 mrs jane clemens and mrs moffett in st louis the
     1 mrs jane clemens and mrs moffett in st louis sept
     1 mrs jane clemens and mrs moffett in st louis on
     1 mrs jane clemens and mrs moffett in st louis hartford
     1 mrs jane clemens and mrs moffett in st louis carson
編輯用於查找較新的第二位頻率的程式碼是
$ grep -o "[^ ]\+ happiness in his home had been wounded and bruised" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     6 shelley's happiness in his home had been wounded and bruised
     1 his happiness in his home had been wounded and bruised
$ grep -o "shelley's happiness in his home had been wounded and [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     6 shelley's happiness in his home had been wounded and bruised
$ grep -o "happiness in his home had been wounded and bruised [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     7 happiness in his home had been wounded and bruised almost
$ grep -o "in his home had been wounded and bruised almost [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     7 in his home had been wounded and bruised almost to
$ grep -o "his home had been wounded and bruised almost to [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     7 his home had been wounded and bruised almost to death
$ grep -o "home had been wounded and bruised almost to death [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     1 home had been wounded and bruised almost to death thirdly
     1 home had been wounded and bruised almost to death secondly
     1 home had been wounded and bruised almost to death it
     1 home had been wounded and bruised almost to death fourthly
     1 home had been wounded and bruised almost to death first
     1 home had been wounded and bruised almost to death fifthly
     1 home had been wounded and bruised almost to death and
從評論編輯
@Inian 發表了很棒的評論：
這記錄在發行說明中 - github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md -您現在受限於以下變數： $ 1 thru $ 9, $ , $ ., $ &, $ ` , $ ’ and $ +_
對於未來的人，我放了一個版本，今天存檔，RELEASE-NOTES
man頁面ack確實有線條
$1 through $9
The subpattern from the corresponding set of capturing parentheses.
If your pattern is "(.+) and (.+)", and the string is "this and that',
then $1 is "this" and $2 is "that".
但我希望有辦法獲得更高的數字。有了來自的資訊RELEASE-NOTES，這種希望似乎幾乎消失了。
但是，我仍然想知道是否有人有解決方法或破解方法，無論是使用ack還是任何更“標準”的 NIX 類型的終端工具。按順序，我的偏好是perl, grep, awk, sed。如果有類似的東西ack（即只是命令行解析，而不是*基於 NLP 工具包的解決方案），我也對此感興趣。
我認為將其作為一個新問題提出可能會更好。如果你在這裡回答，那就太好了。如果我最終發布了一個新問題，我會將連結放在這裡：目前，這只是指向同一個問題的連結。
解析說明
為了讓我的語料庫為 n-gram 分析做好準備，這是我的解析。
tr [:upper:] [:lower:] &lt; TWAIN_Mark_complete_orig.txt | \
# upper case to lower case and avoid useless use of cat
tr '\n' ' ' | \
# newlines into spaces, so we can later make it one line, single-spaced
sed -E "s/[^a-z0-9 '*-]+//g" | \
# get rid of everything but letters, numbers, and a few other symbols (corpus)
awk '{$0=$0;$1=$1}1' &gt; TWAIN_Mark_complete_parsed.txt && \
# collapse all multiple spaces to one space (includes tabs), save to output
:
是的，這可能都在一行上（並且沒有尾隨 && :），但這有助於更容易閱讀以及解釋我為什麼要做我正在做的事情。
系統詳情
$ uname -a
CYGWIN_NT-10.0 MY_MACHINE 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin
$ bash --version | head -n 1
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
$ ack --version | head -n 2
ack v3.3.1 (standard build)
Running under Perl v5.26.3 at /usr/bin/perl.exe
$ systeminfo | sed -n 's/^OS\ *//p'
Name:                   Microsoft Windows 10 Enterprise
Version:                10.0.17134 N/A Build 17134
Manufacturer:           Microsoft Corporation
Configuration:          Member Workstation
Build Type:             Multiprocessor Free

儘管我不是 perl 專家，但這是一個可能的 hack。查看多合一源文件，它似乎ack只處理輸出字元串中的單個字元$。將其更改為接受多個字元無疑是可行的，但為了保持簡單，您可以0..9使用abc.... 例如，我進行了這些更改以接受$a和$b作為$10和$11（顯示為diff -u）
@@ -188,7 +188,7 @@
        $opt_output =~ s/\\r/\r/g;
        $opt_output =~ s/\\t/\t/g;

-        my @supported_special_variables = ( 1..9, qw( _ . ` & ' +  f ) );
+        my @supported_special_variables = ( 1..9, qw( a b _ . ` & ' +  f ) );
        @special_vars_used_by_opt_output = grep { $opt_output =~ /\$$_/ } @supported_special_variables;

        # If the $opt_output contains $&, $` or $', those vars won't be
@@ -924,6 +924,8 @@
                # on them not changing in the process of doing the s///.

                my %keep = map { ($_ =&gt; ${$_} // '') } @special_vars_used_by_opt_output;
+                $keep{a} = $10;
+                $keep{b} = $11;
                $keep{_} = $line if exists $keep{_}; # Manually set it because $_ gets reset in a map.
                $keep{f} = $filename if exists $keep{f};
                my $special_vars_used_by_opt_output = join( '', @special_vars_used_by_opt_output );
但是，如果您只想查找第 10 個匹配項，則可以使用$+它顯示與最後一個成功搜尋模式的最後一個括號匹配的文本。

引用自：https://unix.stackexchange.com/questions/593467

ack ：獲取第 10 個（或更大的第 n 個）匹配/擷取組

問題背景

問題/我嘗試過的

預期/期望的輸出

從評論編輯

解析說明

系統詳情

相關問答

在perl中提取模式

ack.pl 工具和 ack.pl 標誌

確認文件名的正則表達式

需要在模式旁邊或下方搜尋單詞

這個 glob 表達式如何刪除冒號？

perl 在 debian 中寫了多少程式碼？