Perl

ack :獲取第 10 個(或更大的第 n 個)匹配/擷取組

  • June 18, 2020

我想我可能只是搜尋錯誤,但我沒有找到任何答案。如果有重複,請告訴我,我可以刪除它。

問題背景

我正在使用ack( link ),它在引擎蓋下有 Perl 5,來獲取 n-gram - 尤其是高階 n-gram。使用我知道的語法(基本上最多$9),我最多可以得到 9 克,但我無法得到 10 克。使用$10只是給了我$1一個0之後。喜歡的事情$(10)${10}沒有解決問題。我對使用語言建模工具包的解決方案感興趣,我想使用ack.

我正在使用的一個數據集是馬克吐溫的全集

( wget http://www.gutenberg.org/cache/epub/3200/pg3200.txt && mv pg3200.txt TWAIN_Mark_complete_orig.txt ).

我已經把事情解析得很乾淨(請參閱文章末尾的解析說明TWAIN_Mark_complete_parsed.txt)並將解析結果保存為.

我從 2-gram 得到很好,程式碼和部分結果是

time cat TWAIN_Mark_complete_parsed.txt | \
   ack '(\S+) +(?=(\S+) +)' \
   --output '$1 $2' | \
   sort | uniq -c | \
   sort -rn > Twain_2grams.txt
## `time` info not shown
$ head -n 2 Twain_2grams.txt
 18176 of the
 13288 in the

一直到 9 克,與

time cat TWAIN_Mark_complete_parsed.txt | \
   ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
   --output '$1 $2 $3 $4 $5 $6 $7 $8 $9' | \
   sort | uniq -c | sort -rn > Twain_9grams.txt
## time info not shown
$ head -n 2 Twain_9grams.txt
    17 to mrs jane clemens and mrs moffett in st
    17 mrs jane clemens and mrs moffett in st louis

(注意我對命令進行元程式ack,而不是只輸入每一個。)

問題/我嘗試過的

我第一次嘗試 10 克,結果是

time cat TWAIN_Mark_complete_parsed.txt | \
   ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
   --output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $10' | \
   sort | uniq -c | sort -rn > Twain_10grams.txt

$ head -n 2 Twain_10grams.txt
    17 to mrs jane clemens and mrs moffett in st to0
    17 mrs jane clemens and mrs moffett in st louis mrs0

為了更好地了解正在發生的事情,

diff -u <(head -n 2 Twain_10grams.txt) <(head -n 2 Twain_9grams.txt) |  色彩差異 |  差異突出

參看。這個 SO answer(和這個評論)有關如何通過逐字突出顯示來獲得彩色差異的詳細資訊。基本上apt還是yumfor colordiff,然後pipfor diff-highlight

使用$(10)而不是$10給出前兩行輸出為

    17 to mrs jane clemens and mrs moffett in st $(10)
    17 mrs jane clemens and mrs moffett in st louis $(10)

(兩分鐘後)。

使用${10}而不是$10給出前兩行輸出為

    17 to mrs jane clemens and mrs moffett in st ${10}
    17 mrs jane clemens and mrs moffett in st louis ${10}

這就是我的想法。

預期/期望的輸出

請注意,實際輸出存在此處顯示的不同的統計(非常非零和有限)可能性。9-gram 的前兩個結果不是不同的單詞序列。更常見的 10-gram 的其他可能部分可以通過查看前 10 個最常見的 9-gram 來找到 - 使用head而不是head -n 2. 即便如此,我相當肯定即使這樣也不能保證我們有兩個最常見的 10 克。但是,我希望我能清楚地說明我想要完成的工作。

17 to mrs jane clemens and mrs moffett in st louis

3 mrs jane clemens and mrs moffett in st louis honolulu

編輯我已經找到了另一組將預期輸出更改為(可能不是實際輸出,而是從我之前使用的簡單模型更改它的一組。)

    17 to mrs jane clemens and mrs moffett in st louis
     7 happiness in his home had been wounded and bruised almost

那將是head -n 2我一直用來展示我得到什麼樣的結果的。

我不想通過我將在這裡使用的相同過程來獲得它。

$ grep -o "to mrs jane clemens and mrs moffett in st [^ ]\+" \
  TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
    17 to mrs jane clemens and mrs moffett in st louis

$ grep -o "mrs jane clemens and mrs moffett in st louis [^ ]\+" \
  TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     3 mrs jane clemens and mrs moffett in st louis honolulu
     2 mrs jane clemens and mrs moffett in st louis san
     2 mrs jane clemens and mrs moffett in st louis no
     2 mrs jane clemens and mrs moffett in st louis 224
     1 mrs jane clemens and mrs moffett in st louis wash
     1 mrs jane clemens and mrs moffett in st louis wailuku
     1 mrs jane clemens and mrs moffett in st louis virginia
     1 mrs jane clemens and mrs moffett in st louis the
     1 mrs jane clemens and mrs moffett in st louis sept
     1 mrs jane clemens and mrs moffett in st louis on
     1 mrs jane clemens and mrs moffett in st louis hartford
     1 mrs jane clemens and mrs moffett in st louis carson

編輯用於查找較新的第二位頻率的程式碼是

$ grep -o "[^ ]\+ happiness in his home had been wounded and bruised" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     6 shelley's happiness in his home had been wounded and bruised
     1 his happiness in his home had been wounded and bruised
$ grep -o "shelley's happiness in his home had been wounded and [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     6 shelley's happiness in his home had been wounded and bruised
$ grep -o "happiness in his home had been wounded and bruised [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     7 happiness in his home had been wounded and bruised almost
$ grep -o "in his home had been wounded and bruised almost [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     7 in his home had been wounded and bruised almost to
$ grep -o "his home had been wounded and bruised almost to [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     7 his home had been wounded and bruised almost to death
$ grep -o "home had been wounded and bruised almost to death [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     1 home had been wounded and bruised almost to death thirdly
     1 home had been wounded and bruised almost to death secondly
     1 home had been wounded and bruised almost to death it
     1 home had been wounded and bruised almost to death fourthly
     1 home had been wounded and bruised almost to death first
     1 home had been wounded and bruised almost to death fifthly
     1 home had been wounded and bruised almost to death and

從評論編輯

@Inian 發表了很棒的評論

這記錄在發行說明中 - github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md -您現在受限於以下變數: $ 1 thru $ 9, $ , $ ., $ &, $ ` , $ ’ and $ +_

對於未來的人,我放了一個版本,今天存檔RELEASE-NOTES

man頁面ack確實有線條

$1 through $9

The subpattern from the corresponding set of capturing parentheses.

If your pattern is "(.+) and (.+)", and the string is "this and that',

then $1 is "this" and $2 is "that".

但我希望有辦法獲得更高的數字。有了來自 的資訊RELEASE-NOTES,這種希望似乎幾乎消失了。

但是,我仍然想知道是否有人有解決方法或破解方法,無論是使用ack還是任何更“標準”的 NIX 類型的終端工具。按順序,我的偏好是perl, grep, awk, sed。如果有類似的東西ack(即只是命令行解析,而不是*基於 NLP 工具包的解決方案),我也對此感興趣。

我認為將其作為一個新問題提出可能會更好。如果你在這裡回答,那就太好了。如果我最終發布了一個新問題,我會將連結放在這裡:目前,這只是指向同一個問題的連結


解析說明

為了讓我的語料庫為 n-gram 分析做好準備,這是我的解析。

tr [:upper:] [:lower:] < TWAIN_Mark_complete_orig.txt | \
# upper case to lower case and avoid useless use of cat
tr '\n' ' ' | \
# newlines into spaces, so we can later make it one line, single-spaced
sed -E "s/[^a-z0-9 '*-]+//g" | \
# get rid of everything but letters, numbers, and a few other symbols (corpus)
awk '{$0=$0;$1=$1}1' > TWAIN_Mark_complete_parsed.txt && \
# collapse all multiple spaces to one space (includes tabs), save to output
:

是的,這可能都在一行上(並且沒有尾隨 && :),但這有助於更容易閱讀以及解釋我為什麼要做我正在做的事情。


系統詳情

$ uname -a
CYGWIN_NT-10.0 MY_MACHINE 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin
$ bash --version | head -n 1
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
$ ack --version | head -n 2
ack v3.3.1 (standard build)
Running under Perl v5.26.3 at /usr/bin/perl.exe
$ systeminfo | sed -n 's/^OS\ *//p'
Name:                   Microsoft Windows 10 Enterprise
Version:                10.0.17134 N/A Build 17134
Manufacturer:           Microsoft Corporation
Configuration:          Member Workstation
Build Type:             Multiprocessor Free

儘管我不是 perl 專家,但這是一個可能的 hack。查看多合一源文件,它似乎ack只處理輸出字元串中的單個字元$。將其更改為接受多個字元無疑是可行的,但為了保持簡單,您可以0..9使用abc.... 例如,我進行了這些更改以接受$a$b作為$10$11(顯示為diff -u

@@ -188,7 +188,7 @@
        $opt_output =~ s/\\r/\r/g;
        $opt_output =~ s/\\t/\t/g;

-        my @supported_special_variables = ( 1..9, qw( _ . ` & ' +  f ) );
+        my @supported_special_variables = ( 1..9, qw( a b _ . ` & ' +  f ) );
        @special_vars_used_by_opt_output = grep { $opt_output =~ /\$$_/ } @supported_special_variables;

        # If the $opt_output contains $&, $` or $', those vars won't be
@@ -924,6 +924,8 @@
                # on them not changing in the process of doing the s///.

                my %keep = map { ($_ => ${$_} // '') } @special_vars_used_by_opt_output;
+                $keep{a} = $10;
+                $keep{b} = $11;
                $keep{_} = $line if exists $keep{_}; # Manually set it because $_ gets reset in a map.
                $keep{f} = $filename if exists $keep{f};
                my $special_vars_used_by_opt_output = join( '', @special_vars_used_by_opt_output );

但是,如果您只想查找第 10 個匹配項,則可以使用$+它顯示與最後一個成功搜尋模式的最後一個括號匹配的文本

引用自:https://unix.stackexchange.com/questions/593467