  • June 18, 2020



我正在使用ack( link ),它在引擎蓋下有 Perl 5,來獲取 n-gram - 尤其是高階 n-gram。使用我知道的語法(基本上最多$9),我最多可以得到 9 克,但我無法得到 10 克。使用$10只是給了我$1一個0之後。喜歡的事情$(10)${10}沒有解決問題。我對使用語言建模工具包的解決方案感興趣,我想使用ack.


( wget && mv pg3200.txt TWAIN_Mark_complete_orig.txt ).


我從 2-gram 得到很好,程式碼和部分結果是

time cat TWAIN_Mark_complete_parsed.txt | \
   ack '(\S+) +(?=(\S+) +)' \
   --output '$1 $2' | \
   sort | uniq -c | \
   sort -rn > Twain_2grams.txt
## `time` info not shown
$ head -n 2 Twain_2grams.txt
 18176 of the
 13288 in the

一直到 9 克,與

time cat TWAIN_Mark_complete_parsed.txt | \
   ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
   --output '$1 $2 $3 $4 $5 $6 $7 $8 $9' | \
   sort | uniq -c | sort -rn > Twain_9grams.txt
## time info not shown
$ head -n 2 Twain_9grams.txt
    17 to mrs jane clemens and mrs moffett in st
    17 mrs jane clemens and mrs moffett in st louis



我第一次嘗試 10 克,結果是

time cat TWAIN_Mark_complete_parsed.txt | \
   ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
   --output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $10' | \
   sort | uniq -c | sort -rn > Twain_10grams.txt

$ head -n 2 Twain_10grams.txt
    17 to mrs jane clemens and mrs moffett in st to0
    17 mrs jane clemens and mrs moffett in st louis mrs0


diff -u <(head -n 2 Twain_10grams.txt) <(head -n 2 Twain_9grams.txt) |  色彩差異 |  差異突出

參看。這個 SO answer(和這個評論)有關如何通過逐字突出顯示來獲得彩色差異的詳細資訊。基本上apt還是yumfor colordiff,然後pipfor diff-highlight


    17 to mrs jane clemens and mrs moffett in st $(10)
    17 mrs jane clemens and mrs moffett in st louis $(10)



    17 to mrs jane clemens and mrs moffett in st ${10}
    17 mrs jane clemens and mrs moffett in st louis ${10}



請注意,實際輸出存在此處顯示的不同的統計(非常非零和有限)可能性。9-gram 的前兩個結果不是不同的單詞序列。更常見的 10-gram 的其他可能部分可以通過查看前 10 個最常見的 9-gram 來找到 - 使用head而不是head -n 2. 即便如此,我相當肯定即使這樣也不能保證我們有兩個最常見的 10 克。但是,我希望我能清楚地說明我想要完成的工作。

17 to mrs jane clemens and mrs moffett in st louis

3 mrs jane clemens and mrs moffett in st louis honolulu


    17 to mrs jane clemens and mrs moffett in st louis
     7 happiness in his home had been wounded and bruised almost

那將是head -n 2我一直用來展示我得到什麼樣的結果的。


$ grep -o "to mrs jane clemens and mrs moffett in st [^ ]\+" \
  TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
    17 to mrs jane clemens and mrs moffett in st louis

$ grep -o "mrs jane clemens and mrs moffett in st louis [^ ]\+" \
  TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     3 mrs jane clemens and mrs moffett in st louis honolulu
     2 mrs jane clemens and mrs moffett in st louis san
     2 mrs jane clemens and mrs moffett in st louis no
     2 mrs jane clemens and mrs moffett in st louis 224
     1 mrs jane clemens and mrs moffett in st louis wash
     1 mrs jane clemens and mrs moffett in st louis wailuku
     1 mrs jane clemens and mrs moffett in st louis virginia
     1 mrs jane clemens and mrs moffett in st louis the
     1 mrs jane clemens and mrs moffett in st louis sept
     1 mrs jane clemens and mrs moffett in st louis on
     1 mrs jane clemens and mrs moffett in st louis hartford
     1 mrs jane clemens and mrs moffett in st louis carson


$ grep -o "[^ ]\+ happiness in his home had been wounded and bruised" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     6 shelley's happiness in his home had been wounded and bruised
     1 his happiness in his home had been wounded and bruised
$ grep -o "shelley's happiness in his home had been wounded and [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     6 shelley's happiness in his home had been wounded and bruised
$ grep -o "happiness in his home had been wounded and bruised [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     7 happiness in his home had been wounded and bruised almost
$ grep -o "in his home had been wounded and bruised almost [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     7 in his home had been wounded and bruised almost to
$ grep -o "his home had been wounded and bruised almost to [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     7 his home had been wounded and bruised almost to death
$ grep -o "home had been wounded and bruised almost to death [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     1 home had been wounded and bruised almost to death thirdly
     1 home had been wounded and bruised almost to death secondly
     1 home had been wounded and bruised almost to death it
     1 home had been wounded and bruised almost to death fourthly
     1 home had been wounded and bruised almost to death first
     1 home had been wounded and bruised almost to death fifthly
     1 home had been wounded and bruised almost to death and


@Inian 發表了很棒的評論

這記錄在發行說明中 - -您現在受限於以下變數: $ 1 thru $ 9, $ , $ ., $ &, $ ` , $ ’ and $ +_



$1 through $9

The subpattern from the corresponding set of capturing parentheses.

If your pattern is "(.+) and (.+)", and the string is "this and that',

then $1 is "this" and $2 is "that".

但我希望有辦法獲得更高的數字。有了來自 的資訊RELEASE-NOTES,這種希望似乎幾乎消失了。

但是,我仍然想知道是否有人有解決方法或破解方法,無論是使用ack還是任何更“標準”的 NIX 類型的終端工具。按順序,我的偏好是perl, grep, awk, sed。如果有類似的東西ack(即只是命令行解析,而不是*基於 NLP 工具包的解決方案),我也對此感興趣。



為了讓我的語料庫為 n-gram 分析做好準備,這是我的解析。

tr [:upper:] [:lower:] < TWAIN_Mark_complete_orig.txt | \
# upper case to lower case and avoid useless use of cat
tr '\n' ' ' | \
# newlines into spaces, so we can later make it one line, single-spaced
sed -E "s/[^a-z0-9 '*-]+//g" | \
# get rid of everything but letters, numbers, and a few other symbols (corpus)
awk '{$0=$0;$1=$1}1' > TWAIN_Mark_complete_parsed.txt && \
# collapse all multiple spaces to one space (includes tabs), save to output

是的,這可能都在一行上(並且沒有尾隨 && :),但這有助於更容易閱讀以及解釋我為什麼要做我正在做的事情。


$ uname -a
CYGWIN_NT-10.0 MY_MACHINE 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin
$ bash --version | head -n 1
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
$ ack --version | head -n 2
ack v3.3.1 (standard build)
Running under Perl v5.26.3 at /usr/bin/perl.exe
$ systeminfo | sed -n 's/^OS\ *//p'
Name:                   Microsoft Windows 10 Enterprise
Version:                10.0.17134 N/A Build 17134
Manufacturer:           Microsoft Corporation
Configuration:          Member Workstation
Build Type:             Multiprocessor Free

儘管我不是 perl 專家,但這是一個可能的 hack。查看多合一源文件,它似乎ack只處理輸出字元串中的單個字元$。將其更改為接受多個字元無疑是可行的,但為了保持簡單,您可以0..9使用abc.... 例如,我進行了這些更改以接受$a$b作為$10$11(顯示為diff -u

@@ -188,7 +188,7 @@
        $opt_output =~ s/\\r/\r/g;
        $opt_output =~ s/\\t/\t/g;

-        my @supported_special_variables = ( 1..9, qw( _ . ` & ' +  f ) );
+        my @supported_special_variables = ( 1..9, qw( a b _ . ` & ' +  f ) );
        @special_vars_used_by_opt_output = grep { $opt_output =~ /\$$_/ } @supported_special_variables;

        # If the $opt_output contains $&, $` or $', those vars won't be
@@ -924,6 +924,8 @@
                # on them not changing in the process of doing the s///.

                my %keep = map { ($_ => ${$_} // '') } @special_vars_used_by_opt_output;
+                $keep{a} = $10;
+                $keep{b} = $11;
                $keep{_} = $line if exists $keep{_}; # Manually set it because $_ gets reset in a map.
                $keep{f} = $filename if exists $keep{f};
                my $special_vars_used_by_opt_output = join( '', @special_vars_used_by_opt_output );

但是,如果您只想查找第 10 個匹配項,則可以使用$+它顯示與最後一個成功搜尋模式的最後一個括號匹配的文本
