ack :獲取第 10 個(或更大的第 n 個)匹配/擷取組
我想我可能只是搜尋錯誤,但我沒有找到任何答案。如果有重複,請告訴我,我可以刪除它。
問題背景
我正在使用
ack
( link ),它在引擎蓋下有 Perl 5,來獲取 n-gram - 尤其是高階 n-gram。使用我知道的語法(基本上最多$9
),我最多可以得到 9 克,但我無法得到 10 克。使用$10
只是給了我$1
一個0
之後。喜歡的事情$(10)
並${10}
沒有解決問題。我對使用語言建模工具包的解決方案不感興趣,我想使用ack
.我正在使用的一個數據集是馬克吐溫的全集
(
wget http://www.gutenberg.org/cache/epub/3200/pg3200.txt && mv pg3200.txt TWAIN_Mark_complete_orig.txt
).我已經把事情解析得很乾淨(請參閱文章末尾的解析說明
TWAIN_Mark_complete_parsed.txt
)並將解析結果保存為.我從 2-gram 得到很好,程式碼和部分結果是
time cat TWAIN_Mark_complete_parsed.txt | \ ack '(\S+) +(?=(\S+) +)' \ --output '$1 $2' | \ sort | uniq -c | \ sort -rn > Twain_2grams.txt ## `time` info not shown $ head -n 2 Twain_2grams.txt 18176 of the 13288 in the
一直到 9 克,與
time cat TWAIN_Mark_complete_parsed.txt | \ ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \ --output '$1 $2 $3 $4 $5 $6 $7 $8 $9' | \ sort | uniq -c | sort -rn > Twain_9grams.txt ## time info not shown $ head -n 2 Twain_9grams.txt 17 to mrs jane clemens and mrs moffett in st 17 mrs jane clemens and mrs moffett in st louis
(注意我對命令進行元程式
ack
,而不是只輸入每一個。)問題/我嘗試過的
我第一次嘗試 10 克,結果是
time cat TWAIN_Mark_complete_parsed.txt | \ ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \ --output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $10' | \ sort | uniq -c | sort -rn > Twain_10grams.txt $ head -n 2 Twain_10grams.txt 17 to mrs jane clemens and mrs moffett in st to0 17 mrs jane clemens and mrs moffett in st louis mrs0
為了更好地了解正在發生的事情,
參看。這個 SO answer(和這個評論)有關如何通過逐字突出顯示來獲得彩色差異的詳細資訊。基本上
apt
還是yum
forcolordiff
,然後pip
fordiff-highlight
。使用
$(10)
而不是$10
給出前兩行輸出為17 to mrs jane clemens and mrs moffett in st $(10) 17 mrs jane clemens and mrs moffett in st louis $(10)
(兩分鐘後)。
使用
${10}
而不是$10
給出前兩行輸出為17 to mrs jane clemens and mrs moffett in st ${10} 17 mrs jane clemens and mrs moffett in st louis ${10}
這就是我的想法。
預期/期望的輸出
請注意,實際輸出存在與此處顯示的不同的統計(非常非零和有限)可能性。9-gram 的前兩個結果不是不同的單詞序列。更常見的 10-gram 的其他可能部分可以通過查看前 10 個最常見的 9-gram 來找到 - 使用
head
而不是head -n 2
. 即便如此,我相當肯定即使這樣也不能保證我們有兩個最常見的 10 克。但是,我希望我能清楚地說明我想要完成的工作。
17 to mrs jane clemens and mrs moffett in st louis
3 mrs jane clemens and mrs moffett in st louis honolulu
編輯我已經找到了另一組將預期輸出更改為(可能不是實際輸出,而是從我之前使用的簡單模型更改它的一組。)
17 to mrs jane clemens and mrs moffett in st louis 7 happiness in his home had been wounded and bruised almost
那將是
head -n 2
我一直用來展示我得到什麼樣的結果的。我不想通過我將在這裡使用的相同過程來獲得它。
$ grep -o "to mrs jane clemens and mrs moffett in st [^ ]\+" \ TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn 17 to mrs jane clemens and mrs moffett in st louis $ grep -o "mrs jane clemens and mrs moffett in st louis [^ ]\+" \ TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn 3 mrs jane clemens and mrs moffett in st louis honolulu 2 mrs jane clemens and mrs moffett in st louis san 2 mrs jane clemens and mrs moffett in st louis no 2 mrs jane clemens and mrs moffett in st louis 224 1 mrs jane clemens and mrs moffett in st louis wash 1 mrs jane clemens and mrs moffett in st louis wailuku 1 mrs jane clemens and mrs moffett in st louis virginia 1 mrs jane clemens and mrs moffett in st louis the 1 mrs jane clemens and mrs moffett in st louis sept 1 mrs jane clemens and mrs moffett in st louis on 1 mrs jane clemens and mrs moffett in st louis hartford 1 mrs jane clemens and mrs moffett in st louis carson
編輯用於查找較新的第二位頻率的程式碼是
$ grep -o "[^ ]\+ happiness in his home had been wounded and bruised" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn 6 shelley's happiness in his home had been wounded and bruised 1 his happiness in his home had been wounded and bruised $ grep -o "shelley's happiness in his home had been wounded and [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn 6 shelley's happiness in his home had been wounded and bruised $ grep -o "happiness in his home had been wounded and bruised [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn 7 happiness in his home had been wounded and bruised almost $ grep -o "in his home had been wounded and bruised almost [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn 7 in his home had been wounded and bruised almost to $ grep -o "his home had been wounded and bruised almost to [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn 7 his home had been wounded and bruised almost to death $ grep -o "home had been wounded and bruised almost to death [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn 1 home had been wounded and bruised almost to death thirdly 1 home had been wounded and bruised almost to death secondly 1 home had been wounded and bruised almost to death it 1 home had been wounded and bruised almost to death fourthly 1 home had been wounded and bruised almost to death first 1 home had been wounded and bruised almost to death fifthly 1 home had been wounded and bruised almost to death and
從評論編輯
@Inian 發表了很棒的評論:
這記錄在發行說明中 - github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md -您現在受限於以下變數: $ 1 thru $ 9, $ , $ ., $ &, $ ` , $ ’ and $ +_
對於未來的人,我放了一個版本,今天存檔,
RELEASE-NOTES
man
頁面ack
確實有線條
$1 through $9
The subpattern from the corresponding set of capturing parentheses.
If your pattern is "(.+) and (.+)", and the string is "this and that',
then $1 is "this" and $2 is "that".
但我希望有辦法獲得更高的數字。有了來自 的資訊
RELEASE-NOTES
,這種希望似乎幾乎消失了。但是,我仍然想知道是否有人有解決方法或破解方法,無論是使用
ack
還是任何更“標準”的 NIX 類型的終端工具。按順序,我的偏好是perl
,grep
,awk
,sed
。如果有類似的東西ack
(即只是命令行解析,而不是*基於 NLP 工具包的解決方案),我也對此感興趣。我認為將其作為一個新問題提出可能會更好。如果你在這裡回答,那就太好了。如果我最終發布了一個新問題,我會將連結放在這裡:目前,這只是指向同一個問題的連結。
解析說明
為了讓我的語料庫為 n-gram 分析做好準備,這是我的解析。
tr [:upper:] [:lower:] < TWAIN_Mark_complete_orig.txt | \ # upper case to lower case and avoid useless use of cat tr '\n' ' ' | \ # newlines into spaces, so we can later make it one line, single-spaced sed -E "s/[^a-z0-9 '*-]+//g" | \ # get rid of everything but letters, numbers, and a few other symbols (corpus) awk '{$0=$0;$1=$1}1' > TWAIN_Mark_complete_parsed.txt && \ # collapse all multiple spaces to one space (includes tabs), save to output :
是的,這可能都在一行上(並且沒有尾隨
&& :
),但這有助於更容易閱讀以及解釋我為什麼要做我正在做的事情。系統詳情
$ uname -a CYGWIN_NT-10.0 MY_MACHINE 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin $ bash --version | head -n 1 GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin) $ ack --version | head -n 2 ack v3.3.1 (standard build) Running under Perl v5.26.3 at /usr/bin/perl.exe $ systeminfo | sed -n 's/^OS\ *//p' Name: Microsoft Windows 10 Enterprise Version: 10.0.17134 N/A Build 17134 Manufacturer: Microsoft Corporation Configuration: Member Workstation Build Type: Multiprocessor Free
儘管我不是 perl 專家,但這是一個可能的 hack。查看多合一源文件,它似乎
ack
只處理輸出字元串中的單個字元$
。將其更改為接受多個字元無疑是可行的,但為了保持簡單,您可以0..9
使用abc...
. 例如,我進行了這些更改以接受$a
和$b
作為$10
和$11
(顯示為diff -u
)@@ -188,7 +188,7 @@ $opt_output =~ s/\\r/\r/g; $opt_output =~ s/\\t/\t/g; - my @supported_special_variables = ( 1..9, qw( _ . ` & ' + f ) ); + my @supported_special_variables = ( 1..9, qw( a b _ . ` & ' + f ) ); @special_vars_used_by_opt_output = grep { $opt_output =~ /\$$_/ } @supported_special_variables; # If the $opt_output contains $&, $` or $', those vars won't be @@ -924,6 +924,8 @@ # on them not changing in the process of doing the s///. my %keep = map { ($_ => ${$_} // '') } @special_vars_used_by_opt_output; + $keep{a} = $10; + $keep{b} = $11; $keep{_} = $line if exists $keep{_}; # Manually set it because $_ gets reset in a map. $keep{f} = $filename if exists $keep{f}; my $special_vars_used_by_opt_output = join( '', @special_vars_used_by_opt_output );
但是,如果您只想查找第 10 個匹配項,則可以使用
$+
它顯示與最後一個成功搜尋模式的最後一個括號匹配的文本。