一個查找每個 n-gram 的 shell 命令在文本中工作

April 7, 2020

我有一個文本流或一個文件，其中包含由空格分隔的單詞。像：
I have a toy. you may not like it.
每個空格分隔的單詞可以由兩個或多個小單片語成，可以用駝峰式（用不同的大小寫分隔）、蛇形（用下劃線分隔）或用點分隔，比如：
I_amAManTest you_haveAHouse FOO_BAR_test.model
例如：
I_amAManTest
可以分為：
I
am
A
Man
Test
但我想列印複合詞中的每n 個詞（連續小詞的每個子集），例如：
I_amAManTest
輸出：
// from first word on
I
I_am
I_amA
I_amAMan
I_amAManTest
// from second word on 
am
amA
amAMan
amAManTest
// from third word on 
A
AMan
AManTest
// from fourth word on
Man
ManTest
// from fifth word on
Test
所以總而言之，對於像這樣的輸入
I_amAManTest you_haveAHouse FOO_BAR_test
輸出應該是
I
I_am
I_amA
I_amAMan
I_amAManTest
am
amA
amAMan
amAManTest
A
AMan
AManTest
Man
ManTest
Test
you
you_have
you_haveA
you_haveAHouse
have
haveA
haveAHouse
A
AHouse
House
FOO
FOO_BAR
FOO_BAR_test
BAR
BAR_test
test

一個（大部分）sed解決方案：
cat "$@" |
   tr -cs -- '._[:alpha:]' '[\n*]' |
   sed -n  -e 'h; :ms' \
           -e 'p; :ss' \
               -e 's/$[[:lower:]]$[[:upper:]][[:lower:]]*$/\1/p; t ss' \
               -e 's/$[[:lower:]]$[[:upper:]][[:upper:]]*$/\1/p; t ss' \
               -e 's/$[[:upper:]]$[[:upper:]][[:lower:]]\+$/\1/p; t ss' \
               -e 's/[._][[:alpha:]][[:lower:]]*$//p; t ss' \
               -e 's/[._][[:upper:]]\+$//p; t ss' \
           -e 'g' \
           -e 's/^[[:upper:]]\?[[:lower:]]\+$[[:upper:]]$/\1/; t mw' \
           -e 's/^[[:upper:]]\+$[[:upper:]][[:lower:]]$/\1/; t mw' \
           -e 's/^[[:alpha:]][[:lower:]]*[._]//; t mw' \
           -e 's/^[[:upper:]]\+[._]//; t mw' \
           -e 'b' \
           -e ':mw; h; b ms'
該算法是
for each compound word (e.g., “FOO_BAR_test”) in the input
do
   repeat
       print what you’ve got
       repeat
           remove a small word from the end (e.g., “FOO_BAR_test” → “FOO_BAR”) and print what’s left
       until you’re down to the last one (e.g., “FOO_BAR_test” → “FOO”)
       go back to what you had at the beginning of the above loop
         and remove a small word from the beginning
         (e.g., “FOO_BAR_test” → “BAR_test”) ... but don’t print anything
   until you’re down to the last one (e.g., “FOO_BAR_test” → “test”)
end for loop
細節：
cat "$@"是UUOC。我通常避免這些；你可以做，但你不能直接將多個文件傳遞給。tr *args* **<** *file*``tr
tr -cs -- '._[:alpha:]' '[\n*]'將一行許多複合詞分成單獨的行；例如，
I_amAManTest you_haveAHouse FOO_BAR_test
變成
I_amAManTest
you_haveAHouse
FOO_BAR_test
所以 sed 一次可以處理一個複合詞。
sed -n— 不要自動列印任何東西；僅在命令時列印。
-e指定以下表達式是sed 腳本的一部分。
h— 將模式空間複製到保持空間。
:ms— 一個標籤（主循環開始）
p- 列印
:ss— 一個標籤（二級循環開始）
以下命令從復合詞的末尾刪除一個小詞，如果成功，則列印結果並跳回輔助循環的開頭。
s/$[[:lower:]]$[[:upper:]][[:lower:]]*$/\1/p; t ss— 將“nTest”更改為“n”。
s/$[[:lower:]]$[[:upper:]][[:upper:]]*$/\1/p; t ss— 將“mOK”更改為“m”。
s/$[[:upper:]]$[[:upper:]][[:lower:]]\+$/\1/p; t ss— 將“AMan”更改為“A”。
s/[._][[:alpha:]][[:lower:]]*$//p; t ss— 刪除“_am”（將其替換為空）。
s/[._][[:upper:]]\+$//p; t ss— 刪除“_BAR”（將其替換為空）。
這是輔助循環的結束。
g— 將保持空間複製到模式空間（回到上面循環開始時的內容）。
以下命令從復合詞的開頭刪除一個小詞，如果成功，則跳到主循環的末尾（mw = Main loop Wrap-up）。
s/^[[:upper:]]\?[[:lower:]]\+$[[:upper:]]$/\1/; t mw— 將“ama”更改為“A”，將“ManT”更改為“T”。
s/^[[:upper:]]\+$[[:upper:]][[:lower:]]$/\1/; t mw— 將“AMA”更改為“Ma”。
s/^[[:alpha:]][[:lower:]]*[._]//; t mw— 刪除“I_”和“you_”（將它們替換為空）。
s/^[[:upper:]]\+[._]//; t mw— 刪除“FOO_”（將其替換為空）。
如果成功（如果找到/匹配某些內容），上述每個替代命令都會跳轉到主循環總結（如下）。如果我們到達這裡，模式空間只包含一個小詞，所以我們完成了。
b— 分支（跳轉）到 sed 腳本的末尾；即，退出 sed 腳本。
:mw— 主循環總結的標籤。
h— 將模式空間複製到保持空間，為主循環的下一次迭代做好準備。
b ms— 跳轉到主循環的開頭。
它產生請求的輸出。不幸的是，它以不同的順序排列。如果它很重要，我可能會解決它。
$ echo "I_amAManTest you_haveAHouse FOO_BAR_test" | ./myscript
I_amAManTest
I_amAMan
I_amA
I_am
I
amAManTest
amAMan
amA
am
AManTest
AMan
A
ManTest
Man
Test
you_haveAHouse
you_haveA
you_have
you
haveAHouse
haveA
have
AHouse
A
House
FOO_BAR_test
FOO_BAR
FOO
BAR_test
BAR
Test

引用自：https://unix.stackexchange.com/questions/573370

一個查找每個 n-gram 的 shell 命令在文本中工作

相關問答

如何在經常一起出現的多個文件中查找關鍵字？

從文件中查找整行匹配的文件

文件中的行範圍

匹配模式兩次的 Grep 行

Perl 兩種模式之間的多行數據成一行輸出

雙引號和單引號內的Grep字元串