Bash
搜尋三個連續的詞
我的書單(txt 文件)中有重複項,如下所示 -
The Ideal Team Player The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues Ideal Team Player: Recognize and Cultivate The Three Essential Virtues Joy on Demand: The Art of Discovering the Happiness Within Crucial Conversations Tools for Talking When Stakes Are High Joy on Demand Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness Search Inside Yourself ...... ...... ......
我需要找到重複的書籍並在檢查後手動刪除它們。我搜尋並發現線條需要圖案。
前任。
但在我的情況下,很難找到線條中的模式。但是,我在單詞序列中發現了一個模式。
我只想將行標記為重複,前提是它們具有三個連續的單詞*(不區分大小寫)*。
如果你看到你會發現 -
The Ideal Team Player The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues Ideal Team Player: Recognize and Cultivate The Three Essential Virtues
Ideal Team Player
是我正在尋找的連續單詞。我希望輸出類似於以下內容 -
3 Ideal Team Player 2 Joy on Demand 2 Search Inside Yourself ...... ...... ......
我怎樣才能做到這一點?
以下
awk
程序儲存每組三個連續單詞出現次數的計數(刪除標點符號後),如果計數大於 1,則在末尾列印計數和單詞集:{ gsub("[[:punct:]]", "") for (i = 3; i <= NF; ++i) w[$(i-2),$(i-1),$i]++ } END { for (key in w) { count = w[key] if (count > 1) { gsub(SUBSEP," ",key) print count, key } } }
鑑於您問題中的文字,這會產生
2 Search Inside Yourself 2 Cultivate The Three 2 The Three Essential 2 Joy on Demand 2 Recognize and Cultivate 2 Three Essential Virtues 2 and Cultivate The 2 The Ideal Team 3 Ideal Team Player
如您所見,這可能不是那麼有用。
相反,我們可以收集相同的計數資訊,然後對文件進行第二次遍歷,列印包含計數大於 1 的單詞三元組的每一行:
NR == FNR { gsub("[[:punct:]]", "") for (i = 3; i <= NF; ++i) w[$(i-2),$(i-1),$i]++ next } { orig = $0 gsub("[[:punct:]]", "") for (i = 3; i <= NF; ++i) if (w[$(i-2),$(i-1),$i] > 1) { print orig next } }
對您的文件進行測試:
$ cat file The Ideal Team Player The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues Ideal Team Player: Recognize and Cultivate The Three Essential Virtues Joy on Demand: The Art of Discovering the Happiness Within Crucial Conversations Tools for Talking When Stakes Are High Joy on Demand Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness Search Inside Yourself
$ awk -f script.awk file file The Ideal Team Player The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues Ideal Team Player: Recognize and Cultivate The Three Essential Virtues Joy on Demand: The Art of Discovering the Happiness Within Joy on Demand Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness Search Inside Yourself
警告:該
awk
程序需要足夠的記憶體來儲存大約三倍的文件文本,並且即使條目實際上並未真正重複,也可能會在常用片語中找到重複項(例如,“如何烹飪”可能是幾個標題的一部分圖書)。