Bash

搜尋三個連續的詞

  • September 3, 2019

我的書單(txt 文件)中有重複項,如下所示 -

The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues

Joy on Demand: The Art of Discovering the Happiness Within
Crucial Conversations Tools for Talking When Stakes Are High
Joy on Demand

Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness
Search Inside Yourself
......
......
......

我需要找到重複的書籍並在檢查後手動刪除它們。我搜尋並發現線條需要圖案。

前任。

根據部分行比較刪除重複行

在文件中查找部分重複行併計算每行重複了多少次?

在我的情況下,很難找到線條中的模式。但是,我在單詞序列中發現了一個模式。

我只想將行標記為重複,前提是它們具有三個連續的單詞*(不區分大小寫)*。

如果你看到你會發現 -

The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues

Ideal Team Player是我正在尋找的連續單詞。

我希望輸出類似於以下內容 -

3 Ideal Team Player
2 Joy on Demand
2 Search Inside Yourself
......
......
......

我怎樣才能做到這一點?

以下awk程序儲存每組三個連續單詞出現次數的計數(刪除標點符號後),如果計數大於 1,則在末尾列印計數和單詞集:

{
       gsub("[[:punct:]]", "")

       for (i = 3; i <= NF; ++i)
               w[$(i-2),$(i-1),$i]++
}
END {
       for (key in w) {
               count = w[key]
               if (count > 1) {
                       gsub(SUBSEP," ",key)
                       print count, key
               }
       }
}

鑑於您問題中的文字,這會產生

2 Search Inside Yourself
2 Cultivate The Three
2 The Three Essential
2 Joy on Demand
2 Recognize and Cultivate
2 Three Essential Virtues
2 and Cultivate The
2 The Ideal Team
3 Ideal Team Player

如您所見,這可能不是那麼有用。

相反,我們可以收集相同的計數資訊,然後對文件進行第二次遍歷,列印包含計數大於 1 的單詞三元組的每一行:

NR == FNR {
       gsub("[[:punct:]]", "")

       for (i = 3; i <= NF; ++i)
               w[$(i-2),$(i-1),$i]++

       next
}

{
       orig = $0
       gsub("[[:punct:]]", "")

       for (i = 3; i <= NF; ++i)
               if (w[$(i-2),$(i-1),$i] > 1) {
                       print orig
                       next
               }
}

對您的文件進行測試:

$ cat file
The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues

Joy on Demand: The Art of Discovering the Happiness Within
Crucial Conversations Tools for Talking When Stakes Are High
Joy on Demand

Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness
Search Inside Yourself
$ awk -f script.awk file file
The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues
Joy on Demand: The Art of Discovering the Happiness Within
Joy on Demand
Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness
Search Inside Yourself

警告:該awk程序需要足夠的記憶體來儲存大約三倍的文件文本,並且即使條目實際上並未真正重複,也可能會在常用片語中找到重複項(例如,“如何烹飪”可能是幾個標題的一部分圖書)。

引用自:https://unix.stackexchange.com/questions/537800