比較文件中每個行對之間的相似性或 levenshtein 距離？

July 25, 2017

我想使用levenshtein distance之類的東西找到文件中包含的最相似的線對。例如，給定一個文件：
What is your favorite color?
What is your favorite food?
Who was the 8th president?
Who was the 9th president?
…它將返回第 3 行和第 4 行作為最相似的行對。
理想情況下，我希望能夠計算最相似的前 X 行。因此，使用上面的範例，第二個最相似的對將是第 1 行和第 2 行。

我不熟悉 Levenshtein 距離，但是 Perl 有一個用於計算 Levenshtein 距離的模組，所以我編寫了一個簡單的 perl 腳本來計算輸入中每對線組合的距離，然後以增加的“距離”列印它們，受限於“top X”（N）參數：

#!/usr/bin/perl -w
use strict;
use Text::Levenshtein qw(distance);
use Getopt::Std;

our $opt_n;
getopts('n:');
$opt_n ||= -1; # print all the matches if -n is not provided

my @lines=&lt;&gt;;
my %distances = ();

# for each combination of two lines, compute distance
foreach(my $i=0; $i &lt;= $#lines - 1; $i++) {
 foreach(my $j=$i + 1; $j &lt;= $#lines; $j++) {
       my $d = distance($lines[$i], $lines[$j]);
       push @{ $distances{$d} }, $lines[$i] . $lines[$j];
 }
}

# print in order of increasing distance
foreach my $d (sort { $a &lt;=&gt; $b } keys %distances) {
 print "At distance $d:\n" . join("\n", @{ $distances{$d} }) . "\n";
 last unless --$opt_n;
}

在樣本輸入中，它給出：

$ ./solve.pl &lt; input
At distance 1:
Who was the 8th president?
Who was the 9th president?

At distance 3:
What is your favorite color?
What is your favorite food?

At distance 21:
What is your favorite color?
Who was the 8th president?
What is your favorite color?
Who was the 9th president?
What is your favorite food?
Who was the 8th president?
What is your favorite food?
Who was the 9th president?

並顯示可選參數：

$ ./solve.pl -n 2 &lt; input
At distance 1:
Who was the 8th president?
Who was the 9th president?

At distance 3:
What is your favorite color?
What is your favorite food?

我不確定如何明確地列印輸出，但是可以根據需要列印字元串。

引用自：https://unix.stackexchange.com/questions/381552

比較文件中每個行對之間的相似性或 levenshtein 距離？

相關問答

查找包含字元串的行，然後使用 awk 返回該行和文本文件的所有後續行

計算字元串出現的次數

如何提取以相同的前 2 個字元開頭的行，然後輸出到單獨的文件？

比較兩個文件中的不同IP？

在具有相同副檔名的多個文件的模式中讀取最後一行的最快方法

按記錄類型拆分記錄並報告任何意外的記錄類型