Text-Processing
比較文件中每個行對之間的相似性或 levenshtein 距離?
我想使用levenshtein distance之類的東西找到文件中包含的最相似的線對。例如,給定一個文件:
What is your favorite color? What is your favorite food? Who was the 8th president? Who was the 9th president?
…它將返回第 3 行和第 4 行作為最相似的行對。
理想情況下,我希望能夠計算最相似的前 X 行。因此,使用上面的範例,第二個最相似的對將是第 1 行和第 2 行。
我不熟悉 Levenshtein 距離,但是 Perl 有一個用於計算 Levenshtein 距離的模組,所以我編寫了一個簡單的 perl 腳本來計算輸入中每對線組合的距離,然後以增加的“距離”列印它們,受限於“top X”(N)參數:
#!/usr/bin/perl -w use strict; use Text::Levenshtein qw(distance); use Getopt::Std; our $opt_n; getopts('n:'); $opt_n ||= -1; # print all the matches if -n is not provided my @lines=<>; my %distances = (); # for each combination of two lines, compute distance foreach(my $i=0; $i <= $#lines - 1; $i++) { foreach(my $j=$i + 1; $j <= $#lines; $j++) { my $d = distance($lines[$i], $lines[$j]); push @{ $distances{$d} }, $lines[$i] . $lines[$j]; } } # print in order of increasing distance foreach my $d (sort { $a <=> $b } keys %distances) { print "At distance $d:\n" . join("\n", @{ $distances{$d} }) . "\n"; last unless --$opt_n; }
在樣本輸入中,它給出:
$ ./solve.pl < input At distance 1: Who was the 8th president? Who was the 9th president? At distance 3: What is your favorite color? What is your favorite food? At distance 21: What is your favorite color? Who was the 8th president? What is your favorite color? Who was the 9th president? What is your favorite food? Who was the 8th president? What is your favorite food? Who was the 9th president?
並顯示可選參數:
$ ./solve.pl -n 2 < input At distance 1: Who was the 8th president? Who was the 9th president? At distance 3: What is your favorite color? What is your favorite food?
我不確定如何明確地列印輸出,但是可以根據需要列印字元串。