在文件中查找 N 個最常見的單詞以及如何處理連字元？

October 22, 2019

假設我們有一個包含以下文本的文件：
hello hel-
lo world wor-
ld test test he-
lo words words
如果我們只使用空格作為分隔符，我們將有
hello: 1
world: 1
wor:1
ld:1
he: 1
hel-: 1
test:2
lo: 2
words: 2
換句話說，我們如何處理用連字元分隔兩行的單詞並將其視為一個單詞？

這應該這樣做：

sed ':1;/-$/{N;b1};s/-\n//g;y/ /\n/' file | sort | uniq -c

Perl 很方便：-0777 開關將把整個文件轉換成一個字元串
perl -0777 -ne '
  s/-\n//g;                  # join the hyphenated words
  $count{$_}++ for split;    # count all the words
  while (($k,$v) = each %count) {print "$k:$v\n"}
' file
world:2
helo:1
hello:2
words:2
test:2
輸出將沒有特定的順序。
這裡有一個更晦澀的：tcl。tclsh 沒有-e像其他語言那樣好的選擇，所以單行更有效。這樣做的好處是保留了文件中單詞的順序。
echo '
   set fh [open [lindex $argv 1] r]
   set data [read -nonewline $fh]
   close $fh
   foreach word [split [string map {"-\n" ""} $data]] {
       dict incr count $word
   }
   dict for {k v} $count {puts "$k:$v"}
' | tclsh -- file
hello:2
world:2
test:2
helo:1
words:2

引用自：https://unix.stackexchange.com/questions/548014

在文件中查找 N 個最常見的單詞以及如何處理連字元？

相關問答

如何從文件中刪除空行（包括製表符和空格）？

如何計算文件中不同字元的數量？

比較兩個文件並根據兩列獲取匹配的行

如何在文件中使用 sed 命令刪除模式？

將 mtime 添加到 grep -c 輸出並按 mtime 對輸出進行排序

在所有子目錄中查找給定文件類型中字元串的最後一次出現