Shell
如何在經常一起出現的多個文件中查找關鍵字?
我想找到經常相互關聯的關鍵字。
例子
目錄包含降價文件,每個文件的最後一行都有一些關鍵字:
$ tail -n 1 file1.md #doctor #donkey #plants $ tail -n 1 file2.md #doctor #firework #university $ tail -n 1 file3.md #doctor #donkey #linux #plants
偽輸出
- 100% 包含關鍵字“#donkey”的文件也包含關鍵字“#doctor”。
- 50% 包含關鍵字“#plants”的文件也包含關鍵字“#linux”。
- …
一個 Shell 腳本、一個 awk 腳本,或者只是一個關於如何實現這個目標的解釋就足夠了!
任何幫助,將不勝感激。非常感謝
對數組數組使用 GNU awk:
如果關鍵字位於每個文件的第一行而不是使用 GNU awk 來
nextfile
提高效率:$ cat tst.awk FNR == 1 { for ( i=1; i<=NF; i++ ) { words[$i]++ for ( j=i+1; j<=NF; j++ ) { pairs[$i][$j]++ pairs[$j][$i]++ } } nextfile } END { for ( word1 in pairs ) { for ( word2 in pairs[word1] ) { pct = pairs[word1][word2] * 100 / words[word1] printf "%d%% of the files containing the keyword \"%s\" also contain the keyword \"%s\".\n", pct, word1, word2 } } }
$ awk -f tst.awk file*.md 100% of the files containing the keyword "#university" also contain the keyword "#doctor". 100% of the files containing the keyword "#university" also contain the keyword "#firework". 100% of the files containing the keyword "#plants" also contain the keyword "#donkey". 50% of the files containing the keyword "#plants" also contain the keyword "#linux". 100% of the files containing the keyword "#plants" also contain the keyword "#doctor". 100% of the files containing the keyword "#donkey" also contain the keyword "#plants". 50% of the files containing the keyword "#donkey" also contain the keyword "#linux". 100% of the files containing the keyword "#donkey" also contain the keyword "#doctor". 100% of the files containing the keyword "#linux" also contain the keyword "#plants". 100% of the files containing the keyword "#linux" also contain the keyword "#donkey". 100% of the files containing the keyword "#linux" also contain the keyword "#doctor". 33% of the files containing the keyword "#doctor" also contain the keyword "#university". 66% of the files containing the keyword "#doctor" also contain the keyword "#plants". 66% of the files containing the keyword "#doctor" also contain the keyword "#donkey". 33% of the files containing the keyword "#doctor" also contain the keyword "#linux". 33% of the files containing the keyword "#doctor" also contain the keyword "#firework". 100% of the files containing the keyword "#firework" also contain the keyword "#university". 100% of the files containing the keyword "#firework" also contain the keyword "#doctor".
或在最後一行,然後再次依靠 gawk
ENDFILE
:$ cat tst.awk ENDFILE { for ( i=1; i<=NF; i++ ) { words[$i]++ for ( j=i+1; j<=NF; j++ ) { pairs[$i][$j]++ pairs[$j][$i]++ } } } END { for ( word1 in pairs ) { for ( word2 in pairs[word1] ) { pct = pairs[word1][word2] * 100 / words[word1] printf "%d%% of the files containing the keyword \"%s\" also contain the keyword \"%s\".\n", pct, word1, word2 } } } $ awk -f tst.awk file*.md
或者仍然在最後一行,但使用 tail+gawk 更有效:
$ cat tst.awk { for ( i=1; i<=NF; i++ ) { words[$i]++ for ( j=i+1; j<=NF; j++ ) { pairs[$i][$j]++ pairs[$j][$i]++ } } } END { for ( word1 in pairs ) { for ( word2 in pairs[word1] ) { pct = pairs[word1][word2] * 100 / words[word1] printf "%d%% of the files containing the keyword \"%s\" also contain the keyword \"%s\".\n", pct, word1, word2 } } } $ tail -qn1 file*.md | awk -f tst.awk