Shell

如何在經常一起出現的多個文件中查找關鍵字?

  • May 29, 2022

我想找到經常相互關聯的關鍵字。

例子

目錄包含降價文件,每個文件的最後一行都有一些關鍵字:

$ tail -n 1 file1.md
#doctor #donkey #plants

$ tail -n 1 file2.md
#doctor #firework #university

$ tail -n 1 file3.md
#doctor #donkey #linux #plants

偽輸出

  • 100% 包含關鍵字“#donkey”的文件也包含關鍵字“#doctor”。
  • 50% 包含關鍵字“#plants”的文件也包含關鍵字“#linux”。

一個 Shell 腳本、一個 awk 腳本,或者只是一個關於如何實現這個目標的解釋就足夠了!

任何幫助,將不勝感激。非常感謝

對數組數組使用 GNU awk:

如果關鍵字位於每個文件的第一行而不是使用 GNU awk 來nextfile提高效率:

$ cat tst.awk
FNR == 1 {
   for ( i=1; i<=NF; i++ ) {
       words[$i]++
       for ( j=i+1; j<=NF; j++ ) {
           pairs[$i][$j]++
           pairs[$j][$i]++
       }
   }
   nextfile
}
END {
   for ( word1 in pairs ) {
       for ( word2 in pairs[word1] ) {
           pct = pairs[word1][word2] * 100 / words[word1]
           printf "%d%% of the files containing the keyword \"%s\" also contain the keyword \"%s\".\n", pct, word1, word2
       }
   }
}
$ awk -f tst.awk file*.md
100% of the files containing the keyword "#university" also contain the keyword "#doctor".
100% of the files containing the keyword "#university" also contain the keyword "#firework".
100% of the files containing the keyword "#plants" also contain the keyword "#donkey".
50% of the files containing the keyword "#plants" also contain the keyword "#linux".
100% of the files containing the keyword "#plants" also contain the keyword "#doctor".
100% of the files containing the keyword "#donkey" also contain the keyword "#plants".
50% of the files containing the keyword "#donkey" also contain the keyword "#linux".
100% of the files containing the keyword "#donkey" also contain the keyword "#doctor".
100% of the files containing the keyword "#linux" also contain the keyword "#plants".
100% of the files containing the keyword "#linux" also contain the keyword "#donkey".
100% of the files containing the keyword "#linux" also contain the keyword "#doctor".
33% of the files containing the keyword "#doctor" also contain the keyword "#university".
66% of the files containing the keyword "#doctor" also contain the keyword "#plants".
66% of the files containing the keyword "#doctor" also contain the keyword "#donkey".
33% of the files containing the keyword "#doctor" also contain the keyword "#linux".
33% of the files containing the keyword "#doctor" also contain the keyword "#firework".
100% of the files containing the keyword "#firework" also contain the keyword "#university".
100% of the files containing the keyword "#firework" also contain the keyword "#doctor".

或在最後一行,然後再次依靠 gawk ENDFILE

$ cat tst.awk
ENDFILE {
   for ( i=1; i<=NF; i++ ) {
       words[$i]++
       for ( j=i+1; j<=NF; j++ ) {
           pairs[$i][$j]++
           pairs[$j][$i]++
       }
   }
}
END {
   for ( word1 in pairs ) {
       for ( word2 in pairs[word1] ) {
           pct = pairs[word1][word2] * 100 / words[word1]
           printf "%d%% of the files containing the keyword \"%s\" also contain the keyword \"%s\".\n", pct, word1, word2
       }
   }
}

$ awk -f tst.awk file*.md

或者仍然在最後一行,但使用 tail+gawk 更有效:

$ cat tst.awk
{
   for ( i=1; i<=NF; i++ ) {
       words[$i]++
       for ( j=i+1; j<=NF; j++ ) {
           pairs[$i][$j]++
           pairs[$j][$i]++
       }
   }
}
END {
   for ( word1 in pairs ) {
       for ( word2 in pairs[word1] ) {
           pct = pairs[word1][word2] * 100 / words[word1]
           printf "%d%% of the files containing the keyword \"%s\" also contain the keyword \"%s\".\n", pct, word1, word2
       }
   }
}

$ tail -qn1 file*.md | awk -f tst.awk

引用自:https://unix.stackexchange.com/questions/703977