Pdf

我們可以在 pdf 文件中搜尋包含多個單詞且不按特定順序排列的頁面嗎?

  • April 20, 2019

我想在一個 pdf 文件中搜尋所有頁面,每個頁面都包含幾個給定的單詞,沒有特定的順序。例如,我想查找所有包含“hello”和“world”的頁面,沒有特定的順序。

我不確定是否pdfgrep 可以做到。

我正在嘗試做一些類似於我們如何在Google圖書中顯示的書中搜尋幾個單詞的事情。

謝謝。

-P是的,如果您使用該選項(讓它使用PCRE引擎和類似 perl 的正則表達式),您可以使用零寬度前瞻斷言來做到這一點。

$ pdfgrep -Pn '(?=.*process)(?=.*preparation)' ~/Str-Cmp.pdf
8:•     If a preparation process is used, the method used shall be declared.
10:Standard, preparation may be an important part of the ordering process. See Annex C for some examples of
38:padding. The preparation processing could move the original numerals (in order of occurrence) to the very

以上僅在兩個單詞在同一行時才有效;如果單詞可以出現在同一頁面的不同行上,則可以執行以下操作:

$ pdfgrep -Pn '^(?s:(?=.*process)(?=.*preparation))' ~/Str-Cmp.pdf
8:ISO/IEC 14651:2007(E)
9:                                                                                                  ISO/IEC 14651:2007(E)
10:ISO/IEC 14651:2007(E)
12:ISO/IEC 14651:2007(E)
...

中的s標誌(?s:意味著也.將匹配換行符。請注意,這只會列印頁面的第一行;您可以使用以下-A選項進行調整:

$ pdfgrep -A4 -Pn '^(?s:(?=.*process)(?=.*preparation))' ~/Str-Cmp.pdf
8:ISO/IEC 14651:2007(E)
8-•     Any specific internal format for intermediate keys used when comparing, nor for the table used. The use of
8-      numeric keys is not mandated either.
8-•     A context-dependent ordering.
8-•     Any particular preparation of character strings prior to comparison.
--
9:                                                                                                  ISO/IEC 14651:2007(E)
...

一個粗略的包裝腳本,它將以任何順序從與所有模式匹配的頁面中列印與任何模式匹配的行:

usage: **pdfgrepa** [options] files ... -- patterns ...

#! /bin/sh
r1= r2=
for a; do
       if [ "$r2" ]; then
               r1="$r1(?=.*$a)"; r2="$r2|$a"
       else
               case $a in
               --)     r2='(?=^--$)';;
               *)      set -- "$@" "$a";;
               esac
       fi
       shift
done
pdfgrep -A10000 -Pn "(?s:$r1)" "$@" | grep -P --color "$r2"

$ pdfgrepa ~/Str-Cmp.pdf -i -- obtains process preparation 37- the strings after **preparation** are identical, and the end result (as the user would normally see it) could be 37- collation **process** applying the same rules. This kind of indeterminacy is undesirable. 37-one **obtains** after this **preparation** the following strings:

引用自:https://unix.stackexchange.com/questions/513500