根據內容將單個大型 PDF 文件拆分為 n 個 PDF 文件並重命名每個拆分文件（在 Bash 中）

June 7, 2018

我正在研究一種拆分單個大型 PDF 文件（代表信用卡的每月結算）的方法。它是為列印而建構的，但我們想將該文件拆分為單個文件，以備後用。每個定居點都有一個可變長度：2 頁、3 頁、4 頁……所以我們需要“讀取”每一頁，找到“X 的第 1 頁”並將塊分割到下一個“X 的第 1 頁”出現。此外，每個生成的拆分文件都必須有一個唯一的 ID（也包含在“X 頁的第 1 頁”頁面中）。
當我在研發時，我發現了一個名為“PDF Content Split SA”的工具，它可以完成我們需要的確切任務。但我確信在 Linux 中有一種方法可以做到這一點（我們正在轉向 OpenSource+Libre）。
感謝您的閱讀。任何幫助都將非常有用。
編輯
到目前為止，我發現這個 Nautilus 腳本可以完全滿足我們的需要，但我無法讓它工作。
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'\n' read -d '' -r -a filelist &lt; &lt;(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "${filelist[@]}"; do
pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
pattern=''
pagetitle=''
datestamp=''

for (( pageindex=1; pageindex&lt;=$pagecount; pageindex+=1 )); do

 header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
 pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]{9}'`
 let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

 # match ID found on the page to the stored ID
 if [[ $pageid == $storedid ]]; then
  pattern+="$pageindex " # adds number as text to variable separated by spaces
  pagetitle+="$header+"

  if [[ $pageindex == $pagecount ]]; then #process last output of the file 
   pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   storedid=0
   pattern=''
   pagetitle=''
  fi
 else 
  #process previous set of pages to output
  pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
  storedid=$pageid
  pattern="$pageindex "
  pagetitle="$header+"
 fi
done
done
我已經編輯了搜尋條件，並且腳本很好地放置在 Nautilus 腳本文件夾中，但它不起作用。我嘗試使用控制台中的活動日誌進行調試，並在程式碼上添加標記；顯然與pdfinfo的結果值存在衝突，但我不知道如何解決它。

我成功了。至少，它奏效了。但現在我想優化這個過程。處理單個海量 pdf 中的 1000 個項目最多需要 40 分鐘。

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



# read files
IFS=$'\n' read -d '' -r -a filelist &lt; &lt;(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



# process files
for file in "${filelist[@]}"; do
pagecount=$(pdfinfo $file | grep "Pages" | awk '{ print $2 }')
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]{8}')
pattern=''
pagetitle=''
datestamp=''

#for (( pageindex=1; pageindex &lt;= $pagecount; pageindex+=1 )); do
for (( pageindex=1; pageindex &lt;= $pagecount+1; pageindex+=1 )); do

 header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


 pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]{8}')


 echo $pageid
 let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

 # match ID found on the page to the stored ID
 if [[ $pageid == $storedid ]]; then
  pattern+="$pageindex " # adds number as text to variable separated by spaces
  pagetitle+="$header+"


  if [[ $pageindex == $pagecount ]]; then #process last output of the file 
#   pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   pdftk $file cat $pattern output "$storedid.pdf"
   storedid=0
   pattern=''
   pagetitle=''

  fi
 else 
  #process previous set of pages to output
#  pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
  pdftk $file cat $pattern output "$storedid.pdf"
  storedid=$pageid
  pattern="$pageindex "
  pagetitle="$header+"

 fi
done
done

一些快速的python是一種選擇嗎？PyPDF2 包可以讓您完全按照您的要求進行操作。

引用自：https://unix.stackexchange.com/questions/448212

根據內容將單個大型 PDF 文件拆分為 n 個 PDF 文件並重命名每個拆分文件（在 Bash 中）

相關問答

拆分文件後再次加入文件的最佳方法是什麼？

如何更改 Pandoc 生成的 html 文件中內聯 pdf 的大小？

如何在解壓前更改 tar.gz 中的文件夾名稱？

Linux：作為其他和組刪除文件

連結命令時如何在最後一個執行命令中使用多個命令的輸出？

find 命令，僅在文件較新時執行