將文本文件拆分為具有固定字數的行

September 5, 2015

相關但沒有令人滿意的答案：如何將大型文本文件拆分為 500 字左右的塊？
我正在嘗試獲取一個文本文件（http://mattmahoney.net/dc/text8.zip），其中 > 10^7 個單詞全部放在一行中，並將其拆分為每行包含 N 個單詞的行。我目前的方法有效，但相當緩慢且醜陋（使用 shell 腳本）：
i=0
for word in $(sed -e 's/\s\+/\n/g' input.txt)
do
   echo -n "${word} " &gt; output.txt
   let "i=i+1"

   if [ "$i" -eq "1000" ]
   then
       echo &gt; output.txt
       let "i=0"
   fi
done
關於如何使這更快或更緊湊的任何提示？

使用xargs（17 秒）：
xargs -n1000 &lt;file &gt;output
它使用定義最大參數數量的-n標誌。xargs只需更改1000為500您想要的任何限制。
我製作了一個 10^7 字的測試文件：
$ wc -w file
10000000 file
以下是時間統計：
$ time xargs -n1000 &lt;file &gt;output
real    0m16.677s
user    0m1.084s
sys     0m0.744s

Perl 似乎非常擅長這一點：
創建一個包含 10,000,000 個空格分隔單詞的文件
for ((i=1; i&lt;=10000000; i++)); do printf "%s " $RANDOM ; done &gt; one.line
現在，perl 在每 1,000 個單詞後添加一個換行符
time perl -pe '
   s{ 
       (?:\S+\s+){999} \S+   # 1000 words
       \K                    # then reset start of match
       \s+                   # and the next bit of whitespace
   }
   {\n}gx                    # replace whitespace with newline
' one.line &gt; many.line
定時
real    0m1.074s
user    0m0.996s
sys     0m0.076s
驗證結果
$ wc one.line many.line
       0  10000000  56608931 one.line
   10000  10000000  56608931 many.line
   10000  20000000 113217862 total
接受的 awk 解決方案在我的輸入文件上只用了 5 秒多一點。

引用自：https://unix.stackexchange.com/questions/227581

將文本文件拆分為具有固定字數的行

相關問答

拆分以分號開頭的行分隔的文本

將大文件拆分為具有唯一文件名的新文件

在模式之前將文件一分為二

如何在不拆分多行記錄的情況下有效地拆分大型文本文件？

將文件分成兩部分，以某種模式

僅刪除單引號中的逗號