Zip
unzip 使用什麼方法在存檔中查找單個文件?
假設我創建了 100 個文件,每個文件的隨機文本數據大小為 30MB。現在我創建一個壓縮率為 0 的 zip 存檔,即
zip dataset.zip -r -0 *.txt
. 現在我想從這個檔案中只提取一個文件。如此處所述,有兩種方法可以從檔案中解壓/提取文件:
- 查找文件末尾並查找中央目錄。然後使用它來快速隨機訪問要提取的文件。(攤銷
O(1)
複雜度)- 查看每個本地標頭並提取匹配的標頭。(
O(n)
複雜性)unzip 使用哪種方法?從我的實驗來看,它似乎使用了方法 2?
在大型存檔中搜尋單個文件時,它使用方法 1,您可以使用以下方法查看
strace
:open("dataset.zip", O_RDONLY) = 3 ioctl(1, TIOCGWINSZ, 0x7fff9a895920) = -1 ENOTTY (Inappropriate ioctl for device) write(1, "Archive: dataset.zip\n", 22Archive: dataset.zip ) = 22 lseek(3, 943718400, SEEK_SET) = 943718400 read(3, "\340P\356(s\342\306\205\201\27\360U[\250/2\207\346<\252+u\234\225\1[<\2310E\342\274"..., 4522) = 4522 lseek(3, 943722880, SEEK_SET) = 943722880 read(3, "\3\f\225P\\ux\v\0\1\4\350\3\0\0\4\350\3\0\0", 20) = 20 lseek(3, 943718400, SEEK_SET) = 943718400 read(3, "\340P\356(s\342\306\205\201\27\360U[\250/2\207\346<\252+u\234\225\1[<\2310E\342\274"..., 8192) = 4522 lseek(3, 849346560, SEEK_SET) = 849346560 read(3, "D\262nv\210\343\240C\24\227\344\367q\300\223\231\306\330\275\266\213\276M\7I'&35\2\234J"..., 8192) = 8192 stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory) lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory) stat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory) lstat("rand-28.txt", 0x559f43e0a550) = -1 ENOENT (No such file or directory) open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4 ioctl(1, TIOCGWINSZ, 0x7fff9a895790) = -1 ENOTTY (Inappropriate ioctl for device) write(1, " extracting: rand-28.txt "..., 37 extracting: rand-28.txt ) = 37 read(3, "\275\3279Y\206\223\217}\355W%:\220YNT\0\257\260z^\361T\242\2\370\21\336\372+\306\310"..., 8192) = 8192
unzip
opensdataset.zip
,尋找到結尾,然後尋找存檔中請求文件的開頭(rand-28.txt
,在偏移量 849346560 處)並從那裡讀取。通過掃描檔案的最後 65557 個字節找到中心目錄;查看從這裡開始的程式碼:
/*--------------------------------------------------------------------------- Find and process the end-of-central-directory header. UnZip need only check last 65557 bytes of zipfile: comment may be up to 65535, end-of- central-directory record is 18 bytes, and signature itself is 4 bytes; add some to allow for appended garbage. Since ZipInfo is often used as a debugging tool, search the whole zipfile if zipinfo_mode is true. ---------------------------------------------------------------------------*/