Hard-Disk
整個根文件系統變成只讀的
當我下載一個巨大的數據存檔(大約 500GB)時,第一次出現這個問題,但我當時沒有保留輸出和日誌。
我重新啟動系統,它自動進入緊急模式。我使用 fsck 並解決了這個問題,但幾個小時後又發生了。這次我發現整個root fs甚至/tmp都是只讀的(有人讓我試試這個)。這是 dmesg 的最後一個輸出:
[35761.273361] ata4.00: exception Emask 0x0 SAct 0x1800 SErr 0x0 action 0x0 [35761.273373] ata4.00: irq_stat 0x40000008 [35761.273379] ata4.00: failed command: READ FPDMA QUEUED [35761.273386] ata4.00: cmd 60/00:58:c0:31:a1/02:00:38:00:00/40 tag 11 ncq dma 262144 in res 41/40:00:f3:31:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F> [35761.273394] ata4.00: status: { DRDY ERR } [35761.273398] ata4.00: error: { UNC } [35761.276060] ata4.00: configured for UDMA/133 [35761.276077] sd 3:0:0:0: [sdb] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [35761.276083] sd 3:0:0:0: [sdb] tag#11 Sense Key : Medium Error [current] [35761.276089] sd 3:0:0:0: [sdb] tag#11 Add. Sense: Unrecovered read error - auto reallocate failed [35761.276095] sd 3:0:0:0: [sdb] tag#11 CDB: Read(16) 88 00 00 00 00 00 38 a1 31 c0 00 00 02 00 00 00 [35761.276101] print_req_error: I/O error, dev sdb, sector 950088179 [35761.276117] ata4: EH complete [38523.236782] ata4.00: exception Emask 0x0 SAct 0x18080 SErr 0x0 action 0x0 [38523.236793] ata4.00: irq_stat 0x40000001 [38523.236797] ata4.00: failed command: READ FPDMA QUEUED [38523.236802] ata4.00: cmd 60/08:38:f0:31:a1/00:00:38:00:00/40 tag 7 ncq dma 4096 in res 41/40:00:f3:31:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F> [38523.236807] ata4.00: status: { DRDY ERR } [38523.236810] ata4.00: error: { UNC } [38523.236813] ata4.00: failed command: WRITE FPDMA QUEUED [38523.236821] ata4.00: cmd 61/40:78:80:b9:81/09:00:30:00:00/40 tag 15 ncq dma 1212416 ou res 41/40:00:00:00:00/00:00:00:00:00/00 Emask 0x9 (media error) [38523.236825] ata4.00: status: { DRDY ERR } [38523.236828] ata4.00: error: { UNC } [38523.236830] ata4.00: failed command: WRITE FPDMA QUEUED [38523.236834] ata4.00: cmd 61/70:80:e8:4d:9f/00:00:ea:00:00/40 tag 16 ncq dma 57344 out res 41/40:00:00:00:00/00:00:00:00:00/00 Emask 0x9 (media error) [38523.236838] ata4.00: status: { DRDY ERR } [38523.236840] ata4.00: error: { UNC } [38523.238584] ata4.00: configured for UDMA/133 [38523.238607] sd 3:0:0:0: [sdb] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [38523.238615] sd 3:0:0:0: [sdb] tag#7 Sense Key : Medium Error [current] [38523.238622] sd 3:0:0:0: [sdb] tag#7 Add. Sense: Unrecovered read error - auto reallocate failed [38523.238628] sd 3:0:0:0: [sdb] tag#7 CDB: Read(16) 88 00 00 00 00 00 38 a1 31 f0 00 00 00 08 00 00 [38523.238634] print_req_error: I/O error, dev sdb, sector 950088179 [38523.238659] sd 3:0:0:0: [sdb] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [38523.238664] sd 3:0:0:0: [sdb] tag#15 Sense Key : Medium Error [current] [38523.238668] sd 3:0:0:0: [sdb] tag#15 Add. Sense: Unrecovered read error - auto reallocate failed [38523.238674] sd 3:0:0:0: [sdb] tag#15 CDB: Write(16) 8a 00 00 00 00 00 30 81 b9 80 00 00 09 40 00 00 [38523.238679] print_req_error: I/O error, dev sdb, sector 813808000 [38523.238687] EXT4-fs warning (device sdb3): ext4_end_bio:323: I/O error 10 writing to inode 56511830 (offset 26411008 size 1212416 starting block 101726296) [38523.238694] Buffer I/O error on device sdb3, logical block 93788464 [38523.238704] Buffer I/O error on device sdb3, logical block 93788465 [38523.238708] Buffer I/O error on device sdb3, logical block 93788466 [38523.238713] Buffer I/O error on device sdb3, logical block 93788467 [38523.238717] Buffer I/O error on device sdb3, logical block 93788468 [38523.238722] Buffer I/O error on device sdb3, logical block 93788469 [38523.238728] Buffer I/O error on device sdb3, logical block 93788470 [38523.238733] Buffer I/O error on device sdb3, logical block 93788471 [38523.238738] Buffer I/O error on device sdb3, logical block 93788472 [38523.238747] Buffer I/O error on device sdb3, logical block 93788473 [38523.238982] JBD2: Detected IO errors while flushing file data on sdb3-8 [38523.238984] sd 3:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [38523.238995] sd 3:0:0:0: [sdb] tag#16 Sense Key : Medium Error [current] [38523.238999] sd 3:0:0:0: [sdb] tag#16 Add. Sense: Unrecovered read error - auto reallocate failed [38523.239005] sd 3:0:0:0: [sdb] tag#16 CDB: Write(16) 8a 00 00 00 00 00 ea 9f 4d e8 00 00 00 70 00 00 [38523.239010] print_req_error: I/O error, dev sdb, sector 3936308712 [38523.239026] ata4: EH complete [38523.239032] Aborting journal on device sdb3-8. [38523.239045] EXT4-fs (sdb3): Delayed block allocation failed for inode 56511830 at logical offset 6748 with max blocks 120 with error 30 [38523.239055] EXT4-fs (sdb3): This should not happen!! Data will be lost [38523.239643] EXT4-fs error (device sdb3) in ext4_writepages:2906: IO failure [38523.296445] EXT4-fs (sdb3): Remounting filesystem read-only [38523.296477] EXT4-fs error (device sdb3): ext4_journal_check_start:61: Detected aborted journal [38525.832744] ata4.00: exception Emask 0x0 SAct 0x30 SErr 0x0 action 0x0 [38525.833100] ata4.00: irq_stat 0x40000008 [38525.833365] ata4.00: failed command: READ FPDMA QUEUED [38525.833629] ata4.00: cmd 60/80:20:c0:3b:a1/00:00:38:00:00/40 tag 4 ncq dma 65536 in res 41/40:00:e7:3b:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F> [38525.834152] ata4.00: status: { DRDY ERR } [38525.834415] ata4.00: error: { UNC } [38525.836456] ata4.00: configured for UDMA/133 [38525.836737] sd 3:0:0:0: [sdb] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [38525.837001] sd 3:0:0:0: [sdb] tag#4 Sense Key : Medium Error [current] [38525.837267] sd 3:0:0:0: [sdb] tag#4 Add. Sense: Unrecovered read error - auto reallocate failed [38525.837531] sd 3:0:0:0: [sdb] tag#4 CDB: Read(16) 88 00 00 00 00 00 38 a1 3b c0 00 00 00 80 00 00 [38525.837796] print_req_error: I/O error, dev sdb, sector 950090727 [38525.838072] ata4: EH complete [38528.260746] ata4.00: exception Emask 0x0 SAct 0x400 SErr 0x0 action 0x0 [38528.261092] ata4.00: irq_stat 0x40000008 [38528.261357] ata4.00: failed command: READ FPDMA QUEUED [38528.261623] ata4.00: cmd 60/08:50:e0:3b:a1/00:00:38:00:00/40 tag 10 ncq dma 4096 in res 41/40:00:e7:3b:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F> [38528.262144] ata4.00: status: { DRDY ERR } [38528.262405] ata4.00: error: { UNC } [38528.264870] ata4.00: configured for UDMA/133 [38528.265149] sd 3:0:0:0: [sdb] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [38528.265410] sd 3:0:0:0: [sdb] tag#10 Sense Key : Medium Error [current] [38528.265668] sd 3:0:0:0: [sdb] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed [38528.265923] sd 3:0:0:0: [sdb] tag#10 CDB: Read(16) 88 00 00 00 00 00 38 a1 3b e0 00 00 00 08 00 00 [38528.266182] print_req_error: I/O error, dev sdb, sector 950090727 [38528.266459] ata4: EH complete [54010.452717] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user [56341.190097] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user [56572.048951] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user [56633.963486] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user
之後不再記錄任何消息,因為整個 rootfs 已變為只讀。然後我嘗試了:
# mount / -o remount,rw mount: /: cannot remount /dev/sdb3 read-write, is write-protected.
它失敗了,但幸運的是其他硬碟上的分區沒有問題,所以我可以將本地建構的 smartctl 上傳到伺服器並查看智能資訊:
# /other/smartmontools-7.2/smartctl -a /dev/sdb smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-14-amd64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Gold Device Model: WDC WD4002FYYZ-01B7CB0 Serial Number: WD-N8G6724Y LU WWN Device Id: 5 0014ee 25f546502 Firmware Version: 01.01K03 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Tue Jun 29 23:11:25 2021 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (49440) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 533) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x70bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 198 197 051 Pre-fail Always - 14 3 Spin_Up_Time 0x0027 175 149 021 Pre-fail Always - 10208 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 29 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 15352 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 29 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 9 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 19 194 Temperature_Celsius 0x0022 109 098 000 Old_age Always - 43 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 10 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 6 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 8 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 15351 788280736 # 2 Short offline Completed: read failure 90% 15327 788280736 # 3 Short offline Completed without error 00% 97 - # 4 Extended offline Aborted by host 90% 34 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
我不完全確定這是 HDD 的問題,因為我無法很好地理解這些資訊 - 我的母語不是英語,請原諒我的措辭不好:/
根據您的 SMART 日誌,您有壞扇區,換句話說,您的驅動器快要死了。多快?沒人知道。始終製作和測試備份。
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 10 # 1 Extended offline Completed: read failure 90% 15351 788280736 # 2 Short offline Completed: read failure 90% 15327 788280736
您可以使用這些手冊來嘗試強制您的驅動器重新分配它:
- https://www.smartmontools.org/wiki/BadBlockHowto
- https://linoxide.com/how-to-fix-repair-bad-blocks-in-linux/
無論如何,您都需要
e2fsck -c
為受影響的分區執行。維基百科對 SMART 有一個很好的概述:https ://en.wikipedia.org/wiki/SMART