Hard-Disk

整個根文件系統變成只讀的

  • June 30, 2021

當我下載一個巨大的數據存檔(大約 500GB)時,第一次出現這個問題,但我當時沒有保留輸出和日誌。

我重新啟動系統,它自動進入緊急模式。我使用 fsck 並解決了這個問題,但幾個小時後又發生了。這次我發現整個root fs甚至/tmp都是只讀的(有人讓我試試這個)。這是 dmesg 的最後一個輸出:

[35761.273361] ata4.00: exception Emask 0x0 SAct 0x1800 SErr 0x0 action 0x0
[35761.273373] ata4.00: irq_stat 0x40000008
[35761.273379] ata4.00: failed command: READ FPDMA QUEUED
[35761.273386] ata4.00: cmd 60/00:58:c0:31:a1/02:00:38:00:00/40 tag 11 ncq dma 262144 in
                       res 41/40:00:f3:31:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F>
[35761.273394] ata4.00: status: { DRDY ERR }
[35761.273398] ata4.00: error: { UNC }
[35761.276060] ata4.00: configured for UDMA/133
[35761.276077] sd 3:0:0:0: [sdb] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[35761.276083] sd 3:0:0:0: [sdb] tag#11 Sense Key : Medium Error [current]
[35761.276089] sd 3:0:0:0: [sdb] tag#11 Add. Sense: Unrecovered read error - auto reallocate failed
[35761.276095] sd 3:0:0:0: [sdb] tag#11 CDB: Read(16) 88 00 00 00 00 00 38 a1 31 c0 00 00 02 00 00 00
[35761.276101] print_req_error: I/O error, dev sdb, sector 950088179
[35761.276117] ata4: EH complete
[38523.236782] ata4.00: exception Emask 0x0 SAct 0x18080 SErr 0x0 action 0x0
[38523.236793] ata4.00: irq_stat 0x40000001
[38523.236797] ata4.00: failed command: READ FPDMA QUEUED
[38523.236802] ata4.00: cmd 60/08:38:f0:31:a1/00:00:38:00:00/40 tag 7 ncq dma 4096 in
                       res 41/40:00:f3:31:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F>
[38523.236807] ata4.00: status: { DRDY ERR }
[38523.236810] ata4.00: error: { UNC }
[38523.236813] ata4.00: failed command: WRITE FPDMA QUEUED
[38523.236821] ata4.00: cmd 61/40:78:80:b9:81/09:00:30:00:00/40 tag 15 ncq dma 1212416 ou
                       res 41/40:00:00:00:00/00:00:00:00:00/00 Emask 0x9 (media error)
[38523.236825] ata4.00: status: { DRDY ERR }
[38523.236828] ata4.00: error: { UNC }
[38523.236830] ata4.00: failed command: WRITE FPDMA QUEUED
[38523.236834] ata4.00: cmd 61/70:80:e8:4d:9f/00:00:ea:00:00/40 tag 16 ncq dma 57344 out
                       res 41/40:00:00:00:00/00:00:00:00:00/00 Emask 0x9 (media error)
[38523.236838] ata4.00: status: { DRDY ERR }
[38523.236840] ata4.00: error: { UNC }
[38523.238584] ata4.00: configured for UDMA/133
[38523.238607] sd 3:0:0:0: [sdb] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38523.238615] sd 3:0:0:0: [sdb] tag#7 Sense Key : Medium Error [current]
[38523.238622] sd 3:0:0:0: [sdb] tag#7 Add. Sense: Unrecovered read error - auto reallocate failed
[38523.238628] sd 3:0:0:0: [sdb] tag#7 CDB: Read(16) 88 00 00 00 00 00 38 a1 31 f0 00 00 00 08 00 00
[38523.238634] print_req_error: I/O error, dev sdb, sector 950088179
[38523.238659] sd 3:0:0:0: [sdb] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38523.238664] sd 3:0:0:0: [sdb] tag#15 Sense Key : Medium Error [current]
[38523.238668] sd 3:0:0:0: [sdb] tag#15 Add. Sense: Unrecovered read error - auto reallocate failed
[38523.238674] sd 3:0:0:0: [sdb] tag#15 CDB: Write(16) 8a 00 00 00 00 00 30 81 b9 80 00 00 09 40 00 00
[38523.238679] print_req_error: I/O error, dev sdb, sector 813808000
[38523.238687] EXT4-fs warning (device sdb3): ext4_end_bio:323: I/O error 10 writing to inode 56511830 (offset 26411008 size 1212416 starting block 101726296)
[38523.238694] Buffer I/O error on device sdb3, logical block 93788464
[38523.238704] Buffer I/O error on device sdb3, logical block 93788465
[38523.238708] Buffer I/O error on device sdb3, logical block 93788466
[38523.238713] Buffer I/O error on device sdb3, logical block 93788467
[38523.238717] Buffer I/O error on device sdb3, logical block 93788468
[38523.238722] Buffer I/O error on device sdb3, logical block 93788469
[38523.238728] Buffer I/O error on device sdb3, logical block 93788470
[38523.238733] Buffer I/O error on device sdb3, logical block 93788471
[38523.238738] Buffer I/O error on device sdb3, logical block 93788472
[38523.238747] Buffer I/O error on device sdb3, logical block 93788473
[38523.238982] JBD2: Detected IO errors while flushing file data on sdb3-8
[38523.238984] sd 3:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38523.238995] sd 3:0:0:0: [sdb] tag#16 Sense Key : Medium Error [current]
[38523.238999] sd 3:0:0:0: [sdb] tag#16 Add. Sense: Unrecovered read error - auto reallocate failed
[38523.239005] sd 3:0:0:0: [sdb] tag#16 CDB: Write(16) 8a 00 00 00 00 00 ea 9f 4d e8 00 00 00 70 00 00
[38523.239010] print_req_error: I/O error, dev sdb, sector 3936308712
[38523.239026] ata4: EH complete
[38523.239032] Aborting journal on device sdb3-8.
[38523.239045] EXT4-fs (sdb3): Delayed block allocation failed for inode 56511830 at logical offset 6748 with max blocks 120 with error 30
[38523.239055] EXT4-fs (sdb3): This should not happen!! Data will be lost

[38523.239643] EXT4-fs error (device sdb3) in ext4_writepages:2906: IO failure
[38523.296445] EXT4-fs (sdb3): Remounting filesystem read-only
[38523.296477] EXT4-fs error (device sdb3): ext4_journal_check_start:61: Detected aborted journal
[38525.832744] ata4.00: exception Emask 0x0 SAct 0x30 SErr 0x0 action 0x0
[38525.833100] ata4.00: irq_stat 0x40000008
[38525.833365] ata4.00: failed command: READ FPDMA QUEUED
[38525.833629] ata4.00: cmd 60/80:20:c0:3b:a1/00:00:38:00:00/40 tag 4 ncq dma 65536 in
                       res 41/40:00:e7:3b:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F>
[38525.834152] ata4.00: status: { DRDY ERR }
[38525.834415] ata4.00: error: { UNC }
[38525.836456] ata4.00: configured for UDMA/133
[38525.836737] sd 3:0:0:0: [sdb] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38525.837001] sd 3:0:0:0: [sdb] tag#4 Sense Key : Medium Error [current]
[38525.837267] sd 3:0:0:0: [sdb] tag#4 Add. Sense: Unrecovered read error - auto reallocate failed
[38525.837531] sd 3:0:0:0: [sdb] tag#4 CDB: Read(16) 88 00 00 00 00 00 38 a1 3b c0 00 00 00 80 00 00
[38525.837796] print_req_error: I/O error, dev sdb, sector 950090727
[38525.838072] ata4: EH complete
[38528.260746] ata4.00: exception Emask 0x0 SAct 0x400 SErr 0x0 action 0x0
[38528.261092] ata4.00: irq_stat 0x40000008
[38528.261357] ata4.00: failed command: READ FPDMA QUEUED
[38528.261623] ata4.00: cmd 60/08:50:e0:3b:a1/00:00:38:00:00/40 tag 10 ncq dma 4096 in
                       res 41/40:00:e7:3b:a1/00:00:38:00:00/40 Emask 0x409 (media error) <F>
[38528.262144] ata4.00: status: { DRDY ERR }
[38528.262405] ata4.00: error: { UNC }
[38528.264870] ata4.00: configured for UDMA/133
[38528.265149] sd 3:0:0:0: [sdb] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[38528.265410] sd 3:0:0:0: [sdb] tag#10 Sense Key : Medium Error [current]
[38528.265668] sd 3:0:0:0: [sdb] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed
[38528.265923] sd 3:0:0:0: [sdb] tag#10 CDB: Read(16) 88 00 00 00 00 00 38 a1 3b e0 00 00 00 08 00 00
[38528.266182] print_req_error: I/O error, dev sdb, sector 950090727
[38528.266459] ata4: EH complete
[54010.452717] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user
[56341.190097] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user
[56572.048951] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user
[56633.963486] EXT4-fs error (device sdb3): ext4_remount:5338: Abort forced by user

之後不再記錄任何消息,因為整個 rootfs 已變為只讀。然後我嘗試了:

# mount / -o remount,rw
mount: /: cannot remount /dev/sdb3 read-write, is write-protected.

它失敗了,但幸運的是其他硬碟上的分區沒有問題,所以我可以將本地建構的 smartctl 上傳到伺服器並查看智能資訊:

# /other/smartmontools-7.2/smartctl -a /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-14-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Gold
Device Model:     WDC WD4002FYYZ-01B7CB0
Serial Number:    WD-N8G6724Y
LU WWN Device Id: 5 0014ee 25f546502
Firmware Version: 01.01K03
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jun 29 23:11:25 2021 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                       was completed without error.
                                       Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                       the read element of the test failed.
Total time to complete Offline
data collection:                (49440) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                       Auto Offline data collection on/off support.
                                       Suspend Offline collection upon new
                                       command.
                                       Offline surface scan supported.
                                       Self-test supported.
                                       Conveyance Self-test supported.
                                       Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                       power-saving mode.
                                       Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                       General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 533) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                       SCT Error Recovery Control supported.
                                       SCT Feature Control supported.
                                       SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   198   197   051    Pre-fail  Always       -       14
 3 Spin_Up_Time            0x0027   175   149   021    Pre-fail  Always       -       10208
 4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       29
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15352
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       29
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       19
194 Temperature_Celsius     0x0022   109   098   000    Old_age   Always       -       43
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       10
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       6
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       8

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     15351         788280736
# 2  Short offline       Completed: read failure       90%     15327         788280736
# 3  Short offline       Completed without error       00%        97         -
# 4  Extended offline    Aborted by host               90%        34         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

我不完全確定這是 HDD 的問題,因為我無法很好地理解這些資訊 - 我的母語不是英語,請原諒我的措辭不好:/

根據您的 SMART 日誌,您有壞扇區,換句話說,您的驅動器快要死了。多快?沒人知道。始終製作和測試備份。

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       10

# 1  Extended offline    Completed: read failure       90%     15351         788280736
# 2  Short offline         Completed: read failure       90%     15327         788280736

您可以使用這些手冊來嘗試強制您的驅動器重新分配它:

無論如何,您都需要e2fsck -c為受影響的分區執行。

維基百科對 SMART 有一個很好的概述:https ://en.wikipedia.org/wiki/SMART

引用自:https://unix.stackexchange.com/questions/656309