製造商的工具發現了壞塊,但 smartctl 沒有顯示任何
我的問題描述的比較大,所以我先做一個簡短的總結,然後我將準確描述情況。
簡短摘要:製造商的診斷工具在我的硬碟上發現並修復了一些錯誤。據我了解工具手冊,這些錯誤是壞塊。但是,smartctl(在硬碟上執行 SMART 的 Linux 工具)沒有顯示任何重新分配的扇區,並說硬碟是好的。第一個問題:怎麼可能?修復壞塊意味著重新分配扇區,對嗎?那麼為什麼 smartctl 不報告任何重新分配的扇區呢?第二個問題:我幾個月前買了這個磁碟,我仍然有保修。我應該要求賣家更換一個新的還是這個磁碟是好的,我可以繼續使用它?
現在精確描述:
我有西部數據硬碟,型號 WDC WD5000AAKX-001CA0。最近我注意到有時我的電腦會掛起幾秒鐘(大約一分鐘)。掛起後 dmesg 顯示如下錯誤:
knoppix@Microknoppix:~$ dmesg (...) [ 504.003363] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [ 504.003374] ata1.00: failed command: READ DMA EXT [ 504.003383] ata1.00: cmd 25/00:00:80:07:01/00:02:00:00:00/e0 tag 0 dma 262144 in [ 504.003385] res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [ 504.003389] ata1.00: status: { DRDY } [ 509.016652] ata1: link is slow to respond, please be patient (ready=0) [ 514.030002] ata1: soft resetting link [ 514.200386] ata1.00: configured for UDMA/133 [ 514.200420] ata1: EH complete [ 546.003333] ata1: lost interrupt (Status 0x50) [ 546.003364] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [ 546.003371] ata1.00: failed command: READ DMA EXT [ 546.003380] ata1.00: cmd 25/00:00:80:15:06/00:02:00:00:00/e0 tag 0 dma 262144 in [ 546.003381] res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [ 546.003386] ata1.00: status: { DRDY } [ 546.003401] ata1: soft resetting link [ 546.181205] ata1.00: configured for UDMA/133 [ 546.181234] ata1: EH complete
但是,smartctl 說“SMART 整體健康自我評估測試結果:通過”(我將在幾段後粘貼 smartctl 的完整輸出)。每當我嘗試進行 smartctl 自我測試(使用 smartctl -t short 或 smartctl -t long)時,此類測試都會報告為被主機中止。所以我為我的高畫質下載了可啟動 CD 診斷工具 - 這個: http: //support.wdc.com/product/download.asp ?groupid=606&sid=2&lang=en
首先使用這個工具我做了快速測試,它顯示錯誤(不幸的是,我不記得錯誤程式碼是什麼)。據我了解,此工具僅執行 SMART 快速自檢(http://wdc.custhelp.com/app/answers/detail/search/1/a_id/940/c/130/p/227,295 表示“快速測試 -執行 SMART 驅動器快速自檢以收集和驗證驅動器上包含的 Data Lifeguard 資訊。”)然後我進行了擴展測試。據我了解,此擴展測試查找壞扇區(http://wdc.custhelp.com/app/answers/detail/search/1/a_id/940/c/130/p/227,295 表示“擴展測試 -執行全媒體掃描以檢測壞扇區”)。一段時間後,該工具告訴它發現並修復了一些錯誤。
現在我用 knoppix 啟動機器並執行“smartctl –all”。這是它的輸出:
root@Microknoppix:/home/knoppix# smartctl --all /dev/sda smartctl 5.43 2012-06-05 r3561 [i686-linux-3.4.9] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Blue Serial ATA Device Model: WDC WD5000AAKX-001CA0 Serial Number: WD-WMAYUW952768 LU WWN Device Id: 5 0014ee 6ad1d9ef1 Firmware Version: 15.01H15 User Capacity: 500,107,862,016 bytes [500 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Wed Dec 12 03:34:39 2012 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 8160) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 83) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3037) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 486 3 Spin_Up_Time 0x0027 189 141 021 Pre-fail Always - 1525 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 587 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1553 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 578 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 173 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 413 194 Temperature_Celsius 0x0022 097 093 000 Old_age Always - 46 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 5 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 5 SMART Error Log Version: 1 ATA Error Count: 2 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 2 occurred at disk power-on lifetime: 1548 hours (64 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 01 30 4f c2 a0 Error: ABRT Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- b0 d6 01 be 4f c2 a0 02 00:02:58.316 SMART WRITE LOG b0 da 01 00 4f c2 a0 02 00:02:58.259 SMART RETURN STATUS 80 44 00 00 44 57 a0 02 00:02:58.259 [VENDOR SPECIFIC] b0 d6 01 be 4f c2 a0 02 00:02:58.241 SMART WRITE LOG 80 45 00 01 44 57 a0 02 00:02:58.241 [VENDOR SPECIFIC] Error 1 occurred at disk power-on lifetime: 1515 hours (63 days + 3 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 01 30 4f c2 a0 Error: ABRT Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- b0 d6 01 be 4f c2 a0 02 00:02:21.841 SMART WRITE LOG b0 da 01 00 4f c2 a0 02 00:02:21.784 SMART RETURN STATUS 80 44 00 00 44 57 a0 02 00:02:21.784 [VENDOR SPECIFIC] b0 d6 01 be 4f c2 a0 02 00:02:21.768 SMART WRITE LOG 80 45 00 01 44 57 a0 02 00:02:21.768 [VENDOR SPECIFIC] SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed without error 00% 1552 - # 2 Conveyance offline Completed: read failure 90% 1548 787927349 # 3 Conveyance offline Completed: read failure 90% 1515 883391611 # 4 Short offline Completed without error 00% 1503 - # 5 Short offline Completed without error 00% 1503 - # 6 Short offline Aborted by host 80% 1502 - # 7 Extended offline Completed without error 00% 9 - # 8 Short offline Completed without error 00% 6 - # 9 Short offline Aborted by host 90% 6 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
如您所見,一方面離線傳輸完成但讀取失敗。但是,另一方面,所有屬性似乎都不錯——例如,Reallocated_Sector_Ct 為 0。
我還再次嘗試將整個磁碟歸類到 /dev/null - 我在 dmesg 中再次出現錯誤:
root@Microknoppix:/home/knoppix# nice -n 20 ionice -c 3 cat /dev/sda > /dev/null During this cat dmesg shows such errors: knoppix@Microknoppix:~$ dmesg (...) [ 504.003363] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [ 504.003374] ata1.00: failed command: READ DMA EXT [ 504.003383] ata1.00: cmd 25/00:00:80:07:01/00:02:00:00:00/e0 tag 0 dma 262144 in [ 504.003385] res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [ 504.003389] ata1.00: status: { DRDY } [ 509.016652] ata1: link is slow to respond, please be patient (ready=0) [ 514.030002] ata1: soft resetting link [ 514.200386] ata1.00: configured for UDMA/133 [ 514.200420] ata1: EH complete [ 546.003333] ata1: lost interrupt (Status 0x50) [ 546.003364] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [ 546.003371] ata1.00: failed command: READ DMA EXT [ 546.003380] ata1.00: cmd 25/00:00:80:15:06/00:02:00:00:00/e0 tag 0 dma 262144 in [ 546.003381] res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [ 546.003386] ata1.00: status: { DRDY } [ 546.003401] ata1: soft resetting link [ 546.181205] ata1.00: configured for UDMA/133 [ 546.181234] ata1: EH complete
我認為這可能是主機板或將磁碟連接到主機板的數據線的故障。因此,我使用相同的電纜和插槽將另一個磁碟連接到我的主機板,並將其連接到 /dev/null。它成功了,沒有 dmesg 顯示任何錯誤。
沒有重新分配的扇區,因為它們未能重新分配。您的驅動器顯示 5 個 Offline_Uncorrectable 扇區,當自動修復失敗時會發生這種情況。dmesg 輸出中顯示了明顯的讀取失敗、SMART 錯誤以及 SMART 測試中的讀取失敗。正如您在問題中提到的,有一些修復這些扇區的方法,但根據我的經驗,這是一個非常短期的修復。
更換驅動器。