Hard-Disk
mdadm fail 是確定的狀態嗎?
今天,
mdadm
通過 發給我消息更換驅動器沒有問題,因為我有一些備用驅動器。
我不完全了解如何
mdadm
將驅動器評估為故障,所以我有一個問題要問你。失敗是
mdadm
確定的狀態還是我可以以某種方式嘗試復活驅動器?例如,我仍然可以訪問驅動器
gdisk
,所以驅動器在技術上還沒有死(還),這就是我問的原因。細節:
mdadm --detail /dev/md1
輸出:
/dev/md1: Version : 1.2 Creation Time : Sun Mar 26 17:25:30 2017 Raid Level : raid1 Array Size : 976630464 (931.39 GiB 1000.07 GB) Used Dev Size : 976630464 (931.39 GiB 1000.07 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon Oct 2 07:31:25 2017 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 Name : backup-server:1 (local to host backup-server) UUID : 319334f9:76d6fccf:d61307bd:2427b6ba Events : 13023 Number Major Minor RaidDevice State 0 8 49 0 active sync /dev/sdd1 - 0 0 1 removed 1 8 65 - faulty /dev/sde1
和
hdparm -I /dev/sde
輸出:
/dev/sde: ATA device, with non-removable media Model Number: WDC WD1002F9YZ-09H1JL1 Serial Number: WD-WMC5K0D33MEU Firmware Revision: 01.01M03 Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0 Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 1953525168 Logical Sector size: 512 bytes Physical Sector size: 4096 bytes Logical Sector-0 offset: 0 bytes device size with M = 1024*1024: 953869 MBytes device size with M = 1000*1000: 1000204 MBytes (1000 GB) cache/buffer size = unknown Form Factor: 3.5 inch Nominal Media Rotation Rate: 7200 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 0 Advanced power management level: 128 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * WRITE_BUFFER command * READ_BUFFER command * NOP cmd * DOWNLOAD_MICROCODE * Advanced Power Management feature set Power-Up In Standby feature set * SET_FEATURES required to spinup after power up * 48-bit Address feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name * IDLE_IMMEDIATE with UNLOAD * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE unknown 119[7] * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Gen3 signaling speed (6.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters * Idle-Unload when NCQ is active * NCQ priority information * READ_LOG_DMA_EXT equivalent to READ_LOG_EXT * DMA Setup Auto-Activate optimization * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Write Same (AC2) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[7] unknown 206[12] (vendor specific) unknown 206[13] (vendor specific) * DOWNLOAD MICROCODE DMA command * WRITE BUFFER DMA command * READ BUFFER DMA command Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 112min for SECURITY ERASE UNIT. 112min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 50014ee05950af82 NAA : 5 IEEE OUI : 0014ee Unique ID : 05950af82 Checksum: correct
和
smartctl -a /dev/sde
輸出:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-3-amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Se Device Model: WDC WD1002F9YZ-09H1JL1 Serial Number: WD-WMC5K0D33MEU LU WWN Device Id: 5 0014ee 05950af82 Firmware Version: 01.01M03 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Oct 2 07:41:14 2017 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (10560) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 118) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x30bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 174 171 021 Pre-fail Always - 2291 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 202 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 5402 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 202 16 Unknown_Attribute 0x0022 255 000 000 Old_age Always - 8668797885185 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 65 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 136 194 Temperature_Celsius 0x0022 106 094 000 Old_age Always - 37 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 20 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
這不是最終的;您可以嘗試使用
--re-add
. 甚至還有一個變體會自動重新添加所有失敗的設備:mdadm --re-add /dev/md1 faulty
核心日誌應該告訴你為什麼驅動器被標記為失敗。鑑於 SMART 狀態,我懷疑 UDMA CRC 錯誤。您還可以使用查看驅動器上的擴展錯誤日誌
smartctl -x /dev/sde
這些應表明錯誤的性質;例如
Error 10 [9] occurred at disk power-on lifetime: 31192 hours (1299 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 73 30 a5 58 40 00 Error: UNC at LBA = 0x7330a558 = 1932567896 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 60 05 00 00 e0 00 00 73 30 a1 00 40 08 13d+02:07:12.334 READ FPDMA QUEUED 60 00 08 00 d8 00 00 03 d3 aa c0 40 08 13d+02:07:12.334 READ FPDMA QUEUED 60 05 00 00 d0 00 00 73 30 9c 00 40 08 13d+02:07:12.327 READ FPDMA QUEUED 60 00 08 00 c8 00 00 03 d3 a9 90 40 08 13d+02:07:12.327 READ FPDMA QUEUED 60 05 00 00 c0 00 00 73 30 97 00 40 08 13d+02:07:12.321 READ FPDMA QUEUED
(來自我的 SMART 試駕之一)。