Hard-Disk
更換舊的 SATA 電纜。舊的是否會導致 HDD 上的 dmesg 錯誤?
背景
當我用全新的磁碟 WD Red 3TB 組裝我的一台伺服器時,我可能在判斷上犯了嚴重錯誤,即使用了一根舊的和一根非常舊的 SATA(維基百科)電纜。我的問題本質上是硬體背景,在 Debian 10 Linux 系統下執行有一些歷史。
我所說的那些電纜,有兩條:
- 一根已使用 5 年以上但仍通過SATA III認證的數據線。如果過度彎曲或其他原因,恕我直言,這可能會出錯,但我不知道有任何虐待行為,所以那些年屏蔽可能會更好(?)我在想,等等。
- 我驚訝地發現在同一台伺服器上可能有 15 年以上歷史的 SATA 電纜,上面只寫有串列 ATA,沒有別的,考慮到我的
mdadm
RAID 1 陣列出現越來越大的問題,因為我以以下形式組裝了這台伺服器各種dmesg
磁碟 I/O 錯誤消息並最終使陣列降級,我在想這根電纜或可能兩者都可能導致我在讀取和寫入陣列時出錯。更換電纜
我今天所做的是購買 2 條德國製造的(在貼紙上找到),可能質量更高,經過 SATA III 認證的新電纜,看看會發生什麼。
測試
- 我啟動了伺服器,解除安裝了陣列並停止了它。
- 已經開始在夜間執行這兩個單獨的讀取磁碟命令:
pv < /dev/sdX > /dev/null
dmesg
還開始監控nmon
. 1 小時後,到目前為止沒有出現任何錯誤或減速…問題
假設我醒來並且
dmesg
在那些硬碟被完整讀取後沒有錯誤,我是否可以認為舊電纜是錯誤的根源,或者我沒有考慮到一些事情?我無法決定是在此處發布還是在 SuperUser 上發布。如果更適合其他地方,如果有很多這樣的評論到達,我會在早上重新發布。無論如何,感謝您的寶貴時間。
smartctl
WD-WCC4N6EZXNSD
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-9-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Red Device Model: WDC WD30EFRX-68EUZN0 Serial Number: WD-WCC4N6EZXNSD LU WWN Device Id: 5 0014ee 210a9a0ef Firmware Version: 82.00A82 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sat Jun 20 08:47:05 2020 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (40380) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 405) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x703d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 179 178 021 Pre-fail Always - 6050 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 31 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2443 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 31 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2423 194 Temperature_Celsius 0x0022 116 109 000 Old_age Always - 34 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0 SMART Error Log Version: 1 ATA Error Count: 2033 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 2033 occurred at disk power-on lifetime: 424 hours (17 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 02 00 00 00 a0 Device Fault; Error: ABRT Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ef 10 02 00 00 00 a0 00 2d+13:04:28.795 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 00 2d+13:04:28.794 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 2d+13:04:28.794 SET FEATURES [Set transfer mode] ef 10 02 00 00 00 a0 00 2d+13:04:28.794 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 00 2d+13:04:28.793 IDENTIFY DEVICE Error 2032 occurred at disk power-on lifetime: 424 hours (17 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 46 00 00 00 a0 Device Fault; Error: ABRT Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ef 03 46 00 00 00 a0 00 2d+13:04:28.794 SET FEATURES [Set transfer mode] ef 10 02 00 00 00 a0 00 2d+13:04:28.794 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 00 2d+13:04:28.793 IDENTIFY DEVICE c8 00 08 00 00 00 e0 00 2d+13:04:28.779 READ DMA ef 10 02 00 00 00 a0 00 2d+13:04:28.779 SET FEATURES [Enable SATA feature] Error 2031 occurred at disk power-on lifetime: 424 hours (17 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 02 00 00 00 a0 Device Fault; Error: ABRT Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ef 10 02 00 00 00 a0 00 2d+13:04:28.794 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 00 2d+13:04:28.793 IDENTIFY DEVICE c8 00 08 00 00 00 e0 00 2d+13:04:28.779 READ DMA ef 10 02 00 00 00 a0 00 2d+13:04:28.779 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 00 2d+13:04:28.778 IDENTIFY DEVICE Error 2030 occurred at disk power-on lifetime: 424 hours (17 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 08 00 00 00 e0 Device Fault; Error: ABRT 8 sectors at LBA = 0x00000000 = 0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 00 00 00 e0 00 2d+13:04:28.779 READ DMA ef 10 02 00 00 00 a0 00 2d+13:04:28.779 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 00 2d+13:04:28.778 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 2d+13:04:28.778 SET FEATURES [Set transfer mode] ef 10 02 00 00 00 a0 00 2d+13:04:28.778 SET FEATURES [Enable SATA feature] Error 2029 occurred at disk power-on lifetime: 424 hours (17 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 02 00 00 00 a0 Device Fault; Error: ABRT Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ef 10 02 00 00 00 a0 00 2d+13:04:28.779 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 00 2d+13:04:28.778 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 2d+13:04:28.778 SET FEATURES [Set transfer mode] ef 10 02 00 00 00 a0 00 2d+13:04:28.778 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 00 2d+13:04:28.777 IDENTIFY DEVICE SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
WD-WCC4N5EKLTNX
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-9-amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Red Device Model: WDC WD30EFRX-68EUZN0 Serial Number: WD-WCC4N5EKLTNX LU WWN Device Id: 5 0014ee 2bb548051 Firmware Version: 82.00A82 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sat Jun 20 08:50:48 2020 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (39540) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 397) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x703d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 180 179 021 Pre-fail Always - 5975 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 32 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2443 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 31 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 4 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2443 194 Temperature_Celsius 0x0022 115 107 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0 SMART Error Log Version: 1 ATA Error Count: 45 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 45 occurred at disk power-on lifetime: 2416 hours (100 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 02 00 00 00 a0 Device Fault; Error: ABRT Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ef 10 02 00 00 00 a0 08 04:26:20.066 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 08 04:26:20.066 IDENTIFY DEVICE ef 03 46 00 00 00 a0 08 04:26:20.066 SET FEATURES [Set transfer mode] ef 10 02 00 00 00 a0 08 04:26:20.065 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 08 04:26:20.065 IDENTIFY DEVICE Error 44 occurred at disk power-on lifetime: 2416 hours (100 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 46 00 00 00 a0 Device Fault; Error: ABRT Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ef 03 46 00 00 00 a0 08 04:26:20.066 SET FEATURES [Set transfer mode] ef 10 02 00 00 00 a0 08 04:26:20.065 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 08 04:26:20.065 IDENTIFY DEVICE ef 10 02 00 00 00 a0 08 04:26:20.046 SET FEATURES [Enable SATA feature] Error 43 occurred at disk power-on lifetime: 2416 hours (100 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 02 00 00 00 a0 Device Fault; Error: ABRT Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ef 10 02 00 00 00 a0 08 04:26:20.065 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 08 04:26:20.065 IDENTIFY DEVICE ef 10 02 00 00 00 a0 08 04:26:20.046 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 08 04:26:20.046 IDENTIFY DEVICE Error 42 occurred at disk power-on lifetime: 2416 hours (100 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 02 00 00 00 a0 Device Fault; Error: ABRT Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ef 10 02 00 00 00 a0 08 04:26:20.046 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 08 04:26:20.046 IDENTIFY DEVICE ef 03 46 00 00 00 a0 08 04:26:20.046 SET FEATURES [Set transfer mode] ef 10 02 00 00 00 a0 08 04:26:20.045 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 08 04:26:20.045 IDENTIFY DEVICE Error 41 occurred at disk power-on lifetime: 2416 hours (100 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 61 46 00 00 00 a0 Device Fault; Error: ABRT Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ef 03 46 00 00 00 a0 08 04:26:20.046 SET FEATURES [Set transfer mode] ef 10 02 00 00 00 a0 08 04:26:20.045 SET FEATURES [Enable SATA feature] ec 00 00 00 00 00 a0 08 04:26:20.045 IDENTIFY DEVICE ef 10 02 00 00 00 a0 08 04:26:20.030 SET FEATURES [Enable SATA feature] SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
保修單
這兩個硬碟都在保修期內,所以如果有故障證明我可以更換它們。
更換 SATA 電纜後(特別是 SATA v1)
那麼,更換兩條 SATA 電纜後究竟發生了什麼?
- 首先,正如我的問題中提到的,我閱讀了兩個驅動器,沒有錯誤!
- 其次,我知道錯誤可能是特定於寫入的,所以我進行了寫入測試!
下圖為大圖,請隨意點擊放大:
你可以親眼看到,沒有更多的錯誤可以看到
dmesg
,這讓我很高興,也證明了我的理論。在我組裝伺服器的時候,我並沒有意識到那根電纜有多老,這讓我很難過。無論如何,問題現在已經消失了。
相關SATA版本已發布2002(1.5G)、2005(3.0G)、2008(6.0G)。因此,您的電纜來自 1.5 或 3.0 時代。從理論上講,舊電纜應該與更新、更快的設備一起使用,但這種組合的問題是眾所周知的。
您可以通過以下方式獲得目前的 SATA 連結速度
smartctl -a /dev/sda | grep SATA
您可以使用核心參數強制核心將連結配置為較低的速度
libata.force=1.5
。如果問題隨著舊電纜和核心參數而消失,那麼我會合理地確定電纜是問題所在。