Hard-Disk

更換舊的 SATA 電纜。舊的是否會導致 HDD 上的 dmesg 錯誤?

  • June 20, 2020

背景

當我用全新的磁碟 WD Red 3TB 組裝我的一台伺服器時,我可能在判斷上犯了嚴重錯誤,即使用了一根舊的和一根非常舊的 SATA(維基百科)電纜。我的問題本質上是硬體背景,在 Debian 10 Linux 系統下執行有一些歷史。


我所說的那些電纜,有兩條:

  • 一根已使用 5 年以上但仍通過SATA III認證的數據線。如果過度彎曲或其他原因,恕我直言,這可能會出錯,但我不知道有任何虐待行為,所以那些年屏蔽可能會更好(?)我在想,等等。
  • 我驚訝地發現在同一台伺服器上可能有 15 年以上歷史的 SATA 電纜,上面只寫有串列 ATA,沒有別的,考慮到我的mdadmRAID 1 陣列出現越來越大的問題,因為我以以下形式組裝了這台伺服器各種dmesg磁碟 I/O 錯誤消息並最終使陣列降級,我在想這根電纜或可能兩者都可能導致我在讀取和寫入陣列時出錯。

更換電纜

我今天所做的是購買 2 條德國製造的(在貼紙上找到),可能質量更高經過 SATA III 認證的新電纜,看看會發生什麼。


測試

  1. 我啟動了伺服器,解除安裝了陣列並停止了它。
  2. 已經開始在夜間執行這兩個單獨的讀取磁碟命令:
pv < /dev/sdX > /dev/null
  1. dmesg還開始監控nmon. 1 小時後,到目前為止沒有出現任何錯誤或減速…

問題

假設我醒來並且dmesg在那些硬碟被完整讀取後沒有錯誤,我是否可以認為舊電纜是錯誤的根源,或者我沒有考慮到一些事情?

我無法決定是在此處發布還是在 SuperUser 上發布。如果更適合其他地方,如果有很多這樣的評論到達,我會在早上重新發布。無論如何,感謝您的寶貴時間。


smartctl

WD-WCC4N6EZXNSD

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-9-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N6EZXNSD
LU WWN Device Id: 5 0014ee 210a9a0ef
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jun 20 08:47:05 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                   was never started.
                   Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                   without error or no self-test has ever 
                   been run.
Total time to complete Offline 
data collection:        (40380) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                   Auto Offline data collection on/off support.
                   Suspend Offline collection upon new
                   command.
                   Offline surface scan supported.
                   Self-test supported.
                   Conveyance Self-test supported.
                   Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                   power-saving mode.
                   Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                   General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 405) minutes.
Conveyance self-test routine
recommended polling time:    (   5) minutes.
SCT capabilities:          (0x703d) SCT Status supported.
                   SCT Error Recovery Control supported.
                   SCT Feature Control supported.
                   SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0027   179   178   021    Pre-fail  Always       -       6050
 4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       31
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2443
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       31
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2423
194 Temperature_Celsius     0x0022   116   109   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 2033 (device log contains only the most recent five errors)
   CR = Command Register [HEX]
   FR = Features Register [HEX]
   SC = Sector Count Register [HEX]
   SN = Sector Number Register [HEX]
   CL = Cylinder Low Register [HEX]
   CH = Cylinder High Register [HEX]
   DH = Device/Head Register [HEX]
   DC = Device Command Register [HEX]
   ER = Error register [HEX]
   ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2033 occurred at disk power-on lifetime: 424 hours (17 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 04 61 02 00 00 00 a0  Device Fault; Error: ABRT

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 ef 10 02 00 00 00 a0 00   2d+13:04:28.795  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 00   2d+13:04:28.794  IDENTIFY DEVICE
 ef 03 46 00 00 00 a0 00   2d+13:04:28.794  SET FEATURES [Set transfer mode]
 ef 10 02 00 00 00 a0 00   2d+13:04:28.794  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 00   2d+13:04:28.793  IDENTIFY DEVICE

Error 2032 occurred at disk power-on lifetime: 424 hours (17 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 04 61 46 00 00 00 a0  Device Fault; Error: ABRT

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 ef 03 46 00 00 00 a0 00   2d+13:04:28.794  SET FEATURES [Set transfer mode]
 ef 10 02 00 00 00 a0 00   2d+13:04:28.794  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 00   2d+13:04:28.793  IDENTIFY DEVICE
 c8 00 08 00 00 00 e0 00   2d+13:04:28.779  READ DMA
 ef 10 02 00 00 00 a0 00   2d+13:04:28.779  SET FEATURES [Enable SATA feature]

Error 2031 occurred at disk power-on lifetime: 424 hours (17 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 04 61 02 00 00 00 a0  Device Fault; Error: ABRT

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 ef 10 02 00 00 00 a0 00   2d+13:04:28.794  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 00   2d+13:04:28.793  IDENTIFY DEVICE
 c8 00 08 00 00 00 e0 00   2d+13:04:28.779  READ DMA
 ef 10 02 00 00 00 a0 00   2d+13:04:28.779  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 00   2d+13:04:28.778  IDENTIFY DEVICE

Error 2030 occurred at disk power-on lifetime: 424 hours (17 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 04 61 08 00 00 00 e0  Device Fault; Error: ABRT 8 sectors at LBA = 0x00000000 = 0

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c8 00 08 00 00 00 e0 00   2d+13:04:28.779  READ DMA
 ef 10 02 00 00 00 a0 00   2d+13:04:28.779  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 00   2d+13:04:28.778  IDENTIFY DEVICE
 ef 03 46 00 00 00 a0 00   2d+13:04:28.778  SET FEATURES [Set transfer mode]
 ef 10 02 00 00 00 a0 00   2d+13:04:28.778  SET FEATURES [Enable SATA feature]

Error 2029 occurred at disk power-on lifetime: 424 hours (17 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 04 61 02 00 00 00 a0  Device Fault; Error: ABRT

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 ef 10 02 00 00 00 a0 00   2d+13:04:28.779  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 00   2d+13:04:28.778  IDENTIFY DEVICE
 ef 03 46 00 00 00 a0 00   2d+13:04:28.778  SET FEATURES [Set transfer mode]
 ef 10 02 00 00 00 a0 00   2d+13:04:28.778  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 00   2d+13:04:28.777  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

WD-WCC4N5EKLTNX

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-9-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N5EKLTNX
LU WWN Device Id: 5 0014ee 2bb548051
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jun 20 08:50:48 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                   was never started.
                   Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                   without error or no self-test has ever 
                   been run.
Total time to complete Offline 
data collection:        (39540) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                   Auto Offline data collection on/off support.
                   Suspend Offline collection upon new
                   command.
                   Offline surface scan supported.
                   Self-test supported.
                   Conveyance Self-test supported.
                   Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                   power-saving mode.
                   Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                   General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 397) minutes.
Conveyance self-test routine
recommended polling time:    (   5) minutes.
SCT capabilities:          (0x703d) SCT Status supported.
                   SCT Error Recovery Control supported.
                   SCT Feature Control supported.
                   SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0027   180   179   021    Pre-fail  Always       -       5975
 4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       32
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2443
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       31
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2443
194 Temperature_Celsius     0x0022   115   107   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 45 (device log contains only the most recent five errors)
   CR = Command Register [HEX]
   FR = Features Register [HEX]
   SC = Sector Count Register [HEX]
   SN = Sector Number Register [HEX]
   CL = Cylinder Low Register [HEX]
   CH = Cylinder High Register [HEX]
   DH = Device/Head Register [HEX]
   DC = Device Command Register [HEX]
   ER = Error register [HEX]
   ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 45 occurred at disk power-on lifetime: 2416 hours (100 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 04 61 02 00 00 00 a0  Device Fault; Error: ABRT

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 ef 10 02 00 00 00 a0 08      04:26:20.066  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 08      04:26:20.066  IDENTIFY DEVICE
 ef 03 46 00 00 00 a0 08      04:26:20.066  SET FEATURES [Set transfer mode]
 ef 10 02 00 00 00 a0 08      04:26:20.065  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 08      04:26:20.065  IDENTIFY DEVICE

Error 44 occurred at disk power-on lifetime: 2416 hours (100 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 04 61 46 00 00 00 a0  Device Fault; Error: ABRT

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 ef 03 46 00 00 00 a0 08      04:26:20.066  SET FEATURES [Set transfer mode]
 ef 10 02 00 00 00 a0 08      04:26:20.065  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 08      04:26:20.065  IDENTIFY DEVICE
 ef 10 02 00 00 00 a0 08      04:26:20.046  SET FEATURES [Enable SATA feature]

Error 43 occurred at disk power-on lifetime: 2416 hours (100 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 04 61 02 00 00 00 a0  Device Fault; Error: ABRT

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 ef 10 02 00 00 00 a0 08      04:26:20.065  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 08      04:26:20.065  IDENTIFY DEVICE
 ef 10 02 00 00 00 a0 08      04:26:20.046  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 08      04:26:20.046  IDENTIFY DEVICE

Error 42 occurred at disk power-on lifetime: 2416 hours (100 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 04 61 02 00 00 00 a0  Device Fault; Error: ABRT

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 ef 10 02 00 00 00 a0 08      04:26:20.046  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 08      04:26:20.046  IDENTIFY DEVICE
 ef 03 46 00 00 00 a0 08      04:26:20.046  SET FEATURES [Set transfer mode]
 ef 10 02 00 00 00 a0 08      04:26:20.045  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 08      04:26:20.045  IDENTIFY DEVICE

Error 41 occurred at disk power-on lifetime: 2416 hours (100 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 04 61 46 00 00 00 a0  Device Fault; Error: ABRT

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 ef 03 46 00 00 00 a0 08      04:26:20.046  SET FEATURES [Set transfer mode]
 ef 10 02 00 00 00 a0 08      04:26:20.045  SET FEATURES [Enable SATA feature]
 ec 00 00 00 00 00 a0 08      04:26:20.045  IDENTIFY DEVICE
 ef 10 02 00 00 00 a0 08      04:26:20.030  SET FEATURES [Enable SATA feature]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

保修單

這兩個硬碟都在保修期內,所以如果有故障證明我可以更換它們。

更換 SATA 電纜後(特別是 SATA v1)

那麼,更換兩條 SATA 電纜後究竟發生了什麼?

  • 首先,正如我的問題中提到的,我閱讀了兩個驅動器,沒有錯誤!
  • 其次,我知道錯誤可能是特定於寫入的,所以我進行了寫入測試!

下圖為大圖,請隨意點擊放大:

更換 SATA 線纜後


你可以親眼看到,沒有更多的錯誤可以看到dmesg,這讓我很高興,也證明了我的理論。在我組裝伺服器的時候,我並沒有意識到那根電纜有多老,這讓我很難過。無論如何,問題現在已經消失了。

相關SATA版本已發布2002(1.5G)、2005(3.0G)、2008(6.0G)。因此,您的電纜來自 1.5 或 3.0 時代。從理論上講,舊電纜應該與更新、更快的設備一起使用,但這種組合的問題是眾所周知的。

您可以通過以下方式獲得目前的 SATA 連結速度

smartctl -a /dev/sda | grep SATA

您可以使用核心參數強制核心將連結配置為較低的速度libata.force=1.5。如果問題隨著舊電纜和核心參數而消失,那麼我會合理地確定電纜是問題所在。

引用自:https://unix.stackexchange.com/questions/593886