修復 RAID5 陣列

November 17, 2018

我正在嘗試修復由 3 個 2TB 磁碟組成的 RAID5 陣列。在完美執行了一段時間後，電腦（執行 Debian）突然無法啟動，並卡在 GRUB 提示符下。我很確定它與 RAID 陣列有關。

由於很難完整說明已經嘗試過的所有內容，因此我將嘗試描述目前狀態。

mdadm --detail /dev/md0輸出：

/dev/md0:
Version : 1.2
Creation Time : Sun Mar 22 15:13:25 2015
Raid Level : raid5
Used Dev Size : 1953381888 (1862.89 GiB 2000.26 GB)
Raid Devices : 3
Total Devices : 2
Persistence : Superblock is persistent

Update Time : Sun Mar 22 16:18:56 2015
     State : active, degraded, Not Started 
     Active Devices : 2
     Working Devices : 2
     Failed Devices : 0
     Spare Devices : 0

    Layout : left-symmetric
    Chunk Size : 512K

      Name : ubuntu:0  (local to host ubuntu)
      UUID : ae2b72c0:60444678:25797b77:3695130a
    Events : 57

Number   Major   Minor   RaidDevice State
  0       0        0        0      removed
  1       8       17        1      active sync   /dev/sdb1
  2       8       33        2      active sync   /dev/sdc1

mdadm --examine /dev/sda1給出：

mdadm: No md superblock detected on /dev/sda1.

這是有道理的，因為我重新格式化了這個分區，因為我認為它是錯誤的。

mdadm --examine /dev/sdb1給出：

/dev/sdb1:
      Magic : a92b4efc
      Version : 1.2
      Feature Map : 0x0
      Array UUID : ae2b72c0:60444678:25797b77:3695130a
      Name : ubuntu:0  (local to host ubuntu)
      Creation Time : Sun Mar 22 15:13:25 2015
      Raid Level : raid5
      Raid Devices : 3

Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 3906763776 (3725.78 GiB 4000.53 GB)
Used Dev Size : 3906763776 (1862.89 GiB 2000.26 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : f1817af9:1d964693:774d5d63:bfa69e3d

Update Time : Sun Mar 22 16:18:56 2015
Checksum : ab7c79ae - correct
Events : 57

Layout : left-symmetric
Chunk Size : 512K

Device Role : Active device 1
Array State : .AA ('A' == active, '.' == missing)

mdadm --detail /dev/sdc1給出：

/dev/sdc1:
   Magic : a92b4efc
   Version : 1.2
   Feature Map : 0x0
   Array UUID : ae2b72c0:60444678:25797b77:3695130a
   Name : ubuntu:0  (local to host ubuntu)
   Creation Time : Sun Mar 22 15:13:25 2015
   Raid Level : raid5
   Raid Devices : 3

   Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
   Array Size : 3906763776 (3725.78 GiB 4000.53 GB)
   Used Dev Size : 3906763776 (1862.89 GiB 2000.26 GB)
   Data Offset : 262144 sectors
   Super Offset : 8 sectors
   State : clean
   Device UUID : f076b568:007e3f9b:71a19ea2:474e5fe9

   Update Time : Sun Mar 22 16:18:56 2015
   Checksum : db25214 - correct
   Events : 57

   Layout : left-symmetric
   Chunk Size : 512K

   Device Role : Active device 2
   Array State : .AA ('A' == active, '.' == missing)

cat /proc/mdstat:

Personalities : [raid6] [raid5] [raid4] 
md0 : inactive sdb1[1] sdc1[2]
 3906764800 blocks super 1.2

unused devices: &lt;none&gt;

fdisk -l:

Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
81 heads, 63 sectors/track, 765633 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x000d84fa

Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048  3907029167  1953513560   fd  Linux raid autodetect

Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x000802d9

Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *        2048  3907028991  1953513472   fd  Linux raid autodetect

Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x000a8dca

Device Boot      Start         End      Blocks   Id  System
/dev/sdc1            2048  3907028991  1953513472   fd  Linux raid autodetect

Disk /dev/sdd: 7756 MB, 7756087296 bytes
255 heads, 63 sectors/track, 942 cylinders, total 15148608 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x128faec9

Device Boot      Start         End      Blocks   Id  System
/dev/sdd1   *        2048    15148607     7573280    c  W95 FAT32 (LBA)

當然，我已經嘗試/dev/sda1再次添加。mdadm --manage /dev/md0 --add /dev/sda1給出：

mdadm: add new device failed for /dev/sda1 as 3: Invalid argument

如果 RAID 已修復，我可能還需要啟動 GRUB 並再次執行，以便它可以檢測 RAID/LVM 並再次啟動。

編輯（添加 smartctl 測試結果）

smartctl測試的輸出

smartctl -a /dev/sda:

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.16.0-30-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:     WDC WD20EZRX-00D8PB0
Serial Number:    WD-WMC4M0760056
LU WWN Device Id: 5 0014ee 003a4a444
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Mar 24 22:07:08 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                   was completed without error.
                   Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                   the read element of the test failed.
Total time to complete Offline 
data collection:        (26280) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                   Auto Offline data collection on/off support.
                   Suspend Offline collection upon new
                   command.
                   Offline surface scan supported.
                   Self-test supported.
                   Conveyance Self-test supported.
                   Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                   power-saving mode.
                   Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                   General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 266) minutes.
Conveyance self-test routine
recommended polling time:    (   5) minutes.
SCT capabilities:          (0x7035) SCT Status supported.
                   SCT Feature Control supported.
                   SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       3401
 3 Spin_Up_Time            0x0027   172   172   021    Pre-fail  Always       -       4375
 4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       59
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       9697
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       59
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       51
193 Load_Cycle_Count        0x0032   115   115   000    Old_age   Always       -       255276
194 Temperature_Celsius     0x0022   119   106   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       12
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      9692         2057

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

您缺少/dev/md0RAID5 陣列的三個驅動器之一。因此，mdadm將組裝陣列但不執行它。
-R,--run 嘗試啟動陣列，即使提供的驅動器數量少於上次陣列處於活動狀態時的數量。通常，如果沒有找到所有預期的驅動器並且--scan沒有使用，那麼陣列將被組裝但不會啟動。無論如何--run都會嘗試啟動它。
因此，您需要做的就是mdadm --run /dev/md0. 如果你很謹慎，你可以嘗試mdadm --run --readonly /dev/md0按照它mount -o ro,norecover /dev/md0 /mnt來檢查它看起來沒問題。（當然，相反的--readonly是，--readwrite。）
執行後，您可以重新添加新磁碟。
我不建議添加您現有的磁碟，因為它出現 SMART 磁碟錯誤，這一點最近的測試報告證明了這一點
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      9692         2057
但是，如果您真的想嘗試重新添加現有磁碟，那麼--zero-superblock首先在該磁碟上可能是一個非常好的主意。但我仍然建議更換它。

引用自：https://unix.stackexchange.com/questions/191857

修復 RAID5 陣列

相關問答

RAID5 磁碟可以移動到不同的插槽嗎？

將目前在軟體突襲中的驅動器移動到 LVM？

mdadm DegradedArray，該軟體是問題還是硬體缺陷？

lsblk 和 df 之間的磁碟大小不匹配

如何安全地更換 Linux RAID5 陣列中尚未發生故障的磁碟？

了解 mdadm RAID 5 陣列使用大小的好方法是什麼