Linux
mdadm DegradedArray,該軟體是問題還是硬體缺陷?
在主機的專用伺服器上,我收到了所有 raid 陣列 md0/md1/md2 的電子郵件:
This is an automatically generated mail message from mdadm running on cn.com > `This is an automatically generated mail message from mdadm running on > example.com > > A DegradedArray event had been detected on md device /dev/md/2. > > Faithfully yours, etc. > > P.S. The /proc/mdstat file currently contains the following: > > Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] > [raid4] [raid10] md2 : active raid1 nvme0n1p3[0] > 903479616 blocks super 1.2 [2/1] [U_] > bitmap: 7/7 pages [28KB], 65536KB chunk > > md0 : active raid1 nvme0n1p1[0] > 33520640 blocks super 1.2 [2/1] [U_] > md1 : active raid1 nvme0n1p2[0] > 523264 blocks super 1.2 [2/1] [U_] > unused devices: <none> `
不知道這是raid同步問題還是硬碟真的有問題,希望Linux高手能幫幫我。
兩台 NVME 三星設備作為軟體 raid mdadm 執行。
$ fdisk -l nvme1n1 259:0 0 894.3G 0 disk ├─nvme1n1p1 259:2 0 32G 0 part ├─nvme1n1p2 259:3 0 512M 0 part └─nvme1n1p3 259:4 0 861.8G 0 part nvme0n1 259:1 0 894.3G 0 disk ├─nvme0n1p1 259:5 0 32G 0 part │ └─md0 9:0 0 32G 0 raid1 [SWAP] ├─nvme0n1p2 259:6 0 512M 0 part │ └─md1 9:1 0 511M 0 raid1 /boot └─nvme0n1p3 259:7 0 861.8G 0 part └─md2 9:2 0 861.6G 0 raid1 /
從列表中可以看出,nvme1n1 及其分區不在raid 組中。顯然,作業系統也辨識出了nvme1n1。
$ dmesg [ 7664.380493] pcieport 0000:00:1b.4: AER: Corrected error received: 0000:00:1b.4 [ 7664.380514] pcieport 0000:00:1b.4: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) [ 7664.380795] pcieport 0000:00:1b.4: AER: device [8086:a32c] error status/mask=00000001/00002000 [ 7664.381066] pcieport 0000:00:1b.4: AER: [ 0] RxErr [ 7664.780438] pcieport 0000:00:1b.4: AER: Corrected error received: 0000:00:1b.4 [ 7664.780459] pcieport 0000:00:1b.4: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) [ 7664.780739] pcieport 0000:00:1b.4: AER: device [8086:a32c] error status/mask=00000001/00002000 [ 7664.781011] pcieport 0000:00:1b.4: AER: [ 0] RxErr
lspci
向我展示了兩個 nvme 設備$lspci 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
檢查 mdadm 詳細資訊以獲取例如 md0。
mdadm -D /dev/md0 /dev/md0: Version : 1.2 Creation Time : Sat Aug 7 19:34:45 2021 Raid Level : raid1 Array Size : 33520640 (31.97 GiB 34.33 GB) Used Dev Size : 33520640 (31.97 GiB 34.33 GB) Raid Devices : 2 Total Devices : 1 Persistence : Superblock is persistent Update Time : Fri Mar 4 17:42:37 2022 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Consistency Policy : resync Name : rescue:0 UUID : 2e61cb41:dee3a004:b12de575:72c13ed0 Events : 46 Number Major Minor RaidDevice State 0 259 2 0 active sync /dev/nvme0n1p1 - 0 0 1 removed
在這裡我看不到設備 /dev/nvme1n1p1。這對我意味著什麼?
我的 mdadm.conf 文件
# mdadm.conf # # !NB! Run update-initramfs -u after updating this file. # !NB! This will ensure that initramfs has an uptodate copy. # # Please refer to mdadm.conf(5) for information about this file. # # by default (built-in), scan all partitions (/proc/partitions) and all # containers for MD superblocks. alternatively, specify devices to scan, using # wildcards if desired. #DEVICE partitions containers # automatically tag new arrays as belonging to the local system HOMEHOST <system> # instruct the monitoring daemon where to send mail alerts MAILADDR root # definitions of existing MD arrays ARRAY /dev/md/0 metadata=1.2 UUID=2e61cb41:dee3a004:b12de575:72c13ed0 name=rescue:0 ARRAY /dev/md/1 metadata=1.2 UUID=455ba7de:599eb665:202c1fe8:33c709f4 name=rescue:1 ARRAY /dev/md/2 metadata=1.2 UUID=c1f88478:e4ed5e8d:56f296cc:38e97b8c name=rescue:2 ARRAY /dev/md/0 metadata=1.2 UUID=e8c8f0cb:91007124:62e03226:94a707dc name=rescue:0 ARRAY /dev/md/1 metadata=1.2 UUID=a335efb7:cc52634c:3221294c:e7feb748 name=rescue:1 ARRAY /dev/md/2 metadata=1.2 UUID=f2a13b49:17f5e812:8e7c5adf:3114a929 name=rescue:2 # This configuration was auto-generated on Sat, 07 Aug 2021 19:35:14 +0200 by mkconf
我希望你能幫幫我
這是硬體級別的故障。由於它是託管伺服器,請讓供應商更換故障設備。不要費心嘗試修復它,只需將其換掉即可。這就是你付出的代價。
- 您需要與託管服務提供商安排停機時間
- 確保他們 100% 確定哪個磁碟設備有故障(我之前曾從一個應該知道更好的供應商那裡換掉一個好的磁碟。幸運的是,我執行的是 RAID6 並且可以應對第二個“故障” )
- 如果可以的話,無論如何都要備份“以防萬一”,一切都會出錯。無論如何,您都應該進行備份,因此請多備一份