如何對基於 Illumos 的系統上的磁碟控制器進行故障排除？

February 4, 2019

我正在使用基於 Illumos 的 OmniOS。
我有一個包含兩個鏡像的 SSD 的 ZFS 池；池，稱為data報告其%b為 100；下面是iostat -xn：
r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0    8.0    0.0   61.5  8.7  4.5 1092.6  556.8  39 100 data
不幸的是，實際上並沒有很多吞吐量。iotop每秒報告23552字節數。
我也跑了iostat -E，它報告了很多Transport Errors；我們改變了港口，他們走了。
我認為驅動器可能存在問題；SMART 報告沒有問題；我跑了多個smartctl -t shortand smartctl -t long; 未報告任何問題。
我跑了fmadm faulty，它報告了以下內容：
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jun 01 18:34:01 5fdf0c4c-5627-ccaa-d41e-fc5b2d282ab2  ZFS-8000-D3    Major     

Host        : sys1
Platform    : xxxx-xxxx       Chassis_id  : xxxxxxx
Product_sn  : 

Fault class : fault.fs.zfs.device
Affects     : zfs://pool=data/vdev=cad34c3e3be42919
                 faulted but still in service
Problem in  : zfs://pool=data/vdev=cad34c3e3be42919
                 faulted but still in service

Description : A ZFS device failed.  Refer to http://illumos.org/msg/ZFS-8000-D3
             for more information.

Response    : No automated response will occur.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.
就像它暗示我跑了zpool status -x，它報告了all pools are healthy。
我執行了一些 DTraces，發現所有的 IO 活動都來自<none>（對於文件）；這是元數據；所以實際上沒有任何文件 IO 正在進行。
當我執行kstat -p zone_vfs它報告以下內容：
zone_vfs:0:global:100ms_ops     21412
zone_vfs:0:global:10ms_ops      95554
zone_vfs:0:global:10s_ops       1639
zone_vfs:0:global:1s_ops        20752
zone_vfs:0:global:class zone_vfs
zone_vfs:0:global:crtime        0
zone_vfs:0:global:delay_cnt     0
zone_vfs:0:global:delay_time    0
zone_vfs:0:global:nread 69700628762
zone_vfs:0:global:nwritten      42450222087
zone_vfs:0:global:reads 14837387
zone_vfs:0:global:rlentime      229340224122
zone_vfs:0:global:rtime 202749379182
zone_vfs:0:global:snaptime      168018.106250637
zone_vfs:0:global:wlentime      153502283827640
zone_vfs:0:global:writes        2599025
zone_vfs:0:global:wtime 113171882481275
zone_vfs:0:global:zonename      global
高額1s_ops和10s_ops非常令人擔憂。
我認為它是控制器，但我不能確定；有人有想法麼？或者我在哪裡可以獲得更多資訊？

data池是一個加密的lofiZFS 容器；這就是問題。
我能夠確認這是 lofi 的“虛擬”控制器的性能問題，原因如下：
lofi + zfs + 加密的吞吐量約為10-25MB/s
lofi + zfs + no-encryption 吞吐量約30MB/s
沒有使用普通舊 ZFS 的 lofi 的吞吐量約為 250MB/s
控制器報告 100%的data使用率，而真正的控制器幾乎沒有。
在具有相同設置的多台機器上進行了測試，結果基本相同。
這裡的問題是lofi；而不是磁碟控制器。

引用自：https://unix.stackexchange.com/questions/207364

如何對基於 Illumos 的系統上的磁碟控制器進行故障排除？

相關問答

使用 ZFS ACL 在 OmniOS (Illumos) 上繼承組寫入權限，但不執行文件

如何使用 mdb 讀取變數值？

Solaris 11：zfs smb 共享僅作為來賓工作

如何將現有的 Solaris11 安裝移動到新磁碟

調整 vdisk zvol 大小後，在 Linux 中將 VTOC 標籤編輯為 Solaris 中的來賓 LDom

Opensolaris 是否提供分佈式 ZFS 文件系統