Cluster
如果第一個節點關閉,PCS Stonith (fencing) 將殺死兩個節點集群
我已經使用 pcs (corosync/pacemaker/pcsd) 配置了一個兩節點物理伺服器集群 (HP ProLiant DL560 Gen8)。我還使用 fence_ilo4 在它們上配置了圍欄。
如果一個節點出現故障(在 DOWN 下,我的意思是斷電),就會發生奇怪的事情,第二個節點也會死掉。Fencing 會殺死自己,導致兩台伺服器都離線。
我該如何糾正這種行為?
我嘗試的是在下面的部分中添加“
wait_for_all: 0
”和“ ” 。但它仍然會殺死它。expected_votes: 1``/etc/corosync/corosync.conf``quorum
在某些時候,要在其中一台伺服器上執行一些維護,並且必須將其關閉。如果發生這種情況,我不希望其他節點停機。
這是一些輸出
[root@kvm_aquila-02 ~]# pcs quorum status Quorum information ------------------ Date: Fri Jun 28 09:07:18 2019 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 2 Ring ID: 1/284 Quorate: Yes Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 2 Quorum: 1 Flags: 2Node Quorate Membership information ---------------------- Nodeid Votes Qdevice Name 1 1 NR kvm_aquila-01 2 1 NR kvm_aquila-02 (local) [root@kvm_aquila-02 ~]# pcs config show Cluster Name: kvm_aquila Corosync Nodes: kvm_aquila-01 kvm_aquila-02 Pacemaker Nodes: kvm_aquila-01 kvm_aquila-02 Resources: Clone: dlm-clone Meta Attrs: interleave=true ordered=true Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s) start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) Clone: clvmd-clone Meta Attrs: interleave=true ordered=true Resource: clvmd (class=ocf provider=heartbeat type=clvm) Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s) start interval=0s timeout=90s (clvmd-start-interval-0s) stop interval=0s timeout=90s (clvmd-stop-interval-0s) Group: test_VPS Resource: test (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/shared/xml/test.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true is-managed=true priority=100 target-role=Started Utilization: cpu=4 hv_memory=4096 Operations: migrate_from interval=0 timeout=120s (test-migrate_from-interval-0) migrate_to interval=0 timeout=120 (test-migrate_to-interval-0) monitor interval=10 timeout=30 (test-monitor-interval-10) start interval=0s timeout=300s (test-start-interval-0s) stop interval=0s timeout=300s (test-stop-interval-0s) Stonith Devices: Resource: kvm_aquila-01 (class=stonith type=fence_ilo4) Attributes: ipaddr=10.0.4.39 login=fencing passwd=0ToleranciJa pcmk_host_list="kvm_aquila-01 kvm_aquila-02" Operations: monitor interval=60s (kvm_aquila-01-monitor-interval-60s) Resource: kvm_aquila-02 (class=stonith type=fence_ilo4) Attributes: ipaddr=10.0.4.49 login=fencing passwd=0ToleranciJa pcmk_host_list="kvm_aquila-01 kvm_aquila-02" Operations: monitor interval=60s (kvm_aquila-02-monitor-interval-60s) Fencing Levels: Location Constraints: Ordering Constraints: start dlm-clone then start clvmd-clone (kind:Mandatory) Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: kvm_aquila dc-version: 1.1.19-8.el7_6.4-c3c624ea3d have-watchdog: false last-lrm-refresh: 1561619537 no-quorum-policy: ignore stonith-enabled: true Quorum: Options: wait_for_all: 0 [root@kvm_aquila-02 ~]# pcs cluster status Cluster Status: Stack: corosync Current DC: kvm_aquila-02 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum Last updated: Fri Jun 28 09:14:11 2019 Last change: Thu Jun 27 16:23:44 2019 by root via cibadmin on kvm_aquila-01 2 nodes configured 7 resources configured PCSD Status: kvm_aquila-02: Online kvm_aquila-01: Online [root@kvm_aquila-02 ~]# pcs status Cluster name: kvm_aquila Stack: corosync Current DC: kvm_aquila-02 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum Last updated: Fri Jun 28 09:14:31 2019 Last change: Thu Jun 27 16:23:44 2019 by root via cibadmin on kvm_aquila-01 2 nodes configured 7 resources configured Online: [ kvm_aquila-01 kvm_aquila-02 ] Full list of resources: kvm_aquila-01 (stonith:fence_ilo4): Started kvm_aquila-01 kvm_aquila-02 (stonith:fence_ilo4): Started kvm_aquila-02 Clone Set: dlm-clone [dlm] Started: [ kvm_aquila-01 kvm_aquila-02 ] Clone Set: clvmd-clone [clvmd] Started: [ kvm_aquila-01 kvm_aquila-02 ] Resource Group: test_VPS test (ocf::heartbeat:VirtualDomain): Started kvm_aquila-01 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
看起來您已將 STONITH 設備配置為能夠隔離兩個節點。您也沒有位置約束來保持負責隔離給定節點的隔離代理在同一節點上執行(STONITH 自殺),這是一種不好的做法。
嘗試像這樣配置 STONITH 設備和位置約束:
pcs stonith create kvm_aquila-01 fence_ilo4 pcmk_host_list=kvm_aquila-01 ipaddr=10.0.4.39 login=fencing passwd=0ToleranciJa op monitor interval=60s pcs stonith create kvm_aquila-02 fence_ilo4 pcmk_host_list=kvm_aquila-02 ipaddr=10.0.4.49 login=fencing passwd=0ToleranciJa op monitor interval=60s pcs constraint location kvm_aquila-01 avoids kvm_aquila-01=INFINITY pcs constraint location kvm_aquila-02 avoids kvm_aquila-02=INFINITY