Cluster

如果第一個節點關閉,PCS Stonith (fencing) 將殺死兩個節點集群

  • June 28, 2019

我已經使用 pcs (corosync/pacemaker/pcsd) 配置了一個兩節點物理伺服器集群 (HP ProLiant DL560 Gen8)。我還使用 fence_ilo4 在它們上配置了圍欄。

如果一個節點出現故障(在 DOWN 下,我的意思是斷電),就會發生奇怪的事情,第二個節點也會死掉。Fencing 會殺死自己,導致兩台伺服器都離線。

我該如何糾正這種行為?

我嘗試的是在下面的部分中添加“ wait_for_all: 0”和“ ” 。但它仍然會殺死它。expected_votes: 1``/etc/corosync/corosync.conf``quorum

在某些時候,要在其中一台伺服器上執行一些維護,並且必須將其關閉。如果發生這種情況,我不希望其他節點停機。

這是一些輸出

[root@kvm_aquila-02 ~]# pcs quorum status
Quorum information
------------------
Date:             Fri Jun 28 09:07:18 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          2
Ring ID:          1/284
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1  
Flags:            2Node Quorate 

Membership information
----------------------
   Nodeid      Votes    Qdevice Name
        1          1         NR kvm_aquila-01
        2          1         NR kvm_aquila-02 (local)


[root@kvm_aquila-02 ~]# pcs config show
Cluster Name: kvm_aquila
Corosync Nodes:
kvm_aquila-01 kvm_aquila-02
Pacemaker Nodes:
kvm_aquila-01 kvm_aquila-02

Resources:
Clone: dlm-clone
 Meta Attrs: interleave=true ordered=true 
 Resource: dlm (class=ocf provider=pacemaker type=controld)
  Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
              start interval=0s timeout=90 (dlm-start-interval-0s)
              stop interval=0s timeout=100 (dlm-stop-interval-0s)
Clone: clvmd-clone
 Meta Attrs: interleave=true ordered=true 
 Resource: clvmd (class=ocf provider=heartbeat type=clvm)
  Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
              start interval=0s timeout=90s (clvmd-start-interval-0s)
              stop interval=0s timeout=90s (clvmd-stop-interval-0s)
Group: test_VPS
 Resource: test (class=ocf provider=heartbeat type=VirtualDomain)
  Attributes: config=/shared/xml/test.xml hypervisor=qemu:///system migration_transport=ssh
  Meta Attrs: allow-migrate=true is-managed=true priority=100 target-role=Started 
  Utilization: cpu=4 hv_memory=4096
  Operations: migrate_from interval=0 timeout=120s (test-migrate_from-interval-0)
              migrate_to interval=0 timeout=120 (test-migrate_to-interval-0)
              monitor interval=10 timeout=30 (test-monitor-interval-10)
              start interval=0s timeout=300s (test-start-interval-0s)
              stop interval=0s timeout=300s (test-stop-interval-0s)

Stonith Devices:
Resource: kvm_aquila-01 (class=stonith type=fence_ilo4)
 Attributes: ipaddr=10.0.4.39 login=fencing passwd=0ToleranciJa pcmk_host_list="kvm_aquila-01 kvm_aquila-02"
 Operations: monitor interval=60s (kvm_aquila-01-monitor-interval-60s)
Resource: kvm_aquila-02 (class=stonith type=fence_ilo4)
 Attributes: ipaddr=10.0.4.49 login=fencing passwd=0ToleranciJa pcmk_host_list="kvm_aquila-01 kvm_aquila-02"
 Operations: monitor interval=60s (kvm_aquila-02-monitor-interval-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
 start dlm-clone then start clvmd-clone (kind:Mandatory)
Colocation Constraints:
 clvmd-clone with dlm-clone (score:INFINITY)
Ticket Constraints:

Alerts:
No alerts defined

Resources Defaults:
No defaults set
Operations Defaults:
No defaults set

Cluster Properties:
cluster-infrastructure: corosync
cluster-name: kvm_aquila
dc-version: 1.1.19-8.el7_6.4-c3c624ea3d
have-watchdog: false
last-lrm-refresh: 1561619537
no-quorum-policy: ignore
stonith-enabled: true

Quorum:
 Options:
   wait_for_all: 0

[root@kvm_aquila-02 ~]# pcs cluster status
Cluster Status:
Stack: corosync
Current DC: kvm_aquila-02 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
Last updated: Fri Jun 28 09:14:11 2019
Last change: Thu Jun 27 16:23:44 2019 by root via cibadmin on kvm_aquila-01
2 nodes configured
7 resources configured

PCSD Status:
 kvm_aquila-02: Online
 kvm_aquila-01: Online
[root@kvm_aquila-02 ~]# pcs status
Cluster name: kvm_aquila
Stack: corosync
Current DC: kvm_aquila-02 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
Last updated: Fri Jun 28 09:14:31 2019
Last change: Thu Jun 27 16:23:44 2019 by root via cibadmin on kvm_aquila-01

2 nodes configured
7 resources configured

Online: [ kvm_aquila-01 kvm_aquila-02 ]

Full list of resources:

kvm_aquila-01  (stonith:fence_ilo4):   Started kvm_aquila-01
kvm_aquila-02  (stonith:fence_ilo4):   Started kvm_aquila-02
Clone Set: dlm-clone [dlm]
    Started: [ kvm_aquila-01 kvm_aquila-02 ]
Clone Set: clvmd-clone [clvmd]
    Started: [ kvm_aquila-01 kvm_aquila-02 ]
Resource Group: test_VPS
    test   (ocf::heartbeat:VirtualDomain): Started kvm_aquila-01

Daemon Status:
 corosync: active/enabled
 pacemaker: active/enabled
 pcsd: active/enabled

看起來您已將 STONITH 設備配置為能夠隔離兩個節點。您也沒有位置約束來保持負責隔離給定節點的隔離代理在同一節點上執行(STONITH 自殺),這是一種不好的做法。

嘗試像這樣配置 STONITH 設備和位置約束:

pcs stonith create kvm_aquila-01 fence_ilo4 pcmk_host_list=kvm_aquila-01 ipaddr=10.0.4.39 login=fencing passwd=0ToleranciJa op monitor interval=60s
pcs stonith create kvm_aquila-02 fence_ilo4 pcmk_host_list=kvm_aquila-02 ipaddr=10.0.4.49 login=fencing passwd=0ToleranciJa op monitor interval=60s
pcs constraint location kvm_aquila-01 avoids kvm_aquila-01=INFINITY
pcs constraint location kvm_aquila-02 avoids kvm_aquila-02=INFINITY

引用自:https://unix.stackexchange.com/questions/527400