This post will deep dive into RHV Power Management.

What is Host Power Management(Fence)?

When Power Management is configured, RHV can reboot the hosts that are in NonOperational or NonResponsive state. RHV supports the following power management devices:

  • American Power Conversion (apc)
  • IBM Bladecenter (Bladecenter)
  • Cisco Unified Computing System (cisco_ucs)
  • Dell Remote Access Card 5 (drac5)
  • Dell Remote Access Card 7 (drac7)
  • Electronic Power Switch (eps)
  • HP BladeSystem (hpblade)
  • Integrated Lights Out (ilo, ilo2, ilo3, ilo4,ilo_ssh)
  • Intelligent Platform Management Interface (ipmilan)
  • Remote Supervisor Adapter (rsa)
  • Fujitsu-Siemens RSB (rsb)
  • Western Telematic, Inc (wti)

RHV uses fence agents to communicate with power management devices.

What is Auto Fencing?

When Host experiences unexpected failure, the host status will change to connecting and the Host will be in this status for a grace period. If this timeout elapses, the Host will turn to ‘NonResponsive’ or “NonOperational’ state. To react to that state, Engine fences the problematic hosts by performing reboot. Engine uses the fencing agent for the power management card on the host to stop the host, confirm it has stopped, start the host, and confirm that the host has been started.

Auto Fence Grace Period:

By default, Engine will try twice to ask vdsm for status:

option_name | option_value | default_value
-------------------------+--------------+---------------
VDSAttemptsToResetCount | 2 | 2
(1 row)
Grace Period = TimeoutToResetVdsInSeconds + DelayResetPerVmInSeconds*(Number of VMs on host) + DelayResetForSpmInSeconds(isSPM)

For example, if the Host is SPM with two VMs and default_value, then the grace period= 60+ 0.5*2+20=81s

option_name                 | option_value | default_value
----------------------------+--------------+---------------
TimeoutToResetVdsInSeconds  | 60           | 60
DelayResetForSpmInSeconds   | 20           | 20
DelayResetPerVmInSeconds    | 0.5          | 0.5
VDSAttemptsToResetCount     | 2            | 2
(4 rows)

Kdump Fence:

When enabling “Kdump integration”, it just delays the hard-fence until the host finishes writing its memory dump in case of a crash.

Soft Fence:

This can be configured from Cluster Level:

AdminPortal-->Compute-->Cluster-->Edit Cluster-->Fencing Polciy-->Enable Fencing

Before ‘reboot’ the host, Engine attempts to restart VDSM via SSH on ‘non-responsive’ hosts by “SSH Soft Fencing”.

option_name            | option_value                                    | default_value                                  | version
-----------------------+-------------------------------------------------+------------------------------------------------+---------
SshSoftFencingCommand  | /usr/bin/vdsm-tool service-restart vdsmd        | /usr/bin/vdsm-tool service-restart vdsmd       | 4.3
(1 row)

Soft-fencing over SSH can be executed on hosts that have no power management configured. This is distinct from “fencing”. Fencing can be executed only on hosts that have power management configured.

Selecting a Proxy

The default Power Management Proxy Preference is cluster, dc. There is an option to add “other_dc”. It will find a proxy host in “UP” status.

# engine-config -g FenceProxyDefaultPreferences
FenceProxyDefaultPreferences: cluster,dc version: general

Flow:

Flow

Engine Flow:

EngineFlow

Configuration Meta-data

Here are the Meta-datas of VdsFenceType,VdsFenceOptionTypes,VdsFenceOptionMapping,FenceAgentMapping and FenceAgentDefaultParams:

-[ RECORD 1 ]+----------------------------------------------------------------------------------------------------------
option_name | VdsFenceType
option_value | apc,apc_snmp,bladecenter,cisco_ucs,drac5,drac7,eps,hpblade,ilo,ilo2,ilo3,ilo4,ilo_ssh,ipmilan,rsa,rsb,wti
version | 4.3

-[ RECORD 2 ]-+---------------------------------------------------------------------------------------------------------
option_name | VdsFenceOptionTypes
option_value | encrypt_options=bool,secure=bool,port=int,slot=int
default_value | encrypt_options=bool,secure=bool,port=int,slot=int

-[ RECORD 3 ]-+----------------------------------------------------------------------------------------------------------
option_name | VdsFenceOptionMapping
option_value |

apc:secure=secure,port=ipport,slot=port;
apc_snmp:port=port,encrypt_options=encrypt_options;
bladecenter:secure=secure,port=ipport,slot=port;
cisco_ucs:secure=ssl,slot=port;
drac5:secure=secure,slot=port;
drac7:;eps:slot=port;
hpblade:port=port;
ilo:secure=ssl,port=ipport;
ipmilan:;
ilo2:secure=ssl,port=ipport;
ilo3:;
ilo4:;
ilo_ssh:port=port;
rsa:secure=secure,port=ipport;
rsb:;wti:secure=secure,port=ipport,slot=port

default_value |
apc:secure=secure,port=ipport,slot=port;
apc_snmp:port=port,encrypt_options=encrypt_options;
bladecenter:secure=secure,port=ipport,slot=port;
cisco_ucs:secure=ssl,slot=port;
drac5:secure=secure,slot=port;
drac7:;
eps:slot=port;
hpblade:port=port;
ilo:secure=ssl,port=ipport;
ipmilan:;
ilo2:secure=ssl,port=ipport;
ilo3:;
ilo4:;
ilo_ssh:port=port;
rsa:secure=secure,port=ipport;
rsb:;
wti:secure=secure,port=ipport,slot=port

-[ RECORD 4 ]-+----------------------------------------------------------------------------------------------
option_name | FenceAgentMapping
option_value | drac7=ipmilan,ilo2=ilo
default_value | drac7=ipmilan,ilo2=ilo

-[ RECORD 5 ]-+-----------------------------------------------------------------------------------------------
option_name | FenceAgentDefaultParams
option_value | drac7:privlvl=OPERATOR,lanplus=1,delay=10;ilo3:power_wait=4;ilo4:power_wait=4;ilo_ssh:secure=1
default_value | drac7:privlvl=OPERATOR,lanplus=1,delay=10;ilo3:power_wait=4;ilo4:power_wait=4;ilo_ssh:secure=1

The above meta-data can be configured in engine-config:

# engine-config -a |grep 'CustomFence|CustomVdsFence'
CustomFenceAgentMapping: version: general
CustomFenceAgentDefaultParams: version: general
CustomFenceAgentDefaultParamsForPPC: version: general
CustomVdsFenceOptionMapping: version: general
CustomVdsFenceType: version: general
CustomFencePowerWaitParam: version: general

Other Configuration(Timeouts and retries):

# engine-config -a |grep 'FenceStart|FenceStop'
FenceStartStatusRetries: 18 version: general
FenceStartStatusDelayBetweenRetriesInSec: 10 version: general
FenceStopStatusRetries: 18 version: general
FenceStopStatusDelayBetweenRetriesInSec: 10 version: general