This post will illustrate how RHEVM monitors storage health in detail.

Monitor Storage Pool Manager Health

The Storage Pool Manager (SPM) is a management role assigned to one of the hosts in a data center enabling it to manage the storage domains of the data center. RHEVM check SPM availability and metadata integrity at every SPM polling rate 10 seconds.

You can check those engine configuration as follows. It’s not recommended to change below parameters unless it is recommended by Support engineer for speicific use case:

$ engine-config -g StoragePoolRefreshTimeInSeconds
StoragePoolRefreshTimeInSeconds: 10 version: general

$ engine-config -g SpmCommandFailOverRetries
SpmCommandFailOverRetries: 3 version: general

$ engine-config -g SPMFailOverAttempts
SPMFailOverAttempts: 3 version: general

$ engine-config -g DelayResetForSpmInSeconds
DelayResetForSpmInSeconds: 20 version: general

Monitor Storage Domain Health

The storage domain health is monitored by both the RHEVM engine and KVM Vdsm.

1. RHEVM monitors storage domain

The time interval in seconds to poll a Host status by RHEVM is 3 sec by default. During the process, it will check the storage status. If the getRepoStats reports code is non-zero or lastcheck(statsGenTime – domStatus.checkTime) higher than MaxStorageVdsTimeoutCheckSec(30 seconds by default), then storage domain becomes problematic and a timer starts for the domain. The storage domain failure timeout is StorageDomainFailureTimeoutInMinutes(5 minutes default). If the problematic storage domain status isn’t recovered during this time, then the host will be set as non-operation. If the problematic domain is recovered during the time, then RHEVM will activate the KVM host back automatically.

Here are related engine configuration:

$ engine-config -g VdsRefreshRate
VdsRefreshRate: 3 version: general

$ engine-config -g StorageDomainFailureTimeoutInMinutes
StorageDomainFailureTimeoutInMinutes: 5 version: general

$ engine-config -g MaxStorageVdsTimeoutCheckSec
MaxStorageVdsTimeoutCheckSec: 30 version: general

$ engine-config -g MaxStorageVdsDelayCheckSec
MaxStorageVdsDelayCheckSec: 5 version: general

2. KVM host monitor storage health:

If KVM host can’t access the storage domains, it will become Non-Operational. Vdsm will refresh storage with sd_health_check_delay monitor interval. It invokes getStorageDomainStats to get dom.getStats the domain status and return its status back to RHEVM via repoStats in _getDomsStats. Both repo_stats_cache_refresh_timeout and sd_health_check_delay can be configured in /etc/vdsm/vdsm.conf.