High availability is a set of automatic features designed to plan for, and safely recover from issues which take down Citrix Hypervisor servers or make them unreachable. For example, during physically disrupted networking or host hardware failures.
High availability ensures that when a host becomes unreachable or unstable, VMs running on that host are shut down and restarted on another host. Shutting down and restarting VMs on another host avoids the VMs being started (manually or automatically) on a new host. At some point later, the original host is recovered. This scenario can lead two instances of the same VM running on different hosts, and a corresponding high probability of VM disk corruption and data loss.
When the pool master becomes unreachable or unstable, high availability can also recover administrative control of a pool. High availability ensures that administrative control is restored automatically without any manual intervention.
Optionally, high availability can also automate the process of restarting VMs on hosts which are known to be in a good state without manual intervention. These VMs can be scheduled for restart in groups to allow time to start services. It allows infrastructure VMs to be started before their dependent VMs (For example, a DHCP server before its dependent SQL server).
Use high availability along with multipathed storage and bonded networking. Configure multipathed storage and bonded networking before attempting to set up high availability. Customers who do not set up multipathed storage and bonded networking can see unexpected host reboot behavior (Self-Fencing) when there is an infrastructure instability.
All graphics solutions (NVIDIA vGPU, Intel GVT-d, Intel GVT-G, AMD MxGPU, and vGPU pass-through) can be used in an environment that uses high availability. However, VMs that use these graphics solutions cannot be protected with high availability. These VMs can be restarted on a best-effort basis while there are hosts with the appropriate free resources.
A pool is overcommitted when the VMs that are currently running cannot be restarted elsewhere following a user-defined number of host failures.
Overcommitting can happen if there is not enough free memory across the pool to run those VMs following a failure. However, there are also more subtle changes which can make high availability guarantees unsustainable: Changes to Virtual Block Devices (VBDs) and networks can affect which VMs can be restarted on which hosts. Citrix Hypervisor cannot check all potential actions and determine if they cause violation of high availability demands. However, an asynchronous notification is sent if high availability becomes unsustainable.
Citrix Hypervisor dynamically maintains a failover plan which details what to do when a set of hosts in a pool fail at any given time. An important concept to understand is the host failures to tolerate value, which is defined as part of high availability configuration. The value of host failures to tolerate determines the number of failures that is allowed without any loss of service. For example, consider a resource pool that consists of 64 hosts and the tolerated failures is set to 3. In this case, the pool calculates a failover plan that allows any three hosts to fail and restart the VMs on other hosts. If a plan cannot be found, then the pool is considered to be overcommitted. The plan is dynamically recalculated based on VM lifecycle operations and movement. If changes, for example the addition on new VMs to the pool, cause your pool to become overcommitted, alerts are sent (either through XenCenter or email).
If any attempts to start or resume a VM cause the pool to be overcommitted, a warning alert is displayed. This warning appears in XenCenter and is also available as a message instance through the management API. If you have configured an email address, a message can also be sent to the email address. You can then cancel the operation, or proceed anyway. Proceeding causes the pool to become overcommitted. The amount of memory used by VMs of different priorities is displayed at the pool and host levels.
Sometimes, a server can fail due to the loss of network connectivity or when a problem with the control stack is encountered. In such cases, the Citrix Hypervisor server self-fences to ensure the VMs are not running on two servers simultaneously. When a fence action is taken, the server restarts immediately and abruptly, causing all VMs running on it to be stopped. The other servers detect that the VMs are no longer running and the VMs are restarted according to the restart priorities assigned to them. The fenced server enters a reboot sequence, and when it has restarted it tries to rejoin the resource pool.
Hosts in clustered pools can also self-fence when they cannot communicate with more than half the other hosts in the resource pool. For more information, see Clustered pools.
To use the high availability feature, you need:
Citrix Hypervisor pool (this feature provides high availability at the server level within a single resource pool).
We recommend that you enable high availability only in pools that contain at least three Citrix Hypervisor servers. For more information, see CTX129721 - High Availability Behavior When the Heartbeat is Lost in a Pool.
Shared storage, including at least one iSCSI, NFS, or Fibre Channel LUN of size 356 MB or greater - the heartbeat SR. The high availability mechanism creates two volumes on the heartbeat SR:
4 MB heartbeat volume: Used for heartbeating.
256 MB metadata volume: To store pool master metadata to be used if there is a master failover.
- For maximum reliability, we recommend that you use a dedicated NFS or iSCSI storage repository as your high availability heartbeat disk. Do not use this storage repository for any other purpose.
- If your pool is a clustered pool, your heartbeat SR must be a GFS2 SR.
- Storage attached using either SMB or iSCSI when authenticated using CHAP cannot be used as the heartbeat SR.
- When using a NetApp or EqualLogic SR, manually provision an NFS or iSCSI LUN on the array to use as the heartbeat SR.
Static IP addresses for all hosts.
If the IP address of a server changes while high availability is enabled, high availability assumes that the host’s network has failed. The change in IP address can fence the host and leave it in an unbootable state. To remedy this situation, disable high availability using the
host-emergency-ha-disablecommand, reset the pool master using
pool-emergency-reset-master, and then re-enable high availability.
For maximum reliability, we recommend that you use a dedicated bonded interface as the high availability management network.
For a VM to be protected by high availability, it must be agile. It means the VM:
Must have its virtual disks on shared storage. You can use any type of shared storage. iSCSI, NFS, or Fibre Channel LUN is only required for the storage heartbeat and can be used for virtual disk storage.
Can use live migration
Does not have a connection to a local DVD drive configured
Has its virtual network interfaces on pool-wide networks
When high availability is enabled, we strongly recommend using a bonded management interface on the servers in the pool and multipathed storage for the heartbeat SR.
If you create VLANs and bonded interfaces from the CLI, then they might not be plugged in and active despite being created. In this situation, a VM can appear to be not agile and it is not protected by high availability. You can use the CLI
pif-plug command to bring up the VLAN and bond PIFs so that the VM can become agile. You can also determine precisely why a VM is not agile by using the
xe diagnostic-vm-status CLI command. This command analyzes its placement constraints, and you can take remedial action if necessary.
Restart configuration settings
Virtual machines can be considered protected, best-effort, or unprotected by high availability. The value of
ha-restart-priority defines whether a VM is treated as protected, best-effort, or unprotected. The restart behavior for VMs in each of these categories is different.
High availability guarantees to restart a protected VM that goes offline or whose host goes offline, provided the pool isn’t overcommitted and the VM is agile.
If a protected VM cannot be restarted when a server fails, high availability attempts to start the VM when there is extra capacity in a pool. Attempts to start the VM when there is extra capacity might now succeed.
If the host of a best-effort VM goes offline, high availability attempts to restart the best-effort VM on another host. It makes this attempt only after all protected VMs have been successfully restarted. High availability makes only one attempt to restart a best-effort VM. If this attempt fails, high availability does not make further attempts to restart the VM.
If an unprotected VM or the host it runs on is stopped, high availability does not attempt to restart the VM.
ha-restart-priority Value: Value is an empty string
High availability never stops or migrates a running VM to free resources for a protected or best-effort VM to be restarted.
If the pool experiences server failures and the number of tolerable failures drops to zero, the protected VMs are not guaranteed to restart. In such cases, a system alert is generated. If another failure occurs, all VMs that have a restart priority set behave according to the best-effort behavior.
The start order is the order in which Citrix Hypervisor high availability attempts to restart protected VMs when a failure occurs. The values of the
order property for each of the protected VMs determines the start order.
order property of a VM is used by high availability and also by other features that start and shut down VMs. Any VM can have the
order property set, not just the VMs marked as protected for high availability. However, high availability uses the
order property for protected VMs only.
The value of the
order property is an integer. The default value is 0, which is the highest priority. Protected VMs with an
order value of 0 are restarted first by high availability. The higher the value of the
order property, the later in the sequence the VM is restarted.
You can set the value of the
order property of a VM by using the command-line interface:
xe vm-param-set uuid=VM_UUID order=int
Or in XenCenter, in the Start Options panel for a VM, set Start order to the required value.
Enable high availability on your Citrix Hypervisor pool
You can enable high availability on a pool by using either XenCenter or the command-line interface. In either case, you specify a set of priorities that determine which VMs are given the highest restart priority when a pool is overcommitted.
When you enable high availability, some operations that compromise the plan for restarting VMs, such as removing a server from a pool, might be disabled. You can temporarily disable high availability to perform such operations, or alternatively, make VMs protected by high availability unprotected.
If high availability is enabled, you cannot enable clustering on your pool. Temporarily disable high availability to enable clustering. You can enable high availability on your clustered pool. Some high availability behavior, such as self-fencing, is different for clustered pools. For more information, see Clustered pools
Enable high availability by using the CLI
Verify that you have a compatible Storage Repository (SR) attached to your pool. iSCSI, NFS, or Fibre Channel SRs are compatible. For information about how to configure such a storage repository using the CLI, see Manage storage repositories.
For each VM you want to protect, set a restart priority and start order. You can set the restart priority as follows:
xe vm-param-set uuid=vm_uuid ha-restart-priority=restart order=1
Enable high availability on the pool, and optionally, specify a timeout:
xe pool-ha-enable heartbeat-sr-uuids=sr_uuid ha-config:timeout=timeout in seconds
The timeout is the period during which networking or storage is not accessible by the hosts in your pool. If you do not specify a timeout when you enable high availability, Citrix Hypervisor uses the default 30 seconds timeout. If any Citrix Hypervisor server is unable to access networking or storage within the timeout period, it can self-fence and restart.
Run the command
pool-ha-compute-max-host-failures-to-tolerate. This command returns the maximum number of hosts that can fail before there are insufficient resources to run all the protected VMs in the pool.
The number of failures to tolerate determines when an alert is sent. The system recomputes a failover plan as the state of the pool changes. It uses this computation to identify the pool capacity and how many more failures are possible without loss of the liveness guarantee for protected VMs. A system alert is generated when this computed value falls below the specified value for
ha-host-failures-to-tolerateparameter. The value must be less than or equal to the computed value:
xe pool-param-set ha-host-failures-to-tolerate=2 uuid=pool-uuid
Remove high availability protection from a VM by using the CLI
To disable high availability features for a VM, use the
xe vm-param-set command to set the
ha-restart-priority parameter to be an empty string. Setting the
ha-restart-priority parameter does not clear the start order settings. You can enable high availability for a VM again by setting the
ha-restart-priority parameter to
best-effort as appropriate.
Recover an unreachable host
If for some reason, a host cannot access the high availability state file, it is possible that a host might become unreachable. To recover your Citrix Hypervisor installation, you might have to disable high availability using the
xe host-emergency-ha-disable --force
If the host was the pool master, it starts up as normal with high availability disabled. Pool members reconnect and automatically disable high availability. If the host was a pool member and cannot contact the master, you might have to take one of the following actions:
Force the host to reboot as a pool master (
xe pool-emergency-transition-to-master uuid=host_uuid
Tell the host where the new master is (
xe pool-emergency-reset-master master-address=new_master_hostname
When all hosts have successfully restarted, re-enable high availability:
xe pool-ha-enable heartbeat-sr-uuid=sr_uuid
Shutting down a host when high availability is enabled
Take special care when shutting down or rebooting a host to prevent the high availability mechanism from assuming that the host has failed. To shut down a host cleanly when high availability is enabled,
disable the host,
evacuate the host, and finally
shutdown the host by using either XenCenter or the CLI. To shut down a host in an environment where high availability is enabled, run these commands:
xe host-disable host=host_name xe host-evacuate uuid=host_uuid xe host-shutdown host=host_name
Shut down a VM protected by high availability
When a VM is protected under a high availability plan and set to restart automatically, it cannot be shut down while this protection is active. To shut down a VM, first disable its high availability protection and then execute the CLI command. XenCenter offers you a dialog box to automate disabling the protection when you select the Shutdown button of a protected VM.
If you shut down a VM from within the guest, and the VM is protected, it is automatically restarted under the high availability failure conditions. The automatic restart helps ensure that operator error doesn’t result in a protected VM being left shut down accidentally. If you want to shut down this VM, disable its high availability protection first.