Citrix Hypervisor

High availability

High availability is a set of automatic features designed to plan for, and safely recover from issues which take down Citrix Hypervisor servers or make them unreachable. For example, during physically disrupted networking or host hardware failures.

Overview

High availability ensures that if a host becomes unreachable or unstable, the VMs running on that host are safely restarted on another host automatically. This removes the need for the VMs to be manually restarted, resulting in minimal VM downtime.

When the pool master becomes unreachable or unstable, high availability can also recover administrative control of a pool. High availability ensures that administrative control is restored automatically without any manual intervention.

Optionally, high availability can also automate the process of restarting VMs on hosts which are known to be in a good state without manual intervention. These VMs can be scheduled for restart in groups to allow time to start services. It allows infrastructure VMs to be started before their dependent VMs (for example, a DHCP server before its dependent SQL server).

Warnings:

Use high availability along with multipathed storage and bonded networking. Configure multipathed storage and bonded networking before attempting to set up high availability. Customers who do not set up multipathed storage and bonded networking can see unexpected host reboot behavior (Self-Fencing) when there is an infrastructure instability.

All graphics solutions (NVIDIA vGPU, Intel GVT-d, Intel GVT-G, AMD MxGPU, and vGPU pass-through) can be used in an environment that uses high availability. However, VMs that use these graphics solutions cannot be protected with high availability. These VMs can be restarted on a best-effort basis while there are hosts with the appropriate free resources.

Overcommitting

A pool is overcommitted when the VMs that are currently running cannot be restarted elsewhere following a user-defined number of host failures. Overcommitting can happen if there is not enough free memory across the pool to run those VMs following a failure. However, there are also more subtle changes which can make high availability unsustainable: changes to Virtual Block Devices (VBDs) and networks can affect which VMs can be restarted on which hosts. Citrix Hypervisor cannot check all potential actions and determine if they cause violation of high availability demands. However, an asynchronous notification is sent if high availability becomes unsustainable.

Citrix Hypervisor dynamically maintains a failover plan which details what to do when a set of hosts in a pool fail at any given time. An important concept to understand is the host failures to tolerate value, which is defined as part of the high availability configuration. The value of host failures to tolerate determines the number of host failures that are allowed while still being able to restart all protected VMs. For example, consider a resource pool that consists of 64 hosts and host failures to tolerate is set to 3. In this case, the pool calculates a failover plan that allows any three hosts to fail and then restarts the VMs on other hosts. If a plan cannot be found, then the pool is considered to be overcommitted. The plan is dynamically recalculated based on VM lifecycle operations and movement. If changes (for example, the addition of new VMs to the pool) cause the pool to become overcommitted, alerts are sent either via XenCenter or email.

Overcommitment warning

If any attempts to start or resume a VM would cause the pool to become overcommitted, a warning alert is displayed in XenCenter. You can then choose to cancel the operation or proceed anyway. Proceeding causes the pool to become overcommitted and a message is sent to any configured email addresses. This is also available as a message instance through the management API. The amount of memory used by VMs of different priorities is displayed at the pool and host levels.

Host fencing

Sometimes a server can fail due to the loss of network connectivity or when a problem with the control stack is encountered. In such cases, the Citrix Hypervisor server self-fences to ensure that the VMs are not running on two servers simultaneously. When a fence action is taken, the server restarts immediately and abruptly, causing all VMs running on it to be stopped. The other servers detect that the VMs are no longer running and then the VMs are restarted according to the restart priorities assigned to them. The fenced server enters a reboot sequence, and when it has restarted it tries to rejoin the resource pool.

Note:

Hosts in clustered pools can also self-fence when they cannot communicate with more than half of the other hosts in the resource pool. For more information, see Clustered pools.

Configuration requirements

To use the high availability feature, you need:

  • Citrix Hypervisor pool (this feature provides high availability at the server level within a single resource pool).

    Note:

    We recommend that you enable high availability only in pools that contain at least 3 Citrix Hypervisor servers. For more information, see CTX129721 - High Availability Behavior When the Heartbeat is Lost in a Pool.

  • Shared storage, including at least one iSCSI, NFS, or Fibre Channel LUN of size 356 MB or greater - the heartbeat SR. The high availability mechanism creates two volumes on the heartbeat SR:

    • 4 MB heartbeat volume: Used to provide a heartbeat.
    • 256 MB metadata volume: To store pool master metadata to be used if there is a master failover.

    Note:

    Previously, we recommended using a dedicated NFS or iSCSI storage repository as your high availability heartbeat disk. However, this is only of benefit if the storage repository is not sharing resources on the underlying storage appliance, otherwise it is simply increasing complexity and resource usage in the control domain (dom0) of the host.

    If your pool is a clustered pool, your heartbeat SR must be a GFS2 SR.

    Storage attached using either SMB or iSCSI when authenticated using CHAP cannot be used as the heartbeat SR.

  • Static IP addresses for all hosts.

    Warning:

    If the IP address of a server changes while high availability is enabled, high availability assumes that the host’s network has failed. The change in IP address can fence the host and leave it in an unbootable state. To remedy this situation, disable high availability using the host-emergency-ha-disable command, reset the pool master using pool-emergency-reset-master, and then re-enable high availability.

  • For maximum reliability, we recommend that you use a dedicated bonded interface as the high availability management network.

For a VM to be protected by high availability, it must be agile. This means the VM:

  • Must have its virtual disks on shared storage. You can use any type of shared storage. iSCSI, NFS, or Fibre Channel LUN is only required for the storage heartbeat and can be used for virtual disk storage.

  • Can use live migration.

  • Does not have a connection to a local DVD drive configured.

  • Has its virtual network interfaces on pool-wide networks.

Note:

When high availability is enabled, we strongly recommend using a bonded management interface on the servers in the pool and multipathed storage for the heartbeat SR.

If you create VLANs and bonded interfaces from the CLI, then they might not be plugged in and active despite being created. In this situation, a VM can appear to be not agile and it is not protected by high availability. You can use the CLI pif-plug command to bring up the VLAN and bond PIFs so that the VM can become agile. You can also determine precisely why a VM is not agile by using the xe diagnostic-vm-status CLI command. This command analyzes its placement constraints, and you can take remedial action if necessary.

Restart configuration settings

Virtual machines can be considered protected, best-effort, or unprotected by high availability. The value of ha-restart-priority defines whether a VM is treated as protected, best-effort, or unprotected. The restart behavior for VMs in each of these categories is different.

Protected

High availability guarantees to restart a protected VM that goes offline or whose host goes offline, provided the pool isn’t overcommitted and the VM is agile.

If a protected VM cannot be restarted when a server fails, high availability attempts to start the VM when there is extra capacity in a pool. Attempts to start the VM when there is extra capacity might now succeed.

ha-restart-priority Value: restart

Best-effort

If the host of a best-effort VM goes offline, high availability attempts to restart the best-effort VM on another host. It makes this attempt only after all protected VMs have been successfully restarted. High availability makes only one attempt to restart a best-effort VM. If this attempt fails, high availability does not make further attempts to restart the VM.

ha-restart-priority Value: best-effort

Unprotected

If an unprotected VM or the host it runs on is stopped, high availability does not attempt to restart the VM.

ha-restart-priority Value: Value is an empty string

Note:

High availability never stops or migrates a running VM to free resources for a protected or best-effort VM to be restarted.

If the pool experiences server failures and the number of tolerable failures drops to zero, the protected VMs are not guaranteed to restart. In such cases, a system alert is generated. If another failure occurs, all VMs that have a restart priority set behave according to the best-effort behavior.

Start order

The start order is the order in which Citrix Hypervisor high availability attempts to restart protected VMs when a failure occurs. The values of the order property for each of the protected VMs determines the start order.

The order property of a VM is used by high availability and also by other features that start and shut down VMs. Any VM can have the order property set, not just the VMs marked as protected for high availability. However, high availability uses the order property for protected VMs only.

The value of the order property is an integer. The default value is 0, which is the highest priority. Protected VMs with an order value of 0 are restarted first by high availability. The higher the value of the order property, the later in the sequence the VM is restarted.

You can set the value of the order property of a VM by using the command-line interface:

xe vm-param-set uuid=<vm uuid> order=<int>
<!--NeedCopy-->

Or in XenCenter, in the Start Options panel for a VM, set Start order to the required value.

Enable high availability on your Citrix Hypervisor pool

You can enable high availability on a pool by using either XenCenter or the command-line interface (CLI). In either case, you specify a set of priorities that determine which VMs are given the highest restart priority when a pool is overcommitted.

Warnings:

  • When you enable high availability, some operations that compromise the plan for restarting VMs (such as removing a server from a pool) might be disabled. You can temporarily disable high availability to perform such operations, or alternatively, make VMs protected by high availability unprotected.

  • If high availability is enabled, you cannot enable clustering on your pool. Temporarily disable high availability to enable clustering. You can enable high availability on your clustered pool. Some high availability behavior, such as self-fencing, is different for clustered pools. For more information, see Clustered pools.

Enable high availability by using the CLI

  1. Verify that you have a compatible Storage Repository (SR) attached to your pool. iSCSI, NFS, or Fibre Channel SRs are compatible. For information about how to configure such a storage repository using the CLI, see Manage storage repositories.

  2. For each VM you want to protect, set a restart priority and start order. You can set the restart priority as follows:

    xe vm-param-set uuid=<vm uuid> ha-restart-priority=restart order=1
    <!--NeedCopy-->
    
  3. Enable high availability on the pool, and optionally, specify a timeout:

    xe pool-ha-enable heartbeat-sr-uuids=<sr uuid> ha-config:timeout=<timeout in seconds>
    <!--NeedCopy-->
    

    Alternatively, you can set a default timeout for your pool. For more information about how to set a timeout, see Configure high availability timeout.

  4. Run the command pool-ha-compute-max-host-failures-to-tolerate. This command returns the maximum number of hosts that can fail before there are insufficient resources to run all the protected VMs in the pool.

    xe pool-ha-compute-max-host-failures-to-tolerate
    <!--NeedCopy-->
    

    The number of failures to tolerate determines when an alert is sent. The system recomputes a failover plan as the state of the pool changes. It uses this computation to identify the pool capacity and how many more failures are possible without loss of the liveness guarantee for protected VMs. A system alert is generated when this computed value falls below the specified value for ha-host-failures-to-tolerate.

  5. Specify the ha-host-failures-to-tolerate parameter. The value must be less than or equal to the computed value:

    xe pool-param-set ha-host-failures-to-tolerate=2 uuid=<pool uuid>
    <!--NeedCopy-->
    

Configure high availability timeout

High availability timeout is the period during which networking or storage is not accessible by the hosts in your pool. If any Citrix Hypervisor server is unable to access networking or storage within the timeout period, it can self-fence and restart. The default timeout is 60 seconds. However, you can change this value by using the following command.

Set a default high availability timeout for your pool:

xe pool-param-set uuid=<pool uuid> other-config:default_ha_timeout=<timeout in seconds>
<!--NeedCopy-->

If you enable high availability by using XenCenter instead of the xe CLI, this default still applies.

Alternatively, you can set timeout when you enable high availability:

xe pool-ha-enable heartbeat-sr-uuids=<sr uuid> ha-config:timeout=<timeout in seconds>
<!--NeedCopy-->

Note that if you set timeout when enabling high availability, it only applies to that specific enablement. Therefore, if you disable and then reenable high availability later, the high availability feature reverts to using the default timeout.

Remove high availability protection from a VM by using the CLI

To disable high availability features for a VM, use the xe vm-param-set command to set the ha-restart-priority parameter to be an empty string. Setting the ha-restart-priority parameter does not clear the start order settings. You can enable high availability for a VM again by setting the ha-restart-priority parameter to restart or best-effort as appropriate.

Recover an unreachable host

If for some reason, a host cannot access the high availability state file, it is possible that a host might become unreachable. To recover your Citrix Hypervisor installation, you might have to disable high availability using the host-emergency-ha-disable command:

xe host-emergency-ha-disable --force
<!--NeedCopy-->

If the host was the pool master, it starts up as normal with high availability disabled. Pool members reconnect and automatically disable high availability. If the host was a pool member and cannot contact the master, you might have to take one of the following actions:

  • Force the host to reboot as a pool master (xe pool-emergency-transition-to-master)

     xe pool-emergency-transition-to-master uuid=<host uuid>
     <!--NeedCopy-->
    
  • Tell the host where the new master is (xe pool-emergency-reset-master):

     xe pool-emergency-reset-master master-address=<new master hostname>
     <!--NeedCopy-->
    

When all hosts have successfully restarted, re-enable high availability:

xe pool-ha-enable heartbeat-sr-uuid=<sr uuid>
<!--NeedCopy-->

Shutting down a host when high availability is enabled

Take special care when shutting down or rebooting a host to prevent the high availability mechanism from assuming that the host has failed. To shut down a host cleanly when high availability is enabled, disable the host, evacuate the host, and finally shutdown the host by using either XenCenter or the CLI. To shut down a host in an environment where high availability is enabled, run these commands:

xe host-disable host=<host name>
xe host-evacuate uuid=<host uuid>
xe host-shutdown host=<host name>
<!--NeedCopy-->

Shut down a VM protected by high availability

When a VM is protected under a high availability plan and set to restart automatically, it cannot be shut down while this protection is active. To shut down a VM, first disable its high availability protection and then run the CLI command. XenCenter offers you a dialog box to automate disabling the protection when you select the Shutdown button of a protected VM.

Note:

If you shut down a VM from within the guest, and the VM is protected, it is automatically restarted under the high availability failure conditions. The automatic restart helps ensure that operator error doesn’t result in a protected VM being left shut down accidentally. If you want to shut down this VM, disable its high availability protection first.

High availability