Configure high availability

High availability is a set of automatic features designed to plan for, and safely recover from issues which take down XenServer hosts or make them unreachable. For example, during physically disrupted networking or host hardware failures.

Overview

High availability ensures that when a host becomes unreachable or unstable, VMs running on that host are shut down and restarted on another host. Shutting down and restarting VMs on another host avoids the VMs being started (manually or automatically) on a new host. At some point later, the original host is recovered. This scenario can lead two instances of the same VM running on different hosts, and a corresponding high probability of VM disk corruption and data loss.

When the pool master becomes unreachable or unstable, high availability can also recover administrative control of a pool. High availability ensures that administrative control is restored automatically without any manual intervention.

Optionally, high availability can also automate the process of restarting VMs on hosts which are known to be in a good state without manual intervention. These VMs can be scheduled for restart in groups to allow time to start services. It allows infrastructure VMs to be started before their dependent VMs (For example, a DHCP server before its dependent SQL server).

Warning:

High availability can be used along with multipathed storage and bonded networking. Configure multipathed storage and bonded networking before attempting to set up high availability. Customers who do not set up multipathed networking and storage can see unexpected host reboot behavior (Self Fencing) when there is an infrastructure instability.

Note:

All graphics solutions (NVidia vGPU, Intel GVT-d, Intel GVT-G, AMD MxGPU, and vGPU pass-through) can be used in in an environment that makes use of high availability. However, VMs that use these graphics solutions cannot be protected with high availability. These VMs can be restarted on a best-effort basis while there are hosts with the appropriate free resources.

Overcommitting

A pool is overcommitted when the VMs that are currently running cannot be restarted elsewhere following a user-defined number of host failures.

Overcommitting can happen if there is not enough free memory across the pool to run those VMs following a failure. However, there are also more subtle changes which can make high availability guarantees unsustainable: Changes to Virtual Block Devices (VBDs) and networks can affect which VMs can be restarted on which hosts. XenServer cannot check all potential actions and determine if they cause violation of high availability demands. However, an asynchronous notification is sent if high availability becomes unsustainable.

XenServer dynamically maintains a failover plan which details what to do when a set of hosts in a pool fail at any given time. An important concept to understand is the host failures to tolerate value, which is defined as part of high availability configuration. The value of host failures to tolerate determines the number of failures that is allowed without any loss of service. For example, consider a resource pool that consists of 64 hosts and the tolerated failures is set to 3. In this case, the pool calculates a failover plan that allows any three hosts to fail and restart the VMs on other hosts. If a plan cannot be found, then the pool is considered to be overcommitted. The plan is dynamically recalculated based on VM lifecycle operations and movement. If changes, for example the addition on new VMs to the pool, cause your pool to become overcommitted, alerts are sent (either through XenCenter or email).

Overcommitment warning

If any attempts to start or resume a VM cause the pool to be overcommitted, a warning alert is displayed. This warning appears in XenCenter and is also available as a message instance through the Xen API. If you have configured an email address, a message may also be sent to the email address. You can then cancel the operation, or proceed anyway. Proceeding causes the pool to become overcommitted. The amount of memory used by VMs of different priorities is displayed at the pool and host levels.

Host fencing

Sometimes, a server can fail due to the loss of network connectivity or when a problem with the control stack is encountered. In such cases, the XenServer host self-fences to ensure the VMs are not running on two servers simultaneously. When a fence action is taken, the server restarts immediately and abruptly, causing all VMs running on it to be stopped. The other servers detect that the VMs are no longer running and the VMs are restarted according to the restart priorities assigned to them. The fenced server enters a reboot sequence, and when it has restarted it tries to rejoin the resource pool.

Note:

Hosts in clustered pools can also self-fence when they cannot communicate with more than half the other hosts in the resource pool. For more information, see Clustered pools.

Configuration requirements

Note:

Citrix recommends that you enable high availability only in pools that contain at least three XenServer hosts. For more information, see CTX129721 - High Availability Behavior When the Heartbeat is Lost in a XenServer Pool.

To use the high availability feature, you need:

  • Shared storage, including at least one iSCSI, NFS, or Fibre Channel LUN of size 356 MB or greater - the heartbeat SR. The high availability mechanism creates two volumes on the heartbeat SR:

    4 MB heartbeat volume: Used for heartbeating.

    256 MB metadata volume: To store pool master metadata to be used if there is a master failover.

    Notes:

    • For maximum reliability, Citrix recommends that you use a dedicated NFS or iSCSI storage repository as your high availability heartbeat disk. Do not use this storage repository for any other purpose.
    • If your pool is a clustered pool, your heartbeat SR must be a GFS2 SR.
    • Storage attached using either SMB or iSCSI when authenticated using CHAP cannot be used as the heartbeat SR.
    • When using a NetApp or EqualLogic SR, manually provision an NFS or iSCSI LUN on the array to use as the heartbeat SR.
  • XenServer pool (this feature provides high availability at the server level within a single resource pool).

  • Static IP addresses for all hosts.

Warning:

If the IP address of a server changes while high availability is enabled, high availability assumes that the host’s network has failed. The change in IP address can fence the host and leave it in an unbootable state. To remedy this situation, disable high availability using the host-emergency-ha-disable command, reset the pool master using pool-emergency-reset-master, and then re-enable high availability.

For a VM to be protected by high availability, it must be agile. It means the VM:

  • Must have its virtual disks on shared storage. You can use any type of shared storage. iSCSI, NFS, or Fibre Channel LUN is only required for the storage heartbeat and can be used for virtual disk storage.

  • Can use XenMotion

  • Does not have a connection to a local DVD drive configured

  • Has its virtual network interfaces on pool-wide networks

When high availability is enabled, Citrix strongly recommends using a bonded management interface on the servers in the pool and multipathed storage for the heartbeat SR.

If you create VLANs and bonded interfaces from the CLI, then they may not be plugged in and active despite being created. In this situation, a VM can appear to be not agile and it is not protected by high availability. You can use the CLI pif-plug command to bring up the VLAN and bond PIFs so that the VM can become agile. You can also determine precisely why a VM is not agile by using the xe diagnostic-vm-status CLI command. This command analyzes its placement constraints, and you can take remedial action if necessary.

Restart configuration settings

Virtual machines can be considered protected, best-effort, or unprotected by high availability. The value of ha-restart-priority defines whether a VM is treated as protected, best-effort, or unprotected. The restart behavior for VMs in each of these categories is different.

Protected

High availability guarantees to restart a protected VM that goes offline or whose host goes offline, provided the pool isn’t overcommitted and the VM is agile.

If a protected VM cannot be restarted when a server fails, high availability attempts to start the VM when there is extra capacity in a pool. Attempts to start the VM when there is extra capacity might now succeed.

ha-restart-priority Value: restart

Best-effort

If the host of a best-effort VM goes offline, high availability attempts to restart the best-effort VM on another host. It makes this attempt only after all protected VMs have been successfully restarted. High availability makes only one attempt to restart a best-effort VM. If this attempt fails, high availability does not make further attempts to restart the VM.

ha-restart-priority Value: best-effort

Unprotected

If an unprotected VM or the host it runs on is stopped, high availability does not attempt to restart the VM.

ha-restart-priority Value: Value is an empty string

Note:

High availability never stops or migrates a running VM to free resources for a protected or best-effort VM to be restarted.

If the pool experiences server failures and the number of tolerable failures drops to zero, the protected VMs are not guaranteed to restart. In such cases, a system alert is generated. If another failure occurs, all VMs that have a restart priority set behave according to the best-effort behavior.

Start order

The start order is the order in which XenServer high availability attempts to restart protected VMs when a failure occurs. The values of the order property for each of the protected VMs determines the start order.

The order property of a VM is used by high availability and also by other features that start and shut down VMs. Any VM can have the order property set, not just the VMs marked as protected for high availability. However, high availability uses the order property for protected VMs only.

The value of the order property is an integer. The default value is 0, which is the highest priority. Protected VMs with an order value of 0 are restarted first by high availability. The higher the value of the order property, the later in the sequence the VM is restarted.

You can set the value of the order property of a VM by using the command-line interface:

xe vm-param-set uuid=VM_UUID order=int

Or in XenCenter, in the Start Options panel for a VM, set Start order to the required value.

Enable high availability on your XenServer pool

You can enable high availability on a pool by using either XenCenter or the command-line interface. In either case, you specify a set of priorities that determine which VMs are given the highest restart priority when a pool is overcommitted.

Warnings:

  • When you enable high availability, some operations that compromise the plan for restarting VMs, such as removing a server from a pool, may be disabled. You can temporarily disable high availability to perform such operations, or alternatively, make VMs protected by high availability unprotected.

  • If high availability is enabled, you cannot enable clustering on your pool. Temporarily disable high availability to enable clustering. You can enable high availability on your clustered pool. Some high availability behavior, such as self-fencing, is different for clustered pools. For more information, see Clustered pools

Enable high availability by using the CLI

  1. Verify that you have a compatible Storage Repository (SR) attached to your pool. iSCSI, NFS, or Fibre Channel SRs are compatible. For information about how to configure such a storage repository using the CLI, see Manage storage repositories.

  2. For each VM you want to protect, set a restart priority and start order. You can set the restart priority as follows:

    xe vm-param-set uuid=vm_uuid ha-restart-priority=restart order=1
    
  3. Enable high availability on the pool, and optionally, specify a timeout:

    xe pool-ha-enable heartbeat-sr-uuids=sr_uuid ha-config:timeout=timeout in seconds
    

    Timeout is the period during which networking or storage is not accessible by the hosts in your pool. If you do not specify a timeout when you enable high availability, XenServer uses the default 30 seconds timeout. If any XenServer host is unable to access networking or storage within the timeout period, it can self-fence and restart.

  4. Run the command pool-ha-compute-max-host-failures-to-tolerate. This command returns the maximum number of hosts that can fail before there are insufficient resources to run all the protected VMs in the pool.

    xe pool-ha-compute-max-host-failures-to-tolerate
    

    The number of failures to tolerate determines when an alert is sent. The system recomputes a failover plan as the state of the pool changes. It uses this computation to identify the pool capacity and how many more failures are possible without loss of the liveness guarantee for protected VMs. A system alert is generated when this computed value falls below the specified value for ha-host-failures-to-tolerate.

  5. Specify the number of failures to tolerate parameter. The value must be less than or equal to the computed value:

    xe pool-param-set ha-host-failures-to-tolerate=2 uuid=pool-uuid
    

Remove high availability protection from a VM by using the CLI

To disable high availability features for a VM, use the xe vm-param-set command to set the ha-restart-priority parameter to be an empty string. Setting the ha-restart-priority parameter does not clear the start order settings. You can enable high availability for a VM again by setting the ha-restart-priority parameter to restart or best-effort as appropriate.

Recover an unreachable host

If for some reason, a host cannot access the high availability state file, it is possible that a host may become unreachable. To recover your XenServer installation, you may have to disable high availability using the host-emergency-ha-disable command:

xe host-emergency-ha-disable --force

If the host was the pool master, it starts up as normal with high availability disabled. Pool members reconnect and automatically disable high availability. If the host was a pool member and cannot contact the master, you might have to take one of the following actions:

  • Force the host to reboot as a pool master (xe pool-emergency-transition-to-master)

     xe pool-emergency-transition-to-master uuid=host_uuid
    
  • Tell the host where the new master is (xe pool-emergency-reset-master):

     xe pool-emergency-reset-master master-address=new_master_hostname
    

When all hosts have successfully restarted, re-enable high availability:

xe pool-ha-enable heartbeat-sr-uuid=sr_uuid

Shutting down a host when high availability is enabled

Take special care when shutting down or rebooting a host to prevent the high availability mechanism from assuming that the host has failed. To shut down a host cleanly when high availability is enabled, disable the host, evacuate the host, and finally shutdown the host by using either XenCenter or the CLI. To shut down a host in an environment where high availability is enabled, run these commands:

xe host-disable host=host_name
xe host-evacuate uuid=host_uuid
xe host-shutdown host=host_name

Shut down a VM protected by high availability

When a VM is protected under a high availability plan and set to restart automatically, it cannot be shut down while this protection is active. To shut down a VM, first disable its high availability protection and then execute the CLI command. XenCenter offers you a dialog box to automate disabling the protection when you select the Shutdown button of a protected VM.

Note:

If you shut down a VM from within the guest, and the VM is protected, it is automatically restarted under the high availability failure conditions. The automatic restart helps ensure that operator error doesn’t result in a protected VM being left shut down accidentally. If you want to shut down this VM, disable its high availability protection first.