Product Documentation

High Availability

Jun 13, 2017
High Availability
Prev Chapter 3. XenServer Hosts and Resource Pools Next

HA is a set of automatic features designed to plan for, and safely recover from, issues which take down XenServer hosts or make them unreachable. For example physically disrupted networking or host hardware failures.

Note

HA is designed to be used in conjunction with storage multipathing and network bonding to create a system which is resilient to, and can recover from hardware faults. HA should always be used with multipathed storage and bonded networking.

Firstly, HA ensures that in the event of a host becoming unreachable or unstable, VMs which were known to be running on that host are shut down and able to be restarted elsewhere. This avoids the scenario where VMs are started (manually or automatically) on a new host and at some point later, the original host is able to recover. This scenario could lead two instances of the same VM running on different hosts, and a corresponding high probability of VM disk corruption and data loss.

Secondly, HA recovers administrative control of a pool in the event that the pool master becomes unreachable or unstable. HA ensures that administrative control is restored automatically without any manual intervention.

Optionally, HA can also automate the process of restarting VMs on hosts which are known to be in a good state without manual intervention. These VMs can be scheduled for restart in groups to allow time to start services. This allows infrastructure VMs to started before their dependent VMs (For example, a DHCP server before its dependent SQL server).

Warning

HA is designed to be used in conjunction with multipathed storage and bonded networking, and this should be configured before attempting to set up HA. Customers who do not set up multipathed networking and storage, may see unexpected host reboot behaviour (Self Fencing) in the event of infrastructure instability. For more information see CTX134880 - Designing XenServer Network Configurations and CTX134881 - Configuring iSCSI Multipathing Support for XenServer

A pool is overcommitted if the VMs that are currently running could not be restarted elsewhere following a user-defined number of host failures.

This would happen if there was not enough free memory across the pool to run those VMs following failure. However there are also more subtle changes which can make HA guarantees unsustainable: changes to Virtual Block Devices (VBDs) and networks can affect which VMs may be restarted on which hosts. Currently it is not possible for XenServer to check all actions before they occur and determine if they will cause violation of HA demands. However an asynchronous notification is sent if HA becomes unsustainable.

XenServer dynamically maintains a failover plan which details what to do if a set of hosts in a pool fail at any given time. An important concept to understand is the host failures to tolerate value, which is defined as part of HA configuration. This determines the number of failures that is allowed without any loss of service. For example, if a resource pool consisted of 16 hosts, and the tolerated failures is set to 3, the pool calculates a failover plan that allows for any 3 hosts to fail and still be able to restart VMs on other hosts. If a plan cannot be found, then the pool is considered to be overcommitted. The plan is dynamically recalculated based on VM lifecycle operations and movement. Alerts are sent (either through XenCenter or e-mail) if changes (for example the addition on new VMs to the pool) cause your pool to become overcommitted.

If you attempt to start or resume a VM and that action causes the pool to be overcommitted, a warning alert is raised. This warning is displayed in XenCenter and is also available as a message instance through the Xen API. The message may also be sent to an email address if configured. You will then be allowed to cancel the operation, or proceed anyway. Proceeding causes the pool to become overcommitted. The amount of memory used by VMs of different priorities is displayed at the pool and host levels.

If a server failure occurs such as the loss of network connectivity or a problem with the control stack is encountered, the XenServer host self-fences to ensure that the VMs are not running on two servers simultaneously. When a fence action is taken, the server immediately and abruptly restarts, causing all VMs running on it to be stopped. The other servers will detect that the VMs are no longer running and the VMs will be restarted according to the restart priorities assign to them. The fenced server will enter a reboot sequence, and when it has restarted it will try to re-join the resource pool.

Note

Citrix recommends that you enable HA only in pools that contain at least 3 XenServer hosts. For details on how the HA feature behaves when the heartbeat is lost between two hosts in a pool, see the Citrix Knowledge Base article CTX129721.

To use the HA feature, you need:

  • Shared storage, including at least one iSCSI, NFS or Fibre Channel LUN of size 356MB or greater- the heartbeat SR. The HA mechanism creates two volumes on the heartbeat SR:

    4MB heartbeat volume

    Used for heartbeating.

    256MB metadata volume

    Stores pool master metadata to be used in the case of master failover.

    Note

    For maximum reliability, Citrix strongly recommends that you use a dedicated NFS or iSCSI storage repository as your HA heartbeat disk, which must not be used for any other purpose.

    If you are using a NetApp or EqualLogic SR, manually provision an NFS or iSCSI LUN on the array to use as the heartbeat SR.

  • A XenServer pool (this feature provides high availability at the server level within a single resource pool).

  • Static IP addresses for all hosts.

Warning

Should the IP address of a server change while HA is enabled, HA will assume that the host's network has failed, and will probably fence the host and leave it in an unbootable state. To remedy this situation, disable HA using the host-emergency-ha-disable command, reset the pool master using pool-emergency-reset-master, and then re-enable HA.

For a VM to be protected by the HA feature, it must be agile. This means that:

  • it must have its virtual disks on shared storage (any type of shared storage may be used; the iSCSI, NFS or Fibre Channel LUN is only required for the storage heartbeat and can be used for virtual disk storage if you prefer, but this is not necessary)

  • it must not have a connection to a local DVD drive configured

  • it should have its virtual network interfaces on pool-wide networks.

Citrix strongly recommends the use of a bonded management interface on the servers in the pool if HA is enabled, and multipathed storage for the heartbeat SR.

If you create VLANs and bonded interfaces from the CLI, then they may not be plugged in and active despite being created. In this situation, a VM can appear to be not agile, and cannot be protected by HA. If this occurs, use the CLI pif-plug command to bring the VLAN and bond PIFs up so that the VM can become agile. You can also determine precisely why a VM is not agile by using the xe diagnostic-vm-status CLI command to analyze its placement constraints, and take remedial action if required.

Virtual machines can assigned a restart priority and a flag to indicates whether or not they should be protected by HA. When HA is enabled, every effort is made to keep protected virtual machines live. If a restart priority is specified, any protected VM that is halted will be started automatically. If a server fails then the running VMs will be started on another server.

An explanation of the restart priorities is shown below:

HA Restart PriorityRestart Explanation
0attempt to start VMs with this priority first
1attempt to start VMs with this priority, only after having attempted to restart all VMs with priority 0
2attempt to start VMs with this priority, only after having attempted to restart all VMs with priority 1
3attempt to start VMs with this priority, only after having attempted to restart all VMs with priority 2
best-effortattempt to start VMs with this priority, only after having attempted to restart all VMs with priority 3
HA Always RunExplanation
TrueVMs with this setting are included in the restart plan
FalseVMs with this setting are NOT included in the restart plan

Warning

Citrix strongly advises that only StorageLink Service VMs should be given a restart priority of 0. All other VMs (including those dependent on a StorageLink VM) should be assigned a restart priority 1 or higher.

The "best-effort" HA restart priority must NOT be used in pools with StorageLink SRs.

The restart priorities determine the order in which XenServer attempts to start VMs when a failure occurs. In a given configuration where a number of server failures greater than zero can be tolerated (as indicated in the HA panel in the GUI, or by the ha-plan-exists-for field on the pool object on the CLI), the VMs that have restart priorities 0 1, 2 or 3 are guaranteed to be restarted given the stated number of server failures. VMs with a best-effort priority setting are not part of the failover plan and are not guaranteed to be kept running, since capacity is not reserved for them. If the pool experiences server failures and enters a state where the number of tolerable failures drops to zero, the protected VMs will no longer be guaranteed to be restarted. If this condition is reached, a system alert will be generated. In this case, should an additional failure occur, all VMs that have a restart priority set will behave according to the best-effort behavior.

If a protected VM cannot be restarted at the time of a server failure (for example, if the pool was overcommitted when the failure occurred), further attempts to start this VM will be made as the state of the pool changes. This means that if extra capacity becomes available in a pool (if you shut down a non-essential VM, or add an additional server, for example), a fresh attempt to restart the protected VMs will be made, which may now succeed.

Note

No running VM will ever be stopped or migrated in order to free resources for a VM with always-run=true to be restarted.


Prev Up Next
Export Resource Pool Data Home Enabling HA on a XenServer Pool