About Citrix Hypervisor HA
Citrix Hypervisor High Availability (HA) allows virtual machines to automatically be restarted in the event of an underlying hardware failure or loss of any managed server. HA is about making sure that important VMs are always running in a resource pool. With HA enabled, if one of your servers fails, its VMs will be intelligently restarted on other servers in the same pool, allowing essential services to be restored in the event of system or component failure with minimal service interruption. If the pool master server fails, Citrix Hypervisor HA automatically selects a new server to take over as master, so you can continue to manage the pool. Any server in a pool can be a master server, and the pool database is constantly replicated across all nodes and also backed up to shared storage on the heartbeat SR for additional safety.
There are two key aspects to Citrix Hypervisor HA: reliably detecting server failure, and computing a failure plan to enable swift recovery, and these are covered in detail below.
Heartbeats for availability
Detecting server failure reliably is difficult since you need to remotely distinguish between a server disappearing for a while versus catastrophic failure. If we mistakenly decide that a master server has broken down and elect a new master in its place, there may be unpredictable results if the original server were to make a comeback. Similarly, if there is a network issue and a resource pool splits into two equal halves, we need to ensure that only one half accesses the shared storage and not both simultaneously. Citrix Hypervisor solves all these problems by having two mechanisms: a storage heartbeat and a network heartbeat.
When you enable HA in a pool, you nominate an iSCSI, Fibre Channel or NFS storage repository to be the heartbeat SR. Citrix Hypervisor automatically creates a couple of small virtual disks in this SR. The first disk is used by every server in the resource pool as a shared quorum disk. Each server allocates itself a unique block in the shared disk and regularly writes to the block to indicate that it is alive. When HA starts up, all servers exchange data over both network and storage channels, indicating which servers they can see over both channels - that is, which I/O paths are working and which are not. This information is exchanged until a fixed point is reached and all of the servers in the pool are satisfied that they are in agreement about what they can see. When this happens, HA is enabled and the pool is protected. This HA arming process can take a few minutes to settle for larger pools, but is only required when HA is first enabled.
Once HA is active, each server regularly writes storage updates to the heartbeat virtual disk, and network packets over the management interface. It is vital to ensure that network adapters are bonded for resilience, and that storage interfaces are using dynamic multipathing where supported. This will ensure that any single adapter or wiring failures do not result in any availability issues.
The worst-case scenario for HA is the situation where a server is thought to be off-line but is actually still writing to the shared storage, since this can result in corruption of persistent data. In order to prevent this situation, Citrix Hypervisor uses server fencing , that is, the server is automatically powered off and isolated from accessing any shared resources in the pool. This prevents the failing server from writing to any shared disks and damaging the consistency of the stored data during automated failover, when protected virtual machines are being moved to other, healthy servers in the pool.
Servers will self-fence (that is, power off and restart) in the event of any heartbeat failure unless any of the following hold true:
- The storage heartbeat is present for all servers but the network has partitioned (so that there are now two groups of servers). In this case, all of the servers that are members of the largest network partition stay running, and the servers in the smaller network partition self-fence. The assumption here is that the network outage has isolated the VMs, and they ought to be restarted on a server with working networking. If the network partitions are exactly the same size, then only one of them will self-fence according to a stable selection function.
- If the storage heartbeat goes away but the network heartbeat remains, then the servers check to see if they can see all other servers over the network. If this condition holds true, then the servers remain running on the assumption that the storage heartbeat server has gone away. This doesn’t compromise VM safety, but any network glitches will result in fencing, since that would mean both heartbeats have disappeared.
Capacity planning for failure
The heartbeat system gives us reliable notification of server failure, and so we move onto the second step of HA: capacity planning for failure.
A resource pool consists of several servers (say, 32), each with potentially different amounts of memory and a different number of running VMs. In order to ensure that no single server failure will make it impossible to restart its VMs on another server (for example, due to insufficient memory on any other server), Citrix Hypervisor HA dynamically computes a failure plan which calculates the actions that would be taken on any server failure. In addition to dealing with failure of a single server, Citrix Hypervisor HA can deal with the loss of multiple servers in a pool, for example when failure of a network partition takes out an entire group of servers.
In addition to calculating what actions will be taken, the failure plan considers the number of server failures that can be tolerated in the pool. There are two important considerations involved in calculating the HA plan for a pool:
- Maximum failure capacity . This is the maximum number of servers that can fail before there are insufficient resources to run all the protected VMs in the pool; this value is calculated by Citrix Hypervisor by taking account of the restart priorities of the VMs in the pool, and the pool configuration (the number of servers and their CPU and memory capacity).
- Server failure limit . This is a value that you can define as part of HA configuration which specifies the number of server failures that you want to allow in the pool, within the HA plan. For example, in a resource pool of 16 servers, when you set the server failure limit to 3, Citrix Hypervisor calculates a failover plan that allows for any 3 servers to fail and still be able to run all protected VMs in the pool. You can configure the server failure limit to a value that is lower than the maximum failure capacity, making it is less likely that the pool will become overcommitted. This can be useful in an environment with RBAC enabled, for example, to allow RBAC users without Pool Operator permissions to bring more VMs online without breaking the HA plan; see HA and Role-Based Access Control (RBAC) below.
A system alert will be generated when the maximum failure capacity value falls below the value specified for the server failure limit.
When HA is first enabled on a pool, a failure plan is calculated based on the resources available at that time. Citrix Hypervisor HA dynamically calculates a new failure plan in response to events which would affect the pool, for example, starting a new VM. If a new plan cannot be calculated due to insufficient resources across the pool (for example not enough free memory or changes to virtual disks and networks that affect which VMs may be restarted on which servers), the pool becomes overcommitted .
HA restart priority is used to determine which VMs should be started when a pool is overcommitted. When you configure the restart priority for the VMs you want to protect in the HA Configuration dialog box or in the Configure HA wizard, you can see the maximum failure capacity for the pool being recalculated dynamically, allowing you to try various combinations of VM restart priorities depending on your business needs, and see if the maximum failure capacity is appropriate to the level of protection you need for the critical VMs in the pool.
If you attempt to start or resume a VM and that action would cause the pool to be overcommitted, a warning will be displayed in XenCenter. The message may also be sent to an email address, if configured. You will then be allowed to cancel the operation, or proceed anyway, causing the pool to become overcommitted.
Working with an HA-enabled pool
The best practice for HA is not to make configuration changes to the pool while HA is enabled. Instead, it is intended to be the “2am safeguard” which will restart servers in the event of a problem when there isn’t a human administrator nearby. If you are actively making configuration changes in the pool such as applying software updates, then HA should be disabled for the duration of these changes.
- If you try to shut down a protected VM from XenCenter, XenCenter will offer you the option of removing the VM from the pool failure plan first and then shutting it down. This ensures that accidental VM shutdowns will not result in downtime, but that you can still stop a protected VM if you really want to.
- If you need to reboot a server when HA is enabled, XenCenter automatically uses the VM restart priorities to determine if this would invalidate the pool failure plan. If it doesn’t affect the plan, then the server is shut down normally. If the plan would be violated, but the maximum failure capacity is greater than 1, XenCenter will give you the option of lowering the pool’s server failure limit by 1. This reduces the overall resilience of the pool, but always ensures that at least one server failure will be tolerated. When the server comes back up, the plan is automatically recalculated and the original server failure limit is restored if appropriate.
- When you install software updates using the Install Update wizard, you must disable HA on the pool by clicking on the Turn HA off option until after the update has been installed. If you do not disable HA, the update will not proceed. You will need to monitor the pool manually while updates are being installed to ensure that server failures do not disrupt the operation of the pool.
- When HA is enabled, some operations that would compromise the plan for restarting VMs may be disabled, such as removing a server from a pool. To perform these operations, you should temporarily disable HA or you can shut down the protected VMs before continuing.
HA and Role-Based Access Control (RBAC)
In Citrix Hypervisor environments where Role-Based Access Control (RBAC) is implemented, not all users will be permitted to change a pool’s HA configuration settings. Users who can start VMs (VM Operators), for example, will not have sufficient permissions to adjust the failover capacity for an HA-enabled pool. For example, if starting a VM reduces the maximum number of server failures allowed to a value lower than the current maximum failure capacity, the VM Operator will not be able to start the VM. Only Pool Administrator or Pool Operator-level users can configure the number of server failures allowed.
In this case, the user who enables HA (with a Pool Administrator or Pool Operator role) can set the server failure limit to a number that is actually lower than the maximum number of failures allowed. This creates slack capacity and so ensures that less privileged users can start up new VMs and reduce the pool’s failover capacity without threatening the failure plan.