High availability

Citrix Hypervisor high availability enables VMs to restart automatically in the event of an underlying hardware failure or loss of any server. High availability is about making sure important VMs are always running in a resource pool. With high availability enabled, if one of your servers fails, its VMs restart on other servers in the same pool. This capability allows essential services to be restored with minimal service interruption in the event of system or component failure.

If the pool master server fails, Citrix Hypervisor high availability selects a new server to take over as master. Any server in a pool can be a master server. Citrix Hypervisor replicates the pool database constantly across all nodes. It also backs up the database to shared storage on the heartbeat SR for extra safety.

There are two key aspects to Citrix Hypervisor high availability:

  • Reliably detecting server failure
  • Computing a failure plan to enable swift recovery

Heartbeats for availability

Detecting server failure reliably is difficult since you need to remotely distinguish between a server disappearing for a while versus catastrophic failure. If high availability incorrectly decides a master server has broken down and elects a new master, there might be unpredictable results if the original server returns. Similarly, if a network issue causes the resource pool to split into two equal halves, we need to ensure only one half accesses the shared storage and not both simultaneously. Citrix Hypervisor solves all these problems by having two mechanisms: a storage heartbeat and a network heartbeat.

When you enable high availability in a pool, you nominate an iSCSI, Fibre Channel or NFS storage repository to be the heartbeat SR. Citrix Hypervisor automatically creates a couple of small virtual disks in this SR. The first disk is used by every server in the resource pool as a shared quorum disk. Each server allocates itself a unique block in the shared disk and regularly writes to the block to indicate that it is alive. When high availability starts up, all servers exchange data over both network and storage channels. This action indicates which servers they can see over both channels and demonstrates which I/O paths are working and which are not. This information is exchanged until a fixed point is reached and all servers in the pool agree about what they can see. When this agreement happens, high availability is enabled and the pool is protected. This high availability arming process can take a few minutes to settle for larger pools, but is only required when you first enable high availability.

After high availability is active, each server regularly writes storage updates to the heartbeat virtual disk, and network packets over the management interface. Ensure that network adapters are bonded for resilience, and that storage interfaces are using dynamic multipathing where supported. This configuration ensures that any single adapter or wiring failures do not result in any availability issues.

For more information, see:

Server fencing

The worst-case scenario for high availability is one where a server is thought to be off-line but is still writing to the shared storage. This scenario can result in corruption of persistent data. Citrix Hypervisor uses server fencing to prevent this situation. The server is automatically powered off and isolated from accessing any shared resources in the pool. Fencing prevents the failing server from writing to any shared disks and damaging the consistency of the stored data during automated failover, when protected virtual machines are being moved to other, healthy servers in the pool.

Servers self-fence (that is, power off and restart) in the event of any heartbeat failure unless any of the following hold true:

  • The storage heartbeat is present for all servers but the network has partitioned (so that there are now two groups of servers). In this case, all servers that are members of the largest network partition stay running, and the servers in the smaller network partition self-fence. The assumption here is that the network outage has isolated the VMs, and they must be restarted on a server with working networking. If the network partitions are the same size, then only one of them self-fences according to a stable selection function.
  • If the storage heartbeat goes away but the network heartbeat remains, the servers check to see if they can see all other servers over the network. If this condition holds true, the servers remain running on the assumption that the storage heartbeat server has gone away. This action doesn’t compromise VM safety, but any network glitches result in fencing, since that would mean both heartbeats have disappeared.

Capacity planning for failure

The heartbeat system gives us reliable notification of server failure, and so we move onto the second step of high availability: capacity planning for failure.

A resource pool consists of several servers (say, 32), each with potentially different amounts of memory and a different number of running VMs. To ensure that no single server failure makes it impossible to restart its VMs on another server (for example, due to insufficient memory on any other server), Citrix Hypervisor high availability dynamically computes a failure plan which calculates the actions that would be taken on any server failure. In addition to dealing with failure of a single server, Citrix Hypervisor high availability can deal with the loss of multiple servers in a pool. For example, high availability can handle when failure of a network partition takes out an entire group of servers.

In addition to calculating what actions are taken, the failure plan considers the number of server failures that can be tolerated in the pool. There are two important considerations involved in calculating the high availability plan for a pool:

  • Maximum failure capacity. This value is the maximum number of servers that can fail before there are insufficient resources to run all the protected VMs in the pool. Citrix Hypervisor calculates the maximum failure capacity by taking account of the restart priorities of the VMs in the pool, and the pool configuration (the number of servers and their CPU and memory capacity).
  • Server failure limit. You can define this value as part of high availability configuration which specifies the number of server failures that you want to allow in the pool, within the high availability plan. For example, when you set the server failure limit for a resource pool to 3, Citrix Hypervisor calculates a failover plan that allows any 3 servers to fail and still be able to run all protected VMs in the pool. You can configure the server failure limit to a value that is lower than the maximum failure capacity, making it less likely that the pool becomes overcommitted. This configuration can be useful in an environment with RBAC enabled. For example, this setting allows RBAC users with lower permissions than Pool Operator to bring more VMs online without breaking the high availability plan. For more information, see the High availability and Role-Based Access Control (RBAC) section.

A system alert is generated when the maximum failure capacity value falls below the value specified for the server failure limit.

Overcommit protection

When high availability is first enabled on a pool, a failure plan is calculated based on the resources available then. Citrix Hypervisor high availability dynamically calculates a new failure plan in response to events which would affect the pool, for example, starting a new VM. If a new plan cannot be calculated due to insufficient resources across the pool (for example not enough free memory or changes to virtual disks and networks that affect which VMs might be restarted on which servers), the pool becomes overcommitted.

High availability restart priority is used to determine which VMs to start when a pool is overcommitted. When you configure the restart priority for the VMs you want to protect in the HA Configuration dialog box or in the Configure HA wizard, you can see the maximum failure capacity for the pool being recalculated dynamically. This information enables you to try various combinations of VM restart priorities depending on your business needs, and see if the maximum failure capacity is appropriate to the level of protection you need for the critical VMs in the pool.

If you attempt to start or resume a VM and that action would cause the pool to be overcommitted, a warning is displayed in XenCenter. The message can also be sent to an email address, if configured. You are given the option to cancel the operation, or proceed anyway, causing the pool to become overcommitted.

Working with an HA-enabled pool

The best practice for high availability is not to make configuration changes to the pool while high availability is enabled. Instead, it is intended to be the “2am safeguard” which restarts servers in the event of a problem when there isn’t a human administrator nearby. If you are actively making configuration changes in the pool such as applying software updates, disable high availability during these changes.

  • If you try to shut down a protected VM from XenCenter, XenCenter offers you the option of removing the VM from the pool failure plan first and then shutting it down. This ensures that accidental VM shutdowns do not result in downtime, but that you can still stop a protected VM if you really want to.
  • If you need to reboot a server when high availability is enabled, XenCenter automatically uses the VM restart priorities to determine if this would invalidate the pool failure plan. If it doesn’t affect the plan, then the server is shut down normally. If the plan would be violated, but the maximum failure capacity is greater than 1, XenCenter gives you the option of lowering the pool’s server failure limit by 1. This action reduces the overall resilience of the pool, but always ensures that at least one server failure is tolerated. When the server comes back up, the plan is automatically recalculated and the original server failure limit is restored if appropriate.
  • When you install software updates using the Install Update wizard, you must disable high availability on the pool by clicking on the Turn HA off option until after the update has been installed. If you do not disable high availability, the update will not proceed. You will need to monitor the pool manually while updates are being installed to ensure that server failures do not disrupt the operation of the pool.
  • When high availability is enabled, some operations that would compromise the plan for restarting VMs may be disabled, such as removing a server from a pool. To perform these operations, temporarily disable high availability or you can shut down the protected VMs before continuing.

High availability and role-based access control (RBAC)

In Citrix Hypervisor environments where role-based access control (RBAC) is implemented, not all users are permitted to change a pool’s high availability configuration settings. For example, VM Operators do not have sufficient permissions to adjust the failover capacity for an HA-enabled pool. If starting a VM reduces the maximum number of server failures allowed to a value lower than the current maximum failure capacity, the VM Operator cannot start the VM. Only Pool Administrator or Pool Operator-level users can configure the number of server failures allowed.

In this case, the Pool Administrator or Pool Operator who enables high availability can set the server failure limit to a number that is lower than the maximum number of failures allowed. This setting creates slack capacity and so ensures that less privileged users can start up new VMs. It reduces the pool’s failover capacity without threatening the failure plan.