Disaster Recovery (DR)
The disaster recovery (DR) feature allows you to recover VMs and vApps from a catastrophic failure of hardware which disables or destroys a whole pool or site.
For protection against single server failures, you can use High Availability. High availability restarts VMs on an alternate server in the same pool.
Disaster recovery stores all the information needed to recover your business-critical VMs and vApps on storage repositories (SRs). These storage repositories are then replicated from your primary (production) environment to a backup environment. When a protected pool at your primary site fails, the VMs and vApps in that pool can be recovered from the replicated storage and recreated on a secondary (DR) site. The result is minimal application or user downtime.
After the recovered VMs are up and running in the DR pool, the DR pool metadata must also be saved on storage that is replicated. This action allows recovered VMs and vApps to be restored back to the primary site when it is back online.
Disaster Recovery can only be used with LVM over HBA or LVM over iSCSI storage types.
Citrix Hypervisor VMs consist of two components:
- Virtual disks that are being used by the VM, stored on configured storage repositories (SRs) in the pool where the VMs are located.
- Metadata describing the VM environment. The metadata contains all the information required to recreate the VM if the original VM is unavailable or corrupted. Most metadata is written when the VM is created and is updated only when you change the VM configuration. For VMs in a pool, a copy of this metadata is stored on every server in the pool.
In a DR environment, VMs are recreated on a secondary (DR) site from the pool metadata - configuration information about all the VMs and vApps in the pool. The metadata for each VM includes its name, description and Universal Unique Identifier (UUID), and its memory, virtual CPU, networking, and storage configuration. It also includes the VM startup options used when restarting the VM in a high availability or DR environment: start order, delay interval, and restart priority. For example, when recovering VMs, the VMs within a vApp restart in the DR pool in the order and with the delay intervals specified in the metadata.
To use Disaster Recovery, you must be logged in as root or have a role of Pool Operator or higher.
Disaster recovery terminology
vApp: A logical group of related VMs which are managed as a single entity.
Site: A physical group of Citrix Hypervisor resource pools, storage, and hardware equipment.
Primary site: A physical site that runs VMs or vApps which must be protected in the event of disaster.
Secondary site, DR site: A physical site whose purpose is to serve as the recovery location for the primary site, in the event of a disaster.
Failover: Recovery of VMs and vApps on a secondary (recovery) site in the event of disaster at the primary site.
Failback: Restoration of VMs and vApps back to the primary site from a secondary (recovery) site.
Test failover: A “dry run” failover where VMs and vApps are recovered from replicated storage to a pool on a secondary (recovery) site but not started up. Test failovers can be run to check that DR is correctly configured and that your processes are effective.
Pool metadata: Information about the VMs and vApps in the pool, such as their name and description. For VMs, the configuration information includes UUID, memory, virtual CPU, networking and storage configuration, and startup options. Pool metadata is used in DR to re-create the VMs and vApps from the primary site in a recovery pool on the secondary site.
Disaster recovery infrastructure
To use Disaster Recovery, set up the appropriate DR infrastructure at both the primary and secondary sites:
- The storage used for both the pool metadata and the virtual disks used by the VMs must be replicated from your primary (production) environment to a backup environment. Storage replication, for example using mirroring, varies from device to device. We recommend you use your storage solution to handle storage replication.
- After recovered VMs and vApps are up and running on a pool on your DR site, replicate the SRs containing the DR pool metadata and virtual disks. This action allows the recovered VMs and vApps to be restored back to the primary site (failed back) once the primary site is back online.
- The hardware infrastructure at your DR site does not have to match the primary site. However, the Citrix Hypervisor environment must be at the same release and patch level. Also, sufficient resources must be configured in the target pool to allow all the failed over VMs to be re-created and started.
XenCenter and the Disaster Recovery wizard do not control any storage array functionality. Ensure the pool metadata, and the storage used by the VMs which are to be restarted in the event of a disaster, are replicated to a backup site. Some storage arrays contain mirroring features to achieve the copy automatically. If these features are used, disable the mirror functionality before VMs are restarted on the recovery site.
Failover, failback, and test failover with the Disaster Recovery wizard
The Disaster Recovery wizard makes failover and failback simple. The steps involved in these processes are outlined here:
Choose a target pool on your secondary DR site to which you want to recover your VMs and vApps.
Provide details of the storage targets containing the replicated SRs from your primary site. The wizard scans the targets and lists all SRs found there.
Select the SRs containing the metadata and virtual disks for the VMs and vApps you want to recover. The wizard scans the SRs and lists all the VMs and vApps found.
Select which VMs and vApps you want to recover to the DR site. Specify whether you want the wizard to start them up automatically when they have been recovered, or whether you prefer to wait and start them up manually yourself.
The wizard performs prechecks to ensure that the selected VMs and vApps can be recovered to the target DR pool. For example, the wizard checks that all the storage required by the selected VMs and vApps is available.
When the prechecks are complete and any issues resolved, the failover process begins. The selected VMs and vApps are exported from the replicated storage to the DR pool. Failover is now complete.
Choose the target pool on your primary site to which you want to restore the VMs and vApps currently running on the DR site.
Provide details of the storage targets containing the replicated SRs from your DR site. The wizard scans the targets and lists all SRs found.
Select the SRs containing the metadata and virtual disks for the VMs and vApps you want to restore. The wizard scans the SRs and lists all the VMs and vApps found.
Select which VMs and vApps you want to restore back to the primary site. Specify whether you want the wizard to start them up automatically when they have been recovered, or whether you prefer to wait and start them up manually yourself.
The wizard then performs prechecks to ensure that the selected VMs and vApps can be recovered to the target pool on the primary site. For example, the wizard checks that all the storage required by the selected VMs and vApps is available.
When the prechecks are complete and any issues resolved, the failback process begins. The selected VMs and vApps running on your DR site are exported from the replicated storage back to the selected pool at your primary site.
Failback is now complete.
If the Disaster Recovery wizard finds information for the same VM in a two or more places, it uses only the most recent information per VM. For example, the information might be stored on the primary site storage, the DR site storage, and in the pool the data is imported into.
To make recovering VMs and vApps easier, name your SRs to indicate how your VMs and vApps are mapped to SRs, and the SRs to LUNs.
You can also use the Disaster Recovery wizard to run test failovers for non-disruptive testing of your disaster recovery system. In a test failover, the steps are the same as for failover, but recovered VMs and vApps are started up in a paused state on the DR site. Cleanup is performed when the test is finished to remove all VMs, vApps, and storage recreated on the DR site. For more information, see Test Failover.