About XenServer DR
The XenServer Disaster Recovery (DR) feature is designed to allow you to recover virtual machines (VMs) and vApps from a catastrophic failure of hardware which disables or destroys a whole pool or site. For protection against single server failures, you can use XenServer High Availability to have VMs restarted on an alternate server in the same pool.
Understanding XenServer DR
XenServer DR works by storing all the information needed to recover your business-critical VMs and vApps on storage repositories (SRs) that are then replicated from your primary (production) environment to a backup environment. When a protected pool at your primary site goes down, the VMs and vApps in that pool can be recovered from the replicated storage and recreated on a secondary (DR) site, with minimal application or user downtime.
Once the recovered VMs are up and running in the DR pool, the DR pool metadata must also be saved on storage that is replicated, allowing recovered VMs and vApps to be restored back to the primary site when it is back online.
Note: XenServer DR can only be used with LVM over HBA or LVM over iSCSI storage types.
XenServer VMs consists of two components:
- Virtual disks that are being used by the VM, stored on configured storage repositories (SRs) in the pool where the VMs are located.
- Metadata describing the VM environment. This is all the information needed to recreate the VM if the original VM is unavailable or corrupted. Most metadata configuration data is written when the VM is created and is updated only when you make changes to the VM configuration. For VMs in a pool, a copy of this metadata is stored on every server in the pool.
In a DR environment, VMs are recreated on a secondary (DR) site from the pool metadata - configuration information about all the VMs and vApps in the pool. The metadata for each VM includes its name, description and Universal Unique Identifier (UUID), and its memory, virtual CPU and networking and storage configuration. It also includes the VM’s startup options - start order, delay interval and HA restart priority - which are used when restarting the VM in an HA or DR environment. For example, when recovering VMs during disaster recovery, the VMs within a vApp will be restarted in the DR pool in the order specified in the VM metadata, and with the specified delay intervals.
XenServer DR requirements
|Software version||XenServer version 6.0 or later|
|Access||You must be logged in as root or have a role of Pool Operator or higher.|
Disaster recovery infrastructure
To use XenServer DR, the appropriate DR infrastructure needs to be set up at both the primary and secondary sites:
- The storage used for both the pool metadata and the virtual disks used by the VMs must be replicated from your primary (production) environment to a backup environment. Storage replication, for example using mirroring, is best handled by your storage solution, and will vary from device to device.
- Once VMs and vApps have been recovered to a pool on your DR site and are up and running, the SRs containing the DR pool metadata and virtual disks must also be replicated to allow the recovered VMs and vApps to be restored back to the primary site ( failed back ) once the primary site is back online.
- The hardware infrastructure at your DR site does not have to match the primary site, but the XenServer environment must be at the same release and patch level, and sufficient resources should be configured in the target pool to allow all the failed over VMs to be re-created and started.
Important: XenCenter and the Disaster Recovery wizard do not control any storage array functionality. Users of the Disaster Recovery feature must ensure that the pool metadata and the storage used by the VMs which are to be restarted in the event of a disaster are replicated to a backup site. Some storage arrays contain mirroring features to achieve the copy automatically: if these features are used, then it is essential that the mirror functionality is disabled (the mirror is broken) before VMs are restarted on the recovery site.
Failover, Failback and Test Failover with the Disaster Recovery wizard
The Disaster Recovery wizard makes failover (recovery of protected VMs and vApps to a secondary site) and failback (restoration of VMs and vApps back to the primary site) simple. The steps involved in the two processes are outlined here:
- First, you choose a target pool on your secondary DR site to which you want to recover your VMs and vApps.
- Next, you provide details of the storage targets containing the replicated SRs from your primary site.
The wizard scans the targets and lists all SRs found there.
Now you select the SRs containing the metadata and virtual disks for the VMs and vApps you want to recover.
The wizard scans the SRs and lists all the VMs and vApps found.
Now you select which VMs and vApps you want to recover to the DR site, and specify whether you want the wizard to start them up automatically as soon as they have been recovered, or whether you prefer to wait and start them up manually yourself.
- The wizard then performs a number of prechecks to ensure that the selected VMs and vApps can be recovered to the target DR pool, for example, it checks that all the storage required by the selected VMs and vApps is available.
Finally, when the prechecks are complete and any issues resolved, the failover process begins. The selected VMs and vApps are exported from the replicated storage to the DR pool.
Failover is now complete.
- First, you choose the target pool on your primary site to which you want to restore the VMs and vApps currently running on the DR site.
- Next, you provide details of the storage targets containing the replicated SRs from your DR site.
The wizard scans the targets and lists all SRs found.
Now you select the SRs containing the metadata and virtual disks for the VMs and vApps you want to restore.
The wizard scans the SRs and lists all the VMs and vApps found.
Now you select which VMs and vApps you want to restore back to the primary site and specify whether you want the wizard to start them up automatically as soon as they have been recovered, or whether you prefer to wait and start them up manually yourself.
- The wizard then performs a number of prechecks to ensure that the selected VMs and vApps can be recovered to the target pool on the primary site, for example, it checks that all the storage required by the selected VMs and vApps is available.
Finally, when the prechecks are complete and any issues resolved, the failback process begins. The selected VMs and vApps running on your DR site are exported from the replicated storage back to the selected pool at your primary site.
Failback is now complete.
If the Disaster Recovery wizard finds information for the same VM present in a two or more places (for example, storage from the primary site, storage from the DR site and also in the pool that the data in to be imported into) then it will ensure that only the most recent information per VM is used.
Tip: Recovering VMs and vApps from replicated storage will be easier if your SRs are named in a way that captures how your VMs and vApps are mapped to SRs, and the SRs to LUNs.
You can also use the Disaster Recovery wizard to run test failovers for non-disruptive testing of your disaster recovery system. In a test failover, all the steps are the same as for failover, but the VMs and vApps are started up in a paused state after they have been recovered to the DR site, and cleanup is performed when the test is finished to remove all VMs, vApps and storage recreated on the DR site. See Test Failover.
XenServer DR terminology
vApp: A logical group of related VMs which are managed as a single entity.
Site: A physical group of XenServer resource pools, storage and hardware equipment.
Primary site: A physical site that runs VMs or vApps which must be protected in the event of disaster.
Secondary site, DR site: A physical site whose purpose is to serve as the recovery location for the primary site, in the event of a disaster.
Failover: Recovery of VMs and vApps on a secondary (recovery) site in the event of disaster at the primary site.
Failback: Restoration of VMs and vApps back to the primary site from a secondary (recovery) site.
Test failover: A “dry run” failover where VMs and vApps are recovered from replicated storage to a pool on a secondary (recovery) site but not actually started up. Test failovers can be run to check that DR is correctly configured and that your processes are effective.
Pool metadata: Information about the VMs and vApps in the pool, such as their name and description, and, for VMs, configuration information including UUID, memory, virtual CPU, networking and storage configuration, and startup options - start order, delay interval and HA restart priority. Pool metadata is used in DR to re-create the VMs and vApps from the primary site in a recovery pool on the secondary site.