Sizing VDA Instances on Google Cloud Compute Engine

Introduction

This document provides single instance scalability and economic guidance for enterprises that are deploying Citrix workloads (VDA’s) on Google Cloud Compute Engine. The focus is on hosted shared workloads (Windows Server VDA’s with RDSH role installed) running on Google Compute Engine VM instances. Scale testing was performed using a heavy, reproducible synthetic workload, with a goal of providing conservative yet actionable guidance. We close with some sizing guidance for Google Cloud VMware Engine.

Executive Summary

The findings from this study suggest the following:

  • For a large percentage of use cases (workloads) the performance of a Citrix VDA tends to be limited by CPU vs. disk or network.
  • The best end user experience (as measured, given our heavy reference workload) was provided by the latest processors available on Google Cloud at the time of publication. These are available in the N2, N2D, and C2 instance families, and leverage Intel’s Cascade Lake and AMD’s Milan processors.
  • Google’s Tau family of virtual machines were unavailable for this test, but are likely worth considering for Citrix VDA workloads.
  • The 4-vCPU and 8-vCPU instance sizes tended to deliver the highest density and lowest costs per user.
  • For our heavy test workload, the N2D-Standard-8 instance type, (with the AMD Milan processor), was the most cost effective, providing costs as low as $0.028 per user hour using published pricing.
  • With the latest processors, the storage type used for persistent disk does not have a material impact on instance scalability.
  • While not explicitly tested as part of this study, workloads which require a GPU must be run on N1 instance types. Where GPU acceleration is a requirement, leveraging Skylake processors and Premium SSD persistent disk storage will likely yield the best performance and highest scalability.

Scalability Considerations

In this section, we discuss some of the key factors that affect single-server scalability, such as workload, CPU overcommit, processor architecture, and storage selection.

The workload selection used for testing should approximate the users in your environment. For instance, if 50% of your users generate a light user load, and 30% of your users generate a moderate load and the remaining 20% generate a heavy workload, then using a moderate workload would give you a rough idea of what to expect. However, if you had an environment where 20% of the users generate a light user load and 80% generate a heavy workload, using a heavy workload would give you the best approximation.

The ideal situation is to perform scalability testing using your own workload, rather than rely on scalability numbers derived from our testing. One useful feature of Login Enterprise, is that you can easily create your own custom workloads or build a workload from the provided application workflows out-of-the-box. For this test, we wanted to select a workload that would be representative of most users and provide relevant data for comparison, while providing conservative estimates for those enterprises not performing scalability testing on their own workloads.

Another consideration is the ratio of users to vCPU’s on an instance. This ratio is expressed based on the number of users expected to the number of vCPUs present on the virtual machine. For instance, if we expect 2 users for each vCPU, the ratio would be expressed as 2:1. Since CVAD workloads are CPU-intensive the overcommit ratio is commonly used for estimating and we provide it for the recommended instance types. The number of users per vCPU can vary based on the workload, the CPU processor architecture, or even the amount of available memory in a host.

The processor architecture also has an impact on the scalability results, as one might expect with the CPU intensive workloads. One of the goals of this study was to determine which compute instance families and shapes are the most cost effective on Google Cloud. All things equal, faster processors typically provide more users per instance than slower processors, but typically come at an additional cost.

The type and size of storage used for persistent disk on the virtual machine can impact single-server scalability because the local disk is used for swapping memory and storing user profiles. Faster disks reduce the disk latency, which lead to faster read/write operations. If a disk runs out of space to store user profiles, or has long latency in memory swap retrieval times, the overall performance of the host will degrade. Using the Premium SSD storage for persistent disks on Google Cloud should minimize any storage bottlenecks and ensure we are testing the Compute of the instance type and not the storage. To evaluate storage we keep the instance type static and vary the disk type.

Another factor is the launch rate of the sessions. If the sessions are launched quickly (say, 1 session every 10 seconds) versus moderately (1 session per minute) you will find different results. Logon storms will stress an environment and you should always plan to have sufficient capacity to support your peak logon periods. However, scalability testing is usually completed at a moderate logon rate of one session every 30 seconds or 60 seconds depending on the workload intensity. The longer intervals provide a more accurate view of the number of active sessions that can be supported in a steady state.

Test Environment

For the scalability testing, the infrastructure VMs were configured as follows:

  • One Login Enterprise virtual appliance running version 4.5.11
  • Four Login Enterprise launchers running on e2-standard-2 instances using a session launch interval of 1 minute
  • One Citrix Cloud Connector
  • One Active Directory domain controller acting also as a profile server and DNS Server

Citrix workloads were run on a single Windows Server 2019 Datacenter instance with the following configured software:

  • Microsoft RDSH role installed, including desktop experience
  • Citrix Server Multi-session OS VDA 2103.0.0.29045
  • Microsoft Office 2019
  • Microsoft Defender with default settings
  • Latest Windows updates available at time of testing (September 2021)
  • Out of the box settings were used unless otherwise specified

Instance Types

For this study, we first selected machine families where we expected to have the most efficient use of resources, including the following:

  • N1 (General Purpose Intel Skylake with GPU support)
  • N2 (General Purpose Intel Cascade Lake)
  • N2D (General Purpose AMD)
  • C2 (Compute Optimized workload Intel Cascade Lake)
  • E2 (Cost optimized Intel Skylake)

Note:

Google’s Tau series T2D instances were unavailable at the time of testing, but based on the results detailed here we have high expectations for them.

We started with evaluating the 8-vCPU versions of the selected machine families and evaluated the results to determine the top performers. We then went a step further and evaluated the 4-vCPU and 16-vCPU versions in those machine families. Test runs were completed on the following virtual machine instance types:

  • n1-standard-8 (Skylake)
  • n1-highcpu-8 (Skylake)
  • n1-highmem-8 (Skylake)
  • n2-standard-8 (Cascade Lake)
  • c2-standard-4 (Cascade Lake)
  • c2-standard-8 (Cascade Lake)
  • c2-standard-16 (Cascade Lake)
  • e2-standard-8 (Skylake)
  • n2d-standard-4 (Rome)
  • n2d-standard-8 (Rome)
  • n2d-standard-16 (Rome)
  • n2d-standard-4 (Milan)
  • n2d-standard-8 (Milan)
  • n2d-standard-16 (Milan)

We learned early on that the 2-vCPU versions of the machine families are not well suited for our heavy reference workload. For this reason, our testing focused on the 4-vCPU, 8-vCPU, and 16-vCPU instance shapes.

User sessions were simulated using Login Enterprise on each instance type in different test runs to analyze the scalability of different virtual machine instance types and shapes.

Lab Architecture

All infrastructure for session brokering and administration was provided by the Citrix Cloud Virtual Apps and Desktops service. An Active Directory domain controller and Cloud Connector were installed into a VPC, creating a resource location on Google Cloud. The figure below depicts the test architecture.

Lab Architecture

Note:

The N2 and N2D instances were not available in the US-West2 region, so they were created in the US-West1 region.

Login Enterprise Knowledge Worker Workload with End-user Experience (EUX) Score

This version of Login Enterprise provides a way to construct a virtual user’s workflow into a common Office 365 knowledge worker workload. Working closely with Login VSI, this testing also used an experimental scoring system called the EUX score. The EUX score is a way to quantify the responsiveness of a user’s experience, or feel, within a desktop or application session. This user’s workflow includes the following applications:

  • Login VSI EUX Score Application
  • Microsoft Outlook
  • Microsoft Word
  • Microsoft Excel
  • Microsoft PowerPoint
  • Microsoft Edge
  • YouTube

The Login Enterprise Knowledge Worker is a heavy, CPU intensive workload when used at 100% concurrency, and allowed for about 2 users per vCPU on average on the cloud instances tested.

Multiple times during the session, the EUX Scoring Application executes a set of instructions and records the time it takes to execute each step. The results can then be used to generate the EUX score of a session. Those familiar with the legacy Login VSI application will find these operations familiar.

The theoretical maximum EUX score is 10, however, the score is more useful when compared to itself over different workloads. For instance, if we look at the same user load on a c2-standard-8 with an EUX score average of 7.59, we see this graph where the solid line represents the n2d-standard-8 instance and the dashed line represents the c2-standard-8 instance. The graph clearly shows the n2d-standard-8 instance, with an average EUX score of 8.07, outperforming the c2-standard-8 instance throughout the test. You can also see how the score drops slightly as the system becomes busy.

EUX Score Trend Line

The average EUX score, while not a perfect representation of the entire test run, does provide an indicator of how a particular configuration is likely to perform under the workload conditions. The table below shows the EUX scores for different machine type configurations for a single user running the EUX Workload.

  vCPUs Memory (GiB) EUX Score (Single-user)
n2d-standard-4-Milan AMD 4 16 8.61
n2d-standard-8-Milan AMD 8 32 8.54
c2-standard-8-Cascade Lake 8 32 8.5
c2-standard-16-Cascade Lake 16 64 8.4
n2d-standard-16-Milan AMD 16 64 8.35
c2-standard-4-Cascade Lake 4 16 8.13
n1-highmem-8-Skylake 8 42 7.75
n1-standard-8-Skylake 8 30 7.7
n2d-standard-8-Rome AMD 8 32 7.7
n2d-standard-16-Rome AM 16 64 7.7
n1-highcpu-8-Skylake 8 7.2 7.56
n2-standard-8-Cascade Lake 8 32 7.56
e2-standard-8-Broadwell 8 32 7.45
n2d-standard-4-Rome AMD 4 16 7.29

When used by itself for a single user, the EUX score does provide a relatively good indicator of the virtual machine’s maximum performance potential for the test workload. Unfortunately, the EUX score does not provide enough data to determine the number of users that a particular instance type can support. For that, we need to measure and evaluate additional criteria such as the CPU utilization and the new Windows User Input Delay performance counters.

Scoring Methodology

Analyzing the test results on any system is always a challenge and requires an analysis of multiple components within the system. Since we are using the same test on each host, the test results are consistent enough for comparison between the instance configurations. To derive an estimated number of users that a particular instance type would support, we also measured two key performance counters: CPU Utilization and User Input Delay.

CPU Utilization: This counter (Processor(_Total)\% Processor Time) measures the percentage of elapsed time that all the vCPUs of the processor are busy executing non-idle thread tasks. CPU utilization has historically been one of the best indicators of user experience because as the CPU utilization approaches a full load, the user experience degrades rapidly.

For this analysis, a one-minute moving average was calculated based on the total processor time. We looked at the number of users on the system when the moving average reached 85% and again at 90%.

User Input Delay: This counter (User Input Delay per Session(Max)\Max Input Delay) measures the time it takes for user input, such as a mouse or keyboard, to be taken out of the queue and processed. This counter is a good indication of the user experience because the delay is something that the user will notice when it reaches over a second.

For this analysis, a one-minute moving average was calculated based on the maximum user delay experienced by any user on the session host. For example, if 3 users were on the system and all three experienced user delays, the user with the highest input delay would be the one selected for the moving average. We looked at the number of users on the system when the moving average for the worst experience reached a 1-second (1000 ms) delay and again when it reached a 2-second (2000 ms) delay. Remember that not all the users would be experiencing this type of delay, but any one user is enough to warrant a help desk call when a 2-second delay is common for all mouse and keyboard movement.

These two ranges gave us data points as to the number of users an instance type could support.

Minimum number of users: Calculated by comparing the number of users at 85% CPU utilization and the User Input Delay of 1 second and selecting lowest number. Maximum number of users: Calculated by comparing the number of users at 90% CPU utilization and the User Input Delay of 2 seconds and selecting the lowest number.

Expected Users

Before determining the expected costs, we needed to first determine how many users run successfully on an instance type. This turned out to be a little more challenging than expected originally because the EUX Scoring system did not tell us the total number of users that could fit on the system before performance degraded to an unacceptable level, but rather the experience in general of all the users on the system.

Using the combination of CPU Utilization and User Input Delay counters as described above, we were able to identify the range of expected users on a particular instance type. Admittedly, the choice of the metric boundaries (85-90% CPU and 1-2 second input delays) seems a bit arbitrary, but for our purposes it provides a reference point to be used in combination with the EUX score. In the end, we decided to go with the most conservative number for expected users by using the following calculation:

Expected Users: Calculated by taking the minimum value of each of the following for a test run that passed with 99% or higher:

  1. Users logging in during the test run
  2. Users logged in when the Max UID 1-minute moving average was above 1-second
  3. Users logged in when the Max UID 1-minute moving average was above 2-second
  4. Users logged in when the CPU 1-minute moving average was above 85%
  5. Users logged in when the CPU 1-minute moving average was above 90%

Expected Costs

The charts below show the cost per user per hour for each of the 8-vCPU instances tested with this study. All hosts are priced with a 100 GB Performance Disk and the applicable Windows licensing costs included in the Total.

8 vCPU Family Comparison

As you can see from the data the most cost effective machine family configurations for this workload are the c2-standard (Cascade Lake), n2d-standard (Rome), and n2d-standard (Milan). We were able to achieve a total compute and storage cost of $0.028 per hour per user with the n2d-standard-8 (Milan) configuration using the machine costs for the US-Central1 region as of November 1, 2021.

Compute Recommendations

As mentioned earlier, when discussing the user to vCPU ratios, here are the results of our top contenders from our testing:

Expected Users per vCPU

As you can see from the chart, the 4-vCPU and 8-vCPU instances are able to meet and exceed the 2-users per vCPU target, with the n2d-standard-8 Milan leading the pack. One interesting key takeaway here is that the 16-vCPU instances have lower users per vCPU than expected. This surprised us as much as it is surprising you right now, since normally those larger instances have better scale numbers than this. However, for this workload on these instance types we are going to stick with recommending the 4-vCPU or 8-vCPU instances which are getting 2+ users per instance type.

Generally speaking, the latest generation processors should provide the best scalability on a CPU-intensive workload such as the Knowledge Worker we used for this study. The types boasting the highest users per vCPU include two generations of the AMD EPYC chips (Rome and Milan) and only the latest Intel (Cascade Lake) processor, this is because we only chose to include the top 3 best performing types.

As it turns out, we were able to provide some comparative numbers between the processor generations when the instance types offered a choice between the two processor types. Unfortunately, this meant we were unable to compare the Cascade Lake processors running in N2 or C2 instances directly with the previous generations running in N1 instances. However, we were able to compare the Broadwell and Skylake processors from Intel as well as the Rome and Milan EPYC processors from AMD. The chart below shows that processor comparison.

Processor Comparison

As the chart shows, the general trend is for the latest generations to provide the best performance. One advantage of Google Cloud is that you can choose your processor generation for the N1 and N2D machine types and the costs remain the same regardless of the processor selection, so unless you need or want that earlier generation processor, be sure to set the vCPU processor generation to the latest.

While the Cascade Lake processors are the latest generation from Intel and are likely to perform the best since the test is CPU intensive, you may not always want to go for the latest generation. Although not shown on this chart, one odd finding is that the Broadwell processor did occasionally outperform the newer Skylake processors on the n1-standard-8. We believe this is possible for applications that have intensive graphics because the Broadwell processors support quad channel memory while the Skylake processors have only dual channel memory support.

There are non-performance related factors to consider as well. In addition to the performance of a processor on your workload, that the following should also be considered when choosing the machine families and configuration:

Virtual GPU: If you have a workload that requires a GPU, the only machine family that we tested that supports a GPU is the N1 family.

Availability: Not all processor generations or even machine families are available in each region. Look at the regions where you wish to deploy and determine what compute options you have before settling on a particular virtual machine configuration.

Cost: Costs vary between regions. For this paper we use the US-Central1 region for all cost information because it typically has the lowest costs and will have all the machine types available. In some cases, committed use discounts may also not be available in certain regions.

Storage Considerations

Google Cloud offers different types of block storage to use for persistent disks on compute instances, each with a set of performance criteria for application needs. During this study, we tested the three most broadly available types: Standard, Balanced, and SSD.

Lower Cost <——> Higher Performance
Standard Balanced SSD
Big compute/big data, log analysis, cold disks Most enterprise apps, simple databases, steady-state LOB apps, boot disks Most databases,
persistent cache, scale out analytics    
1,200 IOPS, 2,000 MiB/s throughput 80,000 IOPS, 1,200 MiB/s throughput 100,000 IOPS, 1,200 MiB/s throughput
$0.04 per GiB $0.10 per GiB $0.17 per GiB

We had a number of test runs across different 8-vCPU machine families, C2, N2D, and N1 where we varied the disk types but kept the compute power constant. Those tests allowed us to do a quick comparison between Standard, Balanced, and SSD disks. The results are shown in the table below.

Disk Type Comparison

One of the most interesting things that stands out from this chart is that the faster the vCPU the less likely the disk type is to have an impact on the expected number of users. For instance, both the C2 and N2D machine families saw very little impact from changing the disk type, but the N1 family saw a significant difference (almost 33% increase) in scalability. However, keep in mind that the standard disk is under half the cost of the balanced disk and less than a quarter of the cost of a SSD disk, so if performance is the same, lower storage costs may pay off in the long run.

The data from our specific workload testing shows that if cost is the primary concern as you move to GCP, that using the less expensive disks will not affect performance significantly on the latest processor generations. However, the results using your workload may vary, so we strongly recommend that you test your workload before committing to a particular architecture.

Using the EUX Score

Though this paper did not fully utilize the EUX scoring algorithm, we found it to have a high correlation with the performance counters that we did analyze. We ended up using all three data points (EUX score, CPU Utilization and User Input Delay) to calculate our expected users. The graph below provides a view of how the EUX scores vary between a single user on the system and the EUX score at our expected user load.

EUX Score Comparison

As you can see, with the only exception being the E2-standard-8 Broadwell which came in at 6.93, all of the EUX scores are above 7.0. We completed well over 300 test runs during this analysis and in all of the runs where the EUX score was below 7.0, we were able to immediately identify a cause for the lower EUX score. Most often, it was CPU utilization being above 90% or the Max User Input Delay exceeding 2 seconds.

We would go further to say that while a EUX score of 7.0 is the absolute lowest acceptable score, you may find users are not happy at that level either. A good rule of thumb deduced from our testing is as follows:

The EUX score obtained from a system loaded with the expected number of users should be greater than both 7.0 and the single-user EUX score minus a half point.

Scalability Guidance for Google Cloud VMware Engine

As of the date of publishing, Citrix Engineering had just completed testing required for Google Cloud VMware Engine (GCVE) to be considered an officially supported platform for Citrix virtualization. Citrix had not completed any formal scalability testing for GCVE. That said - GCVE is a Google managed service which provides VMware powered Software Defined Data Centers (SDDC’s) based upon VMware’s Cloud Foundation (VCF) architecture. Since VMware, it’s customers, and it’s partners have been deploying Citrix on VCF based systems for many years, there is already a wealth of knowledge available with a direct correlation to GCVE.

For example: GCVE is currently available on a node type called ve1-standard-72. This node type has 36 physical cores (72 hyperthreaded cores) and 768GB of RAM. When estimating a hosted shared desktop workload on Windows Server 2019 using the Server OS VDA and a 3-node SDDC configuration, you will have a total of 108 physical cores available or 216 hyperthreaded vCPUs. For reference - we typically see the most efficient scalability using a Citrix Server OS VDA with a 4 vCPU configuration (2 physical cores) and enough memory to support your user workload. After reserving some compute and memory resources for the VMware host, we would expect that each GCVE node can comfortably host around 17 Citrix VDA instances each with 4 vCPUs and 36 GB of RAM.

If you are familiar with Nick Rintalan’s “Rule of 5 and 10”, which provides single-server scalability guidance for on-premise data centers, you will recall his recommendations of about 10 CVAD session users per physical core. Since the GCVE servers are hyperthreaded, a 4 vCPUs virtual machine is using 2 physical cores for each Citrix session host. Applying the rule of 10 for CVAD session hosts means we can expect about 20 users per VM with a light workload. For a heavy workload, that number is about half or about 10 users per VM.

These calculations are used to identify the approximate number of users to expect within your GCVE cluster and will change given a particular workload. We strongly urge you to test your own workload and make a decision on which configuration works best with your workload.

Conclusion

The findings from this study suggest the following:

As expected, the performance of a Citrix VDA tends to be limited by CPU instead of disk performance or network bandwidth.

The best end user experience, (as measured, given our knowledge worker reference workload), was provided by the latest processors available on Google Cloud at the time of publication. These are available in the N2, N2D, and C2 instance families, and leverage Intel’s Cascade Lake and AMD’s Milan and Rome processors.

For the Knowledge Worker profile, the top three most cost-effective instance types are:

  • N2D-Standard-8 instance type, (with the AMD Milan processor), achieving 2.875 users per vCPU, was the most cost effective, with costs as low as $0.0280 per user hour using published pay-as-you-go pricing.
  • N2D-Standard-8 instance type (with the AMD Rome processor) followed with 2.25 users per vCPU, and a cost as low as $0.036 per hour.
  • C2-Standard-8 instance type, (with the Intel Cascade Lake processor), also achieved 2.25 users per CPU with a slightly higher cost at $0.040 per user hour.

The faster processors make the best platform for knowledge worker workloads. The results show that the sweet spot is 8-vCPU and 4-vCPU instance sizes, while moving to 16-vCPU instance sizes significantly reduces the number of users you can expect per vCPU.

With the latest processors, the storage type used for persistent disk does not have a material impact on instance scalability. While we found no correlation between disk performance type and the number of users expected on a virtual machine when the N2D and C2 instance types were used, we did see a significant performance difference for the N1 instances.

While not explicitly tested as part of this study, workloads which require a GPU must be run on N1 instance types. Where GPU acceleration is a requirement, leveraging Skylake processors and Premium SSD persistent disk storage will likely yield the best performance and highest scalability.

As always, keep in mind that this is a sample workload that is used for testing and validation and may or may not represent your workload. We recommend testing your workloads in Google Cloud before committing to a specific architecture. You can leverage Login VSI’s Login Enterprise tool to create a custom workload that will approximate your user’s daily activities and provide a EUX score that will help you gauge their experience. Record the EUX score with a single-user on the system and then add users one at a time until the score drops by .5 to determine the recommended users on a system.

Special thanks to the Login VSI team that worked closely with us as we used their EUX scoring algorithms pre-release to evaluate Google Cloud.

Sizing VDA Instances on Google Cloud Compute Engine