Shedding light on XenApp on XenServer performance tuning

Shedding light on XenApp on XenServer performance tuning (via Citrix)/ Aug 2008

You won’t be surprised to hear that we spend a lot of time improving XenApp performance when running on XenServer. Although there are some good benchmark comparisons available (such as the Tolly Group report), I still get a lot of customers asking about what the “secret sauce” is. I sat down with George Dunlap, the lead XenServer performance engineer to chat about the very first optimisation we did back in XenServer 4.0 last year.

Before we dive in, we first need to explain how a normal operating system handles memory. George explains:

Modern desktop and server processors don’t access memory directly using its physical address. They use ‘virtual memory’ to separate the addresses that processes use to read and write memory from the actual memory itself. This allows operating systems to hide from processes all the dirty details of how much memory there is, where in physical memory the process needs to write to, and so on.

However, the actual processor still needs to translate from a virtual address to the physical memory address in order to actually read and write any memory. This translation is done with something called page tables.

Page tables are used to implement virtual memory by mapping virtual addresses to physical addresses. The operating system constructs page tables using physical memory addresses, and then puts the physical address of the “top-level” page table into a hardware register called the ‘base pointer’. Then the processor will read these page tables to translate virtual addresses to physical addresses as needed, before reading and writing to physical memory.

Most modern processor types have some sort of paging mechanism, although XenServer is specifically tuned for x86-64 CPUs. An excellent book on the general topic is Modern Operating Systems by Andrew Tanenbaum. When XenServer creates Windows VMs, it takes advantage of the virtualization extensions in modern CPUs, which requires special memory handling in Xen. George explains this further:

When we create a virtual machine, we virtualize the memory as well; that means that the guest operating system’s idea of physical memory does not match up to real physical memory on the host. Traditionally, what the guest thinks of as physical memory is called “physical memory”, and what the hypervisor thinks of as physical memory is called “machine memory”. Since this terminology is a bit confusing, Xen tends to call what the guest thinks of as physical memory as “guest physical” memory, just to help make things more clear.

This means that any fully-virtualized operating system, like Windows, will create page tables using guest physical memory, and will point the base pointer at the guest physical address of the top-level page table. Unfortunately, the hardware still needs to map from virtual memory address to machine addresses, not guest physical addresses.

In order to allow this to happen, the hypervisor sets up shadow page tables. These page tables are generated by the hypervisor are copies of the guest page tables, but with the guest physical addresses converted into machine physical addresses. The guest cannot access them directly, and they don’t reside in the guest’s physical memory; they’re generated out of a pool of memory that the hypervisor allocates when a VM is created, called shadow page table memory.

What this means is that whenever the guest operating system wants to map some new memory, after it writes the data into the page table but before it can actually use it, the hypervisor needs to translate the change to the guest page table into changes to the shadow page table. So any workload that involves a lot of this will necessarily involve the hypervisor a lot, which causes overhead.

So shadow page tables are our mechanism of giving a guest an interface which is identical to real hardware (so it doesn’t need to be modified), but still intercepting changes before they reach the real hardware. You can find more details from the XenSummit 2006 talk or from the 2005 NSDI paper. So how is this all relevant to XenApp performance? Back to George…

The hypervisor allocates a certain amount of memory for each VM to use for shadow page tables; this is called shadow page table memory. As new page tables are created and old ones aren’t used anymore, the hypervisor cycles through this shadow page table memory. When it needs a new page and there isn’t enough, it will ‘unshadow’ the guest page tables that haven’t been used for the longest time to reclaim shadow memory, so that it can use more.

We don’t know ahead of time how much shadow memory a given workload will use, but we can estimate based on the amount of memory that the VM has. We allocate enough shadow memory for each page to be mapped once, more or less, then add an extra 50% to have some slack. For all the workloads we’ve tested, that’s been enough – except XenApp.

XenApp is the one workload we’ve found that requires more shadow page table memory than our standard default. Because XenApp generally starts hundreds of copies of the same process, the same memory ends up mapped in hundreds of different processes. What happens when all of those processes are active is that XenServer is continually unshadowing one process’ page tables in order to shadow another process’ pagetables; only to have to re-shadow the original ones a second or two later when it runs again! This is called thrashing, when there’s not enough of a limited resource.

Once the bottleneck was discovered, the solution was simple. In XenServer 4.1, we created a special XenServer application template called “Citrix XenApp”, which has an increased shadow multiplier that reserves more shadow memory for the guest when it starts. This is also a good example of how templates hide the complexities of performance tuning from the user, but still permitting custom modifications if they are required. For example, on your XenServer host with a VM called “XenApp”, you could view the shadow multiplier by using the CLI:

# xe vm-list name-label=XenApp params=HVM-shadow-multiplier
  HVM-shadow-multiplier ( RW)    : 4.000

The same value is also available from XenCenter in the Optimization pane, but of course do remember that the default value was chosen through extensive testing and doesn’t need to be changed. Most of the other templates in XenServer also have carefully tuned settings (e.g. the hardware platform flags) to ensure smooth running, or in the case of Linux templates, to support para-virtual installation. This is why it’s so important that you not use the “Other Install Media” template in preference of a more specialised one!

I mentioned at the beginning of this post that this was the first of many XenApp optimisations. We’ve just released the public beta of the latest XenServer (“Orlando”) which is even faster. The story of what those improvements are, and the tools which George and his team uses to analyze the inner workings of Xen, are a topic for a future post. For now, get downloading XenServer and start virtualizing your XenApp installations! Or if you’re feeling inspired, go over to xen.org, check out the source, and get coding…

# 4th Aug 2008

notes opensource systems xen