What items can you check to troubleshoot a performance issue for a Lotus Domino server on VMware ESX?
The list below provides information about what items to check first; this list is not meant to be exhaustive.
- VMware File System (VMFS) - the native file system used by VMware ESX servers. VMs are stored on VMFS formatted disks
- Storage Area Network (SAN) - typically used with FiberChannel to provide fast and large storage to servers
- Network Attached Storage (NAS) - typically used with NFS, CIFS, iSCSI and sometimes FiberChannel to provide less expensive storage solutions
- Network File System (NFS) - a protocol
- SW iSCSI - iSCSI using VMWare software initiator
- HW iSCSI - a network card with TCP Off-load Engine (ToE) capability with hardware support for iSCSI
|Items to check|
1. VMware ESX Server version
What version of ESX is the VM running on?
ESX 3.x provides better disk and network I/O performance than ESX 2.5.x. While the upgrade is not required, it is something you can try. Some customers report performance improvements after upgrading from ESX 2.5.x to 3.0 (and more improvement if upgrading from 2.0.x to 3.x).
2. RAM and CPUs
a. How much RAM does ESX have? And how many cores (logical CPUs)?
(For example, a two socket DualCore system has four cores; a two socket QuadCore has eight cores.)
The more cores the server has, the better. Four cores is the minimum to have for running a virtualized environment. Eight cores or better shows good performance. Making sure there is no overcommitment of resources helps performance (for example, assigning more vCPUs and RAM to VMs than is physically available).
b. How many CPUs and how much RAM does the Domino VM have?
More RAM for Domino allows it to cache more data. The operating system does I/O caching too, which improves the overall performance. You want to make sure the VM has at least the recommended number of CPUs for the physical environment. For I/O intensive servers, at least two vCPUs should be assigned to the VM.
3. File system
How are the disks presented to the VM? Are they on a VMFS file system, or do they use RDM (Raw Device Mapping)?
In testing, it was found that Domino performs better when using a virtual disk on a dedicated VMFS as opposed to a RDM. Also VMFS 3 (used by ESX 3) has been shown to provide better performance than VMFS 2 when multiple virtual disks are used on the same VMFS.
a. Where are the virtual disks stored? On local or SAN storage?
SAN should be required and recommended for I/O intensive Domino VMs using a fast connection like 2 or 4 Gb Fiber Channel.
b. (For SAN or NAS storage) What is the connection to the storage?
NFS should be avoided as the repository of virtual disks. iSCSI to NAS connects over Ethernet and should be using 1 Gb; however, the latency and performance may not be sufficient for Domino. If VMware uses the software initiator instead of a TCP/IP Offload Engine (TOE) adapter, which offloads to the NIC the overhead of the connection, then it would limit even further the bandwidth available and affect latency. Fiber Channel is the better choice -- at least 2 Gb, better 4 Gb HBAs.
c. What is the size and speed of the disks used to build the logical unit number (LUN)?
15K RPM disks provide up to 33% more IOPS than 10K RPM disks. Slower disks should be avoided. The smaller the disks, the better the performance.
More information about connection type and disk size can be found in an additional information section below.
5. Platform statistics or perfmon data
What is the queue length reported by Domino's platform statistics for the VM's disks?
If the queue length goes higher when experiencing a performance issue, it means the storage is not fast enough to handle the load. Its values should be as close as possible to 2.
On some SAN configurations, however, IBM Support saw acceptable performance up to 10 (this varies by customers). However, performance would degrade quickly after 12. If the peak goes higher than 12, check the storage, the HBA, and the cache settings of the SAN. You can use esxtop to verify the latency of the SAN (press d for disks, f for fields and select A, H, I and J, maximize window to see all values). A latency of 5 ms is ideal; beyond 10 ms indicates an issue to investigate.
6. DRS and VMotion
Does the problem happen without VMotion?
DRS is the VMware term for clustering server for load balancing. It uses VMotion to dynamically move a running VM from one physical host to another. In theory, this move is totally transparent to the application, but sometimes it creates issues. VMotion can be easily disabled at the server level to verify if this helps.
- To verify if VMotion is enabled, in the VI Client, select the host server. On the Summary tab in the General section, you can see if VMotion is enabled.
Screen capture of Summary tab:
- To disable VMotion, select the Configuration tab. Locate the vSwitch that has the VMkernel port, and click Properties. Select the VMkernel, and click edit. Clear (uncheck) the VMotion box.
Screen capture of VMkernel Properties:
7. VMware tools
Are the VMware tools up to date and installed inside the VM?
These tools are required to provide the best performance of a VM. If they are not installed or are outdated, performance can suffer greatly.
- To verify if the VMware tools are installed and up to date, in the VI client, select the VM. Click the Summary tab, and look at the General section.
Screen capture of Summary tab:
8. Windows time sync / NTP sync
Is the operating system inside the VM synchronizing time using the Windows time service or NTP deamon? Is the synchronize time enabled in the VMware tools?
Only the VMware tools synchronize time should be enabled. Any other service should be disabled. The host server should be the only one synchronizing with an NTP server.
9. Isolating the VM from the others
Can you reproduce the issue when only this VM is running on the server?
Isolating the VM is generally a trivial test and allows you to verify if the server is too busy to handle all the load.
Is the same NIC used by other VMs or dedicated to this VM? Is it 100 Mb or 1 Gb?
Intensive I/O Domino servers should have a dedicated NIC for best performance. In VMware, it is easy to temporary dedicate a NIC to a VM to verify if this helps.
11. Resource reservation
Did you set any limits on memory or resources available to this VM?
VMware allows you to limit CPU in terms of shares and MHz that a VM can use, giving it lower priority versus others or artificially slowing it down. It also allows you to limit the amount of memory available even if the VM sees more memory by artificially taking memory from the operating system (by using the ballooning driver). This usage can be detrimental for performance for Domino because Domino assumes all memory is available and sizes itself accordingly.
When VMware ESX has memory over-commited, it will still allow VMs to run and operate as if they had all the memory available, however performance can suffer. In ESX 3.x administrators tend to think memory availability is plentiful based on what is reported in the VI client:
however the VMs are under performing because memory is constraint and ballooning is being used (see graphic):
In production environments ballooning should always be 0 for optimal performance.
|Additional information about Storage performance|
Assuming the backend storage is able to sustain a heavy load, the following chart shows the relative expected performances by using different technologies to access the same external storage. For iSCSI, the assumptions below are based on 1 Gb Ethernet.
Higher throughput also means lower latency (time between requesting the data from storage and receiving it). For Domino, both fast throughput and low latency are necessary to obtain best performance on a mail server. The 2 and 4 Gb FiberChannel can provide such performance when used in conjunction with a good SAN solution.
The following chart shows the relative theoretical performance in IO Operations per second (IOPS) when building a 1 TB logical unit number (LUN) with disks of different sizes, using both 10K and 15K RPM disks. The Domino mail server is sensitive not only to how many MB/s the storage can provide, but also how many IOPS it can sustain. Smaller disk can provide faster throughput (measured in MB/s) and higher IOPS. Each disk or spindle has a limited number of IOPS; therefore, using more spindles to create the same size LUN would provide higher IOPS. Another way to increase IOPS is to use faster disks (higher RPM).
In production environments with several VMs running Domino connected to the same SAN storage, use 2 to 4 Gb FC connecting to a SAN built using smaller 15K RPM disks. In a benchmark that IBM Support ran internally, we used LUNs created with 36 GB 15K RPM disks and were able to get this result. Customers that used NAS storage with 300 GB 10K RPM disk presented to ESX using iSCSI using ESX 3.x software initiator reported performance issues because the back-end storage was not able to sustain the required IOPS, while throughput (MB/s) was still below the theoretical limit of their storage solution.