ESXTOP is a fantastic tool available for the VMware administrator when troubleshooting performance issues in a vSphere Environment. ESXTOP has a somewhat steep learning curve, but it is all worth it. In this post I want to help you get a head start with ESXTOP. If you want a really good read I recommend Duncan’s very comprehensive post on the same subject here
ESXTOP is available in two ways. Either through the ESXi Shell or through the vSphere Management Assistant with the command RESXTOP. In this article I will focus on ESXTOP from the ESXi shell. It is very simple to get access to ESXTOP.
Step 1: Get access to the ESXi Shell. This is done by opening your vSphere Client, go to host, configuration, security profile and start the ESXi Shell service on a specific ESXi host.
Step 2: Download putty (or another SSH client) and create a SSH connection on port 22 to your ESXi host. Login with root and your password.
Step 3: Type the command esxtop and hit return
Step 4: You are now looking at ESXTOP it should look similar to this:
What you are looking at is the CPU screen in ESXTOP and you are now looking for CPU specific counters. You can browse around through different pages. If you type M you will see memory metrics. N for network etc. If you type H you will see all available commands. By default ESXTOP shows a lot of “worlds” a world is similar to a process in windows task manager. To sort it out and not show “vmkernel worlds” you type lower case v. By doing this you only see the virtual machines running on this specific ESXi host.
Now you are inside ESXTOP so lets focus on some good counters to use for performance troubleshooting.
CPU
When troubleshooting CPU performance for your virtual machines the following counters are the most important.
%USED, %RDY, %CSTP
%USED tells you how much time did the virtual machine spend executing CPU cycles on the physical CPU.
%RDY is a Key Performance Indicator! Always start with this one. This one defines how much time your virtual machine wanted to execute CPU cycles but could not get access to the physical CPU. It tells you how much time did you spend in a “queue”. I normally expect this value to be better than 5% (this equals 1000ms in the vCenter Performance Graphs read about it here)
%CSTP tells you how much time a virtual machine is waiting for a virtual machine with multiple vCPU to catch up. If this number is higher than 3% you should consider lowering the amount of vCPU in your virtual machine.
Memory
When troubleshooting memory performance this is the counters you want to focus on from a virtual machine perspective.
MCTL?, MCTLSZ, SWCUR, SWR/s, SWW/s
MCTL? This column is either YES or NO. If Yes it means that the balloon driver is installed. The Balloon driver is automatically installed with VMware tools and should be in every virtual machine. If it says No in this column then figure out why.
MCTLSZ The column show you how inflated the balloon is in the virtual machine. If it says 500MB it translates to the balloon driver inside the guest operating system has “stolen” 500MB from Windows/Linux etc. You would expect to see a value of 0 (zero) in this column
SWCUR tells you how much memory the virtual machine has in the .vswp file. If you see a number of 500MB here it means that 500MB is from the swap file. This does not necessarily equals to bad performance. To figure out if you virtual machine is suffering from hypervisor swapping you need to look at the next two counters. In a healthy environment you would want this value to på 0 (zero)
SWR/s This value tells you the Read activity to your swap file. If you see a number here, then your virtual machine is suffering from hypervisor swapping.
SWW/s This value tells you the Write activity to your swap file. You want to see the number 0 (zero) here. Every number above 0 is BAD.
If you have made it this far I suggest you to look at the following document that details ALL of the counters in ESXTOP. I call it the ESXTOP Bible 🙂
Awesome explanation.
Hi vfrank
What it is not really clear to me is:
all posts and documentation I found, speak about cpu ready at VM level but, what about cpu ready at ESX host level? I mean: looking in the vsphere client, in the performance charts, I can see both vm cpu ready and, clicking on a physical host, its cpu ready. Should pay I attention to the host cpu ready value?
Thanks
Marco
I second Marco’s question. I have been trying to get a thorough answer to this question for the last 24 hours. There are a few articles that mention HOST %Ready briefly, but apparently no authoritative word from VMware on whether this is an aggregate of all guest VMs %Ready during that sample or whether it is an average. And, even once we know that, do the same rules apply, i.e. does a host with >5% Ready warrant concern? 10%?
Hi,
At the hos level it is an aggregate of all the vCPU ready time for every single vCPU aggregrated. To use this number for anything meaningful you need to figure out how many vCPUS are deployed on the ESXi host and then divide by that number. Because of this the number is hard to use for anything valuable, you need to know how many vCPUs you have on the host, and why would you know that in an environment with VMware DRS?
Hi,
At the hos level it is an aggregate of all the vCPU ready time for every single vCPU aggregrated. To use this number for anything meaningful you need to figure out how many vCPUS are deployed on the ESXi host and then divide by that number. Because of this the number is hard to use for anything valuable, you need to know how many vCPUs you have on the host, and why would you know that in an environment with VMware DRS?
Best
Frank
Hi frankbrix,
Thans for your kind explanation, now it is clear to me that I should check just cpu ready per VM and not at host level. It was just a curisioty 🙂
Hi Franck,
Thanks for this article! I want to optimize my supervision. I want to monitor the SWW/s and SWR/s values.
Regarding to you, what is the during time where we can encountering slowness in VM if this value are different from 0?
Imagine, if SWW/S is different from 0 during 1 seconde; I guess no impact on the VM, but I guess during 1 minute we will feel some slowness on the VM, right?
Regards