So today I got to sink my teeth into a good problem. Performance issues in a virtual environment.
I have to say, this is probably the first time in my career where I walked in and I didn’t have to prove it was the network. The customer was prepared. He had his NMS tools in place ( Cacti ) and had been trending various points in the network over a period of time.
Of course we started at the 101 stage and looked at counters, and when I said “Hey, you have some issues on your ASA” he pulled up the Cacti graph and said “Yeah, that’s an offsite backup that runs at midnight, we know about it and it’s fine with us. “
Can I say it out loud?
A lot of the customers I see are SMB/SME customers ( I am in Canada, remember? ) and although it’s uncommon to find a network with NMS tools in place, it’s even more rare to find one where they are actually using them!
I got called onsite to help out with some performance issues. The nice thing is that it was not the network, at least not yet. ( Until we’re 100% sure, I’m not going to discount anything, right? ). But we DO need to figure out where to start targeting our efforts.
This is one of the problems I’m starting to see more and more of. Hard to troubleshoot anything when it’s in the cloud.
No idea where the apps live in that picture, right? This gets even more interesting when you have VDI accessing virtual applications and start having performance issues on the client side.
I know I’m going to get some snickers from this one, but my suggestion to deal with this is to create application flow maps to document how a complete transaction is made in a multi-tiered application.
I know… ” We can’t get them to create visio’s for the networks they already have, and you’re suggesting to ask them to create more?”
Yeah… I know. But I can dream, right?
So let’s look at the following VDI multi-tiered application. This is pretty simple, right?
1) A client workstation connects to a Citrix Server over ICA or RDP.
2) The citrix server browses to a web-app on a webhost.
3) The web host connects to a remote MS SQL Database and returns the results to the web host.
Can’t get much easier than this right? The great thing about this is that it becomes fairly easy to overlay this to the virtual environment which starts to allow you to get a better idea of how the application is currently instantiated in the physical/virtual environment.
Let’s look at the above example installed in a blade server environment where the three parts of this particular app flow lives on three different blades in three different chassis.
As you can see from a performance troubleshooting standpoint, we just went from a three points to check ( let’s throw out the client as that’s just screen caps ) to twenty-one points, without counting the network devices which are used to provide connectivity between the blade chassis.
Although you can create affinity rules between VMs to ensure they are located on the same hypervisor physical host to avoid performance issues, we all know that people make mistakes, so by creating and applying the application flow map to the physical environment so you can start looking at only the specific devices which are actually involved in your specific performance issue.
Last, but not least, I would also suggest you have on hand the storage flow maps for both the specific application as well as the relationship between the physical hypervisors hosts and their storage arrays.
I’m not a storage expert, but I’ve seen my storage buddies tell stories of Database and VDI LUNS thrashing on the same physical disks that had obviously left them with nightmares for weeks.
Any one have any tricks or suggestions on troubleshooting application performance issues in highly virtualized environments? As we move towards “THE CLOUD” I don’t see this getting any easier.
Let me know how you’re approaching these problems! I’d love to see a better approach!