Troubleshooting Performance Issues in a Virtual Environment

So today I got to sink my teeth into a good problem. Performance issues in a virtual environment.

I have to say, this is probably the first time in my career where I walked in and I didn’t have to prove it was the network. The customer was prepared. He had his NMS tools in place ( Cacti ) and had been trending various points in the network over a period of time. 

Of course we started at the 101 stage and looked at counters, and when I said “Hey, you have some issues on your ASA” he pulled up the Cacti graph and said “Yeah, that’s an offsite backup that runs at midnight, we know about it and it’s fine with us. “

Can I say it out loud?  

WOW.

A lot of the customers I see are SMB/SME customers ( I am in Canada, remember? ) and although it’s uncommon to find a network with NMS tools in place, it’s even more rare to find one where they are actually using them!

I got called onsite to help out with some performance issues. The nice thing is that it was not the network, at least not yet. ( Until we’re 100% sure, I’m not going to discount anything, right? ). But we DO need to figure out where to start targeting our efforts.  

This is one of the problems I’m starting to see more and more of. Hard to troubleshoot anything when it’s in the cloud.

MP900426639

Picture Courtesy of Microsoft’s Online ClipArt Gallery.

 

No idea where the apps live in that picture, right? This gets even more interesting when you have VDI accessing virtual applications and start having performance issues on the client side. 

I know I’m going to get some snickers from this one, but my suggestion to deal with this is to create application flow maps to document how a complete transaction is made in a multi-tiered application. 

I know…  ” We can’t get them to create visio’s for the networks they already have, and you’re suggesting to ask them to create more?” 

Yeah… I know. But I can dream, right?

 

So let’s look at the following VDI multi-tiered application. This is pretty simple, right?

1) A client workstation connects to a Citrix Server over ICA or RDP.  

2) The citrix server browses to a web-app on a webhost.

3) The web host connects to a remote MS SQL Database and returns the results to the web host.

4) etc… 

 

Screen Shot 2012 09 21 at 9 59 02 PM

Can’t get much easier than this right? The great thing about this is that it becomes fairly easy to overlay this to the virtual environment which starts to allow you to get a better idea of how the application is currently instantiated in the physical/virtual environment. 

Let’s look at the above example installed in a blade server environment where the three parts of this particular app flow lives on three different blades in three different chassis. 

Screen Shot 2012 09 21 at 9 59 07 PM

As you can see from a performance troubleshooting standpoint, we just went from a three points to check ( let’s throw out the client as that’s just screen caps ) to twenty-one points, without counting the network devices which are used to provide connectivity between the blade chassis.  

Although you can create affinity rules between VMs to ensure they are located on the same hypervisor physical host to avoid performance issues, we all know that people make mistakes, so by creating and applying the application flow map to the physical environment so you can start looking at only the specific devices which are actually involved in your specific performance issue. 

Last, but not least, I would also suggest you have on hand the storage flow maps for both the specific application as well as the relationship between the physical hypervisors hosts and their storage arrays. 

Screen Shot 2012 09 21 at 10 01 43 PM

I’m not a storage expert, but I’ve seen my storage buddies tell stories of Database and VDI LUNS thrashing on the same physical disks that had obviously left them with nightmares for weeks. 

 

Any one have any tricks or suggestions on troubleshooting application performance issues in highly virtualized environments? As we move towards “THE CLOUD” I don’t see this getting any easier. 

Let me know how you’re approaching these problems! I’d love to see a better approach! 

 

Advertisements

it’s with great sadness and reservation I take these powers….

So first thing, I’m not taking on any powers. now that that’s out of the way, I wanted to take a little time and put together some thoughts on the current state of our industry.

We’re at an inflection point, a paradigm shift where everything that once was is about to change. I’m sure some would argue that we’ve already fallen over that edge. I don’t mean to be all dramatic ( although it does create an more interesting bit of writing!), but I truly believe our industry is in for change.

    BIG CHANGE

Like ” Fish crawling out of the ocean change “.

There’s a great book called the Aquarian Conspiracy that deals with the concept of paradigm shifts. It’s not IT related at all, but I think applicable to this topic sense we are dealing with a point where there are so many of our “beliefs” that are been destroyed that we need to find a new path forward.

what beliefs am I talking about? let’s start with this small list, although I’m sure there are more ( feel free to post in the comments if you have any!)

1) International Standards bodies work. – IEEE/IETF have been infiltrated by overly powerfull vendors with their own agendas, allowing them to force or stall individual projects on a whim. ( see Greg Ferro’s article for a great description of this in detail

2) That overlay networks are going to solve all of our problems.

3) A protocol per problem: Any problem can be solved by adding a new protocol.

Now there are a lot of solutions right now, SDN is hot. Whether that’s Nicara, BigSwitch, IBM and HP with the recent VEPA gear. ( yes, one of them is VEPA “ready” @ioshints! ), or whether we’re talking about something much more devious like vXlan or NVGRE.

There’s a lot of great work that’s been done in the Openflow arena but I’m not sure it’s gotten out of the “solution looking for a problem” stage yet.

But to be honest, there’s one player that has be a little bit concerned here. Perhaps I’ve just seen too many Star Wars movie in my time, and with the recent re-release of Episode 1, my mind is going down a strange road.

VMWare scares me.

There. I said it out loud.

Now to explain to you WHY they scare me, I have to explain a little bit of the star wars story. ( for those of you who have been living under a rock ).

Once upon a time there was a republic that had lived for a thousand years with their glorious protectors.

The Network Jedi Knights.

Now these brave men and woman had been rescuing the business for years from spanning-tree loops, layer2 data center interconnects, and the evil of double ( and single ) NATs.

But unfortunately the Senate ( IEEE/IETF ) had grown complacent with the member planets (vendors) arguing internally for placement.

“TRILL!”

” NO SPB!”

“TRILL lets us sell more hardware!”

“VEPA!!!”

“No VN-link”

“You’re proprietary!”

“No YOU’RE proprietary!”

And suddenly out of the darkness comes VMWARE and VxLAN.

” Yes, I know you haven’t done anything about that little VLAN problem, but you guys just keep arguing…  it makes me really sad, but I suppose I will handle all the traffic decisions, but just until you guys get this figured out, ok? It’s with great sadness and reservation that I take on these power…”

I’m pretty sure that everyone remembers how that story ended.

Don’t believe me? Think about this, VMware introduced the vSwitch and took Cisco on as it’s “apprentice”. Cisco had the only vSwitch in the industry for the last few years that had access to the hypervisor of the major player in the industry.

And now, VMware has it’s own security suite that negates the need for a ASA. Especially when you consider that there are currently no hardware products that support the termination of VxLAN tunnels.

And if all the shady behavior is not enough to convince you, check out this little nugget that I found on the Microsoft page today.

” learn about the Cisco and NetApp pre-validated private cloud offering through the Microsoft Hyper-V Cloud Fast Track”

What the heck happened to VCE???  Now we have to deal with MCN as well?

I don’t know how this is going to play out. Will Openflow grow out of a lab toy into a solution which not only scales, but actually addresses technical requirements in a much more elegantly simplistic way than our current protocol-per-problem paradigm?  I guess we’ll see…

What do you think? Anyone else worried about the state of the networking industry? Change is a constant and embracing change is the key to surviving in this industry, but I also think a healthy dose of vendor sketicism and suspicion is not only healthy, but a survival trait.

I just hope that the network industry pulls out of this before John Chambers ends up in a black suite with a respirator and all the rest of us Jedi’s are gone.

Feel free to let me know where you see this headed in the comments.

@netmanchris