Troubleshooting Performance Issues in a Virtual Environment

So today I got to sink my teeth into a good problem. Performance issues in a virtual environment.

I have to say, this is probably the first time in my career where I walked in and I didn’t have to prove it was the network. The customer was prepared. He had his NMS tools in place ( Cacti ) and had been trending various points in the network over a period of time. 

Of course we started at the 101 stage and looked at counters, and when I said “Hey, you have some issues on your ASA” he pulled up the Cacti graph and said “Yeah, that’s an offsite backup that runs at midnight, we know about it and it’s fine with us. “

Can I say it out loud?  

WOW.

A lot of the customers I see are SMB/SME customers ( I am in Canada, remember? ) and although it’s uncommon to find a network with NMS tools in place, it’s even more rare to find one where they are actually using them!

I got called onsite to help out with some performance issues. The nice thing is that it was not the network, at least not yet. ( Until we’re 100% sure, I’m not going to discount anything, right? ). But we DO need to figure out where to start targeting our efforts.  

This is one of the problems I’m starting to see more and more of. Hard to troubleshoot anything when it’s in the cloud.

MP900426639

Picture Courtesy of Microsoft’s Online ClipArt Gallery.

 

No idea where the apps live in that picture, right? This gets even more interesting when you have VDI accessing virtual applications and start having performance issues on the client side. 

I know I’m going to get some snickers from this one, but my suggestion to deal with this is to create application flow maps to document how a complete transaction is made in a multi-tiered application. 

I know…  ” We can’t get them to create visio’s for the networks they already have, and you’re suggesting to ask them to create more?” 

Yeah… I know. But I can dream, right?

 

So let’s look at the following VDI multi-tiered application. This is pretty simple, right?

1) A client workstation connects to a Citrix Server over ICA or RDP.  

2) The citrix server browses to a web-app on a webhost.

3) The web host connects to a remote MS SQL Database and returns the results to the web host.

4) etc… 

 

Screen Shot 2012 09 21 at 9 59 02 PM

Can’t get much easier than this right? The great thing about this is that it becomes fairly easy to overlay this to the virtual environment which starts to allow you to get a better idea of how the application is currently instantiated in the physical/virtual environment. 

Let’s look at the above example installed in a blade server environment where the three parts of this particular app flow lives on three different blades in three different chassis. 

Screen Shot 2012 09 21 at 9 59 07 PM

As you can see from a performance troubleshooting standpoint, we just went from a three points to check ( let’s throw out the client as that’s just screen caps ) to twenty-one points, without counting the network devices which are used to provide connectivity between the blade chassis.  

Although you can create affinity rules between VMs to ensure they are located on the same hypervisor physical host to avoid performance issues, we all know that people make mistakes, so by creating and applying the application flow map to the physical environment so you can start looking at only the specific devices which are actually involved in your specific performance issue. 

Last, but not least, I would also suggest you have on hand the storage flow maps for both the specific application as well as the relationship between the physical hypervisors hosts and their storage arrays. 

Screen Shot 2012 09 21 at 10 01 43 PM

I’m not a storage expert, but I’ve seen my storage buddies tell stories of Database and VDI LUNS thrashing on the same physical disks that had obviously left them with nightmares for weeks. 

 

Any one have any tricks or suggestions on troubleshooting application performance issues in highly virtualized environments? As we move towards “THE CLOUD” I don’t see this getting any easier. 

Let me know how you’re approaching these problems! I’d love to see a better approach! 

 

FCAPS – A Quick Introduction

It occurs to me that I’ve been writing the last few posts about network management tasks based on an ITSM model and I didn’t even introduce what is probably the more arguably more useful model for breaking down and understanding network management tasks; the FCAPS model.
FCAPS has it’s roots in the ISO, similar to another model we all know and love; the OSI model. Everyone remember that one? Please Don’t Take Sales’ Peoples Advice?  You may have learned another acronym for it, but this is the probably the most basic conceptual model that every networking person uses to understand the world we live in.

For those of you who are looking for some extra credit reading, or need a cure for insomnia, you can find the actual FCAPS standards in the ITU-T M.3400 recommendations. For the rest, I’m hoping to give a brief overview to help you understand the different aspects of the disciplines of network management.

F is for Fault

This involves the detection, isolation, and correction of a fault condition. Or in plain english, this lets you know when things are broken.

Fault Management could involve things like syslog, SNMP traps been escalated to Alarms. Root-Cause-Analysis and Alarm suppression or some AI which tries to seperate the signal from the noise during event storms.  Alarm notification policies ( sending out an e-mail once you get an alarm ).

Traditionally this was implemented in a lot of NMSs as Green-is-good management. Basically, if everything is green. Things are ok. If they are yellow or red, you’ve probably got along night ahead of you.

In recent years, Fault Management has started to include application performance management as well. In modern networks, it’s not enough to know that an application is “up”. Now we must also make sure that the level of service, or SLA, that is been delivered to the end-user is adequate to meet their needs.

Note: Whether an activity falls into one category of FCAPS or another might depend on your perspective. If you are measuring bandwidth on a particular port, you may be in the “P”, but if you are measuring the bandwidth and raising an alarm if you cross a certain threshold, you’re now in the “F”.

This may seem confusing at first, but remember that FCAPS is just a conceptual model.  This is similar to the 7 Layer OSI model. Ask any good network person what layer MPLS falls at and they will either answer ” It depends” or potentially ” 2.5 “.

C is for Configuration

This involves the configuration of the software and hardware in the network. This includes the versions of software, the actual configurations, change management, etc…

This is probably the easiest to understand. If you’re upgrading code on a switch or router, if you’re logging into a router to make a configuration change, or if you’re just plugging a network cable in to a PC, you’re in the “C”s.

Accounting

This involves the identification of cost to the service provider and payment due for the customer. Ie: Billing.

Personally, I find this definition a little restrictive and prefer to apply the definition that I heard in a presentation.  I wish I could remember the name of the gentleman to give him credit. He started out in a thick southern drawl

The thing to remember about a’counting, is that the rest of the world just calls it counting.

I know. Barely funny, right?

But it does allow us to use this to include things like

  • netflow for counting the different protocols running across a certain WAN link.
  • SNMP polling of T1/PRI interfaces for ensuring that you’re Erlang calculations are accurate and you don’t need to raise or lower the number of trunks on your voice gateways.
  • RADIUS to track how long a user was logged into a specific port on the network or how much bandwidth he actually used.

You get the picture. Basically, accounting is just counting things which might be interesting to you.

Although this is not the strict definition from the ITU M.3400, this amended version makes it easier for me to apply this because I don’t have very many customer (read: any) who actually do charge-backs for their services.

Obviously, in a XaaS service, this domain is probably going to get a lot of attention in the coming years.

P is for Performance

This involves evaluating and reporting on the effectiveness of the network, and individual network devices.

Way back when I did my CCNA, one of the things I remember reading about was how you should be checking your routers and switches often to see if their CPU or memory was running high. I’ve never actually met anyone who logged into a device to check on a daily basis, but the advice was actually really good.

With a good NMS, you can

  • use SNMP polling for the CPU and Memory to track their trending over time.
  • use ICMP to track availability of the devices ( assuming it responds!)
  • use ICMP to track the latency of the device to test the quality of the link.

As I mentioned in the Fault section, performance often blurs with fault in that good performance management habits can alert you to  faults in the network. In fact, good performance management can even allow you to proactively avoid faults by identifying a potential performance block in the network, and addressing the issue before it turns into a fault.

Probably the most important thing to know about performance management is that it helps you make better decisions.

Most good network engineers can instinctively know where the bottlenecks are in their networks and can usually correctly identify what needs to be upgraded to get the most benefit.

Most great network engineers can use the pretty graphs from a good performance management tool to get the money from their CFO for those upgrades.

In my home network, I actually track the response time of all my links, as well as additional services, such as the one below which allows me to keep my wife happy.

Facebook Response Time Performance Tracking

note: probably the most recognizable performance management tool would be MRTG/PRTG. I can’t even imagine how many network upgrades were justfied by the pretty graphs that came out of these tools.

Security

Security is… well security. These are the network management activites that involve securing the network and the data running over it.

In a lot of ways, I strongly believe that security should be addressed in every waking (and sleeping!) moment that you’re thinking about your networks. Security should become so second nature to us that it should be almost impossible to perform any of the other tasks without security entering the conversation.

What do I mean?

Fault – CIA – Confidentiality, Availability, and Integrity. Hard to be secure when it’s not available and the Fault domain helps us keep it that way!

Configuration – Auditing – Good configuration management practices can involve automated IT Control objective verification tools, otherwise known as “scripts” which will allow us to have the NMS ensure all the configurations are what they should be and no unneeded services are on our routers and switches.

Performance – You can’t get performance data without SNMP, and if you’re using SNMP, PLEASE USE SNMPv3 if possible!  It can be encrypted with integrity. Also, lock down your management interfaces with ACLs on your devices.
FCAPS

It’s just a model

Please don’t take it too seriously. It’s not a binary model. Feel free to apply some fuzzy logic here and be confident that it’s 46% Fault Management and 54% Performance Management.

The important thing here is that it helps us understand the network management world we live in. It gives us a conceptual model to be able to understand the different activities involved in network management. As an added bonus, it also gives us a handy tool to evaluate different NMS software packages.

Think about the tools you’re using. Are you using a point solution, like Solarwinds Orion NPM which focuses on Performance monitoring, or an Open Source tool like RANCID which focuses on Configuration?

Or are you looking at a SPOG solution like HP’s IMC which provides full FCAPS (and more!) in the base package?

What tools are you using? Are they full FCAPS?Or are they more focused on one particular area?