Getting ITSM Experience

Many of the best network engineers I know have little to no network operation experience.

“What? How can that be?” you ask? Well it’s really quite simple.

Most of the best network engineers I know, and we’re talking some double and triple CCIE’s in this crowd, have never actually operated a network for any length of time. They were professional services guys, short term contract guys, some pre-sales or post sales guys. There are a LOT of paths to the top of Mt. Fuji after all.

Although I did a short term network ops. gig early on in my career, I actually feel I squandered the opportunity as I just wasn’t mature enough to understand the experience that I should have been gaining.

So this question came up with a college last week. ” How do you get network operations experience if you’re not in a network ops group?”

This blog post is dedicated to him.

A few years back, after I decided to really get serious about network management, I had the same issue. I wanted to get some experience in network management, but I had no large network to run. In my day job, I’m actually a pre-sales resource, so it’s not likely I’m going to get any experience in the near future, so it occurred to me that I could start a simulation to try and gain that experience.

At this point, I had already done some Ciscoworks LMS projects (long sleeve shirts to cover the scars to prove it!). I had successfully passed my ITILv3 foundations certifications, and I had even gained the honor of being one of the first Solarwinds Certified Professionals shortly after the SCP program was launched.

The Project

So with a bit of knowledge, I decided to run my home network as if it was an ITSM framework for a year. This means that I had to implement good network management hygene. Good Change management practices. Good fault management practices. Try to implement some of the ITIL processes around Service Strategy, Service Design, Service Operations, and Continual Service Improvement.  Basically, run it like a business who’s success depended on the network.

The Tools

So I had some ideas around the processes I wanted to put in place, but it always takes the three P’s to successfully implement any ITSM initiative. People, products and processes. Fortunately for me, I had access to HP’s Intelligent Management Center, as well as the trial versions of Solarwinds Orion NPM and NCM. but I was still missing some critical pieces to the puzzle.

Service Operations: One of primary activities in Service Operations is really around the help desk. How are tickets logged? How are they tracked? Escalations Procedures. Building out and growing the KMS (knowledge management system )

I didn’t have any help desk or ticketing software in place, so I decided to go the free way; Spiceworks.

For those of you who don’t know it, Spiceworks is a free IT Management app which ” includes a free IT management app for everything from network inventory and monitoring to help desk and more!”

It’s not what I would call a full FCAPS system, but it does have an ok help desk system, and it’s hard to beat free, right?

Note: I noticed last week that my Synology NAS now has a help desk app named OS Ticket in the available apps. I haven’t tried this, but considering it’s free and installs easily on the synology box, it might be a good option for those of you who are lucky enough to have one of these great little machines.

Financial Management

Financial Management falls under the service strategy volume of the ITILv3 core books. I’ll be honest, that this wasn’t exactly the strongest part of my little experiment, but I did try to implement some financial processes.

But unlike some of the helpdesk and change control procedures, this wasn’t exactly something that I could count on good self-discipline to track. Can you imagine that conversation?

“Hey Me… I’d really like this new synology RS812.”

“Hmm… Don’t we already have a 411?”

“Yeah, but this one has TWO gigabit ports!”

“Let me think about that… ok. Let’s buy it!”

As you can see, I had to come up with a different plan.

Fortunately, I’m married, so I merely formalized the process of having to ask my wife for permission to buy any new toys. I have to say, this was probably the year that I got the least amount of new techtoys, but I like to think the experience I gained was worth it. ( < – What’s the HTML tag for the sarcasm font again? )

The Results

So how did things go? Well, it was a little funny at times. Emailing myself a support ticket so that I could fix something that wasn’t working. I did try to get my wife to e-mail the tickets in, but that lasted about a week before she just said ” Can you just fix it!?!?!?!”

For the other things, it felt a little strange asking myself for permission so that I could make a change to the environment and then having to consult myself to see what the affects might be ( Change Advisory Board ). Implementing the RACI (responsible accountable consulted informed ) was pretty easy because I generally get along with myself. etc…

To be honest, I wish I would have been blogging back then, because I think it would have made for some interesting reading in retrospect. I’d like to say that I followed all the processes and ran a bullet proof network for the year, but I didn’t. Sometimes I slipped, made a change and locked myself out of my own gear.

But on the bright side…  I did learn why change management is important.

Any one else gone through an experiment like this? Anyone willing to take up the challenge and blog on the experience?

Advertisements

FCAPS – A Quick Introduction

It occurs to me that I’ve been writing the last few posts about network management tasks based on an ITSM model and I didn’t even introduce what is probably the more arguably more useful model for breaking down and understanding network management tasks; the FCAPS model.
FCAPS has it’s roots in the ISO, similar to another model we all know and love; the OSI model. Everyone remember that one? Please Don’t Take Sales’ Peoples Advice?  You may have learned another acronym for it, but this is the probably the most basic conceptual model that every networking person uses to understand the world we live in.

For those of you who are looking for some extra credit reading, or need a cure for insomnia, you can find the actual FCAPS standards in the ITU-T M.3400 recommendations. For the rest, I’m hoping to give a brief overview to help you understand the different aspects of the disciplines of network management.

F is for Fault

This involves the detection, isolation, and correction of a fault condition. Or in plain english, this lets you know when things are broken.

Fault Management could involve things like syslog, SNMP traps been escalated to Alarms. Root-Cause-Analysis and Alarm suppression or some AI which tries to seperate the signal from the noise during event storms.  Alarm notification policies ( sending out an e-mail once you get an alarm ).

Traditionally this was implemented in a lot of NMSs as Green-is-good management. Basically, if everything is green. Things are ok. If they are yellow or red, you’ve probably got along night ahead of you.

In recent years, Fault Management has started to include application performance management as well. In modern networks, it’s not enough to know that an application is “up”. Now we must also make sure that the level of service, or SLA, that is been delivered to the end-user is adequate to meet their needs.

Note: Whether an activity falls into one category of FCAPS or another might depend on your perspective. If you are measuring bandwidth on a particular port, you may be in the “P”, but if you are measuring the bandwidth and raising an alarm if you cross a certain threshold, you’re now in the “F”.

This may seem confusing at first, but remember that FCAPS is just a conceptual model.  This is similar to the 7 Layer OSI model. Ask any good network person what layer MPLS falls at and they will either answer ” It depends” or potentially ” 2.5 “.

C is for Configuration

This involves the configuration of the software and hardware in the network. This includes the versions of software, the actual configurations, change management, etc…

This is probably the easiest to understand. If you’re upgrading code on a switch or router, if you’re logging into a router to make a configuration change, or if you’re just plugging a network cable in to a PC, you’re in the “C”s.

Accounting

This involves the identification of cost to the service provider and payment due for the customer. Ie: Billing.

Personally, I find this definition a little restrictive and prefer to apply the definition that I heard in a presentation.  I wish I could remember the name of the gentleman to give him credit. He started out in a thick southern drawl

The thing to remember about a’counting, is that the rest of the world just calls it counting.

I know. Barely funny, right?

But it does allow us to use this to include things like

  • netflow for counting the different protocols running across a certain WAN link.
  • SNMP polling of T1/PRI interfaces for ensuring that you’re Erlang calculations are accurate and you don’t need to raise or lower the number of trunks on your voice gateways.
  • RADIUS to track how long a user was logged into a specific port on the network or how much bandwidth he actually used.

You get the picture. Basically, accounting is just counting things which might be interesting to you.

Although this is not the strict definition from the ITU M.3400, this amended version makes it easier for me to apply this because I don’t have very many customer (read: any) who actually do charge-backs for their services.

Obviously, in a XaaS service, this domain is probably going to get a lot of attention in the coming years.

P is for Performance

This involves evaluating and reporting on the effectiveness of the network, and individual network devices.

Way back when I did my CCNA, one of the things I remember reading about was how you should be checking your routers and switches often to see if their CPU or memory was running high. I’ve never actually met anyone who logged into a device to check on a daily basis, but the advice was actually really good.

With a good NMS, you can

  • use SNMP polling for the CPU and Memory to track their trending over time.
  • use ICMP to track availability of the devices ( assuming it responds!)
  • use ICMP to track the latency of the device to test the quality of the link.

As I mentioned in the Fault section, performance often blurs with fault in that good performance management habits can alert you to  faults in the network. In fact, good performance management can even allow you to proactively avoid faults by identifying a potential performance block in the network, and addressing the issue before it turns into a fault.

Probably the most important thing to know about performance management is that it helps you make better decisions.

Most good network engineers can instinctively know where the bottlenecks are in their networks and can usually correctly identify what needs to be upgraded to get the most benefit.

Most great network engineers can use the pretty graphs from a good performance management tool to get the money from their CFO for those upgrades.

In my home network, I actually track the response time of all my links, as well as additional services, such as the one below which allows me to keep my wife happy.

Facebook Response Time Performance Tracking

note: probably the most recognizable performance management tool would be MRTG/PRTG. I can’t even imagine how many network upgrades were justfied by the pretty graphs that came out of these tools.

Security

Security is… well security. These are the network management activites that involve securing the network and the data running over it.

In a lot of ways, I strongly believe that security should be addressed in every waking (and sleeping!) moment that you’re thinking about your networks. Security should become so second nature to us that it should be almost impossible to perform any of the other tasks without security entering the conversation.

What do I mean?

Fault – CIA – Confidentiality, Availability, and Integrity. Hard to be secure when it’s not available and the Fault domain helps us keep it that way!

Configuration – Auditing – Good configuration management practices can involve automated IT Control objective verification tools, otherwise known as “scripts” which will allow us to have the NMS ensure all the configurations are what they should be and no unneeded services are on our routers and switches.

Performance – You can’t get performance data without SNMP, and if you’re using SNMP, PLEASE USE SNMPv3 if possible!  It can be encrypted with integrity. Also, lock down your management interfaces with ACLs on your devices.
FCAPS

It’s just a model

Please don’t take it too seriously. It’s not a binary model. Feel free to apply some fuzzy logic here and be confident that it’s 46% Fault Management and 54% Performance Management.

The important thing here is that it helps us understand the network management world we live in. It gives us a conceptual model to be able to understand the different activities involved in network management. As an added bonus, it also gives us a handy tool to evaluate different NMS software packages.

Think about the tools you’re using. Are you using a point solution, like Solarwinds Orion NPM which focuses on Performance monitoring, or an Open Source tool like RANCID which focuses on Configuration?

Or are you looking at a SPOG solution like HP’s IMC which provides full FCAPS (and more!) in the base package?

What tools are you using? Are they full FCAPS?Or are they more focused on one particular area?

Configuration Management – Software Management

So in the last post I introduced the concepts of the Configuration Management System, and the Configuration Item. Today, I’m going to introduce the concept of the Definitive Media Library.

The DML is really nothing more than a software library. Ideally, this should be tied directly into your element management system so that you can define the baseline software image, deploy the image out to the appropriate devices, and audit the network to ensure that all of the devices are inline with your golden software definitions.

As I laid out in the last post, standardization is there to make your lives easier. But it takes a lot of commitment, especially if your network has gone through significant “organic growth”. Making the choice to commit to good configuration management hygene is sort of like committing to going to the gym or commiting to eat healthier.

Just like going to the gym, the first thing you need to do is figure out your current software state. Hopefully, your NMS software will have the ability to discover and audit the software running on the devices in your network and report against a known good state.

Audit the Current State of the Network

If you don’t have an NCCM tool in place with these features, you may end up writing scripts, or worse case, loging into your devices manually and noting the software version in an excel spreadsheet. Once you have a handle on what’s out there, the next step is chosing what version of code you need to be running.

Choosing your Software Version

So now that you’ve figured out that your devices are all over the place, it’s time to figure out what version of software you actually want to be running. Whether you are running Comware, IOS, NXOS, Junos, FTOS, or some other OS that I haven’t mentioned, the guidelines are pretty much the same.

Wash, Rinse and Repeat.

What about the exceptions?

I was going to try to sugar coat this, but I’ll just come out and say it. Cisco has licensing for many of their platforms, this can create situations where you can’t actually get on a common code version without incurring additional CAPEX costs associated with buying the licenses and OPEX to deal with the SMARTNet’. Or potentially, you can get into the situation where the features you’re looking for are mutually exclusive in two different IOS images for your routers. Or you’re running Cisco Callmanager and your gateways require the Voice image and your regular WAN routers another image.

In any event, my recommendation is still the same. Find the fewest possible combinations of software for the hardware platforms in your network and stick to them unless there is a REALLY good reason to change.

Check out this video of the basic NCCM features in HP’s Intelligent Management Center to help you navigate through your software baseline woes.

Anything I missed here? Feel free to post in the comments below.

Intro to Configuration Management

So in a previous post, I made the recommendation to go find an ITSM framework.  For the rest of this series, I’ll be referring to the ITILv3 ITSM framework a lot.  The two books that, IMHO, apply the most to Network Operations are the Service Transition and the Service Operations books.

For the next few posts, I’m going to focus on the Service Transition volume, and specifically on the Configuration Management sections.

So in ITILv3, one of the MOST important things to understand is the concept of a Configuration Item.

What’s a CI?

The way I explain this to customers is it’s the smallest managed thing, or set of things, in the environment.

How does that apply to my network?

Well, hopefully, I’m going to try and explain that now.

The first CI in a network might be the hardware devices that are in the network. These are your switches, routers, firewalls, load balancers, servers, etc…

So most people are good with the idea of standardization. It makes senses that it’s easier to manage fewer kinds of devices. This is recommendation #1.

1) Standardize on as few hardware platforms as possible.

The good thing is that this is fairly easy to achieve. In fact a lot of people do this instinctively. They standardize on the same two chassis switches in their core, they use the same model in their distribution, and they use the same model for the access layer.

Here’s where things get crazy though.

Many of the same customers who try to standardize on a current device often have no processes in place to ensure that they are all running the same version of code.

So think back to the ITIL Configuration Item. If you have five HP 5500EI switches, and five different OS’s on them, you now have five different CI’s to track. Make sense?

Five different versions of commands

Five different versions of bugs.

5 times the headache.

If a configuration item is the smallest manageable object, then each of the different combinations of hardware and software count as a single CI. BUT… if we standardize one version of code for that hardware platform, we get one configuration item.

So the first thing that I recommend customers to do is…

GET ON ONE VERSION OF CODE!!!!!!

This is commonly called a golden software version. One version of commands, one version of bugs. One CI.

On the flip side; one of the other common mistakes I see made by customers who have taken the first step of getting on a single version is that of upgrading without a reason.

My recommendation here is to do your homework. When a new version of code is released, read the release notes, check the bug fixes, check the new features. If there’s nothing in there that is addressing an issue you’re having, or new functionality that you NEED to have,

WHY WOULD YOU CHANGE?

It may seem strange, but when you get a new switch out of the box, you may want to just plug that into your network and downgrade it to the older software. More thoughts on this in this blog post.

Any decent NMS should have the ability to be able to define, report, and deploy the correct version of code to the hardware devices.

Funny enough, post writting this, I found this another great blog by Terry Slattery, this time over at http://www.nojitter.com.

What about you guys? What configuration tools are you using? HP IMC? Orion NCM? Rancid? Prime? A TFTP Server on a wandering laptop?

Device Instrumentation

Not all devices are created equal.

I know this seems like a piece of Captn’ Obvious wisdom, but it bears thinking about a little in context of network management.

One of the things which I see all the time is someone asking to do XYZ on the device. Whether that’s pull serial numbers from power supplies, or read the sticker on the back of a switch. There are some things that are just outside the realm of possibility, or would just be to difficult to put into place.

If you are seriously looking at implementing an NMS, you need to get friendly with SNMP. Simple Network Management Protocol is probably the most common management protocol on the planet.

To be honest, SNMP is a second language and I would highly recommend anyone who wants to get SERIOUS about network management pick up a book or two and start learning it.  SimpleWeb has some tutorials, podcasts, and slide decks that they make available which may be a good place to start. 

In a nutshell, SNMP MIBs fall into two major categories

Public – These are the standard MIBs that are defined by the IETF. These are your friends, the bridge MIB, dot3 MIB, Entity MIB, etc..  MOST vendors should support these.

Private – Also referred to as Enterprise, these are the MIBs which Vendors write to support their own device specific functionality.

Occasionally, someone brings in a non-snmp capable device and asks for it to be monitored. And then they complain because you can’t make the same pretty graphs.

If it’s not instrumented in the device, we can’t do anything with it.

Let me say that again…

If it’s not instrumented in the device, we can’t do anything with it.

Here’s an example: Say someone comes to you and says ” Hey! Can you please tell me what the serial number is on the power supply in XYZ vendors chassis switch?”

I check the MIBs and it seems that XYZ vendor hasn’t instrumented serial numbers as one of the piece of information which they make available. So the answer is ” No, I can’t”

Then they complain that this NMS stuff, or the specific NMS product sucks. Remember

If it’s not instrumented in the device, we can’t do anything with it.

Baselining

Where are you coming from?

One of the first mistakes that new network operators make is that they don’t have a good idea of where they already are.

Take the case above. Say you want to get to Disneyland. Google can tell you were Disneyland is, but if you don’t have a starting point, there’s really no way to understand how you are going to get to where you want to be.
So the first thing you need to do when you’re trying to operate a new network is figure out exactly where you are.

This concept is known as baselining

Baselining:

At its essence, baselining is really nothing more than taking stock of where you are. Most experienced engineers instinctively know when they reach a new environment that they want to spend some time just getting to know the place before they make any changes. Network Management as a discipline takes this to a much more structured level.
There are a few types of baselines, performance and configuration been the major.

Performance Baselines –

This is the simple act of contain and recording things in the network. One of the common mistakes that I see new network management practitioners make is the ” I want to monitor everything! ” move.

Now there are some grounds for this. Remember this guy?

“When you can measure what you are speaking about, and express it in numbers, you know something about it. But when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind.”

Lord Kelvin, 1891

Continue reading “Baselining”