How many SPOGs?

Seems like every vendor is preaching the value of the Single Pane of Glass ( SPOG ) to their customers. For those of you who have been operations folks, the fragmented nature of xMS ( NMS, SMS (security), SMS ( server ), BSM, APM, etc.) has been a nightmare for most organizations. The data is more silo’d than the IT departments and it really doesn’t scale because of the lack of interaction between the data in the management domain.

So the industry lately has really zoomed in around the idea of the single pane of glass management system. And it got me to thinking

Does anyone really want a single pane of glass?

I think a lot of people are looking for a way to manage complex environments and the idea of having a SPOG that lets you see everything in one console is such a tempting idea. But is it realistic? And even if it was, would it even be useful?

I don’t think anyone would try to argue that convergence in the data center isn’t a reality. The network is virtual, storage is distributed. Applications are federated. Everything is built on a stack of lies and no one in the operations group has any idea where their particular domain of responsibility ends anymore.

But in meeting with many different organizations, it seems that although people want (and NEED ) the SPOG. They also seem to want to continue with the seperation of the seperate silos of servers, storage, and networking.

I’m still thinking this through, but it seems to be that the network guys ( and gals ) want to see things from a network-centric point of view. The servers want to see this through a server-centric point of view, and the storage wants to see this through a storage-centric point of view.

What’s interesting though, is that in smaller shops where the Ops team is actually one or two people who do everything, they still seem to prefer a SPOG per IT domain.

Functionaly Dysfunctional if you will.

There are some solutions out there, like Cisco UCS Manager that does have some great stuff going for it and seems to bring together the Data Center networ and the Servers. I haven’t had a lot of hands on, but it does seem to bring the data center into a SPOG, and I can see the value in that.

But I wonder about the rest of the network. What about the end-users? The data center only exists to offer services to end-users and a solution that seems to completely discount the users it is supposed to serve just seems like it’s missing something to me.

What do you guys think? Would you rather have a NMS tool that allows you to see into the networking centric portions of the virtual environment and gives you full visibility to the end-user? Full visibility into the end-to-end transaction, at least from the network perspective?

Still thinking this one through…

@netmanchris

Advertisements

Through the eyes of a child

Wrote this last summer and apparently didn’t publish. Still amazes me.

Listening to the packet pushers podcast there was a listeners question on studying and learning. Coincedently, I had just had one of the most amazing experiences of my life. I had just watched my 6 year old son ride his bike alone for the first time.

We’ve been working on it all summer, and he was more scared than anything else. It was late September and he was discouraged and didn’t want to practice anymore, and I literally had to force him back on that bike. But I knew that this day was the day.

And it was.

The look of absolute wonder on his face at his new, seemingly superhuman, ability to ride his bike by himself was awe inspiring. And it got me thinking just how lucky we are in this industry.

We have the privilege every day to learn. what a great job we have.

Code is like my wife.

So it’s obvious I’m a management guy. And I don’t mean I manage people, I manage my network, and I help advise my customers on managing their networks. One of the things that never ceases to amaze me is how little people in our industry actually know about change management. In fact I find it down right ironic that an industry who’s only constant is change hasn’t really  embraced change management at all. In fact, I think it’s safe to say we run screaming from change management as quickly as possible. The only thing we avoid more than change management is it’s dreaded enforcer

THE PROJECT MANAGER

Personally, I think that a little more time spent in ITSM ( IT service management ) school could really do a lot to change the stereotype of our profession as networking focused IT workers. The most common ITSM framework today is, of course, ITIL.  The Information Technology Infrastructure Library is a CBOK ( common body of knowledge ) that focuses on how to run an IT organization. It’s most lofty goal is to give us a common vocabulary to be able to effectively communicate across the IT silos. It’s as simple as that. The focus of ITIL is really just to standardize and codify the best practices and wisdom that other IT professionals have gathered over the years and allow us to leverage their prior experience.

def: WISDOMThe experience we get from making bad decisions.

So what does ITIL have to do with code and my wife? We’re getting there, but first we need a little more ITIL knowledge.

ITIL is comprised of 5 core volumes, ( Strategy, Design, Transition, Operations, and CSI for those of you who are counting ).  There are a couple of volumes, transition and operations, that deal specifically with change management. There’s a LOT of good content in there, but I’m going to focus on just one piece

def: Configuration Item: A CI is an asset, service component or other item that is, or will be, under the control of Configuration Management. CI’s may vary widely in complexity, size and type, ranging from an entire service or system including all hardware, software, documentation and support staff to a single software module or a minor hardware component. Configuration items may be grouped and managed together, e.g. a set of components may be grouped into a release. CI’s should be selected using established selection criteria, grouped, classified and identified in such a way that they are manageable and traceable through the service lifecycle.

One of the goals of Configuration Management is really to offer stability in the IT environment. One of the ways to get more stability is to ensure consistency of operations, and one of the way to ensure consistency of operations is to ensure that you have common components. This is way telco’s often standardize on a single box, even though in many cases it may be vastly overkill for the application. From an operational cost point of view, the CAPEX cost is offset by the stability and consistency gained by having a single common CI.

So let me bring this back to my wife…  One of the MOST common mistakes  inefficiencies that I see in many customers networks is the following.

Joe Admin receives a brand new Cisco 3750/Juniper EX4200/HP 5500EI switch for the new wiring closet. It arrived on time, and he got a good price on it. In fact he got a great price because it’s the exact same model that he’s been buying now for a couple of years. So Joe grabs his trusty serial-to-usb adapter, grabs his MacBook Pro and console cable and 10 minutes later, the box is assigned an IP address to VLAN interface 1 and he’s setup telnet and a local user account for remote administration purposes here and he’s good to go!

Now there are a bunch of things that he’s just done that are less than ideal  (see if you can spot the all! ) but I’m just going to focus on one today; Joe didn’t check the version of code on this switch.  And this brings me back to my point.

Code is like my wife

Let me try to explain that one: Every version of code has things that we love, things that we don’t. It has features that we may or may not use, and always, always, always bugs. Of course vendors do their best to ensure that code is as bug free as possible before it goes out the door, but there is always something unintended or unexpected that comes up.

I’ve been married for a few years now, and like any married person I’ve found that my wife has certain… special charms that make me love her even more.   I’m not allowed to call them bugs. But there are certain things in peoples personalities that are just like bugs in code. You can’t predict them, they really don’t seem to make any sense, and they usually pop up at the LEAST desirable moment.

So how is code like my wife? Both have bugs of course! ( I mean undocumented “features”!!!)  But…  we’ve been together for a few years now and I’ve got a lot of the bugs figured out. I understand that when I do X she will exhibit a certain symptom that may not make sense to me, but I can tell you that it’s repeatable on a somewhat consistent basis. ( Yes, I’m male. So that does make me a slow learner ).

So let’s apply that back to Joe.  Joe just grabbed a brand new switch out of the box, with a brand new software load and a brand-new set of undocumented “feature” that he has yet to discover. Now there are certain people in the world who enjoy the excitement of new experiences. The thrill of discovery. The rush of jumping headfirst off a cliff with no clue whether or not there are rocks in that water. Me? Not so much.

I love my wife, but I have to admit that the practical engineer side of me really loves the stability as well. I know a lot of her bugs. I know what sets her off. I know how to avoid them, and when they happen, I know how to diffuse them quickly. I also know that if I have to troubleshoot for a new bug, I have a wide range of experience with this particular version to draw on, which should speed up the whole MTTR (mean time to recovery) process.

Note: I usually use MTTI ( Mean Time to Innocense) , but in my experience so far with my wife has taught me that it’s VERY rarely not me who’s at fault.

So What Should Joe have done here?

1) Pick a version of code. Any version of code. but only pick one. Get to know it inside and out. Read the configuration guide, the release notes and make sure he’s happy with the features and documented bugs.

2) Install THIS specific version of code on every single compatible hardware platform in the entire network.

3) When Joe receives a new switch with a new version of code. Read the release notes. Check for the new features. Are they required? Is there a solid business reason for upgrading? Did the code fix any specific bugs which are impacting the availability of the network or the profitability of the business?   If the answer is No to all of these questions.

DOWNGRADE THE SWITCH!

Yup. You heard me, I’m recommending that you actually downgrade your switches to the chosen version of code. Your knee-jerk reaction would be to upgrade everything else in the network to the new version, right? Wrong.

Wrong. Wrong. Wrong. Wrong. Wrong.

In a industry where there is so much change on a daily basis, we often don’t take the time to actually sit down and think about WHY we are changing thing in the environment. What’s the net benefit for this change? Are we changingg just for changes sake? Or is there a compelling reason for us to invest the time and energy into learning an entire new set of bugs? To link this back to the ITIL CI, if you have a single hardware platform with a single version of code on it, you have a single CI. The fewer CI’s you have to manage, the fewer CIs you have to manage. And I think we can all agree that eliminating complexity and reducing the number of managed elements we have to think about in a day is a good thing.

 

@netmanchris

Tales from the trenches: The divorce feature

This discussion came up on twitter with @amyengineer when discussing odd gateway bugs and I had a couple of requests to blog this out as this is just TOO classic to lose to time.

Let me set the stage: It was approx 2003/2004. I was taking a break from the usual Callmanager/Uni craziness and was assigned to install a CME implementation with the JUST real eased Cisco Unity Express 1.1. ( I had installed 1.0 a few weeks before and was VERY happy they had fixed that little browser logout issue!).

Nothing special or crazy about this install. No ACD functionality was in CME yet, and this SHOULD have been a slam dunk.

I’m doing the typical day2 support calls. Cleaning up some fax issues, etc… And then I get a call.

“The system is randomly dropping calls and it seems to be happening more to the woman. “

Of course, you can imagine my reaction ” Are you seriously trying to imply that my phone system is sexist?” (sarcasm) ” users.” (/sarcasm)

So after doing some more info gathering, I start to put together more pieces of the puzzle.
Symptoms
1) completely random
2) impossible to consistently reproduce ( I never got it to reproduce once in fact!)
3) Users can hear far-end. But far-end can not here users.
4) The majority of the incident reports were filled by… WOMAN.

Typical one-way audio call right? Except for that last one. This was getting interesting. But we were never able to reproduce it, so we didn’t take it very seriously since there was no proof. Yes, I said it. I’m a network engineer and users random anecdotal complaints aren’t proof.

So I arrive onsite about day 4 of the day2 support week, and I get summoned to CIOs office.

(scary music) dum dum dum dum (/scary music)

Now I had heard some stories about this guy. No nonsense, ruled the IT department with an iron hand. Super tight ship. I too have control issues, so I kinda like this guy. But there was one more thing I had learned in the safe confines of the sound proofed data center. They had a nickname for Mrs. CIO.

The Dragon Lady

It was not uncommon for employees to hear this woman emasculating here husband over the phone. They told me that on bad days, You could hear her screeching through the closed door… And she wasn’t on speaker phone.

Now apparently it was on of those days. And apparently the issue happened. And apparently Mrs Dragon Lady thought that Mr CIO had hung up on her in the middle of the verbal assault. Things are not looking good for Mr. CIO when he gets home. And he knows it.

Back to the summons

“Phone boy!” says Mr. CIO
“Yes sir?” I reply.
Mr CIO ” You are not leaving until you figure out what the he** is going on and it’s fixed”
Me” You’re the customer. I’ll do my best”
Mr CIO ” and then you’re calling my wife”

Huh?

Now i’ve got some skills, I had already been a desktop guy, server guy ( Vines, Windows, Novell, OS/2). I had done some as/400, even done some Avaya and Nortel PBX installs.

but I was NOT a couples therapist!

But of course the customer is always right, right? I was just trying to figure out how i’m going to put that activity into the timesheet system? Any one got the code for “psych consult”?

So, i’ve got TAC on the phone, in those days there weren’t THAT many good voice guys, and I had been working with most of them for a few years. Case has been opened for a few days, but we have no hard data points to bite into, so we’re in waiting mode.

So I call up my TAC engineer and explain to him my situation. He puts me on hold so he can tell the guys around him. We wait for the laughter to fall down to a dull roar and start getting to work.

10 hours, no food, and a LOT of coffee later. ( I was never allowed to leave the building ) We finally figure this out. As most of you voice guys ( and galls!) know; human speech is based on a known frequency range. Nyquest theorem states that we’re going to sample at twice the highest frequency range and of course, we picked approx 0-4000 which gets us to the 64k codec.

Now as we all know, woman USUALLY have a higher voice than their harrier male counterparts. And as any married man can attest to, the more excited they get, the higher it goes.

The issue? When a woman callers voice crossed a certain threshold, CME was interpreting this as the initial negotiation tone of a fax machine. You can see how this goes

Mrs. Dragon lady calls Mr CIO -> Gets mad at Mr CIO -> Voice Raises -> Fax detected -> Fax negotiation commences -> one-way audio occurs -> Mrs Dragon thinks he hung up on her -> creative vocabulary ensues – > divorce is imminent -> it’s 9pm, I haven’t eaten in 12 hours and i’m sitting alone in a cold data center with Mr. CIO watching me outside the glass.

Worst part: This was a previously resolved bug which R&D had accidentally reintroduced ( pre IOS code train streamlining) and we had already discounted this hours earlier since it had been “fixed”.

To jump to the end, I called Mrs Dragon and explained the situation. Emailed her the case notes and the bugtraq link. She was actually a very pleasant lady!

Case closed. marriage saved

Almost ten years later I still laugh every time I think of this. Goes to show that there is ALWAYS a reason that things go the way they they do. It is what it is. And when it’s not?

It’s because you didn’t understand it in the first place.

We don’t need another hero….

I was reading the latest posts of the IOS Hints blog http://ow.ly/1uR4r2 on disaster recovery and it got me thinking about the word hero and about the “heroes” in the networking industry today.

One of the things that stuck out was “watch for early signs” and it got me thinking on how many customers miss this basic rule in disaster recovery. Too often in the networking world I talk to customers about the value of network management and there is inevitably a CLI jockey in the room who spouts some variation of the comment “a GUIs not going to help me when everything is burning around me! I need the cli”.

Now I don’t actually disagree with him in principal. Often when you are in the nuclear heat that is the center of a network meltdown, it is only a calm zen-inspired focus and mad CLI skills that are going to get you through the situation with no more that some singed hair. But it really struck me that it actually appeared that he WANTED the network to burn so he could ride in on his white horse and save the day. I’m not sure if it’s a general need for attention, disdain for the business he works for, or just the adrenaline rush he gets from fighting a core meltdown. Even worse than that, it REALLY struck me as odd that his manager usually supports him!

I grew up in a cisco world like most of the current generation of network administrators with a severe lack of good network management tools. But somewhere along the line I decided that I would rather avoid the fire and spend time with my family and friends. I would like to think the CCIE still means that i’m capable of debugging with the cool kids when the need arises, but to be honest I don’t need the attention.

I would much rather let my management station receive the trap, send me an e-mail letting me know one of the power supplies on my core switch is going to fail, call support and get the replacement part ordered, received, and replaced before there is any service impacting outage.

The value of a structured process driven approach to running your network, combined with strict operational discipline, and the right tools can make all the difference. Do you want to be the hero? or are you would you rather be the “3 o clock and alll is wellll” guy?

Now please don’t misunderstand me; i’m with everyone else who applauds the skills of the the hero who is able to jump in the burning building and perform a daring rescue, but I’m sure we can all agree we would much rather the fire was never allowed to get out of control in the first place.

These days the only fires I want are the ones involving my two sons and a bag of marshmallows.