Rethinking Change Control in a SDN world

I had the opportunity to attend the Open Networking User Group event (ONUG) in New York recently and had a chance to talk through some of my musings around change management in an SDN world with some very smart, knowledgable people from a range of different backgrounds.

Let’s talk a little about change control

In a nutshell, people screw things up when left to their own devices. Individuals will inevitably type a wrong command, misplace a decimal point, not have sufficient information, or just plain not-think-something-through.

People are frail, fragile and error prone. But when people come together in groups, share information, share experience, and double check each other’s work, then the error-rate per change tends to drop significantly and changes start to be implemented in a much higher quality fashion.

Change Policy in Modern Organizations

Most modern organizations have some change management process in place. Whether they have succumbed to a full ITIL based process, gone the DevOps route of continual integration, or fall somewhere in between, people have generally figured out that change management is a good thing.

I’ve seen good change management that promotes healthy growth, and I’ve seen bad change managements that restricts the business into stagnation because nothing is ever allowed to change in the organization. ( There’s another word for something that never changes – dead. 🙂

Change Control in a SDN environment

One of the major issues I see in SDN environments is that many of the changes that we are not only capable, but advocating, are currently heavily restricted through the existing organizations change policies.

To make this example more concrete, let’s talk about an app from Guardicore that uses SDN to detect potential advanced persistent threat attacks in the data center and then uses OpenFlow and the HP VAN SDN Controller to dynamically keep the session alive and re-route ( re-bridge?) the flow directly to a Honeypot which is capable of performing further analysis on that particular session to see if that particular flow is trying to do anything more interesting like trying to execute shell code or some other dubious shenanigans.

Now imagine how the Change Advisory Board is going to react to this request. I imagine it could go something like this.

” What? You want to reconfigure the edge, distribution and core of my data center based on an unknown event at an unknown time because something may or may not be going on?”

How do you think that’s going to go?

ITSM Pre-Approved Changes

There is a concept in ITSM frameworks like ITIL and MOF that allow for a common change to be pre-approved. The change request still has to be fed into the system, but the approvals are automatic and no one has to actively log into a system and click the ” I Approve “

One of the approaches I’ve been advocating is the possibility of repurposing the pre-approved change to allow for dynamic flow modification based on known conditions. This seems to be the simplest way for us to allow the ITSM structures in well-run IT organizations to continue to work without having to scrape the whole change approval process.

This is new ground and I think that this topic requires a lot more discussion that we are currently giving it.

What do you think? Is pre-approved change the way to go? Is there another better way? Is your organization currently using SDN and found a way to rationalize this to the Change Advisory Board?

Please blog it up or post in the comments below.

Advertisements

Intro to Configuration Management

So in a previous post, I made the recommendation to go find an ITSM framework.  For the rest of this series, I’ll be referring to the ITILv3 ITSM framework a lot.  The two books that, IMHO, apply the most to Network Operations are the Service Transition and the Service Operations books.

For the next few posts, I’m going to focus on the Service Transition volume, and specifically on the Configuration Management sections.

So in ITILv3, one of the MOST important things to understand is the concept of a Configuration Item.

What’s a CI?

The way I explain this to customers is it’s the smallest managed thing, or set of things, in the environment.

How does that apply to my network?

Well, hopefully, I’m going to try and explain that now.

The first CI in a network might be the hardware devices that are in the network. These are your switches, routers, firewalls, load balancers, servers, etc…

So most people are good with the idea of standardization. It makes senses that it’s easier to manage fewer kinds of devices. This is recommendation #1.

1) Standardize on as few hardware platforms as possible.

The good thing is that this is fairly easy to achieve. In fact a lot of people do this instinctively. They standardize on the same two chassis switches in their core, they use the same model in their distribution, and they use the same model for the access layer.

Here’s where things get crazy though.

Many of the same customers who try to standardize on a current device often have no processes in place to ensure that they are all running the same version of code.

So think back to the ITIL Configuration Item. If you have five HP 5500EI switches, and five different OS’s on them, you now have five different CI’s to track. Make sense?

Five different versions of commands

Five different versions of bugs.

5 times the headache.

If a configuration item is the smallest manageable object, then each of the different combinations of hardware and software count as a single CI. BUT… if we standardize one version of code for that hardware platform, we get one configuration item.

So the first thing that I recommend customers to do is…

GET ON ONE VERSION OF CODE!!!!!!

This is commonly called a golden software version. One version of commands, one version of bugs. One CI.

On the flip side; one of the other common mistakes I see made by customers who have taken the first step of getting on a single version is that of upgrading without a reason.

My recommendation here is to do your homework. When a new version of code is released, read the release notes, check the bug fixes, check the new features. If there’s nothing in there that is addressing an issue you’re having, or new functionality that you NEED to have,

WHY WOULD YOU CHANGE?

It may seem strange, but when you get a new switch out of the box, you may want to just plug that into your network and downgrade it to the older software. More thoughts on this in this blog post.

Any decent NMS should have the ability to be able to define, report, and deploy the correct version of code to the hardware devices.

Funny enough, post writting this, I found this another great blog by Terry Slattery, this time over at http://www.nojitter.com.

What about you guys? What configuration tools are you using? HP IMC? Orion NCM? Rancid? Prime? A TFTP Server on a wandering laptop?