Amazon S3 Outage: Another Opinion Piece

So Amazon S3 had some “issues” last week and it’s taken me a few days to put my thoughts together around this. Hopefully I’ve made the tail-end of the still interested-enough-to-find-this-blog-valuable period.

Trying to make the best of a bad situation, the good news, in my opinion, is that this shows that infrastructure people still have a place in the automated cloudy world of the future. At least that’s something right?

What happened:

You can read the detailed explanation on Amazon’s summary here.

In a nutshell

  • there was a small problem
  • they tried to fix it
  • things went bad for a relatively short time
  • They fixed it

What happened during:

The internet lost it’s minds. Or more accurately, some parts of the internet went down. Some of them extremely ironic

UNADJUSTEDNONRAW thumb bbfd

Initial thoughts

The reaction to this event is amusing and it drives home the point that infrastructure engineers are as critical as ever, if not even more important considering the complete lack of architecture that seems to have gone into the majority of these “applications”.

First let’s talk about availability: Looking at the Amazon AWS S3 SLA, available here, it looks like they did fall below there 99.9% SLA for availability. If we do a quick look at https://uptime.is/ we can see that for the monthly period, they were aiming for no more than 43m 49.7s of outage. Seems like they did about 6-8 hours of an outage so clearly they failed. Looking at the S3 SLA page, looks like customers might be eligible for 25% service credits. I’ll let you guys work that out with AWS.

Don’t “JUST CLICK NEXT”

One of the first things that struck me as funny here was the fact that this was the US-EAST-1 Region which was affected. US-EAST is the default region for most of the AWS services. You have to intentionally select another region if you want your service to be hosted somewhere else. But because it’s easier to just cllck next, it seems that the majority of people just clicked past that part and didn’t think about where they were actually hosting there services or the implications of hosting everything in the same region and probably the same availability zone. For more on this topic, take a look here.

There’s been a lot of criticism of the infrastructure people when anyone with a credit card can go to amazon sign up for a AWS account and start consuming their infrastructure. This has been thrown around like this is actually a good thing, right?

Well this is exactly what happens when “anyone” does that. You end up with all your eggs in one basket.  (m/n in round numbers)

“Design your infrastructure for the four S’s. Stability Scalability, Security, and Stupidity” — Jeff Kabel

Again, this is not an issue with AWS, or any Cloud Providers offerings. This is an issue with people who think that infrastructure and architecture don’t matter and it can just be “automated” away. Automation is important, but it’s there so that your infrastructure people can free up some time from mind numbing tasks to help you properly architect the infra components your applications rely upon.

Why o Why o Why

Why anyone would architect their revenue generating system on an infrastructure that was only guaranteed to 99.9% is beyond me.  The right answer, at least from an infrastructure engineers point of view is obvious, right?

You would use redundant architecture to raise the overall resilience of the application. Relying on the fact that it’s highly unlikely that you’re going to lose the different redundant pieces at the same time.  Put simply, what are the chances that two different systems, both guaranteed to 99.9% SLA are going to go down at the exact same time?

Well doing some really basic probability calculations, and assuming the outages are independent events, we multiple the non-SLA’d time happening ( 0.001% ) in system 1 times the same metric in system 2 and we get.

0.001 * 0.001 = 0.000001 probability of both systems going down at the same time.

Or another way of saying that is 0.999999% of uptime.   Pretty great right?

Note: I’m not an availability calculation expert, so if I’ve messed up a basic assumption here, someone please feel free to correct me. Always looking to learn!

So application people made the mistake of just signing over responsibility to “the cloud” for their application uptime, most of whom probably didn’t even read the SLA for the S3 service or sit down to think.

Really? We had people armed with an IDE and a credit card move our apps to “the cloud” and wonder why things failed.

What could they have done?

There’s a million ways to answer this I’m sure, but let’s just look at what was available within the AWS list of service offerings.

Cloudfront is AWS’s content delivery system. Extremely easy to use. Easy to setup and takes care of automatically moving your content to multiple AWS Regions and Availability Zones.

Route 53 is AWS’s DNS service that will allow you to perform health checks and only direct DNS queries to resources which are “healthy” or actively available.

There are probably a lot of other options as well, both within AWS and without, but my point is that the applications that went down most likely didn’t bother. Or they were denied the budget to properly architect resiliency into their system.

On the bright side, the latter just had a budget opening event.

Look who did it right

Unsurprisingly, there were companies who weathered the S3 storm like nothing happened. In fact, I was able to sit and binge watch Netflix well the rest of the internet was melting down. Yes, it looks like it cost 25% more, but then again, I had no problems with season 4 of Big Bang Theory at all last week, so I’m a happy customer.

Companies still like happy customers, don’t they?

The Cloud is still a good thing

I’m hoping that no one reads this as a anti-cloud post. There’s enough anti-cloud rhetoric happening right now, which I suppose is inevitable considering last weeks highly visible outage, and I don’t want to add to that.

What I do want is for people who read this to spend a little bit of time thinking about their applications and the infrastructure that supports them. This type of thing happens in enterprise environments every day. Systems die. Hardware fails. Get over the it and design your architecture to take into consideration these failures as a foregone conclusion. It IS going to happen, it’s just a matter of when. So shouldn’t we design up front around that?

Alternately, we could also chose to take the risk for those services that don’t generate revenue for the business. If it’s not making you money, maybe you don’t want to pay for it to be resilient. That’s ok too. Just make an informed decision.

For the record, I’m a network engineer well versed in the arcane discipline of plumbing packets. Cloud and Application architectures are pretty far away from the land of BGP peering and routing tables where I spend my days. But for the low low price of $15 and a bit of time on Udemy, I was able to dig into AWS and build some skills that let me look at last weeks outage with a much more informed perspective. To all my infrastructure engineer peeps I highly encourage you to take the time, learn a bit, and get involved in these conversations at your companies. Hoping we can all raise the bar collectively together.

Comments, questions?

@netmanchris

Devops for Networking Forum in Santa Clara

Normally, I would be writing this a few weeks ago, but sometimes the world just takes the luxury of time away from you.  In this case, I couldn’t be happier though as I’m about to part of something that I believe is going to be really really amazing.  This event is really a testimony to Brent Salisbury and John Willis’s commitment to community and their relentless pursuit of trying to evolve the whole industry, bringing along as many of the friends they’ve made along the way as possible. 

Given the speaker list, I don’t believe there’s been any event in recent ( or long term!) memory that has such an amazing list of speakers. The most amazing part is that this event was really put together in the last month!!!! 

If you’re in the bay area, you should definitely be there. If you’re not in the area, you should buy a plane ticket as you might not ever get a chance like this again. 

 

DevOps Forum for Networking

From the website

 

previously known as DevOps4Networks is an event started in 2014 by John Willis and Brent Salisbury to begin a discussion on what Devops and Networking will look like over the next five years. The goal is to create a conversation for change similar to what CloudCamp did for Cloud adoption and DevopsDays for Devops.

 

When and Where

You can register here

DevOps Networking Forum 2016

Monday, March 14, 2016 9:00 AM – 5:00 PM (Pacific Time)

Santa Clara Convention Center
5001 Great America Pkwy
Santa ClaraCalifornia 95054
United States
Questions? Contact us at events@linuxfoundation.org

 Who

You can hit the actual speakers page here, but the here’s the short list

  • Kelsey Hightower, Google,
  • Kenneth Duda, Arista
  • Dave Meyer, Brocade
  • Anees Shaikh, Google
  • Chris Young, HPE
  • Leslie Carr, SFMIX
  • Dinesh Dutt, Cumulus
  • Petr Lapukhov, Facebook
  • Matt Oswalt, keepingitclasseless 
  • Scott Lowe, VMware

I’ve also heard that other of a few industry notables who will be wandering the hallways as ONS starts to spin up for the week. 

Yup. What an amazing list and for the low low price of $100, you can join us as well!

OMG

Im absolutely honoured and, to be honest, a little intimidated to be sharing a spot with some of the industry luminaries who have been guiding lights personally for me in the last five years. I’m hoping to be a little education, a little entertaining, and other than that, I’ll be in the front row with a box of popcorn soaking up as much as I can from the rest of the speakers.  

Hope to see you there!

 

@netmanchris

 

2015 Recap and plans for 2016

About this time last year, I wrote this post.  It’s time to revisit again and plan for 2016. 

 

How did I do?

In 2015, I planned to work on skills in four major areas; python, data science, virtualization, and then just keep up on networking.  In general, I think I did good in all areas, with the breakaway really being in the python area. I made a concerted effort this to year to seek out project after project that would allow me to explore different aspects of python, and force me to grow in as many areas as possible. Attending Interop sessions led by such trailblazers as Jason Edelman, Jeremy Schulman, and Matt Oswalt definitely gave me ideas and inspiration to explore new areas, push boundaries, and have really helped me to grow as a coder/developer not to mention really helped me to cement some opinions on the future of networking in general.

I did manage to get through a bunch of the R courses in Cousera and they were great. I’d love to say that I’m going to return and finish the specialization  (I’m two courses shorts) but if I”m honest, I’m probably going to move towards the Data Science aspects of Python and get into Anaconda and Pandas more in 2016. Nothing like combining two growth areas into one to really push the growth. 

 

More so this year that others, having great conversations over beers, Moscow Mules, and many a sugery umbrella drink helped me expand my knowledge and firm up some of the thoughts I’ve been having for the last couple of years. No need to name drop, but you all know who you are and I’d like to say thanks for all of the conversations and laughs. As much as I love the tech, it’s the people that always help to drive me forward. 

If I’m keeping score, I think that 2015 was a good year. 

 

Plans Plans Plans

Now comes time to publicly declare what I want to accomplish in 2016. This is always the scary part as I know I’m now publicly accountable for any grandiose designs I throw into the ether. 🙂 

 

Practice to Application:

 

I’ve got a bit of a lab at home. If things go as planned, at the end of the year, I will be able to factory reset *almost* the entire lab and have it come back from the dead in a completely automated fashion. The plan here is to use a combination of python, jinja2, IMC, Ansible, and whatever other pieces I need to duck tape together to make this work by the end of the year.  Just because I like to make my life harder than it has to be, I’m planning on building out the topology using vendor independent methodologies, meaning that I want to be able to place a Cisco, HP, or Juniper box into any position in my lab access/distribution/core/dmz/wan/etc… and use a YAML file to dynamically build the required configurations on demand.  

 

Yeah… I know….   But it’s good to set goals right?

 

OpenSwitch

OpenSwitch is also another area which I will definitely be exploring during 2016.  The project is still very new and definitely has some places, mostly in the documentation area, where there’s room to make a difference. I’ve been really lucky to be able to work at a place where I have direct access to some of the projects core developers and I’m hoping I can share the fruits of that access in more blogs posts, pull requests to enhance the documentation, as well as some interoperability testing with some of the usual-suspect network kit that I already have in my lab. Right now, I’m thinking OSPF, BGP, Spanning-Tree as an unambitious start, and moving from there to using the declarative interface and REST interfaces to see how I can incorporate it into project one.

 

 

Thoughts on 2015

2015 was a good year. As an industry, I think we’ve made some great gains in general. The whole “Is SDN really a thing?” conversations seem to be over and we’ve moved on to “ I don’t care what you call it, what does it do for me?” conversations  are starting to really get interesting.  The projects with value are starting to separate themselves from the science fair exhibits and it looks like parts of the networking profession are finally past the out-right denial and have reached the bargaining stage ( “ Can someone else write the scripts for me? Please? ) 

I’ve been able to make forward momentum in all the areas I wanted to and I’m generally where I thought I was going to be at the start of the year.

 

Looking forward to 2016!

 

@netmanchris

XML, JSON, and YAML… Oh my!

I”m a network engineer who codes. Maybe even a network coder. Probably not a a network programmer. Definitely not a programer who knows networking.  I’m in that weird zone where I’m enough of two things that don’t normally go together that it makes conversations I”m having with some of my peers awkward.

I had one such conversation today trying to explain the different data serializations modes in python and why, at the end of the day, they really don’t matter.

The conversation started with one of those “But they have an XML API!!!” comments thrown out as a criticism of someone’s product. My response was something like “ And why does that matter? ”

The person who made the comment certainly couldn’t answer that question. It was just something they had read in a competitive deck somewhere.

I’m all about competing and trying to make sure that the customer’s have the BEST possible information to make the best decisions for their particular requirements, but this little criticism was definitely not, IMHO, the best information. In fact, it was totally irrelevant.   This post is my way of trying to explain why. Hopefully, this will help clear up some of the confusion around data structures and APIs and why they really don’t matter so much, as least not their formatting.

XML

You can read more about XML here. In a nutshell,  XML uses tags, similar to HTML to represent different values in your data stream.  the <item> opens up an item and the </item> closes the item, and what lives between the two is the value for that item. Take a look at the following XML output from the HP IMC NMS. I just cut and paste this straight out of the API interface, so you should be able to do the same if you want to follow along at home.  In this code, I have created a string called x and pasted in the XML formatted text which is a bunch of information about a Cisco 2811 router that lives in my lab. Pay attention to the values as they will stay the same going through this exercise.

XML is the oldest of the bunch, being a W3C recommendation in 1998. It’s important to note though that XML is still relevant, being the native data format of Netconf and still used in a lot of places. It’s old, but that doesn’t mean devoid of value.

Ordered Dictionary

A dictionary is a way of storing data in python that uses keys instead of an index to access the content or value of a specific piece of information you want. Example item[‘ip’] would return “10.101.0.1” with a dictionary.

One of the “issues” with dictionaries is that they are unordered. That means that there’s no guaranty that when you print out a dictionary that the values will be in the same order. ( Pretty obvious when you read the word “unordered” I know.) The OrderedDictionary is a “ dict subclass that remembers the order entries were added”.  So we’re going to use a great little library called xmltodict which takes an XML string ( called x ) above and transforms it into a python ordered dictionary. Now we can do interesting things to it in python. We can access they keys and get to the values directly. We can iterate over top of it because it’s one of pythons native data structures. It’s easy to use. People know it and understand it. It’s a good thing. Lists and dictionaries are the bread and butter of data structures in python. You need, need, need them.

In this code example, we’re going to take the XML string from above, run it through the xmltodict to convert it to an ordered dictionary and assign it to the variable y.  Once I’ve got the ordereddict Y, I could also use xmltodict to convert it back into XML with little to no effort. Cool?

JSON

JSON has become one of the standard ways to represent data between machines. It’s structured, well understood and it’s mostly human readable. A lot of “newer” systems now use JSON as the default data type. Most RESTful APIs for instance seem to have settled on JSON.

This is where things get interesting. Now that I’ve got XML in an ordereddict, I can use the JSON library to convert it to a JSON formatted string which I can then send along to any system that understands JSON. Or write it to a file, or just stare at those pretty, pretty braces.

Note: If I convert from JSON back to a python structure using the json.loads method, it will actually return a regular dictionary, not an Ordered Dictionary, so the values might appear out of order which COULD, in theory, cause issues with an upstream system, but I haven’t seen that in any of my work.

YAML

Although JSON is “more” readable than XML, it’s still got all those braces and apostrophes to worry about. And so YAML was born. YAML is easily the most human readable of the formats I’ve worked with. It uses white space, dashes and asterisk to denote different levels of the data structure. It’s what is commonly used with Jinja2 templating and Ansible and other cool buzzwords that we all are starting to play with.

Just like with the JSON example above, I can take the Ordered Dictionary and convert it to a YAML format (shown below ) and back again.  The yaml.load method does actually return an Ordered Dictionary.

 

What’s my point?

So the original criticism was “But they have an XML api!!!” right?  Well in these little code snippets I just demonstrated how using python and a couple of readily available libraries ( pyyaml and xmltodict are not native python and must be installed ) I was able to go from XML, to OrderedDict, to JSON, to YAML,  with almost no effort. I could take any of these and convert it to something like a Python Pickle, pull it back and convert it to something else. It really doesn’t matter. I can go from one to another without much effort.

Personally, I don’t like working with XML. I can do it, but I would RATHER work with JSON. But that’s just my personal preference, there’s no technical reason why JSON is superior to XML that I can see. At least not in the implementations and the levels that I’m dealing with.

Just like Bilbo Baggins, I can go from there and back again without worrying to much about the actual format in between because when I”m doing something in python, I’m really looking to be working with a native structure like a list of a dictionary anyways.

Anything that I get from externally, I’m just going to convert into a native python data type, munge away, then I”m going convert it back to whatever data format I need, be that JSON, XML or YAML and be on to the next task.

The actual data is what matters.

As long as it’s structured in a way that I can parse easily, I couldn’t care less how it comes in and how it goes out.

Don’t even get me started about simple wrapping CLI commands in XML…

Does that mean the format doesn’t matter at all?

No, I’m sure there are many more experienced programmers who can explain the horror stories of converting between different data formats, or that time when this thing happened that caused this other thing to blow up.  But for me; I’d much rather you had a well structured API that gives me data in a way that I can easily access, convert to a format I can work with, and move on.

Hopefully if you’ve made it to the end of this blog. You’ll agree that the actual format is much less important that you might once have believed. Disagree? Let me know if the comments below. Always looking to learn something and in the coding real, I ‘know I’ve got a LOT to learn!!!

@netmanchris

 

Sometimes Size Matters: I’m sorry, but you’re just not big enough.

So now that I’ve got your attention…I wanted to put together some thoughts around a design principal of what I call the acceptable unit of loss, or AUL. 

Acceptable Unit of Loss: def. A unit to describe the amount of a specific resource that you’re willing to lose

 

Sounds pretty simple doesn’t it?  But what does it have to do with data networking? 

White Boxes and Cattle

2015 is the year of the white box. For those of you who have been hiding under a router for the last year, a white box is basically a network infrastructure device, right now limited to switches, that ships with no operating system.

The idea is that you:

  1. Buy some hardware
  2. Buy an operating system license ( or download an open source version) 
  3. Install the operating system on your network device
  4. Use DevOps tools to manage the whole thing

add Ice. Shake and IT operational goodness ensures. 

Where’s the beef?

So where do the cattle come in?  Pets vs. Cattle is something you can research elsewhere for a more thorough dealing, but in a nutshell, it’s the idea that pets are something that you love and care for and let sleep on the bed and give special treat to on Christmas.  Cattle on the other hand are things you give a number, feed from a trough, and kill off without remorse if a small group suddenly becomes ill. You replace them without a second thought. 

Cattle vs. Pets is a way to describe the operational model that’s been applied to the server operations at scale. The metaphor looks a little like this:

The servers are cattle. They get managed by tools like Ansible, Puppet, Chef, Salt Stack, Docker, Rocket, etc… which at a high level are all tools which allow for a new version of the server to be instantiated on a very specific configuration with little to no human intervention. Fully orchestrated,   

Your servers’s start acting up? Kill it. Rebuild it. Put in back in the herd. 

Now one thing that a lot of enterprise engineers seem to be missing is that this operational model is predicated on the fact that you’re application has been built out with a well thought out scale-out architecture that allows the distributed application to continue to operate when the “sick” servers are destroyed and will seamlessly integrate the new servers into the collective without a second thought. Pretty cool, no?

 

Are your switches Cattle?

So this brings me to the Acceptable Unit of Loss. I’ve had a lot of discussions with enterprise focused engineers who seem to believe that Whitebox and DevOps tools are going to drive down all their infrastructure costs and solve all their management issues.

“It’s broken? Just nuke it and rebuild it!”  “It’s broken? grab another one, they’re cheap!”

For me, the only way that this particular argument that customers give me is if there AUL metric is big enough.

To hopefully make this point I’ll use a picture and a little math:

 

Consider the following hardware:

  • HP C7000 Blade Server Chassis – 16 Blades per Chassis
  • HP 6125XLG Ethernet Interconnect – 4 x 40Gb Uplinks
  • HP 5930 Top of Rack Switch – 32 40G ports, but from the data sheet “ 40GbE ports may be split into four 10GbE ports each for a total of 96 10GbE ports with 8 40GbE Uplinks per switch.”

So let’s put this together

Screen Shot 2015 03 26 at 10 32 56 PM

So we’ll start with

  • 2 x HP 5930 ToR switches

For the math, I’m going to assume dual 5930’s with dual 6125XLGs in the C7000 chassis, we will assume all links are redundant, making the math a little bit easier. ( We’ll only count this with 1 x 5930, cool? )

  • 32 x 40Gb ports on the HP 5930 – 8  x 40Gb ports saved per uplink ) = 24 x 40Gb ports for connection to those HP 6125XLG interconnects in the C7000 Blade Chassis.
  • 24 x 40Gb ports from the HP 5930 will allow us to connect 6 x 6125XLGs for all four 40Gb uplinks. 

Still with me? 

  • 6 x 6125XLGs means 6 x C7000 which then translates into 6*16 physical servers.
Just so we’re all on the same page, if my math is right; we’ve got 96 physical servers on six blade chassis connected through the interconnects at 320Gb ( 4x40Gb x 2 – remember the redundant links?) to the dual HP 5930 ToR switches which will have (16*40Gb – 8*40Gb from each HP 5930) 640Gb of bandwidth out to the spine.  .  

If we go with a conservative VM to server ratio of 30:1,  that gets us to 2,880 VMs running on our little design. 

How much can you lose?

So now is where you ask the question:  

Can you afford to lose 2,880 VMs? 

According to the cattle & pets analogy, cattle can be replaced with no impact to operations because the herd will move on with out noticing. Ie. the Acceptable Unit of Lose is small enough that you’re still able to get the required value from the infrastructure assets. 

The obvious first objection I’m going to get is

“But wait! There are two REDUNDANT switches right? No problem, right?”

The reality of most of networks today is that they are designed to maximize the network throughput and efficient usage of all available bandwidth. MLAGG, in this case brought to you by HPs IRF, allows you to bind interfaces from two different physical boxes into a single link aggregation pipe. ( Think vPC, VSS, or whatever other MLAGG technology you’re familiar with ). 

So I ask you, what are the chances that you’re running the unit at below 50% of the available bandwidth? 

Yeah… I thought so.

So the reality is that when we lose that single ToR switch, we’re actually going to start dropping packets somewhere as you’ve been running the system at 70-80% utilization maximizing the value of those infrastructure assets. 

So what happens to TCP based application when we start to experience packet loss?  For a full treatment of the subject, feel free to go check out Terry Slattery’s excellent blog on TCP Performance and the Mathis Equation. For those of you who didn’t follow the math, let me sum it up for you.

Really Bad Things.  

On a ten gig link, bad things start to happen at 0.0001% packet loss. 

Are your Switches Cattle or Pets?

So now that we’ve done a bit of math and metaphors, we get to the real question of the day: Are you switches Cattle? Or are they Pets? I would argue that if your measuring your AUL in less that 2,000 servers, then you’re switches are probably Pets. You can’t afford to lose even one without bad things happening to your network, and more importantly the critical business applications that are being accessed by those pesky users. Did I mention they are the only reason the network exists?

Now this doesn’t mean that you can’t afford to lose a device. It’s going to happen. Plan for it. Have spares, Support Contracts whatever. But my point is that you probably won’t be able to go with a disposable infrastructure model like what has been suggested by many of the engineers I’ve talked to in recent months about why they want white boxes in their environments.

Wrap up

So are white boxes a bad thing if I don’t have a ton of servers and a well architected distributed application? Not at all! There are other reasons why white box could be a great choice for comparatively smaller environments. If you’ve got the right human resource pool internally with the right skill set, there are some REALLY interesting things that you can do with a white box switch running an OS like Cumulus linux. For some ideas,  check out this Software Gone Wild podcast with Ivan Pepelnjak and Matthew Stone.

But in general, if your metric for  Acceptable Unit of Loss is not measured in Data Centres, Rows, Pods, or Entire Racks, you’re probably just not big enough. 

 

Agree? Disagree? Hate the hand drawn diagram? All comments welcome below.

 

@netmanchris

It Generalist or Network Specialist?

Very shortly after posting this blog, I was pleasantly surprised by a comment from Ethan Banks. Apparently my posting and his free time aligned. 🙂   One line of his comment really got thinking about where I want to take my career in the future. You can read the whole comment, but the part that got thinking was Ethan’s comment “I’m looking into becoming more of an IT generalist in the long run.”.

 

IT Generalist vs. Network Specialist

When I went back and read my post, I can definitely see where the IT Generalist thought came up. There’s a lot of different skills that I”m trying to develop this year, and to be honest, there’s probably a lot more that I’m going to have to gain before I can start developing in the areas that I really want to go after.  For an example, I just spent a few hours reading about GIT.

GIT was not on the list, but I’ve just spent a few very precious hours of my time because it’s almost unthinkable now to be doing any software development without using GIT to share your work. It’s the standard and I just don’t have a job enough working knowledge of repos and forks, and merge’s etc…  just not something I’ve had to pick up in my career yet. So although it’s not on my list, it’s a pre-requisite for really being able to use the skills I want to develop.

So, back to the question; Am I trying to become a IT generalist or a network specialist? I’m not quite sure I have an answer to that yet. But i think that we need to first ask the question: “what is the network?” before I can figure out an answer.

I think this was an easy question a few years ago, but it’s getting much much harder to figure out where my areas of responsibility as a networking professional now ends.

 

How many Tiers?

I was having a conversation with @networkstatic a couple weeks ago and we were laughing the whole idea of a two tier network. Now, I work in marketing and I understand that the point of the phrase “two tier network” is really designed to communicate that there’s less hardware involved, therefore less cost, less latency, less complexity, etc… But once the marketing is down and the POs have been cut, it becomes time to operate the network and then the number tiers suddenly is a whole different number.

Think about a typical connection from a user to an application. Let’s assume that we’ve got typical architecture where there’s a user on a tablet trying to access an application in a data centre somewhere. I’m sure someone is going to disagree and we could easily expand the DMZ into a few more tiers, but for discussion purposes, just work with me here.

 

Network Tiers

Now by my count, this “two tier” network is actually going through thirteen different tiers in the network where policy can be applied which would potentially alter how the traffic would flow through the network, which of course impact the quality of the specific application that the user is trying to access.

Arguably, the last OVS/Linux Bridge layer would not be there for a lot of typical networks, but with technologies like Docker and Rocket, I think we’re going to see that become more commonplace over the next year. This also doesn’t even factor in the whole overlay/underlay and VTEP mess that can also throw another wrench into the mix. But I think you get my point.

Closing the loop

So bringing this back to the opening question; If a networking professional has responsibility to understand the end-to-end path and where issues may arise for a connection between a consumer and a service, it would seem that we need to develop skills across the entire stack.  There are a lot of other places where a problem can arise; database seek times, noisy neighbour issues in a storage array. Badly coded applications, etc…  I would not argue that a network engineer be knowledgable on the ins and outs of ALL of things that can go wrong, but I do believe that we should be making a fighting attempt at understanding the parts that affect our craft.

In the long run, I think we’re going to continue to see networking divide into sub-professions that specialize into specific architectural blocks of the network. Although there may be a lot of common knowledge, I would also argue there’s also a lot of specialized knowledge that can only be gained through experience dedicated to one of these architectural blocks over a period of time.

Network Disciplines List

 

Then again, I might be over thinking the whole thing. It’s also possible that the network knowledge will start to become considered generalist knowledge.

What do you think? Post your comments below.

 

@netmanchris

Plans for 2015: Where to from here?

I know I’m a little bit late for New Years resolutions, but it’s been a tough decision making process. There is so much going on right now in the networking industry and, to be honest, I’m not sure that networking is going to be a skill that will demand the premium that it’s been able to for the last 10-15 years.  Don’t get me wrong, I’m not saying that networking is dead. In fact, just the opposite, networking is going to flourish. There is going to be so much networking that needs to be done that the only way we will be able to deal with it is to dump all of our collective knowledge into code and start to automate what would have previously been the domain of the bit-plumbers that we are. 

 

What skills to pick up in 2015:

So the question: What skills am I looking at picking up in 2015?  I am a huge believer the infrastructure-as-code movement. Looking at what leaders like Matt Oswald, Jason Edelman, Brent Salisbury, Dave Tucker, Colin McNamara, Jeremy Schulman, etc… are taking us, it’s obvious that coding skills are becoming a mandatory skill for anyone in the networking field who wants to become, or remain, at the top of the field.  That’s not to say that core networking skills are not going to be important, but I’m definitely branching out this year in trying to gain some another language, as well as improve my chops with what I already know.

Increase Python Skills

As anyone who’s been here for the last year knows, I’ve been playing around with python a lot. I’m hoping that 2015 will allow me to continue to increase my python skills, specifically as focused around networking, and I’m hoping that I will have enough time to go from just learning to actually contributing back to some code to the community. I’m signed up for Kirk Bayles Python for Networking Engineers course starting in January, as well as going through a few different books. Bets of all, my 9 year old son has also shown some interest in learning to code, so this might actually become a father son project.

I’m also hoping to get more involved with things like Ansible, Schprokits, as well as possibly releasing some of my own all projects.  Crossing my fingers on the stretch goals. 🙂

Gain Data Analysis Skills

Cousera is awesome. If you haven’t checked it out, you need to. You would have to be living under a rock buried in a lead can stored in a faraday cage at the bottom of the ocean to not have heard about SDN. I believe that there’s an ENORMOUS opportunity within the networking space for applying data analysis techniques to the massive amounts of information that flows across our networks every day. There’s a Cousera Data Science Specialization that I’m signed up for that I”m hoping will start me down the path of being able to execute on some the ideas that I’ve had bouncing around in my skull for more than half a decade. I’m sure I will be blogging on the course, but you might have to wait for some of the ideas.  

Virtualization-Ho!

Docker, Rocket, NSX, ESX, KVM, OVS. They are all going to get a little love this year from this guy. I’m not sure how much I’m going to be able to consume, but I believe these are all technologies that are going to be relevant in the coming years. I believe that Containers are going to get a lot of love in the industry and companies like http://www.socketplane.io are going to be something I”m watching closely. 

Networking Networking Networking

This is my core knowledge set and, I believe, what will continue to be the foundation of my value for the foreseeable future. I hit my CCIE Emeritus this year and also had a chance to attend a Narbik bootcamp. It was an incredibly humbling experience and reminded me of how much there is still to learn in this space that I love. If you get a chance to attend a Micronics CCIE bootcamp, I couldn’t recommend it highly enough. There are very few people who understand and can TEACH this information at the level Narbik can. I’m actually planning on finding time to resit the bootcamp this year just soak up more of the goodness. 

 

Plans Plans Plans

2014 was a bit of a mess for me. But I think I still did fairly well in executing on gaining some of the programming skills that I wanted. 2015 is going to be a crazy time for the whole industry. I’m not sure which of these four areas is going consume the most of my time. The way our industry has been going, it’s entirely possible that I will fall in love with something else entirely. 🙂  

If at the end of 2015 I have managed to move forward in these four areas by at least a few steps, I think I will consider the year a success. 

 

What about you?

 

@netmanchris