Sometimes Size Matters: I’m sorry, but you’re just not big enough.

So now that I’ve got your attention…I wanted to put together some thoughts around a design principal of what I call the acceptable unit of loss, or AUL. 

Acceptable Unit of Loss: def. A unit to describe the amount of a specific resource that you’re willing to lose

 

Sounds pretty simple doesn’t it?  But what does it have to do with data networking? 

White Boxes and Cattle

2015 is the year of the white box. For those of you who have been hiding under a router for the last year, a white box is basically a network infrastructure device, right now limited to switches, that ships with no operating system.

The idea is that you:

  1. Buy some hardware
  2. Buy an operating system license ( or download an open source version) 
  3. Install the operating system on your network device
  4. Use DevOps tools to manage the whole thing

add Ice. Shake and IT operational goodness ensures. 

Where’s the beef?

So where do the cattle come in?  Pets vs. Cattle is something you can research elsewhere for a more thorough dealing, but in a nutshell, it’s the idea that pets are something that you love and care for and let sleep on the bed and give special treat to on Christmas.  Cattle on the other hand are things you give a number, feed from a trough, and kill off without remorse if a small group suddenly becomes ill. You replace them without a second thought. 

Cattle vs. Pets is a way to describe the operational model that’s been applied to the server operations at scale. The metaphor looks a little like this:

The servers are cattle. They get managed by tools like Ansible, Puppet, Chef, Salt Stack, Docker, Rocket, etc… which at a high level are all tools which allow for a new version of the server to be instantiated on a very specific configuration with little to no human intervention. Fully orchestrated,   

Your servers’s start acting up? Kill it. Rebuild it. Put in back in the herd. 

Now one thing that a lot of enterprise engineers seem to be missing is that this operational model is predicated on the fact that you’re application has been built out with a well thought out scale-out architecture that allows the distributed application to continue to operate when the “sick” servers are destroyed and will seamlessly integrate the new servers into the collective without a second thought. Pretty cool, no?

 

Are your switches Cattle?

So this brings me to the Acceptable Unit of Loss. I’ve had a lot of discussions with enterprise focused engineers who seem to believe that Whitebox and DevOps tools are going to drive down all their infrastructure costs and solve all their management issues.

“It’s broken? Just nuke it and rebuild it!”  “It’s broken? grab another one, they’re cheap!”

For me, the only way that this particular argument that customers give me is if there AUL metric is big enough.

To hopefully make this point I’ll use a picture and a little math:

 

Consider the following hardware:

  • HP C7000 Blade Server Chassis – 16 Blades per Chassis
  • HP 6125XLG Ethernet Interconnect – 4 x 40Gb Uplinks
  • HP 5930 Top of Rack Switch – 32 40G ports, but from the data sheet “ 40GbE ports may be split into four 10GbE ports each for a total of 96 10GbE ports with 8 40GbE Uplinks per switch.”

So let’s put this together

Screen Shot 2015 03 26 at 10 32 56 PM

So we’ll start with

  • 2 x HP 5930 ToR switches

For the math, I’m going to assume dual 5930’s with dual 6125XLGs in the C7000 chassis, we will assume all links are redundant, making the math a little bit easier. ( We’ll only count this with 1 x 5930, cool? )

  • 32 x 40Gb ports on the HP 5930 – 8  x 40Gb ports saved per uplink ) = 24 x 40Gb ports for connection to those HP 6125XLG interconnects in the C7000 Blade Chassis.
  • 24 x 40Gb ports from the HP 5930 will allow us to connect 6 x 6125XLGs for all four 40Gb uplinks. 

Still with me? 

  • 6 x 6125XLGs means 6 x C7000 which then translates into 6*16 physical servers.
Just so we’re all on the same page, if my math is right; we’ve got 96 physical servers on six blade chassis connected through the interconnects at 320Gb ( 4x40Gb x 2 – remember the redundant links?) to the dual HP 5930 ToR switches which will have (16*40Gb – 8*40Gb from each HP 5930) 640Gb of bandwidth out to the spine.  .  

If we go with a conservative VM to server ratio of 30:1,  that gets us to 2,880 VMs running on our little design. 

How much can you lose?

So now is where you ask the question:  

Can you afford to lose 2,880 VMs? 

According to the cattle & pets analogy, cattle can be replaced with no impact to operations because the herd will move on with out noticing. Ie. the Acceptable Unit of Lose is small enough that you’re still able to get the required value from the infrastructure assets. 

The obvious first objection I’m going to get is

“But wait! There are two REDUNDANT switches right? No problem, right?”

The reality of most of networks today is that they are designed to maximize the network throughput and efficient usage of all available bandwidth. MLAGG, in this case brought to you by HPs IRF, allows you to bind interfaces from two different physical boxes into a single link aggregation pipe. ( Think vPC, VSS, or whatever other MLAGG technology you’re familiar with ). 

So I ask you, what are the chances that you’re running the unit at below 50% of the available bandwidth? 

Yeah… I thought so.

So the reality is that when we lose that single ToR switch, we’re actually going to start dropping packets somewhere as you’ve been running the system at 70-80% utilization maximizing the value of those infrastructure assets. 

So what happens to TCP based application when we start to experience packet loss?  For a full treatment of the subject, feel free to go check out Terry Slattery’s excellent blog on TCP Performance and the Mathis Equation. For those of you who didn’t follow the math, let me sum it up for you.

Really Bad Things.  

On a ten gig link, bad things start to happen at 0.0001% packet loss. 

Are your Switches Cattle or Pets?

So now that we’ve done a bit of math and metaphors, we get to the real question of the day: Are you switches Cattle? Or are they Pets? I would argue that if your measuring your AUL in less that 2,000 servers, then you’re switches are probably Pets. You can’t afford to lose even one without bad things happening to your network, and more importantly the critical business applications that are being accessed by those pesky users. Did I mention they are the only reason the network exists?

Now this doesn’t mean that you can’t afford to lose a device. It’s going to happen. Plan for it. Have spares, Support Contracts whatever. But my point is that you probably won’t be able to go with a disposable infrastructure model like what has been suggested by many of the engineers I’ve talked to in recent months about why they want white boxes in their environments.

Wrap up

So are white boxes a bad thing if I don’t have a ton of servers and a well architected distributed application? Not at all! There are other reasons why white box could be a great choice for comparatively smaller environments. If you’ve got the right human resource pool internally with the right skill set, there are some REALLY interesting things that you can do with a white box switch running an OS like Cumulus linux. For some ideas,  check out this Software Gone Wild podcast with Ivan Pepelnjak and Matthew Stone.

But in general, if your metric for  Acceptable Unit of Loss is not measured in Data Centres, Rows, Pods, or Entire Racks, you’re probably just not big enough. 

 

Agree? Disagree? Hate the hand drawn diagram? All comments welcome below.

 

@netmanchris

Rethinking the UPoE value proposition

First, full disclosure: I am an HP Networking employee. All of the opinions comments and general snarkiness in this blog are my own though. I am writing this from my own personal perspective, not as an HP employee, but I think it's important that anyone reading this knows that I do have some skin in the game, at least in the big picture.

 

So a couple of weeks ago, I was having a conversation with someone at an HP event about VDI, UPoE, Thin Clients etc.. and I said “Yes! We've been talking to customers about the total solutions for Months! ”

Not many people realize how truly broad the HP portfolio is when you look at the entire company. So we have been talking for months about the ability to put together a complete VDI solution from HP.

Basically, you pick your flavour of Virtualization and then pick the appropriate Virtual System configuration. For those of you who don't know, Virtual System is an HP validated configuration specifically for different virtualization workloads. You do have options, either Xenserver on Hyper-V or VMWare View.

Then you can choose the appropriate HP Networking switch for your infrastructure, then you just need to attach one of the HP Thin Clients to connect your users to your applications.

So what does this have to do with rethinking the value prop of UPoE. When I first saw the 60 Watts per port blades that Cisco released on the 4500E last year, I thought

” Wow… I wonder how hot those cables will be?”

After I got past that though, I started thinking about what applications or devices would start to appear in the market to take advantage of these new capabilities? There were some examples out there, but I've noticed something interesting in the last year: Devices are using LESS power, not MORE power.

Do you remember when 802.11n access points first came out? They were one of the first devices that actually justified powering up to 803.3at devices. If you wanted 11n, you needed either power injectors or AT switches. Fast forward and today you can buy 3×3 MIMO with 3 spacial stream access points that will work on 11af Poe ports @ 15.4 watts or less. That's right, they will work on the same switches that you've probably had for years. No need to upgrade your infrastructure to support a new device. Just buy the new access points, get more throughput on your wireless and life is good.

The HP t410 All-in-one Thin Client

So a couple of weeks later, I was invited to a meeting with someone from the personal systems group division of HP to talk about how we had been evangelizing the products and then amazingly… he offered me a HP t410 AIO unit to play with!

I, of course, said

Heck yeah!!!

One week later, a couple of customer meetings and a skeptical twitter conversation, and it seems there's a lot of interest on the t410 at the moment. Mostly around the disbelief that anyone could get an all-in-one Thin client to actually run below 15.4 watts!

So I have collected some pictures of the experience to show you how easy this thing was to setup which was SILLY easy. I didn't include a picture of the box, but I think we've all seen an 18″ monitor and the link above also had some nice pictures of the unit. It's got a small foot print and a nice screen.

So without further ado…

1) Here's a picture of the Unit's Model Name. ( This was the last picture I took, but it's the one I have with the model name ).

Image 12

2) After I took it out of the box and plugged it into an old 3Com 4120 9 port PoE switch ( it's what I had ).

I got the following login page. From what I've read, if I had a “real” vdi solution that was broadcasting it's services, it would automatically detect the connection type and then connect to the server broadcasting the available sessions I think – No VDI in my home lab ( yet ) though so I get to manually select which type of VDI I would like to connect to. ( I chose RDP7 for a window 2008R2 server)

Image

3) It now prompts me for the Server name or address.

Image 1

4) I put in my username and password. ( I didn't need the other options ) and seconds later, I'm logged in.

Image 3

 

Pretty cool, right? (I'll save you the screen cap of a windows server desktop. ). I didn't get to test out the internal speakers since the VM I was connecting against had no sound cards.

 

So what about the PoE part? This is the awesome part.

Screen Shot 2012 09 28 at 10 10 54 PM

 

yup. That's right 10.6 watts while fully operating. Max of 13 watts, Average of 10.9 watts. Can you see why I question UPoE? Somehow the guys in the PC division at HP actually managed to put together a full all-in-one thin client with monitor and left JUST enough power for the keyboard, mouse, and the speakers as well ( I presume on the last one, never tested it ).

Caveats

Are the tradeoffs here? Of course! I've only playing with this for a few hours now, but so far. It's great. No issues at all. According to the data sheet, there are a few things that you will sacrifice in PoE mode though.

Specifically, there's the speed drop from Gig to 10/100. But in the case of a thin client, most of the streams are less than 2Mb +/-, so the whole speed drop is PROBABLY not going to cause anyone any issues.

The other thing, which I haven't experienced, is that the screen brightness will actually come down in the event that there's not enough power budget left on the switch to be able to fully power the unit.

 

Final Thoughts

This is a nice unit. It's got a small foot print. Nice screen. The out-of-box setup was extremely easy and the fact that it only draws 13 watts of power ( I'm using the max draw value I saw ) is absolutely AMAZING to me. It would have been easy for HP to start making Thin Clients that consumed more and more power to try and drive customers into purchasing new switches. Instead, HP threw some engineers at the problem and instead came out with a product that will work in customers existing environments without a costly upgrade.

As an HP Networking pre-sales engineer, I have to say it would be nice to have another reason for our customers to upgrade their switches, but as a human being, it makes me proud to work for a company that does the right thing for their customer and the environment.

 

 

Troubleshooting Performance Issues in a Virtual Environment

So today I got to sink my teeth into a good problem. Performance issues in a virtual environment.

I have to say, this is probably the first time in my career where I walked in and I didn’t have to prove it was the network. The customer was prepared. He had his NMS tools in place ( Cacti ) and had been trending various points in the network over a period of time. 

Of course we started at the 101 stage and looked at counters, and when I said “Hey, you have some issues on your ASA” he pulled up the Cacti graph and said “Yeah, that’s an offsite backup that runs at midnight, we know about it and it’s fine with us. “

Can I say it out loud?  

WOW.

A lot of the customers I see are SMB/SME customers ( I am in Canada, remember? ) and although it’s uncommon to find a network with NMS tools in place, it’s even more rare to find one where they are actually using them!

I got called onsite to help out with some performance issues. The nice thing is that it was not the network, at least not yet. ( Until we’re 100% sure, I’m not going to discount anything, right? ). But we DO need to figure out where to start targeting our efforts.  

This is one of the problems I’m starting to see more and more of. Hard to troubleshoot anything when it’s in the cloud.

MP900426639

Picture Courtesy of Microsoft’s Online ClipArt Gallery.

 

No idea where the apps live in that picture, right? This gets even more interesting when you have VDI accessing virtual applications and start having performance issues on the client side. 

I know I’m going to get some snickers from this one, but my suggestion to deal with this is to create application flow maps to document how a complete transaction is made in a multi-tiered application. 

I know…  ” We can’t get them to create visio’s for the networks they already have, and you’re suggesting to ask them to create more?” 

Yeah… I know. But I can dream, right?

 

So let’s look at the following VDI multi-tiered application. This is pretty simple, right?

1) A client workstation connects to a Citrix Server over ICA or RDP.  

2) The citrix server browses to a web-app on a webhost.

3) The web host connects to a remote MS SQL Database and returns the results to the web host.

4) etc… 

 

Screen Shot 2012 09 21 at 9 59 02 PM

Can’t get much easier than this right? The great thing about this is that it becomes fairly easy to overlay this to the virtual environment which starts to allow you to get a better idea of how the application is currently instantiated in the physical/virtual environment. 

Let’s look at the above example installed in a blade server environment where the three parts of this particular app flow lives on three different blades in three different chassis. 

Screen Shot 2012 09 21 at 9 59 07 PM

As you can see from a performance troubleshooting standpoint, we just went from a three points to check ( let’s throw out the client as that’s just screen caps ) to twenty-one points, without counting the network devices which are used to provide connectivity between the blade chassis.  

Although you can create affinity rules between VMs to ensure they are located on the same hypervisor physical host to avoid performance issues, we all know that people make mistakes, so by creating and applying the application flow map to the physical environment so you can start looking at only the specific devices which are actually involved in your specific performance issue. 

Last, but not least, I would also suggest you have on hand the storage flow maps for both the specific application as well as the relationship between the physical hypervisors hosts and their storage arrays. 

Screen Shot 2012 09 21 at 10 01 43 PM

I’m not a storage expert, but I’ve seen my storage buddies tell stories of Database and VDI LUNS thrashing on the same physical disks that had obviously left them with nightmares for weeks. 

 

Any one have any tricks or suggestions on troubleshooting application performance issues in highly virtualized environments? As we move towards “THE CLOUD” I don’t see this getting any easier. 

Let me know how you’re approaching these problems! I’d love to see a better approach!