Are comatose servers your next big IT headache?

Virtual machine sprawl might be costing your business tens of thousands of dollars. Here's how you can find out – and what to do about it.

Comments

Picture this. An executive at your organization gets an idea for a big project, one that adds a new product line to your company and could result in millions of additional dollars in revenue per year. The whole company is gung ho about this. The new mantra each workday is "what are we doing to advance Project X?" Cheers are sung each morning. And, of course, the IT team gets involved and spins up a number of servers, both physical and virtual, to help out the development team and put the new product or service into production.

There's just one thing: All of this happened in 2005. It is now 2015, a full decade later, and Project X has been replaced by Projects Y, Z and Omega. Omega is now hosted up in the cloud, either on Google Compute Engine, Amazon Web Services or Microsoft Azure. The executive who championed Project X in the first place is long gone, and the original IT team that set up all of the computing power for Project X has transitioned into other teams or out of the company.

Now answer me this: What is the disposition of all of those Project X servers?

It's 8 p.m. Do you know where your servers are?

You probably don't know if you're anything but the smallest of organizations. The authors of a new study on deprecated equipment agree. Jonathan Koomey of Stanford University teamed up with the Anthesis Group to study 4,000 servers and found that up to 30 percent of servers in datacenters are turned on, ready for service, and actively drawing power and consuming resources ... but are not actually doing anything. The study refers to these types of machines as comatose servers, in that the "bodies" are there and working and breathing but, like an unfortunate accident victim who is brain dead, the servers are not actually doing anything. Previous studies by TSO Logic and the Natural Resources Defense Council (NRDC) reported much of the same findings.

To get a sense of the cost of the problem, think about how much you could save if you just turned off a third of the hardware that you manage got rid of or re-used the licensing, unplugged the hardware, and liquidated the rest of it. It's a problem with an enormous cost, and even if the study is half wrong, at 15 percent, that's still a significant cost.

Why does this happen? Fundamentally it comes down to the problem of not knowing what you have and what it is doing. It used to be a little easier to keep track of things because in order to roll out new servers, you had to requisition one, send a PO, receive it, inventory it and mark it, so at least you knew what type of silicon you had on your server closet racks. The operating system and software was another story, but at least you had a fighting chance.

Virtualization changed all that now spinning up a new Web server to host one dinky little authentication task takes minutes and requires no input at all from finance. Virtual machine sprawl is a real problem, and while management software like System Center Virtual Machine Manager and similar VMware tools has long tried to help catalog and inventory virtual machines as well as make sense of how they're deployed not every organization has either invested in such tools or is actively using them. What's more, it's incredibly easy to spin up new virtual machines to take over and consolidate tasks old virtual machines were handling, and it's perhaps even easier to forget to decommission the old virtual machine. Now you have three or four VMs for every physical server you used to have. What a nightmare.

And then there's the fact that business process owners do not always inform IT when things have changed or priorities have shifted. IT may be unaware that an outsourcer or third party has taken over a workload, especially if the project was only minimally staffed by your own IT team. New IT folks might be reluctant to turn off old servers, because there might be a process or dependent resource they don't even know about since they don't have the full institutional memory of the previous team.

The cloud does not solve this problem either in fact, it might even make it worse. At first blush you might think comatose server are the cloud provider's problem to work around "scale up, guys!" you might think but remember who's ultimately footing the bill for that. Plus, unlike comatose servers in your datacenter, which only eat up power and network bandwidth, unused servers sitting active on a cloud solution platform like Azure or AWS are costing you hourly fees. A reasonably equipped virtual machine might run $0.50 cents/hour, which doesn't sound like much until you realize that equates to burning $4,380 every year for every single cloud server you have running that's not doing anything. If anything, it's quick way to reduce expenses and look great in the next budget review.

Solving the problem

What are some ways you can reduce the comatose servers in your organization? Ultimately, the solution is to know what you have and understand its lifecycle. Barring that, however, there are ways to get your head around the problem:

Use a free network scanning tool to get a sense of exactly what you have. This won't pick up everything, mostly because machines have different network security settings, but it will give you a good starting point and may jog your memory or the memory of your teammates about a group of machines that might still be around.

Consult with your finance department to see if you can get records of hardware and software purchases by year to piece together a history of machines. If you can grab MAC addresses off an invoice, or you know a particular line of business application was purchased for some group based on the expenditure justification report, it might be easier to track down those machines and find out if they are still around or not.

Pick low intensity, low usage periods during shoulder and off seasons in your business and just turn old machines off. Chances are, you'll have a group of machines maybe a Virtual Server host and a bunch of guests from 2006 that are still on but you suspect aren't doing anything. Wait until the week after Christmas, or Spring Break or (for universities) the interim period between terms, and just turn them off. See what happens. Note who complains.

Put procedures in place. IT should manage server requests with a justification, however brief is necessary, and an expected lifecycle. IT should know who owns what workloads and who to get in touch with yearly so that an audit of necessary services can be performed. This should go for physical machines, virtual machines and cloud services too.