Scaling Cloud: Ops Team Manages Petabytes with Clear View

October 2, 2013
7 minute read

Handling scale is one of those things people talk about that is an odd mixture of bragging and legitimate challenge. At Acquia, the Operations team is tasked with managing over 6,500 AWS instances, over 12,000 EBS volumes and over 5 million S3 objects. In your early days of growth you’d be able to keep track of this infrastructure just using a spreadsheet but as you can imagine this sprawl really gets out of the wheelhouse of what a spreadsheet can handle after you get past a few dozen servers. It’s not just a matter of keeping track of them individually, you also need to keep track of the relationships.

This is a great problem to have as it speaks to the phenomenal growth and success we’ve had selling our services, but there comes a time when you look at the AWS Console (see below for just one of our accounts) and you realize that you need to seize control.

aws

When you start getting past even just a couple of AWS instances, you need to start thinking about the following:

1. How do I manage that many servers effectively and efficiently? Manual execution of the EC2 commands doesn’t cut it?
2. How do I understand (and control) my costs? In our case, our server footprint is just about doubling every year.
3. And, with a drumroll, how do I anticipate what my growth is going to be next year so our forecasts are accurate and aligned with the rest of the business?

To address the first one we are fortunate, at Acquia, to have an extremely talented cloud engineering team who has built a great deal of automation around the deployment and management of our servers. For example, for us to provision an entire application stack it’s a handful of commands to provision and configure: load balancers, web servers, database servers, storage and development and staging environments. This enabled us, historically, to go from a starting point of nothing to a fully operational and tested cloud environment in less than an hour. Recent efforts by the engineering team have cut that by 25% and there’s more coming. The key message here isn’t how awesome Acquia is but any organization dealing with more than a few dozen servers needs to start looking at tools (Puppet, Chef, whatever) that are going to enable them to move quickly.

Managing costs is another tricky one and something we need to keep a very close eye on to make sure things don't spin out of control financially. Knowing where the growth is coming from and ensuring that the numbers and types of servers that we are spinning up requires an ability to look at things from from the 10 inch and the 10,000 foot perspective. To help us manage this, we’ve partnered with a company called CloudHealth Technologies who are truly amazing when it comes to putting some sanity and perspective around our thousands of servers and petabytes of data. For example, take the image above showing one AWS account’s “dashboard” and now view it through the CloudHealth lens:

I now have a high level snapshot of where I am and my overall status (spend, usage and changes in server counts and distributions)). Based on this dashboard, I'm now able to see changes both large and small that may make me go "huh" and research why things are going up and down in surprising ways. Amazon can provide a very detailed view of the "now" and that's quite useful. However, if you want a historical perspective you either need to find the right partner or build your own systems.

That leads to the last consideration; predicting future growth. We’ve probably all heard the famous quote from George Santayana "Those who cannot remember the past are condemned to repeat it" and this is extremely applicable to handling scale. The other axiom, for the IT folks in the audience is “If you can’t measure it, you can’t manage it.” Once again we use CloudHealth to help us with this as we have close to a year’s worth of readily available historical change to review as part of our annual forecasting effort.

Historically, we had high level data around our growth, essentially we had good data on our server breakdowns at the start and the end of the the year but fine-grained information was not at our fingertips. That’s changed and we now know on a daily basis (and we could get more detailed than that if we wanted) exactly what our “base” looks like and we can correlate that against sales activities as well as events, such as some of the scale-up scenarios I’ve discussed in prior blogs. One last picture (I promise) of a subset of our environment shows how visualizing this makes this whole effort better.

Without this knowledge, understanding the drivers to storage and server growth and aligning them to, for example, quarterly sales numbers is all but impossible. Using AWS tagging you can not only understand changes in count but also what the nature of the change is and understanding “why” is a big part of the scalability battle. If you don’t understand why you have more or less of something your forecasts are going to be significantly harder to develop much less justify.

In order to successfully scale Acquia Cloud to meet the demands of an ever changing market, a talented team and partners such as CloudHealth ensure that our customers can continue to innovate and create great digital experiences without missing a beat.