Most weeks I am consulting with large clients ready to host or tune Drupal. Many of these clients are large Java/Oracle shops, some are non-tech companies just looking to build internal knowledge, and still others love open source and are simply new to PHP/MySQL applications. In any of these cases, they're often familiar (maybe in name alone) with most of the Drupal infrastructure buzzwords - Varnish, Memcached, Redis, APC, XHProf, etc. Most clients are excited and ready to hit the ground running, build out their screaming fast web sites on their newly procured, enterprise-level hardware.
As part of our infrastructure practice here in Professional Services, we go way past performance and look at what it really means to have a successful infrastructure - reliability, stability, consistency, and, maybe most importantly, determinability. What if you site performs exceptionally well, but there are unexpected hiccups every Tuesday morning at 7am EST? None of these performance tools will provide any answers. Questions you should be asking yourself - Is this happening on all of the servers? Are we sure each server is seeing the same issue? What is this hiccup doing - is it crashing MySQL or just Apache? When did this problem actually start and has it been getting worse or better? These are all questions that have no relation to performance and require the knowledge and guidance of infrastructure experts to predict problems and help answer these tough questions.
In a series of subsequent blog posts, we'll cover a variety of answers to these questions and look at what value they provide to organizations looking to host Drupal at a truly "enterprise" level. An in-depth look at each of the four tenants of a successful infrastructure mentioned above -
- Reliability does not just mean an infrastructure that's always available. More generally it means your systems behave in a predictable way, including both uptime and performance. We will look at ways to devise plans in order to set baselines and acceptable ranges, as well as methods for verifying their level of success and long-term sustainability.
- Working off of the concept of reliability, cases where small problems do occur must also be accounted for. Stability is actually a piece of an overall reliability strategy. Specifically reliability refers to the ability to transparently work in the event of unexpected behavior or outages. Rather than focusing on a mean uptime number, we will focus on creating testing scenarios to ensure all possible outages can be safely handled without customer impact.
- The underlying foundation of a reliable infrastructure is the concept of verifiable consistency. Absolute consistency makes management easier. No single part of the solution should be disproportionately utilized. This significantly reduces the possibility of unforeseen problems and facilitates provisioning and growth.
- The most overlooked, and I believe most valuable, tenant is determinability. When a problem does arise, this ensures that we can quickly and definitively understand what has happened and what is the current status. Whether we are troubleshooting a current issue, performing a post mortem on yesterday's issues, or looking at long-term historical data, the right information should be readily available in order to make any necessary adjustments and improvements.