Infrastructure as Code, Made Simple
Using a source control system to manage your code is maybe the most important rule in software development, for a simple reason: when everything is versioned, it makes it easier to track contributions from multiple people, and it also makes it easy to roll back to a previous state.
Over the past few years, more and more people have been applying this principle to infrastructure: couldn’t we describe our infrastructure in version-controlled configuration files, and have something manage the whole setup automatically behind the scenes? The key benefit being to have a versioned infrastructure, just like software.
This declarative approach to infrastructure is one of the main selling points of frameworks such as Kubernetes. Unfortunately, Kubernetes also comes with features such as total hardware abstraction, with multi-servers distribution in mind. Those features necessarily come with a lot of concepts and abstractions, a lot of source code, a steep learning curve, and significative maintenance costs (unless you’re willing to go to fully hosted solutions such as GKE but that’s another story).
But do you need total hardware abstraction? Unless your web application has multiple dozens of thousands of users, you can probably get away with a very simple infrastructure. If your use-case is batch computing (ex: for science), you can probably wait to have a few hundred nodes before considering total hardware abstraction.
I recently set up the infrastructure of an AI startup (they use deep learning neural networks to automatically extract information from satellite images). This startup web application has currently less than 1000 users (it’s B2B), and the computing cluster in charge of running the deep learning algorithms has currently 20–25 predicts to run, per day. Here is how we achieved our objective of Infrastructure as Code, in a very simple fashion. We have a good bare-metal central server (16 threads, 32gb RAM), under ubuntu. We install a firewall, a VPN, and then all the rest (docker registry, SSO provider, nginx, file servers, database, celery task queue, various watchers and services) are just docker-compose apps to docker-compose up -d. All docker-compose.yml files are in version control. Our main web application can manage our users traffic without a problem, and should we need to scale quickly, we can easily add api and front containers to the docker-compose.yml to scale horizontally. For all compute-intensive tasks (transforming satellite images, running AI algorithms on them…), we have a Celery task queue, where each worker spins up a VM at our cloud provider, for each job. Here also, scaling is not a problem, more jobs = the workers spin up more VMs. The key point is, we do NOT abstract away the underlying hardware. All our services run as docker-compose apps on a bare-metal server, and our computing cluster is a Celery task queue that spins up and shuts down VMs where 1 job = 1 VM. Since all our jobs usually last from 15min to 1h, the overhead of VM spin up/termination is negligible. So, we get the benefits of Infrastructure as Code (95% of our infrastructure is set up with just some docker-compose ups and all the yml files are in version control), yet keep a very simple architecture, where we do not try to introduce abstractions and concepts to abstract away the underlying hardware.
Of course this set up may not be optimal for all start-ups. However, if your resources consumption is less than 10–15 physical servers, this architecture is most likely to get the job done with no fuss and no bloat. If the number of servers you need is greater than that, you *might* consider testing more complex alternatives such as Kubernetes. For batch computing, the limit is according to me even higher, a lot of labs run grids made of hundreds of nodes and use very simple scheduling tools such as gLite, HTCondor etc.
The key take-away is this: always make sure you are not engineering for requirements you don’t have. Sometimes you may find a software tool that does every thing you want out of the box (or so they say), but always keep in mind that each time a software wants to solve a problem in a generic way, it *has* to introduce abstractions, and thus complexity.
Originally published at fruty.io on May 10, 2018.