Traditionally, building software, testing it, and then managing the servers it runs on has been kept poles apart. Developers built the software, QA tested it, and they then flung it over the fence to the operations team that handled the deployment part. The DevOps philosophy is trying to bring these 3 disciplines together, and encourages automated testing and deployment of software components to reduce deployment times and friction.
One of the key components of this process is called
Infrastructure-as-Code, abbreviated as
iac. This involves writing your server clusters the same way your would write application code: using programming best practices, a version control system, robust unit tests, etc. IaC is a larger topic, but today I want to go over a small part of it, which is immutable servers.
But before I do that, I want to go over some of the ways I’ve deployed code over the years.
My career in software started pretty simply. I would write my code, which at the time was simple HTML and CSS, and then write a simple
bash script to upload the files to the target server. At some point, I discovered that using
rsync can ensure that only changed files get uploaded, and that was the extent of my optimisation. The target server in question was something I would launch and provision manually, which would involve using the UI provided by the vendor to launch a machine in the OS I wanted. The, I would get SSH access to install the applications and server software I needed. To provision the machine, I would either just run the required
apt install <application> commands directly on the machine, use
sed to update configuration files, and then reload or restart the services as required. If I was feeling really smart, I would write all these steps into a single
bash file for future use.
And that’s how life went. This new server would then become a holy temple, restricting and controlling access by others would become my sacred calling, and any problem with the server or the application would require manually entering the machine/temple, checking logs and CPU, etc, and fixing the problem on the machine itself. Maybe I would update my single-install bash script to have those changes there as well, unless I was too busy. In the event of this machine failing, I would launch a new one, repeat the entire process, and change the DNS entries to point to the new machine. Repeating the installation process on the new machine was error-prone, because if the machine was several weeks or months old, then the software libraries for the base packages would have changed, and installation would become more complicated. Backups would be handled by cron jobs running on the machine itself. Monitoring was negligible, and would usually involve waiting for the application to die. If the traffic on the machine increased, I would just increase the size of the machine, since horizontal scaling was not an option for me at the time. So managing a single machine manually was hard enough.
Eventually, vertical scaling became too expensive, and I decided to add a few more servers to handle application load. This required refactoring the application to move session state off the machines, and then using a load balancer to divert traffic to them. But adding more machines meant my tool chain to manage them would have to change as well. Ansible came in, promising to alleviate the pain points of managing multiple machines. With Ansible, I was able to build my code locally, and then
synchronize it to multiple machines. Build times were minimal, and the code would deploy fairly quickly. But with multiple servers came multiple issues, which is only natural, and tools cannot change bad habits. . Maybe the disk became full, or a service crashed and required restarting, or a configuration file needed updating. These always required intervention, and would usually involve me making some ad-hoc Ansible script and deploying the changes to the cluster.
Over time, this collection of ad-hoc scripts increased, which meant on-boarding new colleagues became difficult, and involving other teams in deployment and operations became next to impossible, because now they needed to understand the specific way I did things, instead of just generally understanding the application and its requirements. The problem was even worse if I was managing a legacy system, where someone else had their own flavour of ad-hoc scripts, or no scripts at all. Over time, some machines would have some changes, and some would not, causing configuration drift, resulting in an application cluster that was not homogenous. This caused ever increasing fear that automation tools and techniques would not work for me, leading to the dreaded automation fear spiral. This fear caused me to only touch the cluster when changes were needed, which meant servers went a long time without being updated, which meant when I did end up updating them out of sheer necessity, the changes required would be huge, or the required system packages were no longer available. This lead to even more frustration, and even more fear.
While all of this can be made to work for existing machines (sort of), how do I make sure new machines come up correctly? I used to make AMIs of the currently running machines, and use user data to provision those machines in real-time. An AMI is an image of the OS and disks of the machines which can be made programmatically, and new machines can be brought up from those images. But depending on how frequently you make these images, it can be hard to ensure that new machines always have the latest code. User data is a script that you can run when a machine comes up, such as downloading the latest code and configuration files from a repository. AMIs and user data can ensure that the machines come up with the code we need, but managing them can be difficult. Along with Ansible deployment scripts and ad-hoc maintenance scripts, they add a lot to the burden of managing a cluster.
One way I tried to get over the fear was to routinely apply the required configuration changes, regardless of whether the changes were required or not. It just required an Ansible script on a cron job, that updated all the necessary configuration files, overwriting any files that were changed from outside this particular Ansible script. This ensured that servers configurations always stayed in line, and reduced the haystack in case something went wrong. Code deployments would still require other Ansible scripts, which would also require updating AMIs and user data to handle new machines. But at least periodically over-writing configurations files prevented configuration drift and snowflake servers, but the larger problem was far from solved.
Consider a different approach. What if we could programmatically create AMIs, at will, pre-loaded with the latest production code, the required configurations files, and all of the stuff we wanted to do in the user data? Once this AMI would be built, we could just recycle the current instances to use the new AMI, and update the launch configuration to ensure new machines also used the same AMI. This would ensure there would be no need to maintain separate scripts for separate steps, and the burden of managing existing and new machines would fall on maintaining the process of just building and provisioning the AMI. One place where this approach would fail would be if we continued to use this new approach along with the old one, meaning we continue to create new AMIs to launch machines and at the same time also use ad-hoc scripts to update things.
In order for this new approach to work, we would have to forgo the old one, which is the concept of immutable servers. An immutable server is one which does not change once launched. Any update to code or configuration required on servers would need updating our AMI provisioning steps, adding what we need, creating a new AMI, recycling instances, and updating launch configurations for new machines. This would take some time to get used to, since breaking bad habits is hard, but it would ensure that all servers are exactly the same. Once automated, this approach can really improve the quality of life of someone managing systems.
To prevent existing servers from drifting apart over time and suppressing the changes of untracked ad-hoc scripts, the immutable server approach involes periodically recycling instances by destroying them and launching new ones from the updated launch configuration. This is called the Phoenix server approach, and has the advantage of enforcing that any ad-hoc changes are destroyed immediately. If there are any issues, just update the AMI provisioning steps, and recycle the instances.
Launching and provisioning AMIs should be repeatable and automated. This involves keeping the required Packer or Ansible scripts in a single repo, thereby providing a single source of truth. This would ensure that on-boarding new team members would be easy, since everything they need to know is in one place. This is the philosophy of
Infrastructure as Code: defining infrastructure coxmponents in source code like you would for any other application.
Consider disabling SSH access. By removing the one line of communication we rely on to deal with servers, it forces us to think about what it is that we actually need from servers to diagnose problems and monitor usage. Need CPU, memory and network stats? Application access and errors logs? Set up a monitoring pipeline using Collectd and InfluxDB and move the information off the server. Need to save state data? Use a remote database and move the information off the server. This would make our servers disposable. This is the central tenet of this new philosophy: servers are disposable, not holy shrines to be worshipped.