Fundamentals of DevOps and Software Delivery

Update, June 25, 2024: This blog post series is now also available as a book called Fundamentals of DevOps and Software Delivery: A hands-on guide to deploying and managing production software, published by O’Reilly Media!

There are many guides out there on how to write software. This blog post series is a guide to software delivery: that is, all the processes, tools, and techniques that are required to run and maintain software in production on an ongoing basis. In particular, this series is a guide to DevOps, one of the methodologies used today to make software delivery vastly more efficient.

Whereas most DevOps content tends to focus on culture and theory, this series is a hands-on guide that includes dozens of examples that walk you, step-by-step, through how to run real systems and real code. You’ll start with the basics—an app running on a single server—and work all the way up to microservices running in a Kubernetes cluster with a service mesh, automated deployment pipeline, end-to-end encryption, and more. By the time you’re done with the blog post series, you will have had hands-on practice with all the core concepts and practices of modern DevOps and software delivery.

In the spirit of "ship early and often," I’m releasing the first half of this series now, and I’ll incrementally add the remaining parts over the following weeks. Here’s an outline of the series, with links to the parts available now, and "coming soon" for the parts that’ll be available in the future:

An Introduction to DevOps and Software Delivery
How to Manage Your Infrastructure as Code
How to Deploy Your Apps
How to Version, Build, and Test Your Code
How to Set Up Continuous Integration (CI) and Continuous Delivery (CD)
How to Work with Multiple Teams and Environments
How to Set Up Networking
How to Secure Communication and Storage
How to Store Data
How to Monitor Your Systems
The Future of DevOps and Software Delivery [coming soon]

Before jumping into all this content, I want to take a moment to talk about why I felt the need to write this series—and why you may want to read it.

Why I Wrote This Series

Almost every piece of software today depends on software delivery practices to be able to deploy, maintain, and secure that software. And yet, I’m not aware of any hands-on guides that teach software delivery end-to-end. Just about everyone that learns software delivery today is learning it the hard way: that is, through trial and error. Unfortunately, errors in software delivery can be very costly: they involve outages, data loss, and security breaches. The lack of a good way to learn software delivery is making the entire software industry slower, less effective, and less secure.

I experienced this first hand. Back in 2011, I was working at LinkedIn. From the outside, everything looked great: the company had just had its IPO, the share price was up by over 100%, revenue was growing by more than 100% year over year, and the website had over 100M members, with 2 new members joining every single second. But from the inside, the company was in turmoil. Why? Because our software delivery practices had gotten so bad that we could no longer deploy.

Back then, we would do deployments once every two weeks, and it was always a painful, tedious, slow, and error-prone affair. In 2011, we had a deployment that went so badly, that we couldn’t complete it, no matter how hard we tried. We rolled out some new changes, which caused outages and bugs; we pushed some fixes, but those caused new issues; we pushed more fixes, which only led to more issues. Teams worked through the night, into the next day, and we still couldn’t get things stable. In the end, after a several day deployment nightmare, we had to roll everything back.

Here was a company worth nearly $10 billion, and we could not deploy code. This was the cost of not having proper software delivery practices in place. To get out of this mess, we kicked off Project Inversion: a complete freeze on all new feature development for several months while the entire engineering, product, and design team reworked all the underlying infrastructure, tooling, and practices. The result was a huge success: months later, we were able to deploy dozens of times per day, with far fewer issues and outages, and that allowed the whole company to move much faster.

Today, we might call this a "DevOps transformation" (though back then, the term "DevOps" had just appeared on the scene, so we didn’t call it that), and to get there, we had to go through a lot of pain and outages. The honest truth is that we didn’t know what we didn’t know. We had to go out and chat with companies across the industry, learning about trunk based development from one company, canary deployments from another, feature toggles from another, and so on.

Sadly, even now, as I write this blog post series nearly 15 years later, relatively few developers know about these DevOps and software delivery practices. After leaving LinkedIn, I co-founded Gruntwork, where I had the opportunity to work with hundreds of companies on their DevOps and software delivery practices. What I saw was LinkedIn’s DevOps nightmare repeated over and over again at companies of all sizes. The techniques that a handful of the top tech companies had figured out were not filtering down to the rest of the industry. Most developers out there still don’t know what they don’t know.

So I decided to write a blog post series.

I hope that this series can be a small step in improving this situation. I hope that a comprehensive, hands-on overview of DevOps and software delivery will help the next generation of software companies get off on the right foot, and avoid some of the DevOps nightmares I’ve seen. I hope that instead of just hacking things together and learning things the hard way, this blog post series will allow you to learn from the experience of others, and perhaps the result will be a software industry that can build software faster, more reliably, and more securely.

All that said, a fair warning: while the results from adopting DevOps can be wonderful, the experience along the way can be anything but wonderful, as described next.

Watch Out for Snakes

I’m going to let you in on a little secret: we use a single word, "DevOps," to describe what’s actually dozens and dozens of largely unrelated concepts. What does the cryptography behind a TLS certificate have to do with defining a deployment pipeline in GitHub Actions YAML or backing up data from a PostgreSQL database? Not much. And yet, your typical SRE or DevOps Engineer has to deal with all of these, and countless other concepts, too.

What makes DevOps hard is not that any one of these concepts is incredibly complicated by itself, but that there are so many concepts to master—and you have to connect them all together just right. The TLS certificate must be configured just right, or your users will get scary errors that prevent them from accessing your website; your deployment pipeline must be configured just right, or your team won’t be able to deploy; your database backup must be set up just right, or you are at risk of data loss, and if you lose all your data, you may go out of business entirely. DevOps is remarkable combination of an incredibly broad surface area, but also one where you have to sweat every single detail, for either you get everything connected together correctly, or nothing works at all.

I often use the analogy of a box of cables: you reach into the box, hoping to pull out just one cable, but you inevitably end up pulling out a giant mess where everything is tangled together. Unfortunately, that’s the state of DevOps today: it’s a relatively new industry, the tools and techniques we have just aren’t that mature, and it often feels like everything is broken and frustrating and hopelessly tangled.

My hope in this blog post series is, as much as I can, to untangle this mess of cables for you. To show you that these are, in fact, separate cables—separate concepts—that, in isolation, are something you can readily understand, begin to work with, and ultimately become proficient in.

But sometimes, this is hard to do. Sometimes, reaching into this box of cables feels more like reaching into a box of snakes. You just end up getting bitten. If you find yourself sitting there, staring at some nonsense error message, tearing your hair out, stressed, angry, and afraid, know this:

You are not alone.

There are thousands of other developers reaching into that box of snakes and getting bitten, every day. I’m one of them. I’ve lost more hair to DevOps than I care to admit. Even while writing this blog post series, I frequently found myself frustrated, or confused, or yelling at my screen, even though I’ve done most of these things a thousand times. That’s just how it is today.

In fact, there are a few places in this series where I haven’t been able to untangle the wires as much as I’d like: e.g., some example code that’s just too complicated and long to include in the blog post series, so I have to settle for a simpler and less realistic version, or a concept I can’t explain without introducing ten other concepts that come later, so I can only give you a partial explanation for now. In cases like these, I’ve added a "box of snakes" warning that looks like this:

Watch out for snakes: example title for the warning

An example warning. When you see these, be prepared to enter a particularly hairy and tangled corner of DevOps.

Whenever you see such a warning, understand that you’re going to see part of the picture now, but perhaps won’t be able to get the full picture until later. In fact, this is true of DevOps in general. If you’re new to DevOps, initially, it’ll all seem strange, confusing, and full of incomprehensible buzzwords. And each time you go to learn a new buzzword, you’re hit with ten more unfamiliar buzzwords, so you never feel like you’re getting the whole picture. But I promise you that if you give it enough time, you’ll eventually get over a hump, and suddenly, the pieces will start to make sense, and really come together. You need to build up a big enough base of knowledge and experience, and it’s hard going at first, but at some point, it starts to get easier. It never becomes completely easy, but you get to a point where you always feel confident that you can figure it out.

So stick with it. And watch out for those snakes.

Who Should Read This Series

This blog post series is for anyone responsible for deploying and managing apps in production—that is, anyone responsible for software delivery. This includes:

Individual contributors in operations roles: Current and aspiring Site Reliability Engineers, DevOps Engineers, Sysadmins, Operations Engineers, and Release Engineers who want to level up their knowledge of software delivery.
Individual contributors in dev roles: Software Engineers, Software Developers, Web Developers, and Full Stack Engineers who want to learn more about the operations side of the house.
Managers: Engineering Managers, Engineering Directors, CTOs, VPEs, and CIOs who want to learn how to adopt and improve DevOps and software delivery practices in their organizations.

This blog post series does not assume that you’re already an expert coder or expert sysadmin: a basic familiarity with programming, the command line, and server-based software (e.g., websites) should suffice. Everything else you need you’ll be able to pick up as you go. The only tools you need are a computer, an internet connection, and the desire to learn.

What You’ll Find in This Series

Table 1 shows a part-by-part outline of what the blog post series covers, including the key ideas you’ll explore and the hands-on examples you’ll try in each part:

Table 1. An outline of the blog post series
Part	Key ideas you’ll explore	Examples you’ll try out
Part 1, An Introduction to DevOps and Software Delivery	The evolution of DevOps On-prem vs cloud PaaS vs IaaS	Run an app locally Run an app on Render Run an app on an EC2 instance in AWS
Part 2, How to Manage Your Infrastructure as Code	Ad hoc scripts Configuration management tools Server templating tools Provisioning tools	Use Bash to deploy an EC2 instance Use Ansible to deploy an EC2 instance Use Packer to build an AMI Use OpenTofu to deploy an EC2 instance
Part 3, How to Deploy Your Apps	Server orchestration VM orchestration Container orchestration Serverless orchestration	Use Ansible to deploy app servers & Nginx Use OpenTofu to deploy an ASG and ALB Deploy a Dockerized app in Kubernetes Deploy a serverless app with AWS Lambda
Part 4, How to Version, Build, and Test Your Code	Version control Build systems Dependency management Automated testing	Store your code in GitHub and use PRs Configure your build in NPM Set up automated tests for a Node.js app Set up automated tests for OpenTofu code
Part 5, How to Set Up Continuous Integration (CI) and Continuous Delivery (CD)	Trunk-based development Feature toggles Deployment strategies, pipelines	Use OIDC with GitHub Actions and AWS Run tests in GitHub Actions Run deployments in GitHub Actions
Part 6, How to Work with Multiple Teams and Environments	Multiple environments Multiple libraries Multiple services	Create multiple AWS accounts Configure apps for multiple environments Deploy microservices in Kubernetes
Part 7, How to Set Up Networking	Domain Name System (DNS) Virtual private clouds (VPCs) Network access and hardening Service discovery, service meshes	Set up a custom domain name in Route 53 Deploy a custom VPC in AWS Use SSH to connect to a server Use Istio as a service mesh with Kubernetes
Part 8, How to Secure Communication and Storage	Cryptography Encryption at rest Encryption in transit	Encrypt data with AES and RSA Store secrets in AWS Secrets Manager Set up HTTPS with LetsEncrypt
Part 9, How to Store Data	Relational DBs, schemas NoSQL, NewSQL, queues, streams File storage and CDNs Backup and recovery	Deploy PostgreSQL using RDS Configure RDS backup, replicas Use Knex.js for schema migrations Use S3 and CloudFront for static assets
Part 10, How to Monitor Your Systems	Logs and log aggregation Metrics, dashboards, alerts Observability and tracing	Create a dashboard in CloudWatch Do structured logging with Node.js Set up Route 53 health checks and alerts
Part 11, The Future of DevOps and Software Delivery [coming soon]	Higher-level abstractions Generative AI Shift left, supply chain security Platform engineering	Runme Snyk Chain guard Backstage

Feel free to read the blog post series from beginning to end or jump around to the parts that interest you the most. Note that the examples in each part reference and build upon the examples from the previous parts, so if you skip around, use the open source code examples (as described in Open Source Code Examples) to get your bearings.

Given the breadth of DevOps, this book covers a lot of ground and includes a lot of detail. To help you avoid missing the forest for the trees, I try to call out the key takeaways in each blog post as follows:

Key takeaway #1

A key takeaway from the blog post.

Pay special attention to these items, as they typically highlight the most important lessons in that post.

What You Won’t Find in This Series

This blog post series is meant to fill a specific gap: a hands-on guide to DevOps and software delivery, targeted at practitioners. This is already a huge amount of content to cover, which means there are some DevOps and software delivery topics that this series will either skip or only touch on lightly:

DevOps culture and organizational processes: Most of the DevOps books out there today primarily focus on DevOps culture and organizational processes such as cross-functional teams, capacity planning, blameless postmortems, on-call rotations, KPIs, SLOs, and error budgets, so this blog post series won’t spend much time on these items.
Server hardening: While this blog post series covers a range of security topics, I can’t cover them all. In particular, one area I won’t be able to discuss too much is how to harden your servers against attacks: e.g., OS permissions, intrusion protection, file integrity monitoring, sandboxing, hardened images, etc.
Low-level networking: This blog post series includes a post on networking, but it only focuses on higher level concepts: DNS, CDNs, VPCs, VPNs, service meshes, and basic network hardening. This post will not go into any lower-level details, such as routers, switches, links, routing protocols, and so on.
Compliance: DevOps engineers are often tasked with helping their companies meet various compliance standards and regulations, such as SOC 2, ISO 27001, HIPAA, PCI, GDPR, NIST 800-53 and so on. While the practices I recommend in this blog post series go a long way towards setting up the kind of security posture you need to meet these compliance standards, this series is not meant to be a detailed guide towards meeting any standard in particular.
Cost optimization and performance tuning: DevOps engineers are also often asked to help optimize the company’s systems to reduce costs or improve performance. These are detailed and ever-changing topics in their own right, so this blog post series will only touch on them at a surface level.

Open Source Code Examples

This blog post series includes many examples for you to work through. You can find all these code samples in the following GitHub repository:

https://github.com/brikis98/devops-book

You might want to check out this repo before you begin reading so you can follow along with all the examples on your own computer (if you are new to Git, check out the Git tutorial in Part 4):

$ git clone https://github.com/brikis98/devops-book.git

The code samples are organized by part (e.g., ch1, ch2, etc.), and within each part, by tool (e.g., ansible, kubernetes, tofu). For example, the example Packer template in Part 2 will be in the folder ch2/packer, and the example OpenTofu module called lambda in Part 3 will be in the folder ch3/tofu/modules/lambda.

It’s worth noting that most of the examples show you what the code looks like at the end of a part. If you want to maximize your learning, you’re better off writing the code yourself, from scratch, and checking the "official" solutions only at the very end.

An important note for Windows users

While the example code included in this blog post series should work on any operating system, the series also includes many example terminal commands that you run locally. These terminal commands are mostly written in Bash, so to run them, you need either a computer with Unix, Linux, or macOS, or, if you’re on Windows, you can use the Windows Subsystem for Linux or Cygwin.

Opinionated Code Examples

The core concepts in the blog post series—e.g., managing infrastructure as code, CI / CD, networking, secrets management, etc—are relatively ubiquitous and applicable across the entire software industry. The code samples, however, represent just one opinionated way to implement these core concepts. The examples are there to give you hands-on practice, and to help with learning: they are not there as a claim that this is the only way or even the best way to do things.

In the real world, there is no single "best" way that applies to all circumstances. All technology choices are trade-offs, and some solutions will be a better fit in some situations than others. The goal of this blog post series is to teach you the underlying concepts and techniques of DevOps and software delivery, and not a specific set of tools or technologies, so once you understand the basics, feel free to explore other technologies and approaches, and always use your judgment to pick the right tool for the job.

A Note About Versions

Whereas the core concepts in this blog post series change only over relatively long time spans (5-10 years), the code samples used to demonstrate and implement the core concepts change much more frequently. Therefore, it’s possible that by the time you read this, some of the examples will be out of date. I’ll try to update the examples as often as I can, but if you hit an issue, please file a bug in the series’s GitHub repo.

You Have to Get Your Hands Dirty

Reading a blog post series is not enough to become an expert at DevOps and software delivery. This isn’t unique to DevOps or software delivery: for example, reading a book on weight lifting isn’t enough to become an expert at weight lifting. A book on weight lifting can teach you principles, routines, and exercises, but it’s only after you spend hours in the gym practicing, sweating, and applying what you read that you’ll be able to lift serious weight. Likewise, this series can teach you principles, techniques, and tools, but it’s only after you spend hours writing code, running code, and applying what you read that you’ll be able to achieve serious results.

That’s what the code examples in this series are for. Instead of only reading, you get to learn by doing. So don’t just skim the code examples: write the code, run it, and get it working. Moreover, you’ll see sections like the following throughout the series:

Get your hands dirty

A list of exercises to try at home.

The examples in this blog post series will get you to the point where you have something working; these "get your hands dirty" sections are an opportunity for you to take those examples and tweak them, customize them to your needs, break things, figure out how to fix them, and so on. Think of this as time spent practicing and sweating at the gym: getting your hands dirty is when the real learning happens.

Let’s Get Started

Now that you have a basic understanding of what this blog post series is all about, it’s time to get started. And where better to begin than at the beginning: head over to Part 1, An Introduction to DevOps and Software Delivery to learn where DevOps came from and the basics of deploying apps.