1. This data was originally compiled by Peter Norvig and later popularized by Jeff Dean. The numbers in this blog post series are based on data compiled by Colin Scott on this this interactive website and Brendan Gregg in his book Systems Performance: Enterprise and the Cloud (Pearson Education).
1. The 2023 State of DevOps Report does not publish the raw data, but only summaries of the results, so the values in the "elite vs low performance" column are estimated. For example, the report says that the deployment frequency for elite companies is "on-demand (multiple deploys per day)": I’m assuming this is 2-10 deploys per day, so with 5 working days per week, and 4 working weeks per month, this works out to 40-200 deploys per month. The report says that the deployment frequency for low companies is "between once per week and once per month": with 4 working weeks per month, this works out to 1-4 deploys per month. So the difference is 10x at the low end and 200x at the high end.
2. From The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations (IT Revolution Press, 2016) by Gene Kim, Jez Humble, Patrick Debois, and John Willis.
3. The Standish Group, "CHAOS Manifesto 2013: Think Big, Act Small," 2013, https://www.standishgroup.com/sample_research_files/CM2013-8+9.pdf.
4. Dan Milstein, "How to Survive a Ground-Up Rewrite Without Losing Your Sanity," OnStartups.com, April 8, 2013, https://www.onstartups.com/tabid/3339/bid/97052/How-To-Survive-a-Ground-Up-Rewrite-Without-Losing-Your-Sanity.aspx.
5. The loopback network interface is typically 127.0.0.1: you can try out http://127.0.0.1:8080 in your browser and it should give the same "Hello, World!" response
6. One exception to this rule is to temporarily expose an app on your computer to a trusted 3rd party (typically, a coworker) for feedback: e.g., you’ve built something really cool and you want to quickly send a link to one of your colleagues so they can check it out. In that case, you can use tunneling tools to get a temporary, randomly-generated URL for testing: e.g., see localtunnel, ngrok, localhost.run, and btunnel. These are great for quick internal tests, but do not use these to share your app with the whole world.
7. See https://fortune.com/longform/amazon-web-services-ceo-adam-selipsky-cloud-computing/ for a more complete overview of the history of AWS.
8. See Basecamp for an example: https://basecamp.com/cloud-exit.
9. The first PaaS, Zimki, was actually created by Canon (the camera company) in 2006, but shut down just a year later: https://www.porter.run/blog/history-of-paas-how-canon-almost-became-a-major-cloud-provider.
10. I’m assuming that you’re running the examples in this blog post series in an AWS account dedicated solely to learning and testing so that the broad permissions of the
AdministratorAccess
Managed Policy are not a big risk.
11. This is where the term bus factor comes from: your team’s bus factor is the number of people you can lose (e.g., because they got hit by a bus, or perhaps something less dramatic, like they changed jobs) before you can no longer operate your business. You never want to have a bus factor of 1.
12. On most modern operating systems, code runs in one of two "spaces": kernel space or user space. Code running in kernel space has direct, unrestricted access to all of the hardware. There are no security restrictions (i.e., you can execute any CPU instruction, access any part of the hard drive, write to any address in memory) or safety restrictions (e.g., a crash in kernel space will typically crash the entire computer), so kernel space is generally reserved for the lowest-level, most trusted functions of the OS (typically called the kernel). Code running in user space does not have any direct access to the hardware and must use APIs exposed by the OS kernel instead. These APIs can enforce security restrictions (e.g., user permissions) and safety (e.g., a crash in a user space app typically affects only that app), so just about all application code runs in user space.
13. As a general rule, containers provide isolation that’s good enough to run your own code, but if you need to run third-party code (e.g., you’re building your own cloud provider) that might actively be performing malicious actions, you’ll want the increased isolation guarantees of a VM.
14. You can learn more about AWS regions and Availability Zones on the AWS website.
16. See A Brief History of Containers: From the 1970s Till Now for a nice overview.
17. This was actually one of the driving factors in the early days of container orchestration: huge companies such as Google found that, with server and VM orchestration, where each app gets its own cluster of servers, you’d have to provision enough servers to handle peak load, but when load was not at peak (as is the case most of the time), most servers sit completely idle. One of the big benefits of container orchestration was that it allowed running multiple containers on each server, and moving them around quickly, so all your servers could act as one big pool of resources, which allowed those companies to make much more efficient use of their computing resources.
18. Kubernetes supports other types of Services as well. See the documentation for details.
19. For example, as of 2024, AWS Lambda costs $0.0000166667 for every GB-second of execution, plus $0.20 per 1M requests, with a free tier that includes 400,000 GB-seconds and one million free requests per month. If you built an app that processed three million requests per month, and this app ran on a Lambda function that used 1536 MB of memory and had an average function execution duration of 120 ms, the total cost would be less than $3 per month. See Lambda pricing for more details.
20. Kubernetes Deployments support rolling and canary deployments natively. Several popular tools in the Kubernetes ecosystem support blue-green deployments, such as Argo Rollouts.
21. This is true up to a point. For example, AWS Lambda has concurrency limits, and if you have enough load to exceed them, you may have to configure concurrency controls to avoid being throttled.
22. Lambda functions can use the Invoke API to trigger other functions, but this is typically considered an anti-pattern. The more common approach is to use an event-driven architecture where one Lambda function performs an action—e.g., put a message in a queue, write a file to a file store, etc.—which, in turn, triggers another Lambda function asynchronously.
23. For example, it’s easy to launch an ASG with 5 EC2 instances, and to have an EBS Volume—a network-attached hard drive you’ll learn more about in Part 9—attached to each one. However, when you roll out a new change with instance refresh, you’ll end up with 5 new EC2 instances, and 5 new EBS Volumes attached, so any data they had before is not carried over.
24. This method of calculating commit IDs is very clever. First, it ensures that commit IDs are consistent without the need for any central mechanism for issuing IDs: the same commit done on any computer. anywhere in the world, at any time always gets the exact same ID. Second, it ensures commits can’t be tampered with: change even 1 bit of the contents or metadata or history, and you get a totally different SHA-1 hash. Third, it gives you a very efficient way to compare commits. In part, this is because you can compare commits using just the IDs, without having to send the full contents. But even more interesting is that you can compare the full history just by comparing IDs. That’s because the commit ID calculation includes the ID of the previous commit, which, in turn, included the ID of its predecessor, and so on, so two commit IDs are only equal if their entire history is equal.
25. One study of code reviews found that they can catch 55%-60% of bugs, which is a higher rate than even automated testing ("Software Defect-Removal Efficiency" by Capers Jones, IEEE Computer, April 1996). Another study found code reviews could reduce error rates by over 80% (Handbook of Walkthroughs, Inspections, and Technical Reviews: Evaluating Programs, Projects, and Products by Daniel P. Freedman and Gerald M. Weinberg, Dorset House).
26. Unit tests for application code are typically fast. Unit tests for infrastructure code are typically moderate, at best.
27. "ISS Configuration." Wikipedia, The Free Encyclopedia, December 12, 2022. https://commons.wikimedia.org/wiki/File:ISS_configuration_2022-12_en.svg
28. You can also update batches of servers instead of just one at a time: e.g., if you have 50 servers to update, you might take 5 down at a time, boot up 5 new ones, and when they are healthy, repeat with the next batch of 5, until all 50 are replaced.
29. For an example of a full-featured deployment pipeline that implements most of this functionality out-of-the-box, see Gruntwork Pipelines.
30. For example, see GitHub’s push rulesets for a way to lock down who can edit specific file paths in a repo.
31. Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (O’Reilly).
32. Code Complete: A Practical Handbook of Software Construction by Steve McConnell (Microsoft Press).
33. Statistia estimates that there are over 5 billion Internet users, but even that’s a drop in the bucket compared to the total number of devices that need IPs: you also need to count all the users that have multiple devices (computer, phone, tablet, TV, car), all the networking devices (routers, switches), all the IoT devices, and so on.
35. By default, these are dynamic IPs that are chosen at random from the pool of IPs owned by the cloud provider, so they may change every time you redeploy. If you want to use the same IP address for a long period of time, you can typically reserve a static IP for an additional fee: e.g., AWS offers Elastic IPs (EIPs), GCP offers static external IP addresses, and Azure offers static public IP addresses.
36. For example, in AWS, you can use Elastic Network Interface (ENI)s to reserve static, private IP addresses, and attach them to EC2 instances.
37. See REST - Explained For Beginners for more details.
38. As of 2024, the world’s fastest distributed computer is the Frontier system at Oak Ridge National Laboratory, which was able to perform 1.2 exaFLOPS, or about 1.2 x 1018 floating point operations per second. That’s a remarkable accomplishment, but even if you generously assume that you could try one key per floating point operation, this system would need to run for roughly 9 trillion years to perform 2128 floating point operations, which is 650 times longer than the age of the universe (13.8 billion years).
39. As Phil Karlton famously said, "There are only two hard things in Computer Science: cache invalidation and naming things." I also liked Leon Bambrick’s version of this quote: "There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors."
40. There are some exceptions, such as MongoDB, which has support for joins via the lookup operator, although it’s more limited than the types of joins you get with relational databases.
41. Again, there are some exceptions, such as MongoDB, which has support for distributed transactions, though again, it’s more limited than what you get with relational databases. Moreover, transactions are not the default, but something you have to remember to use, which is quite error-prone.
42. This data comes from a blog post titled Building and deploying MySQL Raft at Meta, which describes that to handle this level of scale, Meta had to create MySQL Raft, a consensus engine that turns MySQL into a "true distributed system," so it’s not clear if you can still call it a relational database.
43. The original EventBrite page for the meetup is no longer available, but you can still find it snapshots of it in the Internet Archive.
44. The only other type of NoSQL data store that I didn’t cover in this blog post is the graph database, which wasn’t motivated by the need for scalability or availability, but the need to efficiently store, query, and navigate relationship data within a graph. I rarely come across these in the wild, but if you’re interested in them, have a look at Neo4j, Amazon Neptune, and Aerospike.
46. See Data Driven Products Now! by Dan McKinley for a great write-up on how data-driven product development works.