Site Reliability Engineering: Running software in production

1.7.2021 | 7 minutes reading time

Lately, Site Reliability Engineering (SRE) has been getting a lot of attention. With SRE came metrics such as Service-Level Objective (SLO), Service-Level Indicator (SLI), and error budget. The SRE discipline also details a lot about running software in production. But the above buzzwords are more or less only what enables Site Reliability Engineers to do their job.
There is another buzzword: “production-ready.” This one is more about what an SRE or software developer can do to improve the software behind the metrics. This blog post will take a look at how these buzzwords work together and whether they are only buzzwords or there’s more to them.

The above-mentioned topics are not only buzzwords. Books have been written on them. There are the Google SRE books and there are also books on production-ready software:
“Production-ready Microservices ” by Susan Fowler and
“Release It! — Design and Deploy Production-Ready Software ” by Michael Nygard .

Besides being buzzwords and filling books, they have real impact. Although this is centered around microservices, it is also valid for software running on servers, functions or at the edge. So, let’s have a closer look at what it is all about. To do so, let’s start with two examples which should be valid for most applications: logging and retries.

Logging

As an example, let’s assume this error message somewhere in a log:


Connection timed out.

In case your software connects to one service. This might be valid, not nice, but valid. As soon as you use two or more backend services in your software, questions arise. Which host do you try to reach? Which port did you use? After what time did you receive the timeout?
This log message is more helpful:


Connection to host hello.from.the.other.side on port 12345 timed out after 100ms.

At first, the 100 ms timeout might seem strange. It could be a valid timeout in case the consumed service runs inside the same datacenter. It’s definitely not valid when the service is located on the other side of the globe. With this example, other questions arise. Did this timeout happen because of a misconfiguration? Is a firewall blocking the connection? Or is the backend not available? You would have come to these questions also in case of the first log message, but only after also figuring out the questions mentioned above. So, this additional information in the log messages will help save some precious time.

This log message directly leads to our next example.

Retries

In case a backend isn’t available, the software shouldn’t give up after the first try. Perhaps a load balancer switched backends or a switch somewhere along the way reboots. So, give it another try. Not directly, but after a short break. With the production-ready log message of the previous section, this might look like this:


Connection to host hello... on port 12345 timed out after 100ms. Retries: 2/5

In most cases, you do not have to implement it on your own, e.g. Google implemented it as an HTTP client in Java . It is also nice to add this information about the number of retries or that the client gave up to the logs. This will help during debugging.

These are two common examples. As you might guess from the books mentioned above, there is more to making software production-ready. These examples should get you started rather than educate you on the entire topic.

Even these two small examples will help your software, and even more the people running your software in production. More on that later.

But not all chapters of the production-ready or SRE books will apply to your software. For example, a back-office insurance company software does not have to scale up from 100 to 1000 users in a matter of seconds. It also does not need to scale down some minutes later. But it’s also good to know that this is not necessary. This also goes for document factors which aren’t applicable to your software.

There is more to running software in production

But this article is about running software in production, not only production-ready software. So, there is more.

“Production-ready” mostly describes functionality built into the software. This functionality will help cope with situations occurring during production lifetime. To run software in production, you also have to take care of some infrastructure, systems, and processes around the software.

Let’s again take a look at two examples: backup/restore and certificates.

Backup/restore

Stateless is a buzzword I didn’t mention above. And although you should aim for stateless services, you will have some sort of state somewhere in your application. And this state will need backups.
The questions which will arise here are:

How often do you need backups?
How long do we have to keep them?
…

This functionality will not be part of the software, but these questions arise sooner or later. Otherwise, an outage might lead to a complete data loss. This e.g. happened to GitLab quite some time ago. And this is not to blame GitLab, but we should instead be thankful they shared a postmortem about this from which we can learn.

Certificates

Your service will most likely be accessible via HTTPS. Or they will access other services via some sort of secured communication protocol. So, some topics which aren’t covered by the software, but might cause major downtimes when not handled:

Does the software need a client certificate?
Are certificates renewed automatically?
Are any self-signed certificates involved?

Depending on the answers to these questions, you might need additional processes around the software to keep it running smoothly. Otherwise, things like this might happen:
not so production-ready certificate

They were so kind to discuss this in a publicly accessible ticket . And as you can see in the screenshot, the browser doesn’t even allow you to proceed. So, this broken certificate is actually a service outage. Thanks to Jenkins for discussing this publicly and giving us the opportunity to learn.

But why?

Let’s take the opposite perspective. What could possibly go wrong?

But first some definitions up front.
Some people have very specific definitions of what a distributed system is. For this article, we use a not so strict definition:

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.

https://en.wikipedia.org/wiki/Distributed_computing

The fallacies

So, by the definition in the previous paragraph, nearly every system you will develop nowadays is a distributed system. In this context, let’s take a look at the fallacies of distributed systems:

The network is reliable;
Latency is zero;
Bandwidth is infinite;
The network is secure;
Topology doesn’t change;
There is one administrator;
Transport cost is zero;
The network is homogeneous.

https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

For a more detailed explanation of each fallacy, take a look at these blog posts .

Some seem obvious, some are inevitable. But you can circumvent some of these fallacies with production-ready software. That’s one of the reasons why “production-ready” makes your application more robust.

But you cannot circumvent all fallacies. The more complex your systems gets, the more the following quote applies:

… complex systems run as broken systems. The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.

https://how.complexsystems.fail/#5

Short detour

That’s where observability comes into play. Since there is always something broken, you want to know:

What broke?
Where is the broken part in your system?
How does it affect your users?

But that’s an entirely different topic with more than a handful of books.

The fallacies and public clouds

What’s important is that you are aware of these fallacies and prepare your software accordingly. But not all the parts are under your control. In case your software is running in a public cloud, the cloud providers offer insight into how to make your services more robust on their platforms:

The Amazon Builders’ Library
Azure Application Architecture Guide
Google Cloud — Cloud Architecture Center

As already mentioned above, this article cannot cover the entire topic. It can only give you some hints to get you started.

Conclusion

Based on our experience running software in production, we offer Production Readiness Review (another term coined by the SRE discipline) workshops containing more of these topics. But they aren’t specific to us who are running the software in production. So, no matter who will run the software in production, they will be interested in the answers to questions like the ones mentioned above.

Was this post helpful?

Blog author

Christian Zunker

Do you still have questions? Just send me a message.

fromChristian Zunker

Overview of hardened container base images

How to choose the best container base image? What does “best” mean in this context? This blog post will not try to determine the best base image. We will pick just one of the aspects: security. We will have a look at how you can give your container base...

CI/CD
IT-Security

9.8.2021 | 6 minutes reading time

Christian Zunker

How to use OAuth2 Proxy for central authentication

This blog post will show you how to use one central OAuth2 Proxy (see the official page ) as authentication proxy for multiple services inside your Kubernetes Cluster . The default example on how to secure a service with Nginx and OAuth2 Proxy shows...

Infrastructure
Microservices
Cloud
Kubernetes
IT-Security

7.6.2021 | 2 minutes reading time

Christian Zunker

The how of monitoring your services

Lately, there has been a lot of discussion about SLAs, SLOs and SLIs. As this article states, it is hard to define the correct SLOs and SLIs. This discussion is about what part of your services you want to monitor. But it is also difficult to measure...

Infrastructure
APM

17.11.2020 | 5 minutes reading time

Christian Zunker

Cynicism and burnout in Information Technology

Earlier this year, my colleague Nandor already wrote about passion and burnout . The following post will show my perspective on cynicism and burnout. Sadly, last year, @sadserver and @sadoperator retired their Twitter accounts. As stated in this blog...

24.8.2020 | 6 minutes reading time

Christian Zunker

Kubernetes deployment concepts

There is a wide variety of tools out there to deploy software to a Kubernetes cluster. In the context of these tools, even a new *Ops term emerged: GitOps . This article will not be another comparison of Kubernetes deployment tools but a comparison of...

CI/CD
DevOps
Kubernetes

5.8.2020 | 3 minutes reading time

Christian Zunker

Daniel Marks

Debugging Kubernetes Network Policies with ephemeral containers

As you are developing your new shiny containerized service on Kubernetes (k8s), you might also want to apply Network Policies . But during the process, you experience connection problems inside your containers. You followed best practices and kept your...

Software development
Kubernetes

22.7.2020 | 2 minutes reading time

Christian Zunker

Configuring Kubernetes login with Keycloak

Kubernetes does not have its own user management and relies on external providers like Keycloak. This blog post will describe how to configure Kubernetes to use Keycloak as an authentication provider. We are running Kubernetes clusters based on OpenStack...

16.5.2019 | 2 minutes reading time

Christian Zunker

Daniel Marks

Configure your Gitlab CI with docker-machine against keystone v3

We are running our Gitlab CI infrastructure on top of OpenStack . To not use a fixed number of VMs, we use Gitlab CI with docker-machine to create VMs as needed by the build jobs. This blog post will describe how to enable docker-machine to properly...

27.11.2018 | 2 minutes reading time

Christian Zunker

Measure your radosgw usage with OpenStack-Ansible

We use OpenStack-Ansible to set up our OpenStack cluster and Ceph’s Rados Gateway (radosgw) as object store backend. Unfortunately, the telemetry (and in consequence accounting) for radosgw will not work out of the box. You need to change different ...

Infrastructure
Cloud

25.7.2018 | 2 minutes reading time

Christian Zunker

Daniel Marks

Measuring your OpenStack Cloud with Gnocchi and Ceph storage backend

To solve our performance problems with Gnocchi and the whole OpenStack telemetry stack, we tried Gnocchi with Ceph as backend starting with OpenStack-Ansible Newton. The experience wasn’t good. Sooner or later, we experienced slow requests and stuck ...

Software architecture
Cloud
Open Source
Infrastructure

15.7.2018 | 4 minutes reading time

Christian Zunker

Daniel Marks

Modifications to the CoreOS Ambassador Pattern

In this post I explain my changes to the ambassador pattern I implemented during a microservices project earlier this year. With Docker Links , Docker containers are able to communicate with each other over the network. When creating a Link, IP and exposed...

Pattern
Linux
Microservices

12.8.2015 | 2 minutes reading time

Christian Zunker

Nicer Ansible output for Puppet tasks

In a previous post , I wrote about executing Puppet from within an Ansible playbook. But the output did not look very nice. In this post I take a closer look at how to change that. Just as a reminder, the output of Puppet looks like this, when called...

15.4.2015 | 4 minutes reading time

Christian Zunker

Migrate from Puppet to Ansible

In a previous post , I wrote about combining Ansible and Puppet, with Ansible as remote executor for arbitrary commands. In this post I take a look at how to migrate from Puppet to Ansible. Combine the Execution of Ansible and Puppet If you want to ...

17.12.2014 | 3 minutes reading time

Christian Zunker

Ansible as remote executor in a Puppet environment

When you are using Puppet you might know this problem: How can I execute arbitrary commands on all or some of my Puppet nodes? In this article, I explain how you can do so with Ansible . Ansible it another configuration management tool like Puppet and...

CI/CD
DevOps
Infrastructure
Open Source

21.9.2014 | 4 minutes reading time

Christian Zunker

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

20 years of coding

We all grow older. It is simply inevitable. As the saying goes, The only way to not grow old is to die young. Recently, I've completed my 20th year in the development industry. Through academia, consulting, and a stint in product development, I've learned...

Software development
Training
Culture

11.4.2025 | 10 [Missing String "readingTime"]

Elisabeth Schulz

Pull off Architecture Reviews at Light-Speed with LASR!

Foreword: This blog is loosely based on a recent project experience. All persons, companies and names are fictitious, as to make them NDA compliant. Any resemblance to a person, existing company or brand is purely coincidental and unintentional.For most...

Software architecture

4.4.2025 | 13 [Missing String "readingTime"]

Feature-Sliced Design and what we need for good frontend architecture

Feature-Sliced Design and what we need for good frontend architecture While a lot has been published on the topic of software architecture in the backend, and there are well-established best practices, this topic is less prominent for frontend applications...

Software architecture
Frontend

23.1.2025 | 10 [Missing String "readingTime"]

Hexagonal Architecture is just an island

Imagine an island called "Alistair Island." This island is a vibrant place with houses, fertile soil, and a well-coordinated community of residents who live by well-defined routines. Every activity on the island has significance and serves a specific...

Software architecture
Testing
Software development

22.1.2025 | 10 [Missing String "readingTime"]

Danny Steinbrecher

Modularization the easy way: Spring Modulith with Kotlin and Hexagonal...

Modularization the easy way: Spring Modulith with Kotlin and Hexagonal Architecture Modularization is a key concept in modern software development to make applications maintainable, testable and flexible. In this article we will see how Spring Modulith...

Software architecture
Kotlin
Spring

14.1.2025 | 9 [Missing String "readingTime"]

Danny Steinbrecher

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 [Missing String "readingTime"]

Daniel Kocot

ArchUnit in practice: Keep your Architecture Clean

Who hasn’t been there: A new project kicks off or the old code finally needs a cleanup. A big meeting with all the developers is called: “This time, we’ll do it right—clean, correct, and structured!” Architecture Decision Records (ADRs) are created to...

Software architecture
Java
Kotlin
Software development

20.9.2024 | 18 [Missing String "readingTime"]

Danny Steinbrecher

Dangling DNS in cloud infrastructures

Dangling DNS entries are nothing new. Forgotten, outdated or incorrect DNS records can lead to subdomains being taken over and used in phishing campaigns, for example, to steal employee secrets. Due to dynamic IP addresses of rapidly changing resources...

IT-Security
Validation
Cloud
AWS
Infrastructure

5.9.2024 | 4 [Missing String "readingTime"]

Markus Höfer

Charge your APIs Volume 30 - Gateway to Success: Understanding and Choosing...

API gateways are essential for managing and securing data flow between services. As software architectures evolve, different types of API gateways have emerged to address specific challenges: Legacy, Agnostic, and Kubernetes-native. Drawing on insights...

API
Software architecture
Infrastructure
Integration

21.8.2024 | 12 [Missing String "readingTime"]

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 [Missing String "readingTime"]

Dr. Florian Rademacher

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

In a recent blog post, we explored how Dapr works and how to test it on a simple local Kubernetes cluster. One of Dapr's key advantages is its component system, which enhances portability. In this post, we'll take our previously daperized demo app and...

Software development
Cloud
Azure
Cloud native

22.7.2024 | 10 [Missing String "readingTime"]

Manuel Zapf

React is dead, long live React - React 19 is here

The world of frontend development has changed once again, and this time React 19 is leading the way. This version brings a variety of new features and improvements, but the most exciting innovation is the brand new compiler, which already requires React...

React
Frontend
Software development
JavaScript
Webdevelopment

19.7.2024 | 6 [Missing String "readingTime"]

Michel Ehmen

Exploring Dapr: A Deep Dive into Distributed Application Runtime

In a recent blog post, we introduced Dapr (Distributed Application Runtime) and highlighted its potential as a valuable tool for cloud-native applications, in combination with Aspire. This post dives deeper into the inner workings of Dapr, explaining...

Software development
Cloud native
Software architecture
Open Source

10.7.2024 | 10 [Missing String "readingTime"]

Manuel Zapf

Spring Boot and HTMX: The boring app

Motivation Most apps I touched in the wild follow the same two tiered approach. A backend delivering JSON (some may call this REST) and a frontend framework, consuming JSON from the backend converting it to the HTML displayed to the user. Worst case,...

Software architecture
Software development
Spring
Kotlin

28.6.2024 | 16 [Missing String "readingTime"]

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

I recall the days when writing a web application in C# with .NET meant deploying it on an IIS web server for accessibility. Today, this approach seems outdated, especially with the shift towards microservice-based architectures. Fortunately, Microsoft...

Software architecture
Open Source
Cloud
Microservices
Infrastructure as Code
.NET
Cloud native

27.6.2024 | 8 [Missing String "readingTime"]

Manuel Zapf

Zero Trust Azure Identity & Access Architecture

Falko Lehmann and Hendrik Kamp have already explained in their blog post on Zero-trust Architecture why zero-trust security models are preferable to traditional perimeter security models in order to minimize damage from cyber attacks. Falko and Hendrik...

IT-Security
IAM
Azure
Software architecture

4.6.2024 | 14 [Missing String "readingTime"]

From sidecars to sidecarless: Tracing the evolution of service mesh technologies...

Ever wondered how the technology that seamlessly manages microservices traffic evolved from early implementations to lean, kernel-level solutions? Let's dive into the fascinating journey of service meshes, from Linkerd 1.x to the cutting-edge technologies...

Cloud
Networking
Infrastructure
Kubernetes
Linux

22.5.2024 | 10 [Missing String "readingTime"]

Manuel Zapf

Charge your APIs Volume 25: Contract Testing

I feel the way we do integration testing is sort of like setting your house on fire to test your smoke alarm. It is excessive, tiresome and way too costly. This is not a quote from myself. I typically don't come up with such good ideas when I need....

Testing
Software development
API

2.4.2024 | 11 [Missing String "readingTime"]

Pasquale Brunelli

Cloud-native (application) networking in 2024

It's 2024 and Software is still eating the world. Whether it's powering an e-commerce platform, driving AI applications, or supporting critical business processes within organizations, there's a high likelihood that these applications are running in ...

Cloud
Networking
Infrastructure
Kubernetes

8.3.2024 | 2 [Missing String "readingTime"]

Manuel Zapf

How to gain visibility as a software developer?

No matter if junior, medior or senior, introverted or extroverted: Every software developer can increase their visibility with different tools and should treat the topic as important. The only question is: how and with what effort? In this blog post,...

Training
Software development
Community
Open Source

21.2.2024 | 6 [Missing String "readingTime"]

Site Reliability Engineering: Running software in production

Logging

Retries

There is more to running software in production

Backup/restore

Certificates

But why?

The fallacies

Short detour

The fallacies and public clouds

Conclusion

Was this post helpful?

Blog author

More articles

Overview of hardened container base images

How to use OAuth2 Proxy for central authentication

The how of monitoring your services

Cynicism and burnout in Information Technology

Kubernetes deployment concepts

Debugging Kubernetes Network Policies with ephemeral containers

Configuring Kubernetes login with Keycloak

Configure your Gitlab CI with docker-machine against keystone v3

Measure your radosgw usage with OpenStack-Ansible

Measuring your OpenStack Cloud with Gnocchi and Ceph storage backend

Modifications to the CoreOS Ambassador Pattern

Nicer Ansible output for Puppet tasks

Migrate from Puppet to Ansible

Ansible as remote executor in a Puppet environment

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

20 years of coding

Pull off Architecture Reviews at Light-Speed with LASR!

Feature-Sliced Design and what we need for good frontend architecture

Hexagonal Architecture is just an island

Modularization the easy way: Spring Modulith with Kotlin and Hexagonal...

Charge your APIs Volume 36 - Trends for 2025

ArchUnit in practice: Keep your Architecture Clean

Dangling DNS in cloud infrastructures

Charge your APIs Volume 30 - Gateway to Success: Understanding and Choosing...

When Business Meets Technology: From Data Product to Data Architecture...

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

React is dead, long live React - React 19 is here

Exploring Dapr: A Deep Dive into Distributed Application Runtime

Spring Boot and HTMX: The boring app

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

Zero Trust Azure Identity & Access Architecture

From sidecars to sidecarless: Tracing the evolution of service mesh technologies...

Charge your APIs Volume 25: Contract Testing

Cloud-native (application) networking in 2024

How to gain visibility as a software developer?