The how of monitoring your services

17.11.2020 | 5 minutes reading time

Lately, there has been a lot of discussion about SLAs, SLOs and SLIs. As this article states, it is hard to define the correct SLOs and SLIs. This discussion is about what part of your services you want to monitor. But it is also difficult to measure these correctly. In this blog post I take a look at two examples of what can (and for us, did) go wrong in monitoring. This is about how you monitor your services.

Example: TCP connections for monitored services

The first example will be about TCP connections, a proxy, and handshakes.

Expectation vs. reality

For one of our projects we use an authentication proxy which is talking to an LDAP server as backend. We came across connections piling up on the server hosting this proxy. At first, it was not clear what was causing these connections.

After the proxy was installed, I integrated it into our Zabbix monitoring. To verify the proxy is answering requests, I used Zabbix’ built-in check net.tcp.connect. At first all seemed fine. The check was doing exactly what I expected.

But after a while we saw connections to the backend piling up on the server running the proxy. As no one was using the proxy for authentication at that point, I suspected Zabbix causing the vast number of connections. The monitoring of the service wasn’t working as expected. But what exactly was happening?

Each time Zabbix initiated the check, it was doing a three-way TCP handshake …

Source: https://www.cs.purdue.edu/homes/park/cs536-e2e-3.pdf

… and after that tore down the connection:

Source: https://www.cs.purdue.edu/homes/park/cs536-e2e-3.pdf

In tcpdump, it looks like this:

That was expected, so why were there so many connections still left on the system?

The proxy responded correctly, so the Zabbix check said everything is fine. But what happened on the connection from the proxy to the backend system?

It turned out, the proxy was starting a TLS connection to the backend for every incoming TCP connection. It did not matter to the proxy, there was no data sent. But the TLS connection to the backend should not be a problem either. It should have been torn down when the TCP connection from Zabbix to the Proxy ended. But that is theory. In reality the TLS connection even persisted after the correct TCP teardown:

The Swiss army knife of networking

So, I found the connections piling up on the proxy. But I still did not know what was the real problem. I tried to get a more precise view by connecting to the proxy manually with netcat: nc -v backend.example.com 8636

But nothing happened. Each time I opened a connection with netcat to the proxy, it started a TLS connection to the backend. I closed netcat and after that, the proxy tore down the TLS connection to the backend. No connections piled up on the Proxy. What was different? After some more testing and man page reading I managed to reproduce the Zabbix behaviour with netcat: nc -z -v backend.example.com 8636

The parameter that did the trick was -z. It instructs netcat to close the connection after a successful connect:


-z      Specifies that nc should just scan for listening daemons, 
        without sending any data to them. It is an error to use 
        this option in conjunction with the -l option.

So, it is not a problem specific to Zabbix, but it seems to be the Proxy. During the tests with netcat I observed, the problem didn’t appear when I used netcat in interactive mode.

Perhaps everything is a timing problem?

Netcat offers another handy parameter for these tests:


-w timeout
             Connections which cannot be established or are idle 
             timeout after timeout seconds. The -w flag has no 
             effect on the -l option, i.e. nc will listen forever
             for a connection, with or without the -w flag. 
             The default is no timeout.

So, I tried it again with nc -w 1 -v control01.baremetal 8636 and it turned out, it works.

I did some more tests with this parameter and it worked without leftovers. Taking a closer look at the tcpdump traces, the TLS connection is not torn down when the initiating TCP connection to the Proxy ends before the TLS handshake finished. As soon as the TCP teardown sequence starts after the TLS handshake is done, the TLS connection also ended as expected(tcpdump view):

Monitoring the service

So, I used the netcat command to create a new Zabbix check with the slowed down TCP disconnect. It is not the perfect solution, but works fine for my situation.

To be complete, implementing the check for Zabbix did not work without problems. In short, it showed Zabbix also needs the parameter -d. Otherwise, it does something weird with stdin and the parameter -w 1 has no effect.

Example: State of a monitored service

This is another example of an application we monitor. There we monitored the availability and response time of an HTTP endpoint. The first approach was a simple HTTP GET showing these response times:

As you can see in the graph above, the response time piled up the more we queried the endpoint.

As it turned out, the application held a state associated with the endpoint. Be surprised, but not everything is stateless. This state grew bigger and bigger each time we queried the endpoint. Therefore, it took the application longer and longer to process our requests. The session timeout was too long, to discard the session between the monitoring queries.

The application had to be modified, so that it does not create a session for the endpoints used for monitoring. This is just an example and might also happen with disk space, memory, or CPU consumption.

Conclusion

Not only is it hard to define SLOs/SLIs and define the correct measures for a user perspective. As shown with these examples, it is also hard to monitor the services correctly without impacting the selected SLOs with your measurement. It’s crucial not to only know what service to monitor, but also how to monitor this service. The Observer effect is not only applicable to quantum physics.

Was this post helpful?

Blog author

Christian Zunker

Do you still have questions? Just send me a message.

Full control despite virus protection and modern systems – How to truly...

Recently, codecentric's security experts were tasked with testing the IT infrastructure security of a company with several hundred employees. The clients believed they were secure: The systems were running on the latest version of Windows 11 and Windows...

IT-Security
Infrastructure

2.7.2025 | 6 minutes reading time

Serverless from Europe: My Experience with Scaleway as an Alternative ...

In addition to dominant US providers like AWS, Azure, and GCP, the French company Scaleway now offers a comprehensive serverless computing portfolio. This includes services for Function as a Service, a lightweight Key/Value Store, and a simple messaging...

Compliance
Infrastructure
data protection
Cloud native
Cloud
Infrastructure as Code

28.5.2025 | 5 minutes reading time

Florian Lüdiger

Dangling DNS in cloud infrastructures

Dangling DNS entries are nothing new. Forgotten, outdated or incorrect DNS records can lead to subdomains being taken over and used in phishing campaigns, for example, to steal employee secrets. Due to dynamic IP addresses of rapidly changing resources...

IT-Security
Validation
Cloud
AWS
Infrastructure

5.9.2024 | 4 minutes reading time

Markus Höfer

Charge your APIs Volume 30 - Gateway to Success: Understanding and Choosing...

API gateways are essential for managing and securing data flow between services. As software architectures evolve, different types of API gateways have emerged to address specific challenges: Legacy, Agnostic, and Kubernetes-native. Drawing on insights...

API
Software architecture
Infrastructure
Integration

21.8.2024 | 12 minutes reading time

Daniel Kocot

From sidecars to sidecarless: Tracing the evolution of service mesh technologies...

Ever wondered how the technology that seamlessly manages microservices traffic evolved from early implementations to lean, kernel-level solutions? Let's dive into the fascinating journey of service meshes, from Linkerd 1.x to the cutting-edge technologies...

Cloud
Networking
Infrastructure
Kubernetes
Linux

22.5.2024 | 10 minutes reading time

Manuel Zapf

Cloud-native (application) networking in 2024

It's 2024 and Software is still eating the world. Whether it's powering an e-commerce platform, driving AI applications, or supporting critical business processes within organizations, there's a high likelihood that these applications are running in ...

Cloud
Networking
Infrastructure
Kubernetes

8.3.2024 | 2 minutes reading time

Manuel Zapf

Secure your Kubernetes workloads with OPA Gatekeeper

Last month, Kubernetes 1.25 was released. And with that, the long-announced removal of PodSecurityPolicies (short: PSPs) finally becomes reality. Finally? Yes – as Tabitha Sable from the Kubernetes SIG Security Team said herself in the linked blog post...

IT-Security
Kubernetes
Infrastructure

15.12.2022 | 8 minutes reading time

Introduction to GitOps with ArgoCD

In this post you will learn what GitOps is about and see the steps to create a setup on your laptop to gain some experience with ArgoCD. Using an industry standard container orchestrator such as Kubernetes, this enables developers to continuously deploy...

CI/CD
Kubernetes
GitHub
Open Source
DevOps
Container
Infrastructure as Code
Infrastructure
Spring

31.10.2022 | 10 minutes reading time

Building an instant noodles DevOps starter pack with Terraform and AWS

How can we help a fictitious startup kickstart its software development process? Using Terraform and AWS services, we’ll build an IT infrastructure that is ready within minutes and ticks quite a few boxes on the technical DevOps capabilities list. Just...

Cloud
Infrastructure
AWS
CI/CD
DevOps

27.6.2022 | 21 minutes reading time

From specification to infrastructure – automated API deployments

Deploying an API into the various stages of a software development pipeline involves not only the aspect of writing (designing) an API specification, but also having or simultaneously deploying a corresponding infrastructure. This article describes possible...

AWS
CI/CD
Infrastructure
Infrastructure as Code
API

27.1.2022 | 11 minutes reading time

Daniel Kocot

JavaScript test performance: getting the best out of Jest

In recent years Jest has established itself as the go-to testing framework for JavaScript and TypeScript development. It provides a complete toolkit (test runner, assertion library, mocking library, code coverage and more) out of the box, and requires...

Node.js
JavaScript
APM
Testing

12.11.2021 | 7 minutes reading time

Speed up your CI/CD jobs in Kubernetes

A performant and well integrated CI/CD environment is one of the key factors for fast and agile software development. To achieve short feedback cycles and increase development speed, jobs need to be as fast as possible and – ideally – should start instantly...

GitLab
Software architecture
CI/CD
Infrastructure
Cloud
Kubernetes

2.9.2021 | 7 minutes reading time

Site Reliability Engineering: Running software in production

Lately, Site Reliability Engineering (SRE) has been getting a lot of attention. With SRE came metrics such as Service-Level Objective (SLO), Service-Level Indicator (SLI), and error budget. The SRE discipline also details a lot about running software...

Software architecture
Infrastructure
Software development

12.7.2021 | 7 minutes reading time

How to use OAuth2 Proxy for central authentication

This blog post will show you how to use one central OAuth2 Proxy (see the official page ) as authentication proxy for multiple services inside your Kubernetes Cluster .The default example on how to secure a service with Nginx and OAuth2 Proxy shows ...

Infrastructure
Microservices
Cloud
Kubernetes
IT-Security

7.6.2021 | 2 minutes reading time

API Gateway and Service Mesh in the context of service connectivity

When thinking about the development of microservices and their connectivity, one inevitably stumbles across the terms / patterns of API gateway and service mesh. But why do these patterns or technologies exist at all? Sometimes it also happens that the...

Software architecture
Cloud
API
Infrastructure
Kubernetes

23.2.2021 | 1 minutes reading time

Daniel Kocot

Performance optimization of a GraphQL app with Instana

“Works on my machine.” Okay, but we know quite well software never behaves the same when running on different machines… We knew that, but ran into unexpected performance issues when going live with a simple app. Here’s how we fixed the problem and improved...

Cloud
APM
API
JavaScript

21.7.2020 | 8 minutes reading time

How to secure a GraphQL service using persisted queries

GraphQL is a rising query language that gives clients the power to ask for what they need and get exactly that in a single request. In theory this leads to effective and flexible client-server communication. But adopting new technology always comes ...

API
JavaScript
APM
IT-Security

30.4.2020 | 10 minutes reading time

Performance Analysis of a GraphQL application with Instana

Modern IT landscapes typically consist of a bunch of different microservices. Replacing the monoliths brings us more complexity due to more parts and all their dependencies.A key aspect for running these systems is the appropriate monitoring with the...

DevOps
Infrastructure
API
Microservices
APM

6.3.2020 | 9 minutes reading time

Hyperledger Fabric CouchDB is killing my cloud storage bills

Hyperledger Fabric is a nice DLT platform and offers great customization options. One of which is the ability to choose different databases to store blockchain data. The recommended and best supported option is to use a CouchDB. It offers the ability...

Blockchain
Database
Infrastructure
Open Source

9.1.2020 | 2 minutes reading time

Publishing application metrics to CloudWatch using Micrometer

Why metrics?In my post about Quality attributes in software we introduced observability as an important quality attribute of modern software applications. Observability expresses whether changes in a system are reflected in a quantitative measure.Especially...

AWS
Cloud
DevOps
Kotlin
APM

21.12.2019 | 10 minutes reading time

The how of monitoring your services

Example: TCP connections for monitored services

Expectation vs. reality

The Swiss army knife of networking

Monitoring the service

Example: State of a monitored service

Conclusion

Was this post helpful?

Blog author

More articles in this subject area

Full control despite virus protection and modern systems – How to truly...

Serverless from Europe: My Experience with Scaleway as an Alternative ...

Dangling DNS in cloud infrastructures

Charge your APIs Volume 30 - Gateway to Success: Understanding and Choosing...

From sidecars to sidecarless: Tracing the evolution of service mesh technologies...

Cloud-native (application) networking in 2024

Secure your Kubernetes workloads with OPA Gatekeeper

Introduction to GitOps with ArgoCD

Building an instant noodles DevOps starter pack with Terraform and AWS

From specification to infrastructure – automated API deployments

JavaScript test performance: getting the best out of Jest

Speed up your CI/CD jobs in Kubernetes

Site Reliability Engineering: Running software in production

How to use OAuth2 Proxy for central authentication

API Gateway and Service Mesh in the context of service connectivity

Performance optimization of a GraphQL app with Instana

How to secure a GraphQL service using persisted queries

Performance Analysis of a GraphQL application with Instana

Hyperledger Fabric CouchDB is killing my cloud storage bills

Publishing application metrics to CloudWatch using Micrometer