Monitoring AWS Lambda functions with CloudWatch

23.10.2018 | 10 minutes reading time

Introduction

Functions as a Service products like AWS Lambda provide a great deal of convenience compared to bare metal, virtual machines, and also containerized deployments. You only have to manage the actual code you want to run and the rest is taken care of by the cloud provider. But are they also convenient to operate?

In this blog post we want to take a look into how to assist Lambda operations through monitoring and alerting using Amazon CloudWatch. We will use existing metrics but also create a custom metric filter to parse the memory consumption from CloudWatch logs.

The metrics are visualized in a CloudWatch dashboard and alarms are configured to push a notification towards an AWS SNS topic in case a threshold is breached. As usual, everything will be deployed with HashiCorp Terraform. Below you find a screenshot of the resulting dashboard that we will have at the end of the post.

The source code is available on GitHub. Please note that we are not going to discuss the topics of managing multiple Lambda functions within a single repository or how to show the alarm notifications inside a Slack channel. Please refer to other posts for more information on those topics.

Metrics

In CloudWatch, metrics are organized in so-called namespaces. A namespace is like a folder for metrics and can be used to group together metrics of the same application. A metric is a time-ordered set of data points, also known as a time series. Examples are CPU usage of an EC2 instance or number of requests made towards your API.

Most AWS services send predefined metrics to CloudWatch out of the box, but it is also possible to send custom metrics. As of today, AWS Lambda exposes the following metrics to CloudWatch out of the box:

Invocations. The invocations metric measures the number of times a function is invoked. Invocations can happen either through an event or an invocation API call. It includes both failed and successful invocations, but not failed invocation requests, e.g. if throttling occurs.
Errors. This metric measures the number of times an invocation failed due to an error in the function, a function timeout, out of memory error, or permission error. It does not include failures due to exceeding concurrency limits or internal service errors.
Dead letter errors. If you configured a dead letter queue, AWS Lambda is going to write the event payload of failed invocations into this queue. The dead letter error metric captures failed deliveries of dead letters.
Duration. The duration measures the elapsed wall clock time from when the function code starts to when it stops executing. Watch out, as the clock is not monotonic, you might get negative values.
Throttles. Throttled invocations are counted whenever an invocation attempt fails due to exceeded concurrency limits.
Iterator age (stream only). The iterator age metric is only available when the Lambda function is invoked by an AWS streaming service such as Kinesis. It represents the time difference between an event being written to the stream and the time it gets picked up by the Lambda function.
Concurrent executions. This metric is an account-wide aggregate metric indicating the sum of concurrent executions for a given function. It is applicable to functions with a custom concurrency limit.
Unreserved concurrent executions. Similar to the concurrent execution metric, the unreserved concurrent executions metric is also an account-wide metric. It indicates the sum of concurrency of all functions that do not have a custom concurrency limit specified.

Every metric can have up to ten dimensions assigned. A dimension is a key-value pair that describes a metric and can be used to uniquely identify a metric in addition to the metric name. Metrics emitted by AWS Lambda have the following dimensions:

Function name. The function name can be used to select a metric based on the name of the Lambda function.
Resource. The resource dimension is useful to filter based on function version or alias.
Executed version. You can use the executed version dimension to filter based on the function version when using alias invocations.

Metrics can be aggregated through so-called statistics. CloudWatch offers the following statistics: Minimum, maximum, sum, average, count, and percentiles. The statistics are computed within the specified period. As far as I understand it uses a tumbling window to do that but I am not entirely sure. Please refer to my previous post about Window Functions in Stream Analytics for more information on tumbling windows.

Metric Filters

As mentioned earlier, it is also possible to generate custom metrics in addition to the ones that AWS services provide out of the box. When looking at AWS Lambda a metric of common interest is the maximum memory consumption.

Custom metrics can be written directly through the CloudWatch API, or using the AWS SDK. However the Lambda function itself does not have any information about its memory consumption. How can we solve this problem? Luckily, metric filters are to the rescue! After every function execution, AWS writes a report into the CloudWatch logs that looks like this:

1REPORT RequestId: f420d819-d07e-11e8-9ef2-4d8f649fd167  Duration: 158.15 ms Billed Duration: 200 ms Memory Size: 128 MB Max Memory Used: 38 MB

This report contains the information we are looking for: Max Memory Used: 38 MB. CloudWatch provides a convenient functionality to convert logs into metrics called a metric filter.

A filter consists of a pattern, a name, a namespace, a value, and an optional default value. It applies the pattern to each log line and if it matches, emits the specified value inside a metric of the given name in the given namespace. The default value will be emitted if no log events occur.

To extract the maximum memory from the log line we can use the pattern below and emit the value $max_memory_used_value. For more information on the pattern syntax please refer to the official documentation .

1[
2  report_label=\"REPORT\",
3  request_id_label=\"RequestId:\", request_id_value,
4  duration_label=\"Duration:\", duration_value, duration_unit=\"ms\",
5  billed_duration_label1=\"Billed\", bill_duration_label2=\"Duration:\", billed_duration_value, billed_duration_unit=\"ms\",
6  memory_size_label1=\"Memory\", memory_size_label2=\"Size:\", memory_size_value, memory_size_unit=\"MB\",
7  max_memory_used_label1=\"Max\", max_memory_used_label2=\"Memory\", max_memory_used_label3=\"Used:\", max_memory_used_value, max_memory_used_unit=\"MB\"
8]

I noticed that the maximum memory being written to the log is already aggregated across some previous invocations. My suspicion is that it refers to the maximum memory of one running instance behind the scenes which gets reset every time the function is starting up again, e.g. after a break or redeployment. If you have more information on that matter, please leave a comment!

Alarms

Metrics are an important building block to support your operations. In order to make them truly useful, we need to define a process including either automated or manual actions based how metric values change over time.

CloudWatch allows you to define alarms, which are rules that trigger actionsbased on a threshold over a number of time periods for one metric. Each alarm is associated with one or more actions. This can be an EC2 action, an EC2 autoscaling action, or an SNS notification. If you send an SNS event, you can implement many different consumers like a Lambda function sending a Slack message .

For our Lambda function, we will implement the following three basic alarms:

Execution Time. Every Lambda function has a configurable timeout. If your code runs longer than the timeout specified, the invocation will be aborted. You can create an alarm if the execution time exceeds a certain percentage of the configured timeout. This way, CloudWatch will notify you in case you might need to adjust the threshold or improve the performance of your code.
Maximum Memory. Similar to the execution time, there is also a configurable maximum amount of memory available. Thanks to our previously defined metric filter for the maximum used memory, we can trigger an alarm if a certain threshold is exceeded.
Execution Errors. Sometimes code breaks. It might happen because some downstream service is not available, the input format changed or your code contains a bug. By triggering an alarm for execution errors, you can receive notifications and act accordingly. Note, however, that this will include all errors even if the invocation succeeded after a retry. If you are only interested in events that could not be processed even after retrying, you need to configure a dead letter queue.

If you are using Terraform, you can directly interpolate the execution timeout and maximum memory and pick a percentage for the threshold. The following listing illustrates creating an alarm resource with Terraform that gets triggered if your execution time exceeds 75% of the timeout.

1resource "aws_cloudwatch_metric_alarm" "calculator-time" {
2  alarm_name          = "${local.project-name}-calculator-execution-time"
3  comparison_operator = "GreaterThanOrEqualToThreshold"
4  evaluation_periods  = "1"
5  metric_name         = "Duration"
6  namespace           = "AWS/Lambda"
7  period              = "60"
8  statistic           = "Maximum"
9  threshold           = "${aws_lambda_function.calculator.timeout * 1000 * 0.75}"
10  alarm_description   = "Calculator Execution Time"
11  treat_missing_data  = "ignore"
12
13  insufficient_data_actions = [
14    "${aws_sns_topic.alarms.arn}",
15  ]
16
17  alarm_actions = [
18    "${aws_sns_topic.alarms.arn}",
19  ]
20
21  ok_actions = [
22    "${aws_sns_topic.alarms.arn}",
23  ]
24
25  dimensions {
26    FunctionName = "${aws_lambda_function.calculator.function_name}"
27    Resource     = "${aws_lambda_function.calculator.function_name}"
28  }
29}

If you log in to the AWS Console, CloudWatch shows you an overview of all your current alarms. The table below illustrates the three alarms for our example function. I generated some test events and one of it was generating an error inside the function which triggered the corresponding alarm.

In addition to the table, you also have a very simple graph view of each alarm which is independent of CloudWatch dashboards. The next figure depicts the three graphs for our alarms.

Dashboard

I am a big fan of automation. I believe that nobody should have to look at dashboards 24/7 trying to spot errors. Nevertheless, dashboards are very useful to get a quick overview of the system. They also allow humans to spot new patterns which lead to implementing new types of alarms.

In CloudWatch a dashboard consists of multiple widgets. A widget can be a graph of metrics or text in Markdown syntax. For our example function, we want to plot the four metrics execution time, max memory used, execution errors, and invocations.

Dashboards are internally stored as JSON objects and can also be managed by Terraform. The dashboard object consists of an array of widget objects. The source code of the complete dashboard (cloudwatch_dashboard.tf ) is too large to be displayed here so we will only look at two widgets to illustrate the point. The following listing shows the invocation sum widget.

1{
2  "type": "metric",
3  "x": 12,
4  "y": 7,
5  "width": 12,
6  "height": 6,
7  "properties": {
8    "metrics": [
9      [
10        "AWS/Lambda", "Invocations",
11        "FunctionName", "${aws_lambda_function.calculator.function_name}",
12        "Resource", "${aws_lambda_function.calculator.function_name}",
13        {
14          "color": "${local.dashboard-calculator-invocation-color}",
15          "stat": "Sum",
16          "period": 10
17        }
18      ]
19    ],
20    "view": "timeSeries",
21    "stacked": false,
22    "region": "${data.aws_region.current.name}",
23    "title": "Invocations"
24  }
25}

And here is what it looks like in the browser:

We can also add horizontal annotations to indicate our alarm threshold. Additionally, it can be useful to display different statistics. For the execution time widget, we added a horizontal annotation as well as two statistics: Maximum and average execution time. Please find the code and a screenshot of the result below.

1{
2  "type": "metric",
3  "x": 0,
4  "y": 1,
5  "width": 12,
6  "height": 6,
7  "properties": {
8    "metrics": [
9      [
10        "AWS/Lambda", "Duration",
11        "FunctionName", "${aws_lambda_function.calculator.function_name}",
12        "Resource", "${aws_lambda_function.calculator.function_name}",
13        {
14          "stat": "Maximum",
15          "yAxis": "left",
16          "label": "Maximum Execution Time",
17          "color": "${local.dashboard-calculator-max-time-color}",
18          "period": 10
19        }
20      ],
21      [
22        "AWS/Lambda", "Duration",
23        "FunctionName", "${aws_lambda_function.calculator.function_name}",
24        "Resource", "${aws_lambda_function.calculator.function_name}",
25        {
26          "stat": "Average",
27          "yAxis": "left",
28          "label": "Average Execution Time",
29          "color": "${local.dashboard-calculator-avg-time-color}",
30          "period": 10
31        }
32      ]
33    ],
34    "view": "timeSeries",
35    "stacked": false,
36    "region": "${data.aws_region.current.name}",
37    "yAxis": {
38      "left": {
39        "min": 0,
40        "max": ${aws_lambda_function.calculator.timeout}000,
41        "label": "ms",
42        "showUnits": false
43      }
44    },
45    "title": "Execution Time",
46    "period": 300,
47    "annotations": {
48      "horizontal": [{
49          "color": "${local.dashboard-calculator-max-time-color}",
50          "label": "Alarm Threshold",
51          "value": ${aws_cloudwatch_metric_alarm.calculator-time.threshold}
52        }
53      ]
54    }
55  }
56}

Conclusion

In this post, we have seen how to create CloudWatch metric filters, alarms, and dashboards. We looked at the different metrics that are provided for Lambda functions out of the box and how to parse the maximum memory consumption from CloudWatch logs using a metric filter.

You can add automated alerting or even precautious actions based on alarms in order to react to dangerous situations in your system. A dashboard helps humans to get a glimpse on what is going on from time to time.

Did you ever use CloudWatch to manage metrics and alarms? I personally find it much more convenient to work with than using an external solution like ElasticSearch when working on Lambda functions. What is your opinion? Please let me know in the comments below.

Cover image by Roger Schultz

Was this post helpful?

Blog author

Frank Rosner

Do you still have questions? Just send me a message.

Ingress NGINX Retirement — Don't Panic, We've Got You Covered

Ingress NGINX Retirement - Don't Panic, We've Got You Covered At KubeCon NA 2025, the Kubernetes community faced a significant announcement: the ingress-nginx controller officially entered retirement by March 2026. Adding to the uncertainty, its designated...

Kubernetes
Cloud native
DevOps
Cloud

18.11.2025 | 5 minutes reading time

Manuel Zapf

Serverless from Europe: My Experience with Scaleway as an Alternative ...

In addition to dominant US providers like AWS, Azure, and GCP, the French company Scaleway now offers a comprehensive serverless computing portfolio. This includes services for Function as a Service, a lightweight Key/Value Store, and a simple messaging...

Compliance
Infrastructure
data protection
Cloud native
Cloud
Infrastructure as Code

28.5.2025 | 5 minutes reading time

Florian Lüdiger

The Ultimate Tool for Engineers and Developers: Compass Premium

It’s not an every day activity that a tool comes and redefines how engineering and development teams operate, but Compass is the tool with a game-changing solution. As Atlassian's out-of-the-box internal developer platform, Compass helps teams to stay...

Atlassian
Cloud

3.12.2024 | 4 minutes reading time

Özge Kavas

Living on the edge: building serverless applications with Cloudflare Workers

Cloudflare is best known for its CDN, DNS server (1.1.1.1) or WAF/DDos mitigation services. These services are highly predicated on “Edge Computing”, bringing data closer to the user interested in those services – a user in Australia will be happier ...

Cloud native
Cloud
Serverless

28.11.2024 | 14 minutes reading time

We deployed our SaaS Application on fly.io (and it was great).

How we deployed our application in a fraction of the time while saving 100% of the cost. Our team, a bunch of experienced software engineers without prior contact to cloud deployments, wanted to deploy our OCPP-compliant EV Charging Station Simulator...

AWS
Cloud

23.10.2024 | 4 minutes reading time

Jannis Mainczyk

Dangling DNS in cloud infrastructures

Dangling DNS entries are nothing new. Forgotten, outdated or incorrect DNS records can lead to subdomains being taken over and used in phishing campaigns, for example, to steal employee secrets. Due to dynamic IP addresses of rapidly changing resources...

IT-Security
Validation
Cloud
AWS
Infrastructure

5.9.2024 | 4 minutes reading time

Markus Höfer

Spring Boot and HTMX: Deployment to AWS Lambda

This is the next part of my series about Spring Boot and HTMX. In this post, I will show you how to deploy the application created in the previous post to AWS Lambda. If you're in a hurry or impatient, you can simply check out the accompanying Git Repo...

Serverless
Spring
AWS
DevOps
Cloud

30.7.2024 | 5 minutes reading time

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

In a recent blog post, we explored how Dapr works and how to test it on a simple local Kubernetes cluster. One of Dapr's key advantages is its component system, which enhances portability. In this post, we'll take our previously daperized demo app and...

Software development
Cloud
Azure
Cloud native

22.7.2024 | 10 minutes reading time

Manuel Zapf

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

I recall the days when writing a web application in C# with .NET meant deploying it on an IIS web server for accessibility. Today, this approach seems outdated, especially with the shift towards microservice-based architectures. Fortunately, Microsoft...

Software architecture
Open Source
Cloud
Microservices
Infrastructure as Code
.NET
Cloud native

27.6.2024 | 8 minutes reading time

Manuel Zapf

From sidecars to sidecarless: Tracing the evolution of service mesh technologies...

Ever wondered how the technology that seamlessly manages microservices traffic evolved from early implementations to lean, kernel-level solutions? Let's dive into the fascinating journey of service meshes, from Linkerd 1.x to the cutting-edge technologies...

Cloud
Networking
Infrastructure
Kubernetes
Linux

22.5.2024 | 10 minutes reading time

Manuel Zapf

Demystifying the Kubernetes Gateway API: What the heck is it and why should...

When Gateway API debuted in October last year, this concluded a nearly four-year-long process that started in summer 2019. Gateway API is the successor of core Ingress definition, aiming towards various goals. This blog post will give a brief overview...

API
Open Source
Cloud
Networking
Kubernetes
Cloud native

15.3.2024 | 6 minutes reading time

Manuel Zapf

Cloud-native (application) networking in 2024

It's 2024 and Software is still eating the world. Whether it's powering an e-commerce platform, driving AI applications, or supporting critical business processes within organizations, there's a high likelihood that these applications are running in ...

Cloud
Networking
Infrastructure
Kubernetes

8.3.2024 | 2 minutes reading time

Manuel Zapf

Charge your APIs Volume 22: Mastering the Art of API Federation

API Federation is becoming essential in modern API management, addressing the complexities of evolving digital enterprises. It marks a shift from centralised, monolithic management to a dynamic, modular framework. Unlike traditional methods, API Federation...

API
Cloud
Cloud native

7.2.2024 | 11 minutes reading time

Daniel Kocot

Python and CDK (Part 2): Taking control of Python dependencies in AWS ...

In Part 1 of this series, Developing AWS Lambda Functions with Python and CDK, we covered the initial setup of a CDK and Python project. We walked through the process of creating a basic Hello World* Lambda function, testing it with a unit test, defining...

AWS
Serverless
Python

2.6.2023 | 2 minutes reading time

Python and CDK (Part 1): Developing AWS Lambda functions with Python and...

This blog post assumes that you are familiar with Python development and know the basic concepts of Amazon CDK. What's more, you should have an AWS account and have configured the AWS CLI. If you're new to CDK, go here, if you need to configure the AWS...

AWS
Serverless
Python

6.3.2023 | 6 minutes reading time

How to upgrade your Aurora Serverless database schema using CDK and Lambda

Imagine the following situation: You are building a serverless application using e.g. lambdas, you setup your system using CDK (or CloudFormation) and you store your data in Aurora Serverless. How would you automate your database schema adaptations or...

Cloud
Database
AWS
Infrastructure as Code
Serverless

16.1.2023 | 12 minutes reading time

Heroku is dead: Let’s deploy Spring Boot containers on fly.io!

Heroku is cancelling their free plan! What about all my open-source projects? Luckily fly.io comes to the rescue! Here are the missing docs on how to run Spring Boot on fly.io.Why I love(d) HerokuHeroku was my go-to PaaS for open-source projects for ...

CI/CD
Java
Cloud
DevOps
Spring

18.9.2022 | 17 minutes reading time

CloudWatch on AWS: How to tackle high-security requirements

If you build cloud-native applications, you will also generate log output. Log outputs are essential to log the functionality of the application and to be able to localize errors very quickly in the event of a crash. However, log outputs of any kind ...

AWS
Cloud
IT-Security

23.8.2022 | 15 minutes reading time

Jörg Riegel

Tame the multi-cloud beast with Crossplane: Let’s start with AWS S3

What if learning the Kubernetes API is all you need to provision any infrastructure? And we’re not only talking about AWS, Azure & Google – but also IONOS, DigitalOcean and even vSphere. Let’s have a look at Crossplane and how we can create an S3 Bucket...

AWS
CI/CD
Cloud
DevOps

3.7.2022 | 21 minutes reading time

Building an instant noodles DevOps starter pack with Terraform and AWS

How can we help a fictitious startup kickstart its software development process? Using Terraform and AWS services, we’ll build an IT infrastructure that is ready within minutes and ticks quite a few boxes on the technical DevOps capabilities list. Just...

Cloud
Infrastructure
AWS
CI/CD
DevOps

27.6.2022 | 21 minutes reading time

Monitoring AWS Lambda functions with CloudWatch

Introduction

Metrics

Metric Filters

Alarms

Dashboard

Conclusion

Was this post helpful?

Blog author

More articles in this subject area

Ingress NGINX Retirement — Don't Panic, We've Got You Covered

Serverless from Europe: My Experience with Scaleway as an Alternative ...

The Ultimate Tool for Engineers and Developers: Compass Premium

Living on the edge: building serverless applications with Cloudflare Workers

We deployed our SaaS Application on fly.io (and it was great).

Dangling DNS in cloud infrastructures

Spring Boot and HTMX: Deployment to AWS Lambda

Integrating Dapr with Azure Kubernetes Service (AKS): Portability is key

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

From sidecars to sidecarless: Tracing the evolution of service mesh technologies...

Demystifying the Kubernetes Gateway API: What the heck is it and why should...

Cloud-native (application) networking in 2024

Charge your APIs Volume 22: Mastering the Art of API Federation

Python and CDK (Part 2): Taking control of Python dependencies in AWS ...

Python and CDK (Part 1): Developing AWS Lambda functions with Python and...

How to upgrade your Aurora Serverless database schema using CDK and Lambda

Heroku is dead: Let’s deploy Spring Boot containers on fly.io!

CloudWatch on AWS: How to tackle high-security requirements

Tame the multi-cloud beast with Crossplane: Let’s start with AWS S3

Building an instant noodles DevOps starter pack with Terraform and AWS