Selfhosting AI models in your kuberenetes clusters

3.7.2026 | 6 minutes reading time

AI is on everybody's mind nowadays. While some organizations have the possibility to use externally hosted models from e.g. Anthropic, Google, ..., others might not have those options. There are multiple options to host AI models on your own hardware or infrastructure from e.g. AWS, OVHCloud, IONOS, Hetzner. In this article we will create everything you need to host an open weight AI model (qwen3-30b) on a kubernetes cluster that utilized GPU nodes to run the models on.

Why Kubernetes?

Running large language models on any infrastructure is a tough task so you might wonder why bother running it in kubernetes? Why not run the container on a VM or hardware?

Kubernetes is heavily used across a wide spread of industries and often the go to for companies that need to run applications on their own and have more control over where data is located.

It is also the easiest to move around. If you want to move your workloads from AWS to OVH and all you have is running in kubernetes you most certainly have an easier time to move your applications and workload as if you had them on EC2 instances, Bedrock, Lambda, Aurora, etc.

Why GPU Nodes? (and VRAM Math)

While it is possible to run AI models on any compute node with decent memory there are enormous performance differences. This goes for overall response time as well as TTFT (time to first token). If your model only runs on your CPU you probably have to use rather small models with higher quantization or accept to wait for your responses. If you use a GPU you should make the effort to calculate how big of a model and context you can fit into to GPU's VRAM. Only if you can fit the whole model inside is when you can achieve the best performance and get the most out of your models. With llama-server (which is used in this setup) you can also move parts of your model out of the GPU's VRAM (through the -ngl flag) but this will also heavily drag down your performance. However, there are models that are more optimized for this use case. MoE (Mixture of Experts) models seem to perform better in this context.

Here is the calculation used for this setup: We utilize an AWS g5.2xlarge instance that provides an NVIDIA A10G GPU with 24 GB of VRAM. The Qwen3-30B-A3B model in the 4-bit quantized version takes up roughly 19 GB of VRAM. As it is a MoE model we have a small KV cache. Therefore, we can get away with a context size of 32k and still have enough headroom in the VRAM.

Architecture

To move past a basic "hello world" setup, we are integrating kagent—a CNCF sandbox project that brings an open-source agentic runtime directly into Kubernetes. This allows us to interact with our cluster using natural language through Model Context Protocol (MCP) servers and a clean web UI.

The illustration below displays a high level overview over what resources we have here.

Internet
    │ HTTP:80
    ▼
AWS ALB  (provisioned by AWS LBC from Ingress managed in OpenTofu)
    │
    ▼
kagent UI  (CPU system node, namespace: kagent, port 8080)
    │
kagent Engine ──HTTP──► llama-server.llama-server.svc.cluster.local
                                    │
                                    ▼
                         Llama-Server pod (GPU node, g5.2xlarge)
                           nvidia.com/gpu: 1 (A10G, 24 GB VRAM)
                           EBS gp3-retain PVC (model weights, 100 Gi default)

We will use opentofu to deploy this whole stack and use aws, kubernetes and helm providers for the resources. kagent and llama-server will be deployed using the helm provider and their respective helm charts.

If you only want to have an AI model in your cluster you can spare all the kagent setup and only deploy the llama-server.

How to get this setup running?

You can find the necessary code here.

Prerequisites

We are using opentofu to deploy this stack onto AWS as this the easiest way to get access to a kubernetes cluster and GPU nodes for an example at a reasonable price.

We need some resources in AWS to get started with opentofu e.g. a bucket for the state and a DynamoDB table for the lock state.

1BUCKET=my-kagent-tfstate   # must be globally unique
2TABLE=my-kagent-tflock
3REGION=us-east-1
4
5aws s3api create-bucket --bucket $BUCKET --region $REGION
6
7aws s3api put-bucket-versioning \
8  --bucket $BUCKET \
9  --versioning-configuration Status=Enabled
10
11aws s3api put-bucket-encryption \
12  --bucket $BUCKET \
13  --server-side-encryption-configuration \
14    '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
15
16aws dynamodb create-table \
17  --table-name $TABLE \
18  --attribute-definitions AttributeName=LockID,AttributeType=S \
19  --key-schema AttributeName=LockID,KeyType=HASH \
20  --billing-mode PAY_PER_REQUEST \
21  --region $REGION

Create and adapt the backend config to your needs

1cp infra/backend.hcl.example infra/backend.hcl
2# Edit infra/backend.hcl — set bucket and dynamodb_table to the values above.

Initialise and generate the provider lock file

1cd infra
2tofu init -backend-config=backend.hcl
3
4# Generate the lock file and commit it (one-time, or after provider upgrades).
5tofu providers lock \
6  -platform=linux_amd64 \
7  -platform=linux_arm64 \
8  -platform=darwin_arm64 \
9  -platform=darwin_amd64
10git add .terraform.lock.hcl && git commit -m "chore: add provider lock file"

Restrict API endpoint access (recommended)

The EKS public API endpoint defaults to 0.0.0.0/0. As we don't want everybody to be able to access it we limit the access to our own IP and set it to our egress IP before sharing the cluster:

1# Find your egress IP
2curl -s https://checkip.amazonaws.com
3
4# adapt in infra/terraform.tfvars
5allowed_public_access_cidrs = ["203.0.113.42/32"]

Review the plan

We can use the command below to see what opentofu will create for us:

1tofu plan -var-file=terraform.tfvars

Not all resources are covered by the free tier resources in AWS so there will be costs associated with deploying this!

Apply the plan

If everything is looking fine we can use the command below to let opentofu create all resources needed:

1tofu apply -var-file=terraform.tfvars

This can take quite a while to deploy so be patient and grab a coffee.

Accessing your resources

You can configure your local kubectl for cluster access using this command and get the URL to access kagent:

1$(tofu output -raw kubeconfig_command)
2
3# ALB hostname (~60 s after apply for AWS to provision)
4tofu output kagent_ui_hostname

Now you can access the url provided and check the kagent UI or using regular kubectl you can check if the llama-server is deployed and utilizing the GPU:

1$(tofu output -raw kubeconfig_command)
2
3# All nodes ready — GPU nodes show nvidia.com/gpu=true
4kubectl get nodes -L nvidia.com/gpu,workload-type,role
5
6# Llama-server on a GPU node
7kubectl get pods -n llama-server -o wide
8
9# kagent components on system nodes
10kubectl get pods -n kagent -o wide

Hosting AI Models

Hosting you own open weight or even open source models can be difficult and for many companies this is probably not worth the cost associated to implement and maintain these resources in a fast-changing environment. If however, you can not use public services to get access to AI models this can be a way to still profit from AI in your environment. Closed weight model are still performing better but open weight models are catching up slowly - Open Weight vs Closed Weight comparison.

Outlook: The Future of Agentic Platform Operations

Operating open-weight models in-house requires initial engineering efforts, but the autonomy it brings to enterprise infrastructure is substantial. By wiring your custom LLM deployment into an operational runtime framework like kagent, you shift from basic chatbot interactions to true agentic automation. Platform and DevOps engineers can utilize these localized models to interpret live cluster metrics, trace configuration anomalies, and securely troubleshoot platform infrastructure without a single byte of sensitive data ever leaving your firewall boundaries.

Cost Optimization Tip:

To keep cloud bills low in sandbox or development environments, combine this configuration with Karpenter. It allows you to automatically scale your expensive GPU node groups down to zero when no active inference requests are hitting your cluster.

Was this post helpful?

Blog author

Andreas Maier

Do you still have questions? Just send me a message.

Why every redesign breaks your Playwright project — and how three layers...

TL;DR: We show how a structural separation of UI selectors and business logic can look like when using Playwright, adapting the proven Robot Pattern into the Layered Robot Pattern. This way, browser automation can proceed without fear of UI changes. ...

AI
Software development
Frontend
Testing
Pattern
UX/UI
Test Driven Development
Software architecture
Resilience
Webdevelopment
BDD
Android

3.7.2026 | 9 minutes reading time

Lars Jouon

Rebecca Jox

Replacing Low-Code Platforms with AI-Driven Custom Development in Healthcare

A healthcare software solution needs to be developed to aggregate information (e.g., patient data, diagnoses, lab results) from various medical systems and provide it to another component for further processing via a custom-defined API. The system must...

AI
Software development
Integration

27.6.2026 | 8 minutes reading time

Autonomous development workflows with Claude Code

Most developers today use AI tools as faster autocomplete. Over the past few months, on a client project, I took a different path: multi-agent setups with Claude Code, where specialized agents work in parallel, review one another, and coordinate on their...

AI
Software development
Generative AI

22.6.2026 | 17 minutes reading time

Christoph Dalski

From prompt to product: Why the design step matters

Anyone working with AI-assisted coding assistants today knows the promise: Type a description, and seconds later a working interface appears. Tools like Cursor, Claude Code, or GitHub Copilot deliver increasingly impressive results. Yet what is convincing...

AI
UX/UI
Frontend
Generative AI

16.6.2026 | 9 minutes reading time

Michel Ehmen

Brainstorming With AI — When to Play Devil's Advocate

Brainstorming With AI — When to Play Devil’s Advocate Part of the series Domain-Driven Design Meets AI. Every project starts with a blank canvas, and the blank canvas is where good ideas go to die. You put 8–12 people in a room, point at an empty whiteboard...

DDD
Generative AI
LLM

15.6.2026 | 10 minutes reading time

Ensuring accessibility with AI: what works today (and what doesn't)

Since June 2025, the Barrierefreiheitsstärkungsgesetz (BFSG), Germany's law implementing the European Accessibility Act, has been in effect. Most teams know they should be doing something about it, but in day-to-day work, the topic usually falls by the...

Accessibility
AI
UX/UI
Testing

2.6.2026 | 11 minutes reading time

Building MCP Servers with Spring AI

Introduction The Model Context Protocol (MCP) is an open standard that defines how AI models communicate with external tools, services, and data sources. It replaces ad-hoc integrations with a single, well-defined JSON-RPC 2.0 protocol, making it easy...

AI
Software development

17.5.2026 | 5 minutes reading time

Tobias Trelle

From Inference to Governance: Why Agent Metadata Matters When LLMs Already...

Modern LLMs demonstrate strong capability in inferring meaning from column names. A tool such as Genie can typically resolve pct_cust_attrit_q to "churn" or map rev_mrr_usd to a"MRR" through pattern recognition alone. On a small, well-structured table...

AI
LLM
Big Data
Database

15.5.2026 | 6 minutes reading time

Niklas Niggemann

AI as a Design Partner — Drafter, Validator, Provocateur

Part of the series Domain-Driven Design Meets AI. The previous post introduced the Synergetic Blueprint as the structured process that turns DDD methods into a coherent end-to-end design flow, and made the case that AI augments every step of it. This...

14.5.2026 | 12 minutes reading time

The Accessible Domain: Knowledge Engineering for AI-Assisted Development

The Old Promise In the late 1970s, Stanford computer scientist Edward Feigenbaum coined the term "Knowledge Engineering". He described it as the process of extracting expert knowledge, structuring it, and making it usable within a software system. Central...

Generative AI
AI
LLM
Software Modernization
Software development

11.5.2026 | 10 minutes reading time

Johannes Barop

Benjamin Font Pera

Data Quality Powers AI Analytics: Building Trustworthy Genie Spaces in...

Garbage In, Garbage Out. This computing truism has never been more critical than in the age of AI. Large Language Models don't amplify poor data quality, they wrap it in confident-sounding prose that can mislead even experienced users. As organizations...

Generative AI
LLM
AI
Data

7.5.2026 | 8 minutes reading time

Niklas Niggemann

16,000 Tests in 4 Days – Reaching 80% Test Coverage with Claude Code

The Starting Point When we at codecentric recently took over a codebase from a previous service provider for a client, it quickly became clear that this would be no ordinary challenge. Backends, frontends, batch jobs, services — a grown application landscape...

AI
Software development
Testing

5.5.2026 | 12 minutes reading time

Selvarajah Sivarupan

Isolated Kubernetes GitOps with FluxCD and OCI Repositories

Isolated Kubernetes GitOps with FluxCD and OCI Repositories Introduction: The Challenge of Isolated Environments Operating Kubernetes in isolated environments presents unique challenges for platform engineering teams. When clusters have no direct access...

Kubernetes
Infrastructure
DevSecOps
Compliance

5.5.2026 | 8 minutes reading time

Sven Hertzberg

The Synergetic Blueprint Revisited — and Why AI Changes Everything

From Workshop to Working Software — the Gap Nobody Talks About Most teams that adopt Domain-Driven Design invest heavily in workshops. Domain Storytelling sessions, EventStorming boards, context mapping exercises — the collaboration is real, and the ...

28.4.2026 | 8 minutes reading time

Is Spring Boot Becoming Obsolete?

In March 2026, we kicked off a modernization project for a client. Spring Boot was an obvious choice. There was a strategic decision behind it. There was existing know-how. There was existing infrastructure. The team was set. The work began. One of the...

Generative AI
LLM
AI
Software development
Software architecture

27.4.2026 | 7 minutes reading time

Johannes Barop

EXACT Coding: AI-powered development that prioritizes quality over chaotic...

TL;DR Uncontrolled agentic coding (“vibe coding”) delivers code quickly—and often leads to security and maintenance issues as soon as the software goes live. EXACT Coding (Example-guided AI-Collaborative Test-driven Coding) combines best practices: ....

Generative AI
AI
Test Driven Development

22.4.2026 | 7 minutes reading time

Marco Emrich

Ferdinand Ade

The Ralph Wiggum Loop: Autonomous Code Generation with a Fresh Context

Ralph Wiggum is the simple-minded boy from The Simpsons who says things like "I'm learnding!" and eats glue. Of all people, he is now the namesake for a technique for autonomous code generation. The idea behind: If the thought of letting code be generated...

Generative AI
LLM
AI
Software development

6.4.2026 | 7 minutes reading time

Johannes Barop

KubeCon Europe 2026: AI agents go to production

tl;dr A summary of KubeCon Europe 2026: It is the year AI agents move from prototypes to production. This article covers what that means: giving agents verifiable identities, routing inference traffic with the new Gateway API Inference Extension, governing...

Cloud native
AI

31.3.2026 | 11 minutes reading time

AI Code Tsunami Hits the QA Dam: The End of Balanced Velocity

Note upfront: This article is specifically aimed at teams working on the modernization and further development of existing systems, not at greenfield projects where completely different rules apply. Everyone is talking about the massive productivity ...

Generative AI
AI
DevOps
Test Driven Development
Testing

30.3.2026 | 8 minutes reading time

DeepFake: Detect AI-Generated Images in 5 Steps

We live in a time when an image is no longer a reliable guarantee of truth. AI‑generated content floods social media feeds, news platforms and messenger groups every single day, and only very few people are able to tell the difference. What once required...

IT-Security
AI
Generative AI
Search
Google
data protection
Digitalization

16.3.2026 | 5 minutes reading time

Selfhosting AI models in your kuberenetes clusters

Why Kubernetes?

Why GPU Nodes? (and VRAM Math)

Architecture

How to get this setup running?

Prerequisites

Create and adapt the backend config to your needs

Initialise and generate the provider lock file

Restrict API endpoint access (recommended)

Review the plan

Apply the plan

Accessing your resources

Hosting AI Models

Outlook: The Future of Agentic Platform Operations

Cost Optimization Tip:

Was this post helpful?

Blog author

More articles in this subject area

Why every redesign breaks your Playwright project — and how three layers...

Replacing Low-Code Platforms with AI-Driven Custom Development in Healthcare

Autonomous development workflows with Claude Code

From prompt to product: Why the design step matters

Brainstorming With AI — When to Play Devil's Advocate

Ensuring accessibility with AI: what works today (and what doesn't)

Building MCP Servers with Spring AI

From Inference to Governance: Why Agent Metadata Matters When LLMs Already...

AI as a Design Partner — Drafter, Validator, Provocateur

The Accessible Domain: Knowledge Engineering for AI-Assisted Development

Data Quality Powers AI Analytics: Building Trustworthy Genie Spaces in...

16,000 Tests in 4 Days – Reaching 80% Test Coverage with Claude Code

Isolated Kubernetes GitOps with FluxCD and OCI Repositories

The Synergetic Blueprint Revisited — and Why AI Changes Everything

Is Spring Boot Becoming Obsolete?

EXACT Coding: AI-powered development that prioritizes quality over chaotic...

The Ralph Wiggum Loop: Autonomous Code Generation with a Fresh Context

KubeCon Europe 2026: AI agents go to production

AI Code Tsunami Hits the QA Dam: The End of Balanced Velocity

DeepFake: Detect AI-Generated Images in 5 Steps