Popular searches
//

Selfhosting AI models in your kuberenetes clusters

3.7.2026 | 6 minutes reading time

AI is on everybody's mind nowadays. While some organizations have the possibility to use externally hosted models from e.g. Anthropic, Google, ..., others might not have those options. There are multiple options to host AI models on your own hardware or infrastructure from e.g. AWS, OVHCloud, IONOS, Hetzner. In this article we will create everything you need to host an open weight AI model (qwen3-30b) on a kubernetes cluster that utilized GPU nodes to run the models on.

Why Kubernetes?

Running large language models on any infrastructure is a tough task so you might wonder why bother running it in kubernetes? Why not run the container on a VM or hardware?

Kubernetes is heavily used across a wide spread of industries and often the go to for companies that need to run applications on their own and have more control over where data is located.

It is also the easiest to move around. If you want to move your workloads from AWS to OVH and all you have is running in kubernetes you most certainly have an easier time to move your applications and workload as if you had them on EC2 instances, Bedrock, Lambda, Aurora, etc.

Why GPU Nodes? (and VRAM Math)

While it is possible to run AI models on any compute node with decent memory there are enormous performance differences. This goes for overall response time as well as TTFT (time to first token). If your model only runs on your CPU you probably have to use rather small models with higher quantization or accept to wait for your responses. If you use a GPU you should make the effort to calculate how big of a model and context you can fit into to GPU's VRAM. Only if you can fit the whole model inside is when you can achieve the best performance and get the most out of your models. With llama-server (which is used in this setup) you can also move parts of your model out of the GPU's VRAM (through the -ngl flag) but this will also heavily drag down your performance. However, there are models that are more optimized for this use case. MoE (Mixture of Experts) models seem to perform better in this context.

Here is the calculation used for this setup: We utilize an AWS g5.2xlarge instance that provides an NVIDIA A10G GPU with 24 GB of VRAM. The Qwen3-30B-A3B model in the 4-bit quantized version takes up roughly 19 GB of VRAM. As it is a MoE model we have a small KV cache. Therefore, we can get away with a context size of 32k and still have enough headroom in the VRAM.

Architecture

To move past a basic "hello world" setup, we are integrating kagent—a CNCF sandbox project that brings an open-source agentic runtime directly into Kubernetes. This allows us to interact with our cluster using natural language through Model Context Protocol (MCP) servers and a clean web UI.

The illustration below displays a high level overview over what resources we have here.

Internet
    │ HTTP:80
    ▼
AWS ALB  (provisioned by AWS LBC from Ingress managed in OpenTofu)
    │
    ▼
kagent UI  (CPU system node, namespace: kagent, port 8080)
    │
kagent Engine ──HTTP──► llama-server.llama-server.svc.cluster.local
                                    │
                                    ▼
                         Llama-Server pod (GPU node, g5.2xlarge)
                           nvidia.com/gpu: 1 (A10G, 24 GB VRAM)
                           EBS gp3-retain PVC (model weights, 100 Gi default)

We will use opentofu to deploy this whole stack and use aws, kubernetes and helm providers for the resources. kagent and llama-server will be deployed using the helm provider and their respective helm charts.

If you only want to have an AI model in your cluster you can spare all the kagent setup and only deploy the llama-server.

How to get this setup running?

You can find the necessary code here.

Prerequisites

We are using opentofu to deploy this stack onto AWS as this the easiest way to get access to a kubernetes cluster and GPU nodes for an example at a reasonable price.

We need some resources in AWS to get started with opentofu e.g. a bucket for the state and a DynamoDB table for the lock state.

1BUCKET=my-kagent-tfstate   # must be globally unique
2TABLE=my-kagent-tflock
3REGION=us-east-1
4
5aws s3api create-bucket --bucket $BUCKET --region $REGION
6
7aws s3api put-bucket-versioning \
8  --bucket $BUCKET \
9  --versioning-configuration Status=Enabled
10
11aws s3api put-bucket-encryption \
12  --bucket $BUCKET \
13  --server-side-encryption-configuration \
14    '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
15
16aws dynamodb create-table \
17  --table-name $TABLE \
18  --attribute-definitions AttributeName=LockID,AttributeType=S \
19  --key-schema AttributeName=LockID,KeyType=HASH \
20  --billing-mode PAY_PER_REQUEST \
21  --region $REGION

Create and adapt the backend config to your needs

1cp infra/backend.hcl.example infra/backend.hcl
2# Edit infra/backend.hcl — set bucket and dynamodb_table to the values above.

Initialise and generate the provider lock file

1cd infra
2tofu init -backend-config=backend.hcl
3
4# Generate the lock file and commit it (one-time, or after provider upgrades).
5tofu providers lock \
6  -platform=linux_amd64 \
7  -platform=linux_arm64 \
8  -platform=darwin_arm64 \
9  -platform=darwin_amd64
10git add .terraform.lock.hcl && git commit -m "chore: add provider lock file"

Restrict API endpoint access (recommended)

The EKS public API endpoint defaults to 0.0.0.0/0. As we don't want everybody to be able to access it we limit the access to our own IP and set it to our egress IP before sharing the cluster:

1# Find your egress IP
2curl -s https://checkip.amazonaws.com
3
4# adapt in infra/terraform.tfvars
5allowed_public_access_cidrs = ["203.0.113.42/32"]

Review the plan

We can use the command below to see what opentofu will create for us:

1tofu plan -var-file=terraform.tfvars

Not all resources are covered by the free tier resources in AWS so there will be costs associated with deploying this!

Apply the plan

If everything is looking fine we can use the command below to let opentofu create all resources needed:

1tofu apply -var-file=terraform.tfvars

This can take quite a while to deploy so be patient and grab a coffee.

Accessing your resources

You can configure your local kubectl for cluster access using this command and get the URL to access kagent:

1$(tofu output -raw kubeconfig_command)
2
3# ALB hostname (~60 s after apply for AWS to provision)
4tofu output kagent_ui_hostname

Now you can access the url provided and check the kagent UI or using regular kubectl you can check if the llama-server is deployed and utilizing the GPU:

1$(tofu output -raw kubeconfig_command)
2
3# All nodes ready — GPU nodes show nvidia.com/gpu=true
4kubectl get nodes -L nvidia.com/gpu,workload-type,role
5
6# Llama-server on a GPU node
7kubectl get pods -n llama-server -o wide
8
9# kagent components on system nodes
10kubectl get pods -n kagent -o wide

Hosting AI Models

Hosting you own open weight or even open source models can be difficult and for many companies this is probably not worth the cost associated to implement and maintain these resources in a fast-changing environment. If however, you can not use public services to get access to AI models this can be a way to still profit from AI in your environment. Closed weight model are still performing better but open weight models are catching up slowly - Open Weight vs Closed Weight comparison.

Outlook: The Future of Agentic Platform Operations

Operating open-weight models in-house requires initial engineering efforts, but the autonomy it brings to enterprise infrastructure is substantial. By wiring your custom LLM deployment into an operational runtime framework like kagent, you shift from basic chatbot interactions to true agentic automation. Platform and DevOps engineers can utilize these localized models to interpret live cluster metrics, trace configuration anomalies, and securely troubleshoot platform infrastructure without a single byte of sensitive data ever leaving your firewall boundaries.

Cost Optimization Tip:

To keep cloud bills low in sandbox or development environments, combine this configuration with Karpenter. It allows you to automatically scale your expensive GPU node groups down to zero when no active inference requests are hitting your cluster.

share post

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.