Lessons from deploying on-premises LLM cost effectively

Developing infrastructure for M. Zilinec’s on-prem LLM powered data extraction pipeline, one of the first of its kind to process the Czech language.

Problem at hand

We needed to create infrastructure to support an on-premises LLM document extraction pipeline
We didn’t have any reliable bare metal GPU machine at hand
We needed to maintain low costs as much as possible, as cloud GPU virtual machines can get EXPENSIVE

Devising an architecture

The app backend/frontend + all scheduling components run in a single VM.
It sends inquires to GPU worker instances, which only run the GPU workloads.
We need to save cost by autoscaling GPU instances, as the cheapest GPU instances with 16GB memory are 400 EUR per month (330 GBP)

This makes deploying the stack inside Kubernetes an obvious choice.

Challenges

GPU resource management in Kubernetes is not optimal
CUDA python images are BIG (6GB).
Machine learning models are EVEN BIGGER (20+ GB).
GPU instance boot-up time is LONG (5+ minutes).

GPUs in k8s

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-worker-llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
    spec:
      containers:
      - name: cuda-container
        image: nvidia/cuda:12.9.0-cudnn-devel-ubuntu24.04 # 6GB just the image! +20GB ML models mounted
        command: ["nvidia-smi"]
        resources:
          limits:
            nvidia.com/gpu: 1  # Request 1 GPU
      restartPolicy: Always
    tolerations:
      # This taint or similar is added on GPU nodes
      # by the cloud provider to select out GPU-required workloads
      - key: "nvidia.com/gpu"
        operator: "Equal"
        value: "present"
        effect: "NoSchedule"

The main takeaways here are:

The GPU k8s resource is discrete
The CUDA image required to run our workload is also very big

The bottleneck for workloads in these cases will most likely be GPU VRAM limitations. The VM we used had 16GB GPU VRAM. However, as you can see, the allocation can not be controlled. In case you need two pods to share a GPU, or control which pods get which GPU instances, you must create the logic by which they schedule to meet their chip and VRAM requirements.

The cloud provider usually pre-installs a GPU driver on the GPU node image. An AMI with pre-installed drivers speeds up the node spin up, although the pre-installed driver might not always be the latest. So For latest GPU features, consider installing it with the NVIDIA gpu-operator instead.

📝 Note: High-end GPUs support multi-instance GPU, but this is not available for the lower grades. There is also interesting work undergoing to make this simpler, such as Project-HAMi or training GPU schedulers with Reinforcement learning.

Autoscaling trigger

I had to implement the apps queue size as a custom metric to trigger GPU autoscaling. This means designing a custom scaling event based on a custom metric, which represents the number of user requests in queue. The KEDA operator perfectly covered this use case:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-worker-llm
  namespace: gpu-worker
spec:
  scaleTargetRef:
    name: my-worker-llm  # The name of deployment to scale in this namespace
  maxReplicaCount: 5
  minReplicaCount: 0
  triggers:
    - type: prometheus  # Trigger scaling based on prometheus query
      metadata:
        serverAddress: http://prometheus-operated.monitoring.svc.cluster.local:9090  # Prometheus address
        metricName: "my_queue_size_total" # Custom queue metric
        threshold: "10"  # Scales up 1 pod per 10 of my_queue_size_total
        query: sum(my_queue_size_total)  # The Prometheus query to check for the metric
        activationThreshold: "1"  # Scale up when at least 1 in queue

I do not recommend using prometheus-adapter. prometheus-adapter fired up first in search, but it has complicated syntax. I got stuck with the syntax for hours, whereas I achieved the same with KEDA in minutes.

Spot vs Regular

We needed to save costs where possible, so at first we thought even Spot instances could be enough. Sometimes the spot instances will not spin up for 15+ minutes. It is definitely only meant for delayable batch workloads.

We quickly switched to Regular, which boots up in 1 minute.

Images are TOO BIG

The images, even when stored ‘locally’ in an OCI registry, in the same region as the k8s cluster, need to be downloaded on each GPU node.

Same region data transfer fortunately does not incur cost, but storing the images in a registry does.

Tier	Minimum Download Bandwidth
Basic	30 Mbps
Standard	60 Mbps
Premium	100 Mbps

Switching to ‘Premium’ tier made the registry a more costly, but did not effectively speed up the download. The specs are minimum bandwidth and we already had the max advertised speed even with Basic. The specs are therefore not for our usecase but just for meeting SLAs and such, or for a larger number of nodes.

Volume mount hacks

Loki’s Wager posted a perfect blog tackling a similar problem by mounting all big files as a read-only volumes in k8s, then only boot up a minimal image such as Alpine. Here is a snippet from their Dockerfile:

FROM builder as tmp

RUN mkdir /data && mv /usr/local/cuda-11.8 /data/cuda-11.8 && \
    mv /opt/nvidia /data/nvidia

FROM ubuntu:22.04 as slim

COPY --from=tmp /etc /etc
COPY --from=tmp /usr /usr
COPY --from=tmp /root /root
COPY --from=tmp /var/cache /var/cache
COPY --from=tmp /var/lib /var/lib
COPY --from=tmp /run /run

This is nice and definitely works, but requires effort digging through parts of the image and re-mounting it correctly via Kubernetes PersistentVolumes. You also have to version each PersistentVolume if the files differ for each image.

It was not easy to keep track of the changing ML models being used as well as the current CUDA image. We therefore felt this was too hackey and searched for another solution.

Kubernetes 1.31 introduces a mounting images as volumes feature. The images still must be downloaded from the Container registry and this not does speed up the node boot duration. However, this can make the transition to the aforementioned volume mount hack simpler.

Disk speed

The disk speed in Azure VMs is determined by the size of the VM itself. Same VM series can have a lot slower/faster disk IO depending on vCPUs/RAM size.

It is not entirely transparent what disk IO speeds you get. While cloud providers disclose the IOPS value, we still had to conduct measurements ourselves to get an accurate idea.

The image pull gets noticably upon switching to the Premium_LRS disk tier. It was however not fast enough for us to make a difference in the product user experience.

Deallocate vs Delete

This is how our boot timeline looked up to this point:

Time	Event
00:00	start (user request)
00:15	autoscaler adds pods
06:30	azure adds node
09:15	worker image pulled

Finally, we settled on detaching the VM hard drives and binding them back upon node launch.

This doesn’t work with the Premium_LRS disk tier. The initial launch of the nodes is therefore a lot slower. Fortunately, that didn’t matter for our product experience, because re-attaching the disk with all images already downloaded, is very fast:

Time	Event
00:00	start (user request)
00:15	autoscaler adds pods
01:30	azure adds node

The re-mounted disk also contains all models, which were previously streamed from a bucket.

The cost is higher than in the volume mount hack method - Azure bills each detached node hard drive separately.

In k8s, the detached nodes never disappear from the nodes list, but are instead marked as NotReady.

This got us down to 1m30s service ready speed - from the initial user query to starting processing the workload - without incurring cloud costs for costly GPU virtual machines while the service is idle.