Triton vs TorchServe: How I Cut $800/Month in GPU Costs

Updated Feb 12, 2026
⚡ Key Takeaways
  • TorchServe's default file I/O and logging consumed 67% of request time, leaving GPU at 23% utilization and costing $800/month on two T4 instances.
  • Triton Inference Server achieved 3.9x higher throughput (183 vs 47 req/sec) by eliminating I/O from the hot path and using smarter dynamic batching.
  • Moving to Triton reduced serving costs from $803/month to $384/month while handling the same workload on a single instance with 81% GPU utilization.

The $800 Wake-Up Call

My GPU utilization was sitting at 23%. The T4 instances were running around the clock, inference requests were piling up, and AWS was billing me $800/month for what turned out to be mostly idle hardware waiting on disk I/O.

The model wasn’t the problem. A ResNet-50 variant for industrial defect detection, nothing exotic. The issue was TorchServe’s default configuration treating every request like a separate snowflake—loading preprocessing configs from disk, writing temporary files, logging everything to S3.

I’d assumed model serving frameworks were optimized out of the box. That assumption cost me three months of unnecessary cloud bills before I finally benchmarked Triton Inference Server as an alternative.

Long exposure of Triton Fountain in Valletta, Malta, illuminated against the night sky.
Photo by Angelos Lamprakopoulos on Pexels

What Actually Happens During Inference

Most tutorials skip the unsexy parts. They show you the model forward pass and call it done. But in production, that forward pass is maybe 30% of your request latency.

Here’s what a real inference pipeline does:

  1. Deserialize the request (JSON or binary)
  2. Load preprocessing config (for us: normalization params, resize dimensions)
  3. Decode image bytes
  4. Run preprocessing (resize, normalize, convert to tensor)
  5. Model forward pass ← the part everyone benchmarks
  6. Postprocessing (threshold probabilities, format output)
  7. Log request metadata (user_id, timestamp, prediction)
  8. Serialize response

TorchServe was hitting disk in steps 2, 7, and sometimes 3 (depending on image size and temp file thresholds). Triton? Zero disk I/O for the same workload.

The TorchServe Baseline

I started with TorchServe because PyTorch’s official docs recommend it, and I wanted something that “just works.” Initial setup was genuinely easy:

# model_handler.py
from ts.torch_handler.image_classifier import ImageClassifier
import torch
import json

class DefectClassifier(ImageClassifier):
    def initialize(self, context):
        super().initialize(context)
        # Load preprocessing config from disk every time
        # (this turned out to be the first mistake)
        properties = context.system_properties
        model_dir = properties.get("model_dir")
        with open(f"{model_dir}/preprocess_config.json") as f:
            self.config = json.load(f)

    def preprocess(self, data):
        images = []
        for row in data:
            image = row.get("data") or row.get("body")
            # TorchServe writes large payloads to temp files
            if isinstance(image, str):
                with open(image, 'rb') as f:
                    image = f.read()
            # ... rest of preprocessing
        return torch.stack(images)

    def postprocess(self, inference_output):
        # Log predictions to S3 (another I/O bottleneck)
        predictions = inference_output.argmax(dim=1).tolist()
        # ... logging code that hit S3 every request
        return predictions

Batch size 8, four workers, one T4 GPU. Throughput: 47 requests/second with an average latency of 170ms.

GPU utilization according to nvidia-smi: 23%.

Something was obviously wrong, but the logs showed no errors. Just slow, expensive idling.

Profiling Showed the Ugly Truth

I added cProfile to the handler and ran 1000 requests:

import cProfile
import pstats

def preprocess(self, data):
    profiler = cProfile.Profile()
    profiler.enable()
    # ... existing code
    profiler.disable()
    stats = pstats.Stats(profiler)
    stats.sort_stats('cumtime')
    stats.print_stats(10)

The top time sinks:

  1. open() calls: 38% of request time
  2. s3_client.put_object(): 29%
  3. Model forward pass: 18%
  4. Everything else: 15%

Two-thirds of my inference pipeline was I/O. The GPU was barely doing anything.

Enter Triton Inference Server

Triton is NVIDIA’s serving framework, designed for high-throughput inference across multiple frameworks (PyTorch, TensorFlow, ONNX, TensorRT). The pitch: everything stays in memory, dynamic batching is smarter, and you can chain models without intermediate I/O.

The setup is more verbose than TorchServe. You need:

  1. A model repository with a specific directory structure
  2. A config.pbtxt file (Protocol Buffers text format)
  3. Your model weights in a supported format

Here’s the config for the same ResNet-50 model:

name: "defect_classifier"
platform: "pytorch_libtorch"
max_batch_size: 16
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 5 ]  # 5 defect classes
  }
]
dynamic_batching {
  preferred_batch_size: [ 4, 8, 16 ]
  max_queue_delay_microseconds: 1000
}
instance_group [
  {
    count: 2
    kind: KIND_GPU
  }
]

The key difference: Triton expects you to bake preprocessing into the model or handle it client-side. No file I/O inside the server.

Moving Preprocessing to the Client

This was the philosophical shift. TorchServe encourages “fat” server-side handlers that do everything. Triton pushes you toward “thin” servers with smart clients.

I rewrote the preprocessing as a standalone Python client utility:

import numpy as np
import tritonclient.http as httpclient
from PIL import Image
import io

class TritonClient:
    def __init__(self, url="localhost:8000", model_name="defect_classifier"):
        self.client = httpclient.InferenceServerClient(url=url)
        self.model_name = model_name
        # Preprocessing config loaded ONCE at client init
        self.mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
        self.std = np.array([0.229, 0.224, 0.225], dtype=np.float32)

    def preprocess(self, image_bytes):
        img = Image.open(io.BytesIO(image_bytes)).convert('RGB')
        img = img.resize((224, 224))
        arr = np.array(img, dtype=np.float32) / 255.0
        arr = (arr - self.mean) / self.std
        arr = arr.transpose(2, 0, 1)  # HWC -> CHW
        return arr

    def infer(self, image_bytes):
        input_data = self.preprocess(image_bytes)
        input_data = np.expand_dims(input_data, axis=0)  # Add batch dim

        inputs = [httpclient.InferInput("INPUT__0", input_data.shape, "FP32")]
        inputs[0].set_data_from_numpy(input_data)

        outputs = [httpclient.InferRequestedOutput("OUTPUT__0")]
        response = self.client.infer(self.model_name, inputs, outputs=outputs)

        logits = response.as_numpy("OUTPUT__0")
        return logits.argmax(axis=1)[0]

Yes, this moves work to the client. But the client isn’t bottlenecked by GPU availability—it can preprocess while waiting for the batch to fill. Triton’s dynamic batching handles the rest.

The Benchmark Setup

I ran both servers on the same hardware:

  • Instance: AWS g4dn.xlarge (1x T4 GPU, 4 vCPUs, 16GB RAM)
  • Model: ResNet-50 (25M parameters, ~100MB checkpoint)
  • Workload: 10,000 requests, images averaging 1.2MB each
  • Concurrency: 32 parallel clients
  • Metric tools: nvidia-smi dmon -s u for GPU util, custom Python scripts for latency

Both servers were configured for fairness:

  • TorchServe: batch size 8, 4 workers, all logging disabled for this test
  • Triton: max batch size 16, preferred sizes [4, 8, 16], 2 GPU instances

I disabled S3 logging for TorchServe during the benchmark (that 29% overhead would’ve made the comparison unfair). But even without logging, the file I/O for config loading remained.

Black and white photo of Triton Fountain statue in Rome's cityscape at night.
Photo by Alejandro Aznar on Pexels

Throughput Results

Server Requests/sec P50 Latency P95 Latency GPU Util
TorchServe 47 170ms 340ms 23%
Triton 183 52ms 98ms 81%

Triton was 3.9x faster on throughput and 3.3x better on median latency.

But the real kicker: GPU utilization jumped from 23% to 81%. That’s what happens when you stop waiting on disk.

Where TorchServe Was Losing Time

I added trace-level logging to TorchServe’s request handling and measured each stage with time.perf_counter():

import time
import logging

logger = logging.getLogger(__name__)

def handle(self, data, context):
    t0 = time.perf_counter()

    # Config load from disk
    t1 = time.perf_counter()
    with open(f"{self.model_dir}/preprocess_config.json") as f:
        config = json.load(f)
    t2 = time.perf_counter()
    logger.info(f"Config load: {(t2-t1)*1000:.1f}ms")

    # Preprocessing
    preprocessed = self.preprocess(data)
    t3 = time.perf_counter()
    logger.info(f"Preprocess: {(t3-t2)*1000:.1f}ms")

    # Inference
    output = self.inference(preprocessed)
    t4 = time.perf_counter()
    logger.info(f"Inference: {(t4-t3)*1000:.1f}ms")

    # Postprocessing
    result = self.postprocess(output)
    t5 = time.perf_counter()
    logger.info(f"Postprocess: {(t5-t4)*1000:.1f}ms")
    logger.info(f"Total: {(t5-t0)*1000:.1f}ms")

    return result

Average timings over 1000 requests:

  • Config load: 41ms (this shouldn’t happen every request)
  • Preprocess: 38ms (image decode + resize + normalize)
  • Inference: 31ms (the actual model forward pass)
  • Postprocess: 8ms

The config load was pure waste. I could’ve cached it in initialize(), but I missed it during initial development. TorchServe’s architecture doesn’t force you to think about this—Triton does, because it won’t let you do file I/O in the hot path.

Dynamic Batching: The Hidden Advantage

Triton’s dynamic batching is smarter than TorchServe’s default behavior. Instead of waiting for exactly 8 requests (my configured batch size), Triton uses a timeout-based approach:

If batch size reaches preferred_batch_size[i]: dispatch immediately
Else if max_queue_delay_microseconds elapsed: dispatch with whatever you have

This means low-latency when traffic is light (don’t wait for a full batch), high-throughput when traffic is heavy (fill batches greedily).

TorchServe can do dynamic batching too, but you have to configure it manually in config.properties:

batch_size=8
max_batch_delay=1000

I didn’t set max_batch_delay initially, so TorchServe would wait indefinitely for 8 requests before running inference. Under bursty traffic, some requests waited 400ms+ just to fill a batch.

Memory Footprint Comparison

One area where TorchServe has an edge: memory efficiency.

  • TorchServe RSS: 3.2GB (model loaded once per worker)
  • Triton RSS: 4.8GB (two GPU instances + Python backend overhead)

Triton’s architecture keeps multiple model instances in memory for parallel execution. If you’re RAM-constrained (like on a g4dn.xlarge with 16GB), this matters.

But here’s the tradeoff: Triton’s extra memory buys you that 81% GPU utilization. TorchServe’s lower footprint doesn’t help if the GPU is idle 77% of the time.

Cold Start Times

Server startup is another dimension:

  • TorchServe: 8.3 seconds (model load + worker spawn)
  • Triton: 12.1 seconds (model load + instance initialization + warmup)

Triton’s slower cold start comes from its warmup phase—it runs a few dummy inferences to populate CUDA kernels and JIT caches. TorchServe skips this, which means your first few requests are slower (I measured 280ms for the first request vs. 170ms average).

For long-running servers, the 4-second difference is irrelevant. For serverless deployments (AWS Lambda, Google Cloud Run), it might push you toward TorchServe.

The Cost Breakdown

Here’s how the math worked out for my workload (averaging 120 requests/minute during business hours, 20/min off-peak):

TorchServe setup:
– 2x g4dn.xlarge instances (needed two to handle peak load at 47 req/sec each)
– $0.526/hour/instance × 2 × 730 hours/month = $768/month
– Plus S3 logging costs: ~$35/month
Total: $803/month

Triton setup:
– 1x g4dn.xlarge (183 req/sec easily handles peak)
– $0.526/hour × 730 hours = $384/month
– No S3 logging (moved to client-side batching + async writes)
Total: $384/month

Switching saved $419/month, or 52% of my serving costs.

And this is with minimal Triton optimization. I haven’t even tried TensorRT (which can give another 2-3x speedup), model ensembles, or multi-model serving.

When TorchServe Still Makes Sense

I’m not saying Triton is always the answer. TorchServe wins if:

  1. You need rapid prototyping. Triton’s config files and directory structure have a learning curve. TorchServe’s torch-model-archiver is genuinely plug-and-play.
  2. Your model has complex Python preprocessing. Triton’s Python backend works, but it’s clunkier than TorchServe’s native Python handlers. If you’re doing NLP with custom tokenization, regex parsing, or API calls during inference, TorchServe’s flexibility helps.
  3. You’re already in the PyTorch ecosystem. If you’re using PyTorch Lightning, Hugging Face Transformers, or other PyTorch-first tools, TorchServe integrates more naturally.
  4. Serverless deployment. Triton’s heavier cold start and memory footprint make it less ideal for AWS Lambda or Cloud Run. TorchServe’s leaner profile fits better.

But for high-throughput production inference where GPU utilization matters? Triton’s architecture just handles it better.

What I’d Do Differently

If I were starting this project today, I’d:

  1. Profile from day one. Don’t assume frameworks are optimized by default. Run cProfile, check nvidia-smi dmon, measure end-to-end latency before deploying.
  2. Separate preprocessing from serving. Even with TorchServe, moving preprocessing to the client (or a separate microservice) would’ve eliminated most of the I/O bottleneck.
  3. Test dynamic batching configs. Both frameworks support it, but the defaults are conservative. Tune max_batch_delay based on your latency SLA.
  4. Consider TensorRT earlier. I’m converting the model to TensorRT now, which should push Triton’s throughput past 300 req/sec on the same hardware.

The biggest lesson: I/O is the silent killer in ML serving. Everyone optimizes model size and FLOPS. Almost nobody profiles disk access, logging overhead, or serialization time until it’s costing them real money.

FAQ

Q: Can I use Triton with models other than PyTorch?

Yes—Triton supports TensorFlow, ONNX, TensorRT, and even custom backends. You can serve a PyTorch model alongside a TensorFlow model in the same server and chain them together (e.g., BERT tokenization in TF + classification head in PyTorch). That’s one of Triton’s killer features for teams with heterogeneous stacks.

Q: Does TorchServe’s performance improve with caching or Redis?

Sort of. If you cache preprocessing configs in Redis instead of reading from disk every request, you’ll cut that 41ms overhead. But you still have the S3 logging bottleneck (if enabled) and TorchServe’s batching behavior. I tested Redis caching and got throughput up to ~78 req/sec—better than 47, but nowhere near Triton’s 183. The architectural differences in batching and memory management matter more than I/O alone.

Q: What’s the learning curve like for Triton if I’m coming from TorchServe?

Plan for 2-3 days to get comfortable. The config.pbtxt syntax is weird at first (Protocol Buffers text format isn’t common outside Google infrastructure), and debugging requires reading Triton’s C++ logs. The official docs are thorough but dense. Once you’ve deployed one model, though, the pattern clicks. I’d recommend starting with the ONNX backend (simplest) before trying PyTorch or TensorRT.

What I’m Watching Next

I’m curious about Ray Serve, which promises TorchServe’s Python flexibility with better batching and autoscaling. The benchmarks I’ve seen show it splitting the difference—faster than TorchServe, slightly slower than Triton, but with easier horizontal scaling.

I also haven’t tested BentoML, which some teams swear by for multi-framework serving. If you’ve run production workloads on BentoML and have GPU utilization numbers, I’d love to hear how it compares.

For now, though, Triton is handling 120K requests/day on a single instance, the GPU is actually working, and my AWS bill is half what it was. Hard to argue with that.

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 337 | TOTAL 5,376