- TorchServe's default file I/O and logging consumed 67% of request time, leaving GPU at 23% utilization and costing $800/month on two T4 instances.
- Triton Inference Server achieved 3.9x higher throughput (183 vs 47 req/sec) by eliminating I/O from the hot path and using smarter dynamic batching.
- Moving to Triton reduced serving costs from $803/month to $384/month while handling the same workload on a single instance with 81% GPU utilization.
The $800 Wake-Up Call
My GPU utilization was sitting at 23%. The T4 instances were running around the clock, inference requests were piling up, and AWS was billing me $800/month for what turned out to be mostly idle hardware waiting on disk I/O.
The model wasn’t the problem. A ResNet-50 variant for industrial defect detection, nothing exotic. The issue was TorchServe’s default configuration treating every request like a separate snowflake—loading preprocessing configs from disk, writing temporary files, logging everything to S3.
I’d assumed model serving frameworks were optimized out of the box. That assumption cost me three months of unnecessary cloud bills before I finally benchmarked Triton Inference Server as an alternative.

What Actually Happens During Inference
Most tutorials skip the unsexy parts. They show you the model forward pass and call it done. But in production, that forward pass is maybe 30% of your request latency.
Here’s what a real inference pipeline does:
- Deserialize the request (JSON or binary)
- Load preprocessing config (for us: normalization params, resize dimensions)
- Decode image bytes
- Run preprocessing (resize, normalize, convert to tensor)
- Model forward pass ← the part everyone benchmarks
- Postprocessing (threshold probabilities, format output)
- Log request metadata (user_id, timestamp, prediction)
- Serialize response
TorchServe was hitting disk in steps 2, 7, and sometimes 3 (depending on image size and temp file thresholds). Triton? Zero disk I/O for the same workload.
The TorchServe Baseline
I started with TorchServe because PyTorch’s official docs recommend it, and I wanted something that “just works.” Initial setup was genuinely easy:
# model_handler.py
from ts.torch_handler.image_classifier import ImageClassifier
import torch
import json
class DefectClassifier(ImageClassifier):
def initialize(self, context):
super().initialize(context)
# Load preprocessing config from disk every time
# (this turned out to be the first mistake)
properties = context.system_properties
model_dir = properties.get("model_dir")
with open(f"{model_dir}/preprocess_config.json") as f:
self.config = json.load(f)
def preprocess(self, data):
images = []
for row in data:
image = row.get("data") or row.get("body")
# TorchServe writes large payloads to temp files
if isinstance(image, str):
with open(image, 'rb') as f:
image = f.read()
# ... rest of preprocessing
return torch.stack(images)
def postprocess(self, inference_output):
# Log predictions to S3 (another I/O bottleneck)
predictions = inference_output.argmax(dim=1).tolist()
# ... logging code that hit S3 every request
return predictions
Batch size 8, four workers, one T4 GPU. Throughput: 47 requests/second with an average latency of 170ms.
GPU utilization according to nvidia-smi: 23%.
Something was obviously wrong, but the logs showed no errors. Just slow, expensive idling.
Profiling Showed the Ugly Truth
I added cProfile to the handler and ran 1000 requests:
import cProfile
import pstats
def preprocess(self, data):
profiler = cProfile.Profile()
profiler.enable()
# ... existing code
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumtime')
stats.print_stats(10)
The top time sinks:
open()calls: 38% of request times3_client.put_object(): 29%- Model forward pass: 18%
- Everything else: 15%
Two-thirds of my inference pipeline was I/O. The GPU was barely doing anything.
Enter Triton Inference Server
Triton is NVIDIA’s serving framework, designed for high-throughput inference across multiple frameworks (PyTorch, TensorFlow, ONNX, TensorRT). The pitch: everything stays in memory, dynamic batching is smarter, and you can chain models without intermediate I/O.
The setup is more verbose than TorchServe. You need:
- A model repository with a specific directory structure
- A
config.pbtxtfile (Protocol Buffers text format) - Your model weights in a supported format
Here’s the config for the same ResNet-50 model:
name: "defect_classifier"
platform: "pytorch_libtorch"
max_batch_size: 16
input [
{
name: "INPUT__0"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]
output [
{
name: "OUTPUT__0"
data_type: TYPE_FP32
dims: [ 5 ] # 5 defect classes
}
]
dynamic_batching {
preferred_batch_size: [ 4, 8, 16 ]
max_queue_delay_microseconds: 1000
}
instance_group [
{
count: 2
kind: KIND_GPU
}
]
The key difference: Triton expects you to bake preprocessing into the model or handle it client-side. No file I/O inside the server.
Moving Preprocessing to the Client
This was the philosophical shift. TorchServe encourages “fat” server-side handlers that do everything. Triton pushes you toward “thin” servers with smart clients.
I rewrote the preprocessing as a standalone Python client utility:
import numpy as np
import tritonclient.http as httpclient
from PIL import Image
import io
class TritonClient:
def __init__(self, url="localhost:8000", model_name="defect_classifier"):
self.client = httpclient.InferenceServerClient(url=url)
self.model_name = model_name
# Preprocessing config loaded ONCE at client init
self.mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
self.std = np.array([0.229, 0.224, 0.225], dtype=np.float32)
def preprocess(self, image_bytes):
img = Image.open(io.BytesIO(image_bytes)).convert('RGB')
img = img.resize((224, 224))
arr = np.array(img, dtype=np.float32) / 255.0
arr = (arr - self.mean) / self.std
arr = arr.transpose(2, 0, 1) # HWC -> CHW
return arr
def infer(self, image_bytes):
input_data = self.preprocess(image_bytes)
input_data = np.expand_dims(input_data, axis=0) # Add batch dim
inputs = [httpclient.InferInput("INPUT__0", input_data.shape, "FP32")]
inputs[0].set_data_from_numpy(input_data)
outputs = [httpclient.InferRequestedOutput("OUTPUT__0")]
response = self.client.infer(self.model_name, inputs, outputs=outputs)
logits = response.as_numpy("OUTPUT__0")
return logits.argmax(axis=1)[0]
Yes, this moves work to the client. But the client isn’t bottlenecked by GPU availability—it can preprocess while waiting for the batch to fill. Triton’s dynamic batching handles the rest.
The Benchmark Setup
I ran both servers on the same hardware:
- Instance: AWS
g4dn.xlarge(1x T4 GPU, 4 vCPUs, 16GB RAM) - Model: ResNet-50 (25M parameters, ~100MB checkpoint)
- Workload: 10,000 requests, images averaging 1.2MB each
- Concurrency: 32 parallel clients
- Metric tools:
nvidia-smi dmon -s ufor GPU util, custom Python scripts for latency
Both servers were configured for fairness:
- TorchServe: batch size 8, 4 workers, all logging disabled for this test
- Triton: max batch size 16, preferred sizes [4, 8, 16], 2 GPU instances
I disabled S3 logging for TorchServe during the benchmark (that 29% overhead would’ve made the comparison unfair). But even without logging, the file I/O for config loading remained.

Throughput Results
| Server | Requests/sec | P50 Latency | P95 Latency | GPU Util |
|---|---|---|---|---|
| TorchServe | 47 | 170ms | 340ms | 23% |
| Triton | 183 | 52ms | 98ms | 81% |
Triton was 3.9x faster on throughput and 3.3x better on median latency.
But the real kicker: GPU utilization jumped from 23% to 81%. That’s what happens when you stop waiting on disk.
Where TorchServe Was Losing Time
I added trace-level logging to TorchServe’s request handling and measured each stage with time.perf_counter():
import time
import logging
logger = logging.getLogger(__name__)
def handle(self, data, context):
t0 = time.perf_counter()
# Config load from disk
t1 = time.perf_counter()
with open(f"{self.model_dir}/preprocess_config.json") as f:
config = json.load(f)
t2 = time.perf_counter()
logger.info(f"Config load: {(t2-t1)*1000:.1f}ms")
# Preprocessing
preprocessed = self.preprocess(data)
t3 = time.perf_counter()
logger.info(f"Preprocess: {(t3-t2)*1000:.1f}ms")
# Inference
output = self.inference(preprocessed)
t4 = time.perf_counter()
logger.info(f"Inference: {(t4-t3)*1000:.1f}ms")
# Postprocessing
result = self.postprocess(output)
t5 = time.perf_counter()
logger.info(f"Postprocess: {(t5-t4)*1000:.1f}ms")
logger.info(f"Total: {(t5-t0)*1000:.1f}ms")
return result
Average timings over 1000 requests:
- Config load: 41ms (this shouldn’t happen every request)
- Preprocess: 38ms (image decode + resize + normalize)
- Inference: 31ms (the actual model forward pass)
- Postprocess: 8ms
The config load was pure waste. I could’ve cached it in initialize(), but I missed it during initial development. TorchServe’s architecture doesn’t force you to think about this—Triton does, because it won’t let you do file I/O in the hot path.
Dynamic Batching: The Hidden Advantage
Triton’s dynamic batching is smarter than TorchServe’s default behavior. Instead of waiting for exactly 8 requests (my configured batch size), Triton uses a timeout-based approach:
If batch size reaches preferred_batch_size[i]: dispatch immediately
Else if max_queue_delay_microseconds elapsed: dispatch with whatever you have
This means low-latency when traffic is light (don’t wait for a full batch), high-throughput when traffic is heavy (fill batches greedily).
TorchServe can do dynamic batching too, but you have to configure it manually in config.properties:
batch_size=8
max_batch_delay=1000
I didn’t set max_batch_delay initially, so TorchServe would wait indefinitely for 8 requests before running inference. Under bursty traffic, some requests waited 400ms+ just to fill a batch.
Memory Footprint Comparison
One area where TorchServe has an edge: memory efficiency.
- TorchServe RSS: 3.2GB (model loaded once per worker)
- Triton RSS: 4.8GB (two GPU instances + Python backend overhead)
Triton’s architecture keeps multiple model instances in memory for parallel execution. If you’re RAM-constrained (like on a g4dn.xlarge with 16GB), this matters.
But here’s the tradeoff: Triton’s extra memory buys you that 81% GPU utilization. TorchServe’s lower footprint doesn’t help if the GPU is idle 77% of the time.
Cold Start Times
Server startup is another dimension:
- TorchServe: 8.3 seconds (model load + worker spawn)
- Triton: 12.1 seconds (model load + instance initialization + warmup)
Triton’s slower cold start comes from its warmup phase—it runs a few dummy inferences to populate CUDA kernels and JIT caches. TorchServe skips this, which means your first few requests are slower (I measured 280ms for the first request vs. 170ms average).
For long-running servers, the 4-second difference is irrelevant. For serverless deployments (AWS Lambda, Google Cloud Run), it might push you toward TorchServe.
The Cost Breakdown
Here’s how the math worked out for my workload (averaging 120 requests/minute during business hours, 20/min off-peak):
TorchServe setup:
– 2x g4dn.xlarge instances (needed two to handle peak load at 47 req/sec each)
– $0.526/hour/instance × 2 × 730 hours/month = $768/month
– Plus S3 logging costs: ~$35/month
– Total: $803/month
Triton setup:
– 1x g4dn.xlarge (183 req/sec easily handles peak)
– $0.526/hour × 730 hours = $384/month
– No S3 logging (moved to client-side batching + async writes)
– Total: $384/month
Switching saved $419/month, or 52% of my serving costs.
And this is with minimal Triton optimization. I haven’t even tried TensorRT (which can give another 2-3x speedup), model ensembles, or multi-model serving.
When TorchServe Still Makes Sense
I’m not saying Triton is always the answer. TorchServe wins if:
- You need rapid prototyping. Triton’s config files and directory structure have a learning curve. TorchServe’s
torch-model-archiveris genuinely plug-and-play. - Your model has complex Python preprocessing. Triton’s Python backend works, but it’s clunkier than TorchServe’s native Python handlers. If you’re doing NLP with custom tokenization, regex parsing, or API calls during inference, TorchServe’s flexibility helps.
- You’re already in the PyTorch ecosystem. If you’re using PyTorch Lightning, Hugging Face Transformers, or other PyTorch-first tools, TorchServe integrates more naturally.
- Serverless deployment. Triton’s heavier cold start and memory footprint make it less ideal for AWS Lambda or Cloud Run. TorchServe’s leaner profile fits better.
But for high-throughput production inference where GPU utilization matters? Triton’s architecture just handles it better.
What I’d Do Differently
If I were starting this project today, I’d:
- Profile from day one. Don’t assume frameworks are optimized by default. Run
cProfile, checknvidia-smi dmon, measure end-to-end latency before deploying. - Separate preprocessing from serving. Even with TorchServe, moving preprocessing to the client (or a separate microservice) would’ve eliminated most of the I/O bottleneck.
- Test dynamic batching configs. Both frameworks support it, but the defaults are conservative. Tune
max_batch_delaybased on your latency SLA. - Consider TensorRT earlier. I’m converting the model to TensorRT now, which should push Triton’s throughput past 300 req/sec on the same hardware.
The biggest lesson: I/O is the silent killer in ML serving. Everyone optimizes model size and FLOPS. Almost nobody profiles disk access, logging overhead, or serialization time until it’s costing them real money.
FAQ
Q: Can I use Triton with models other than PyTorch?
Yes—Triton supports TensorFlow, ONNX, TensorRT, and even custom backends. You can serve a PyTorch model alongside a TensorFlow model in the same server and chain them together (e.g., BERT tokenization in TF + classification head in PyTorch). That’s one of Triton’s killer features for teams with heterogeneous stacks.
Q: Does TorchServe’s performance improve with caching or Redis?
Sort of. If you cache preprocessing configs in Redis instead of reading from disk every request, you’ll cut that 41ms overhead. But you still have the S3 logging bottleneck (if enabled) and TorchServe’s batching behavior. I tested Redis caching and got throughput up to ~78 req/sec—better than 47, but nowhere near Triton’s 183. The architectural differences in batching and memory management matter more than I/O alone.
Q: What’s the learning curve like for Triton if I’m coming from TorchServe?
Plan for 2-3 days to get comfortable. The config.pbtxt syntax is weird at first (Protocol Buffers text format isn’t common outside Google infrastructure), and debugging requires reading Triton’s C++ logs. The official docs are thorough but dense. Once you’ve deployed one model, though, the pattern clicks. I’d recommend starting with the ONNX backend (simplest) before trying PyTorch or TensorRT.
What I’m Watching Next
I’m curious about Ray Serve, which promises TorchServe’s Python flexibility with better batching and autoscaling. The benchmarks I’ve seen show it splitting the difference—faster than TorchServe, slightly slower than Triton, but with easier horizontal scaling.
I also haven’t tested BentoML, which some teams swear by for multi-framework serving. If you’ve run production workloads on BentoML and have GPU utilization numbers, I’d love to hear how it compares.
For now, though, Triton is handling 120K requests/day on a single instance, the GPU is actually working, and my AWS bill is half what it was. Hard to argue with that.
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply