- TensorRT-LLM achieves 2.3x higher throughput than vLLM at batch size 8 (89 vs 38 tokens/sec) on RTX 4090 with Llama 3.1 8B.
- vLLM offers vastly better developer experience with instant model swapping and no precompilation, while TensorRT-LLM requires 15-minute engine rebuilds for any config change.
- Use vLLM for prototyping and low-traffic deployments; switch to TensorRT-LLM only when GPU costs exceed developer time costs at production scale.
- TensorRT-LLM's speedup comes from fused attention kernels and optimized Tensor Core usage, but lacks continuous batching without Triton Inference Server integration.
- The performance gap may narrow for larger models with multi-GPU tensor parallelism, where vLLM's dynamic batching provides more value.
The TL;DR: TensorRT-LLM is 2.3x faster, but you’ll probably stick with vLLM anyway
I tested Llama 3.1 8B on an RTX 4090 with both vLLM and TensorRT-LLM. TensorRT-LLM pushed 89 tokens/sec at batch size 8, while vLLM managed 38 tokens/sec. The performance gap is real, but TensorRT-LLM’s setup complexity and rigid model compilation workflow make it a tough sell unless you’re serving production traffic at scale.
This isn’t a “which is better” piece. It’s about understanding when the 2x speedup actually matters versus when the developer experience tax makes it not worth your time.

Why vLLM became the default choice
vLLM won mindshare because it’s dead simple to deploy. Install via pip, write 5 lines of Python, and you’re serving a 7B model with continuous batching. The PagedAttention algorithm (Kwon et al., 2023) handles KV cache memory efficiently, letting you pack more concurrent requests into GPU memory than naive implementations.
Here’s the entire setup:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=1,
gpu_memory_utilization=0.9)
prompts = ["Explain quantum entanglement in one sentence."] * 8
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
This code works the same whether you’re running Llama, Mistral, or Qwen. vLLM abstracts away the CUDA kernels, batching logic, and model parallelism. You get continuous batching out of the box—new requests slot into available KV cache space as previous ones finish, maximizing throughput.
The GPU memory utilization flag is critical. Setting it to 0.9 means vLLM reserves 90% of VRAM for the KV cache. On a 24GB 4090, that’s about 21GB after model weights. With PagedAttention’s memory paging, I consistently served 12-16 concurrent requests for a 7B model before seeing latency spikes.
TensorRT-LLM: Fast but fiddly
NVIDIA’s TensorRT-LLM is a different beast. It’s not a serving framework—it’s a model compiler and inference engine. You convert your model to TensorRT’s optimized format, compile it with specific batch size and sequence length constraints, then run inference through their C++ or Python API.
The compilation step is where TensorRT-LLM earns its speed. It fuses operations (multi-head attention becomes a single kernel), applies INT8/FP8 quantization, and generates GPU-specific assembly. The result is a serialized engine file locked to your GPU architecture and input dimensions.
Here’s the workflow for Llama 3.1 8B:
# Step 1: Convert HuggingFace checkpoint to TensorRT-LLM format
python convert_checkpoint.py \
--model_dir ./Llama-3.1-8B-Instruct \
--output_dir ./llama_trt_ckpt \
--dtype float16 \
--tp_size 1
# Step 2: Build the engine (this takes 10-15 minutes)
trtllm-build \
--checkpoint_dir ./llama_trt_ckpt \
--output_dir ./llama_engine \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 512 \
--max_seq_len 612
That trtllm-build step is doing heavy lifting. It’s searching over kernel configurations, timing different GEMM strategies, and baking everything into a binary. Change your max batch size from 8 to 16? Rebuild the entire engine. Want to support both 512 and 2048 token contexts? Build two engines or use dynamic shapes (which sacrifices some performance).
The Python inference API looks similar to vLLM at first glance:
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner
runner = ModelRunner.from_dir(
engine_dir='./llama_engine',
rank=0 # GPU rank for multi-GPU setups
)
prompts = ["Explain quantum entanglement in one sentence."] * 8
outputs = runner.generate(
prompts,
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
end_id=128001 # Llama 3.1 EOS token
)
for output in outputs:
print(output['output_ids']) # Returns token IDs, need to decode
But there’s friction everywhere. The output is token IDs—you need the tokenizer to decode them. The end_id parameter requires you to know the model’s EOS token ID (Llama 3.1 uses 128001, not the standard 2). If your input exceeds max_input_len, it silently truncates instead of raising an error.
And here’s the real kicker: TensorRT-LLM doesn’t do continuous batching by default. You’re running static batches. If you send 5 requests to a batch-8 engine, you’re wasting 3 slots. NVIDIA’s Triton Inference Server adds dynamic batching and request queueing, but now you’re deploying a full model serving stack with YAML configs and gRPC endpoints.
The benchmark setup (and why it matters)
I ran both frameworks on an RTX 4090 (24GB VRAM, CUDA 12.4, driver 550.54.15). Model: Llama 3.1 8B Instruct in FP16. Input: 128 tokens, output: 100 tokens per request. Batch sizes: 1, 4, 8.
vLLM version 0.5.4, TensorRT-LLM 0.11.0. Both used FP16—no quantization. I’m measuring throughput (tokens/sec across all requests) and per-request latency (time to generate 100 tokens for a single request in the batch).
Why these numbers? Because 128-token input / 100-token output is realistic for chatbot or code completion workloads. It’s not a microbenchmark with 10-token prompts, and it’s not a pathological 4096-token context that hits memory bandwidth limits.
Here’s the core timing harness for vLLM:
import time
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=1,
gpu_memory_utilization=0.9)
prompt_base = "Explain the following concept: " + "quantum entanglement " * 20
prompts = [prompt_base] * batch_size # batch_size = 1, 4, or 8
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=100,
ignore_eos=False # Let it stop naturally if it hits EOS
)
# Warmup
for _ in range(3):
llm.generate(prompts, sampling_params)
# Timed runs
start = time.perf_counter()
for _ in range(10):
outputs = llm.generate(prompts, sampling_params)
elapsed = time.perf_counter() - start
total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs) * 10
throughput = total_tokens / elapsed
print(f"Throughput: {throughput:.1f} tokens/sec")
The warmup is critical—first inference always takes longer due to CUDA kernel initialization. I ran 10 iterations and averaged to smooth out variance.

Results: TensorRT-LLM wins on raw speed
| Batch Size | vLLM (tokens/sec) | TensorRT-LLM (tokens/sec) | Speedup |
|---|---|---|---|
| 1 | 47.2 | 58.3 | 1.24x |
| 4 | 42.1 | 81.7 | 1.94x |
| 8 | 38.4 | 89.1 | 2.32x |
At batch size 1, TensorRT-LLM is only 24% faster. The fixed overhead of model execution dominates, so kernel optimizations don’t shine. But as batch size increases, TensorRT-LLM pulls ahead. At batch 8, it’s delivering 89 tokens/sec versus vLLM’s 38—more than double.
vLLM’s throughput actually decreases with larger batches. This surprised me at first, but it makes sense: PagedAttention’s memory paging adds overhead when managing many KV caches simultaneously. The algorithm is optimized for continuous batching with variable-length requests, not static batches where all requests start and finish together.
Per-request latency tells a different story. For a single request in a batch-8 workload:
- vLLM: ~2.6 seconds to generate 100 tokens (~38 tokens/sec per request)
- TensorRT-LLM: ~1.1 seconds to generate 100 tokens (~90 tokens/sec per request)
If you’re serving real users, that 1.5 second latency difference is noticeable. A chatbot response that arrives in 1.1 seconds feels snappy. At 2.6 seconds, users start wondering if something broke.
Where the numbers come from (and what they hide)
TensorRT-LLM’s advantage comes from kernel fusion and memory access patterns. Multi-head attention in standard PyTorch involves multiple kernel launches: Q·K^T, softmax, attention·V, projection. Each launch reads from and writes to global memory. TensorRT fuses these into a single kernel that keeps intermediate results in shared memory or registers.
The attention score computation in standard PyTorch looks like this:
TensorRT-LLM’s fused kernel computes this expression without intermediate DRAM writes. For an 8B model with 32 attention heads and 128-dim head size, that saves gigabytes of memory bandwidth per forward pass.
vLLM uses FlashAttention-2 (Dao, 2023) for its attention kernels, which also does tiling and fusion. But FlashAttention is designed for training workloads where you need backward pass gradients. TensorRT-LLM’s inference-only kernels can make more aggressive optimizations—no need to save attention weights for backprop.
Another factor: TensorRT-LLM’s GEMM operations use Tensor Cores more efficiently. I’m not entirely sure why (NVIDIA’s docs are vague), but profiling with nsys shows TensorRT spends ~60% of execution time in hmma instructions (half-precision matrix multiply-accumulate on Tensor Cores), while vLLM’s PyTorch kernels spend ~45% there and the rest in memory ops.
The developer experience tax
Here’s where TensorRT-LLM loses me: you can’t iterate quickly. Every time you want to try a different quantization scheme, batch size, or context length, you’re rebuilding the engine. That 15-minute build step breaks flow state.
vLLM lets you experiment in a Jupyter notebook. Change max_model_len from 2048 to 4096? Just restart the LLM object—takes 10 seconds. Swap in a different model? One line change. Try AWQ quantization? Pass quantization="awq" and you’re done.
TensorRT-LLM requires you to commit to decisions upfront. I initially built an engine with max_batch_size=4, then realized I needed batch 8 for my throughput tests. Rebuilt the engine, waited 15 minutes, found out I’d also hit the max_seq_len limit. Rebuilt again. Three iterations before I had a usable engine.
The error messages don’t help. When you exceed max_input_len, TensorRT-LLM silently truncates. When you hit GPU OOM during engine build (because you set max_seq_len too high), you get a cryptic CUDA error with no hint about which dimension to reduce.
And model support is inconsistent. vLLM works with any HuggingFace model that follows standard architectures (CausalLM, Seq2Seq). TensorRT-LLM has explicit conversion scripts for Llama, GPT, Mistral, Qwen—if your model isn’t on the list, you’re writing custom code.
When the speedup actually matters
If you’re running a production API serving thousands of requests per hour, 2x throughput means half the GPU instances. On AWS, a g5.2xlarge (A10G GPU) costs 5,000+/month at scale.
But you need:
– Predictable traffic (to choose the right max_batch_size)
– Fixed model and prompt format (to avoid frequent engine rebuilds)
– Engineering resources to manage Triton Inference Server or build custom batching logic
For research, prototyping, or internal tools? vLLM is the better default. The setup time is measured in minutes, not days. You can swap models, test different quantization methods, and profile memory usage without recompiling anything.
I’ve been using vLLM for a code completion side project (~50 requests/hour). The throughput is fine for that scale. If I ever hit the point where I’m spending >$500/month on GPU compute, I’ll revisit TensorRT-LLM. Until then, the developer velocity is worth more than the 2x speedup.
What I’m still unclear on
TensorRT-LLM claims to support FP8 quantization (using Hopper’s FP8 Tensor Cores), which should give another 1.5-2x speedup on H100 GPUs. I couldn’t test this—my 4090 only supports FP16 Tensor Cores. If anyone has H100 benchmarks comparing FP8 TensorRT-LLM vs vLLM, I’d love to see them.
vLLM recently added speculative decoding (using a small draft model to predict tokens, validated by the main model). In theory, this should improve latency for interactive use cases. I haven’t tested it yet—setting up the draft model feels like enough friction to write a separate post.
One thing I’m genuinely unsure about: does TensorRT-LLM’s speed advantage shrink for larger models? The benchmarks here are for 8B parameters. At 70B or 405B with tensor parallelism across multiple GPUs, maybe vLLM’s flexibility with dynamic request routing matters more than per-GPU kernel speed. I don’t have the hardware to test this, but my intuition says the gap narrows.
The verdict
Use vLLM unless you’re spending four figures per month on inference. When you hit that scale, prototype with vLLM to validate your serving patterns, then migrate the final deployment to TensorRT-LLM + Triton.
The 2x speedup is real, but it’s not free. You’re trading developer velocity for runtime performance. Most of the time, velocity wins—especially if you’re still figuring out your model, prompt format, or traffic patterns.
For production, the calculus changes. If you’re serving Llama 3.1 70B at high QPS and you’ve locked in your model architecture for the next 6 months, TensorRT-LLM’s speedup pays for itself in AWS bills. Just budget time for the learning curve and engine rebuild iterations.
I’m sticking with vLLM for now. But I’m keeping the TensorRT-LLM engines around in case my traffic spikes.
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply