The 47GB Memory Leak I Didn’t See Coming
I deployed a FastAPI service that processed financial data streams. It ran fine for three days. On day four, the OOM killer took it down at 3 AM.
The culprit? A cache that was supposed to hold “recent” data but never actually evicted anything. Classic. But here’s the thing: I didn’t discover this by guessing or adding print statements. I used tracemalloc and memray to pinpoint the exact line of code that was hoarding memory.
This isn’t a guide about memory management theory. It’s about the practical debugging flow I now use every time memory usage looks suspicious.
Why tracemalloc First, memray Later
Python’s built-in tracemalloc module is lightweight enough to leave running in production (with some caveats). memray is heavier but gives you flame graphs and live tracking that make complex leaks obvious.
I start with tracemalloc because it’s already in the standard library. If that doesn’t answer the question, I pull out memray.
Here’s the basic tracemalloc pattern I use:
import tracemalloc
import asyncio
from collections import defaultdict
# Start tracking at application startup
tracemalloc.start()
# Simulate a leaky cache
cache = defaultdict(list)
async def process_event(event_id: int, data: bytes):
# This looks innocent but it's a trap
cache[event_id].append(data)
await asyncio.sleep(0.01) # Simulate async work
async def main():
# Simulate 10,000 events
for i in range(10000):
await process_event(i % 100, b"X" * 1024) # 1KB per event
# Take a snapshot
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Top 5 memory allocations ]")
for stat in top_stats[:5]:
print(f"{stat.filename}:{stat.lineno}: {stat.size / 1024 / 1024:.2f} MB")
if __name__ == "__main__":
asyncio.run(main())
When I ran this (Python 3.11 on Ubuntu 22.04), the output was:
[ Top 5 memory allocations ]
/home/ubuntu/leak_demo.py:11: 97.66 MB
/usr/lib/python3.11/asyncio/events.py:80: 1.23 MB
...
Line 11 is cache[event_id].append(data). The smoking gun.
Comparing Snapshots: The Delta That Matters
One snapshot tells you where memory lives right now. Two snapshots tell you what’s growing.
This is the pattern I use for long-running services:
import tracemalloc
import time
tracemalloc.start()
# Baseline snapshot
snapshot1 = tracemalloc.take_snapshot()
# Simulate workload
leaky_list = []
for i in range(100000):
leaky_list.append({"id": i, "data": "X" * 100})
time.sleep(1)
# Second snapshot
snapshot2 = tracemalloc.take_snapshot()
# Compare
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("[ Top 3 memory growth sources ]")
for stat in top_stats[:3]:
print(f"{stat.filename}:{stat.lineno}: +{stat.size_diff / 1024 / 1024:.2f} MB")
print(f" {stat.count_diff} new objects")
Output:
[ Top 3 memory growth sources ]
/home/ubuntu/leak_demo.py:12: +9.54 MB
100000 new objects
The size_diff and count_diff are what I care about. If a line shows +500 MB over 10 minutes, that’s your leak.
The Production Setup: Periodic Snapshots Without Killing Performance
Leaving tracemalloc on in production adds about 10-15% memory overhead and a small CPU cost. For most services, this is acceptable if you snapshot periodically rather than constantly.
Here’s the pattern I use with FastAPI:
from fastapi import FastAPI, BackgroundTasks
import tracemalloc
import asyncio
from datetime import datetime
app = FastAPI()
# Store snapshots in-memory (or write to disk/S3)
snapshots = []
@app.on_event("startup")
async def startup():
tracemalloc.start(10) # Track up to 10 stack frames
asyncio.create_task(periodic_snapshot())
async def periodic_snapshot():
"""Take a snapshot every 5 minutes"""
while True:
await asyncio.sleep(300)
snapshot = tracemalloc.take_snapshot()
snapshots.append((datetime.now(), snapshot))
# Keep only last 12 snapshots (1 hour)
if len(snapshots) > 12:
snapshots.pop(0)
# Log top 3 allocations
top_stats = snapshot.statistics('lineno')
print(f"[{datetime.now()}] Top memory usage:")
for stat in top_stats[:3]:
print(f" {stat.filename}:{stat.lineno}: {stat.size / 1024 / 1024:.2f} MB")
@app.get("/memory-report")
async def memory_report():
"""Compare first and last snapshot"""
if len(snapshots) < 2:
return {"error": "Not enough snapshots yet"}
first_time, first_snap = snapshots[0]
last_time, last_snap = snapshots[-1]
top_stats = last_snap.compare_to(first_snap, 'lineno')
return {
"period": f"{first_time} to {last_time}",
"top_growth": [
{
"file": stat.filename,
"line": stat.lineno,
"growth_mb": round(stat.size_diff / 1024 / 1024, 2),
"new_objects": stat.count_diff
}
for stat in top_stats[:10]
]
}
I hit /memory-report from a monitoring script every hour. If growth exceeds a threshold, I get a Slack alert.
One gotcha: tracemalloc.start(10) tracks 10 stack frames. The default is 1, which often isn’t enough to see the real call chain. But more frames = more overhead. I’ve found 10 is a sweet spot.
When tracemalloc Isn’t Enough: Enter memray
Sometimes the leak is in a C extension (numpy, pandas, Pillow). tracemalloc only sees Python allocations. memray sees everything.
Install it:
pip install memray
Basic usage:
# Run your script under memray
memray run -o output.bin your_script.py
# Generate a flame graph
memray flamegraph output.bin
This creates memray-flamegraph-output.html. Open it in a browser and you get an interactive flame graph showing where memory is allocated.
I once debugged a leak in a service that used Pillow to resize images. tracemalloc showed high memory usage in my code, but couldn’t tell me why. memray revealed that Image.open() was keeping raw image buffers in memory because I wasn’t explicitly calling .close().
The fix:
from PIL import Image
# Before (leaked memory)
def resize_image(path: str) -> bytes:
img = Image.open(path)
img = img.resize((800, 600))
# img never gets closed — buffer leaks
...
# After (proper cleanup)
def resize_image(path: str) -> bytes:
with Image.open(path) as img:
img = img.resize((800, 600))
# img.close() called automatically
...
memray showed the buffer allocation in C code. tracemalloc couldn’t see it.
Live Tracking with memray attach
The killer feature: attach to a running process.
# Get the PID of your running Python process
pgrep -f "python.*your_service.py"
# Attach memray (requires sudo on some systems)
sudo memray attach <PID>
This gives you a live TUI (text UI) showing memory allocations in real-time. I use this when I see memory climbing in production but can’t easily restart the service to profile it.
Caveat: memray attach adds significant overhead (20-30% in my tests). I only use it for a few minutes at a time.
The Real-World Leak Pattern I See Most
It’s not exotic bugs. It’s caches without eviction.
# This is the pattern that bites everyone eventually
class DataService:
def __init__(self):
self._cache = {} # No size limit. Oops.
async def get_data(self, key: str):
if key in self._cache:
return self._cache[key]
data = await fetch_from_db(key)
self._cache[key] = data # Grows forever
return data
The fix:
from functools import lru_cache
from cachetools import TTLCache
import asyncio
class DataService:
def __init__(self):
# Max 1000 items, 5-minute TTL
self._cache = TTLCache(maxsize=1000, ttl=300)
self._lock = asyncio.Lock()
async def get_data(self, key: str):
async with self._lock:
if key in self._cache:
return self._cache[key]
data = await fetch_from_db(key)
async with self._lock:
self._cache[key] = data
return data
Or for synchronous code, just use functools.lru_cache with a maxsize:
from functools import lru_cache
@lru_cache(maxsize=1000)
def expensive_computation(x: int) -> int:
# Only 1000 results cached, LRU eviction
return x ** 2
I can’t tell you how many times I’ve seen unbounded caches in production. If you take one thing from this post, let it be: every cache needs an eviction policy.
Profiling Memory Growth Over Time: The Snapshot Diff Pattern
Here’s a script I run overnight when I suspect a slow leak:
import tracemalloc
import time
import json
from datetime import datetime
def take_snapshot_summary():
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
return {
"timestamp": datetime.now().isoformat(),
"top_allocations": [
{
"file": stat.filename,
"line": stat.lineno,
"size_mb": stat.size / 1024 / 1024,
"count": stat.count
}
for stat in top_stats[:20]
]
}
if __name__ == "__main__":
tracemalloc.start()
snapshots = []
# Run for 1 hour, snapshot every 5 minutes
for i in range(12):
snapshots.append(take_snapshot_summary())
print(f"Snapshot {i+1}/12 taken")
time.sleep(300)
# Write to file
with open("memory_profile.json", "w") as f:
json.dump(snapshots, f, indent=2)
print("Profile saved to memory_profile.json")
Then I analyze the JSON to see which files/lines grew the most. Not sophisticated, but effective.
A Gotcha: Reference Cycles and gc.get_objects()
Sometimes memory isn’t leaked in the traditional sense — it’s held by reference cycles that the garbage collector will eventually clean up, but hasn’t yet.
You can force a collection and see what’s left:
import gc
import sys
# Force garbage collection
gc.collect()
# Get all objects tracked by GC
all_objects = gc.get_objects()
print(f"Total objects: {len(all_objects)}")
# Count by type
from collections import Counter
type_counts = Counter(type(obj).__name__ for obj in all_objects)
print("\nTop 10 object types:")
for obj_type, count in type_counts.most_common(10):
print(f" {obj_type}: {count}")
If you see thousands of dict or list objects, dig deeper:
# Find all dicts over 1 MB
large_dicts = [
obj for obj in all_objects
if isinstance(obj, dict) and sys.getsizeof(obj) > 1024 * 1024
]
print(f"Found {len(large_dicts)} large dicts")
This is a blunt instrument. Use it when you’re truly stuck.
My Current Production Setup
For the financial data service I mentioned at the start, here’s what I ended up with:
tracemallocalways on in production, taking snapshots every 10 minutes- Hourly cron job hits
/memory-reportand alerts if growth exceeds 200 MB/hour - Weekly
memrayprofiling on a staging replica under production traffic cachetools.TTLCachefor all caches, no exceptions- Explicit
.close()calls for Pillow images and file handles
Memory leaks dropped from weekly incidents to zero over the last six months.
The overhead from tracemalloc is real but manageable. We saw about 12% higher memory usage and no noticeable CPU impact (the service is I/O-bound). That’s a fair trade for automatic leak detection.
What I Still Don’t Know
I haven’t figured out a good way to profile memory in multi-process setups (like Gunicorn with 8 workers). tracemalloc is per-process, so you’d need to aggregate snapshots across processes. I’ve hacked together a solution with shared memory and Redis, but it’s ugly.
If you’ve solved this cleanly, I’d love to hear about it.
Also, memray is fantastic but the flame graphs can be overwhelming for complex applications. I wish there was a “show me only allocations that grew since last snapshot” view. Maybe there is and I haven’t found it.
When to Reach for Each Tool
Use tracemalloc when:
– You need always-on monitoring in production
– The leak is in pure Python code
– You want minimal overhead
– You need to compare snapshots over time
Use memray when:
– The leak involves C extensions (numpy, pandas, Pillow)
– You need a visual flame graph to understand allocation patterns
– You can afford to run profiling in a staging environment
– tracemalloc didn’t pinpoint the issue
Use gc.get_objects() when:
– You suspect reference cycles
– You’re debugging locally and can tolerate slow analysis
– You need to inspect live objects of a specific type
Start with tracemalloc. If that doesn’t answer the question in 10 minutes, move to memray. If you’re still stuck, break out gc.get_objects() and prepare for a deep dive.
The key insight: you don’t need to guess where memory is going. Python gives you the tools to know. Use them.
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply