Python Memory Profiling in Production: Tracking and Fixing Memory Leaks with tracemalloc and memray

Q: Why tracemalloc First, memray Later?

Python's built-in tracemalloc module is lightweight enough to leave running in production (with some caveats). memray is heavier but gives you flame graphs and live tracking that make complex leaks obvious. I start with tracemalloc because it's already in the standard library. If that doesn't answer

Updated Feb 6, 2026

The 47GB Memory Leak I Didn’t See Coming

I deployed a FastAPI service that processed financial data streams. It ran fine for three days. On day four, the OOM killer took it down at 3 AM.

The culprit? A cache that was supposed to hold “recent” data but never actually evicted anything. Classic. But here’s the thing: I didn’t discover this by guessing or adding print statements. I used tracemalloc and memray to pinpoint the exact line of code that was hoarding memory.

This isn’t a guide about memory management theory. It’s about the practical debugging flow I now use every time memory usage looks suspicious.

Why `tracemalloc` First, `memray` Later

Python’s built-in tracemalloc module is lightweight enough to leave running in production (with some caveats). memray is heavier but gives you flame graphs and live tracking that make complex leaks obvious.

I start with tracemalloc because it’s already in the standard library. If that doesn’t answer the question, I pull out memray.

Here’s the basic tracemalloc pattern I use:

import tracemalloc
import asyncio
from collections import defaultdict

# Start tracking at application startup
tracemalloc.start()

# Simulate a leaky cache
cache = defaultdict(list)

async def process_event(event_id: int, data: bytes):
    # This looks innocent but it's a trap
    cache[event_id].append(data)
    await asyncio.sleep(0.01)  # Simulate async work

async def main():
    # Simulate 10,000 events
    for i in range(10000):
        await process_event(i % 100, b"X" * 1024)  # 1KB per event

    # Take a snapshot
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')

    print("[ Top 5 memory allocations ]")
    for stat in top_stats[:5]:
        print(f"{stat.filename}:{stat.lineno}: {stat.size / 1024 / 1024:.2f} MB")

if __name__ == "__main__":
    asyncio.run(main())

When I ran this (Python 3.11 on Ubuntu 22.04), the output was:

[ Top 5 memory allocations ]
/home/ubuntu/leak_demo.py:11: 97.66 MB
/usr/lib/python3.11/asyncio/events.py:80: 1.23 MB
...

Line 11 is cache[event_id].append(data). The smoking gun.

Comparing Snapshots: The Delta That Matters

One snapshot tells you where memory lives right now. Two snapshots tell you what’s growing.

This is the pattern I use for long-running services:

import tracemalloc
import time

tracemalloc.start()

# Baseline snapshot
snapshot1 = tracemalloc.take_snapshot()

# Simulate workload
leaky_list = []
for i in range(100000):
    leaky_list.append({"id": i, "data": "X" * 100})

time.sleep(1)

# Second snapshot
snapshot2 = tracemalloc.take_snapshot()

# Compare
top_stats = snapshot2.compare_to(snapshot1, 'lineno')

print("[ Top 3 memory growth sources ]")
for stat in top_stats[:3]:
    print(f"{stat.filename}:{stat.lineno}: +{stat.size_diff / 1024 / 1024:.2f} MB")
    print(f"  {stat.count_diff} new objects")

Output:

[ Top 3 memory growth sources ]
/home/ubuntu/leak_demo.py:12: +9.54 MB
  100000 new objects

The size_diff and count_diff are what I care about. If a line shows +500 MB over 10 minutes, that’s your leak.

The Production Setup: Periodic Snapshots Without Killing Performance

Leaving tracemalloc on in production adds about 10-15% memory overhead and a small CPU cost. For most services, this is acceptable if you snapshot periodically rather than constantly.

Here’s the pattern I use with FastAPI:

from fastapi import FastAPI, BackgroundTasks
import tracemalloc
import asyncio
from datetime import datetime

app = FastAPI()

# Store snapshots in-memory (or write to disk/S3)
snapshots = []

@app.on_event("startup")
async def startup():
    tracemalloc.start(10)  # Track up to 10 stack frames
    asyncio.create_task(periodic_snapshot())

async def periodic_snapshot():
    """Take a snapshot every 5 minutes"""
    while True:
        await asyncio.sleep(300)
        snapshot = tracemalloc.take_snapshot()
        snapshots.append((datetime.now(), snapshot))

        # Keep only last 12 snapshots (1 hour)
        if len(snapshots) > 12:
            snapshots.pop(0)

        # Log top 3 allocations
        top_stats = snapshot.statistics('lineno')
        print(f"[{datetime.now()}] Top memory usage:")
        for stat in top_stats[:3]:
            print(f"  {stat.filename}:{stat.lineno}: {stat.size / 1024 / 1024:.2f} MB")

@app.get("/memory-report")
async def memory_report():
    """Compare first and last snapshot"""
    if len(snapshots) < 2:
        return {"error": "Not enough snapshots yet"}

    first_time, first_snap = snapshots[0]
    last_time, last_snap = snapshots[-1]

    top_stats = last_snap.compare_to(first_snap, 'lineno')

    return {
        "period": f"{first_time} to {last_time}",
        "top_growth": [
            {
                "file": stat.filename,
                "line": stat.lineno,
                "growth_mb": round(stat.size_diff / 1024 / 1024, 2),
                "new_objects": stat.count_diff
            }
            for stat in top_stats[:10]
        ]
    }

I hit /memory-report from a monitoring script every hour. If growth exceeds a threshold, I get a Slack alert.

One gotcha: tracemalloc.start(10) tracks 10 stack frames. The default is 1, which often isn’t enough to see the real call chain. But more frames = more overhead. I’ve found 10 is a sweet spot.

When `tracemalloc` Isn’t Enough: Enter `memray`

Sometimes the leak is in a C extension (numpy, pandas, Pillow). tracemalloc only sees Python allocations. memray sees everything.

Install it:

pip install memray

Basic usage:

# Run your script under memray
memray run -o output.bin your_script.py

# Generate a flame graph
memray flamegraph output.bin

This creates memray-flamegraph-output.html. Open it in a browser and you get an interactive flame graph showing where memory is allocated.

I once debugged a leak in a service that used Pillow to resize images. tracemalloc showed high memory usage in my code, but couldn’t tell me why. memray revealed that Image.open() was keeping raw image buffers in memory because I wasn’t explicitly calling .close().

The fix:

from PIL import Image

# Before (leaked memory)
def resize_image(path: str) -> bytes:
    img = Image.open(path)
    img = img.resize((800, 600))
    # img never gets closed — buffer leaks
    ...

# After (proper cleanup)
def resize_image(path: str) -> bytes:
    with Image.open(path) as img:
        img = img.resize((800, 600))
        # img.close() called automatically
        ...

memray showed the buffer allocation in C code. tracemalloc couldn’t see it.

Live Tracking with `memray attach`

The killer feature: attach to a running process.

# Get the PID of your running Python process
pgrep -f "python.*your_service.py"

# Attach memray (requires sudo on some systems)
sudo memray attach <PID>

This gives you a live TUI (text UI) showing memory allocations in real-time. I use this when I see memory climbing in production but can’t easily restart the service to profile it.

Caveat: memray attach adds significant overhead (20-30% in my tests). I only use it for a few minutes at a time.

The Real-World Leak Pattern I See Most

It’s not exotic bugs. It’s caches without eviction.

# This is the pattern that bites everyone eventually
class DataService:
    def __init__(self):
        self._cache = {}  # No size limit. Oops.

    async def get_data(self, key: str):
        if key in self._cache:
            return self._cache[key]

        data = await fetch_from_db(key)
        self._cache[key] = data  # Grows forever
        return data

The fix:

from functools import lru_cache
from cachetools import TTLCache
import asyncio

class DataService:
    def __init__(self):
        # Max 1000 items, 5-minute TTL
        self._cache = TTLCache(maxsize=1000, ttl=300)
        self._lock = asyncio.Lock()

    async def get_data(self, key: str):
        async with self._lock:
            if key in self._cache:
                return self._cache[key]

        data = await fetch_from_db(key)

        async with self._lock:
            self._cache[key] = data

        return data

Or for synchronous code, just use functools.lru_cache with a maxsize:

from functools import lru_cache

@lru_cache(maxsize=1000)
def expensive_computation(x: int) -> int:
    # Only 1000 results cached, LRU eviction
    return x ** 2

I can’t tell you how many times I’ve seen unbounded caches in production. If you take one thing from this post, let it be: every cache needs an eviction policy.

Profiling Memory Growth Over Time: The Snapshot Diff Pattern

Here’s a script I run overnight when I suspect a slow leak:

import tracemalloc
import time
import json
from datetime import datetime

def take_snapshot_summary():
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')
    return {
        "timestamp": datetime.now().isoformat(),
        "top_allocations": [
            {
                "file": stat.filename,
                "line": stat.lineno,
                "size_mb": stat.size / 1024 / 1024,
                "count": stat.count
            }
            for stat in top_stats[:20]
        ]
    }

if __name__ == "__main__":
    tracemalloc.start()

    snapshots = []

    # Run for 1 hour, snapshot every 5 minutes
    for i in range(12):
        snapshots.append(take_snapshot_summary())
        print(f"Snapshot {i+1}/12 taken")
        time.sleep(300)

    # Write to file
    with open("memory_profile.json", "w") as f:
        json.dump(snapshots, f, indent=2)

    print("Profile saved to memory_profile.json")

Then I analyze the JSON to see which files/lines grew the most. Not sophisticated, but effective.

A Gotcha: Reference Cycles and `gc.get_objects()`

Sometimes memory isn’t leaked in the traditional sense — it’s held by reference cycles that the garbage collector will eventually clean up, but hasn’t yet.

You can force a collection and see what’s left:

import gc
import sys

# Force garbage collection
gc.collect()

# Get all objects tracked by GC
all_objects = gc.get_objects()

print(f"Total objects: {len(all_objects)}")

# Count by type
from collections import Counter
type_counts = Counter(type(obj).__name__ for obj in all_objects)

print("\nTop 10 object types:")
for obj_type, count in type_counts.most_common(10):
    print(f"  {obj_type}: {count}")

If you see thousands of dict or list objects, dig deeper:

# Find all dicts over 1 MB
large_dicts = [
    obj for obj in all_objects
    if isinstance(obj, dict) and sys.getsizeof(obj) > 1024 * 1024
]

print(f"Found {len(large_dicts)} large dicts")

This is a blunt instrument. Use it when you’re truly stuck.

My Current Production Setup

For the financial data service I mentioned at the start, here’s what I ended up with:

tracemalloc always on in production, taking snapshots every 10 minutes
Hourly cron job hits /memory-report and alerts if growth exceeds 200 MB/hour
Weekly memray profiling on a staging replica under production traffic
cachetools.TTLCache for all caches, no exceptions
Explicit .close() calls for Pillow images and file handles

Memory leaks dropped from weekly incidents to zero over the last six months.

The overhead from tracemalloc is real but manageable. We saw about 12% higher memory usage and no noticeable CPU impact (the service is I/O-bound). That’s a fair trade for automatic leak detection.

What I Still Don’t Know

I haven’t figured out a good way to profile memory in multi-process setups (like Gunicorn with 8 workers). tracemalloc is per-process, so you’d need to aggregate snapshots across processes. I’ve hacked together a solution with shared memory and Redis, but it’s ugly.

If you’ve solved this cleanly, I’d love to hear about it.

Also, memray is fantastic but the flame graphs can be overwhelming for complex applications. I wish there was a “show me only allocations that grew since last snapshot” view. Maybe there is and I haven’t found it.

When to Reach for Each Tool

Use tracemalloc when:
– You need always-on monitoring in production
– The leak is in pure Python code
– You want minimal overhead
– You need to compare snapshots over time

Use memray when:
– The leak involves C extensions (numpy, pandas, Pillow)
– You need a visual flame graph to understand allocation patterns
– You can afford to run profiling in a staging environment
– tracemalloc didn’t pinpoint the issue

Use gc.get_objects() when:
– You suspect reference cycles
– You’re debugging locally and can tolerate slow analysis
– You need to inspect live objects of a specific type

Start with tracemalloc. If that doesn’t answer the question in 10 minutes, move to memray. If you’re still stuck, break out gc.get_objects() and prepare for a deep dive.

The key insight: you don’t need to guess where memory is going. Python gives you the tools to know. Use them.

Did you find this helpful?

☕ Buy me a coffee

Python Memory Profiling in Production: Tracking and Fixing Memory Leaks with tracemalloc and memray

The 47GB Memory Leak I Didn’t See Coming

Why `tracemalloc` First, `memray` Later

Comparing Snapshots: The Delta That Matters

The Production Setup: Periodic Snapshots Without Killing Performance

When `tracemalloc` Isn’t Enough: Enter `memray`

Live Tracking with `memray attach`

The Real-World Leak Pattern I See Most

Profiling Memory Growth Over Time: The Snapshot Diff Pattern

A Gotcha: Reference Cycles and `gc.get_objects()`

My Current Production Setup

What I Still Don’t Know

When to Reach for Each Tool

Comments

Leave a Reply Cancel reply

Python Memory Profiling in Production: Tracking and Fixing Memory Leaks with tracemalloc and memray

The 47GB Memory Leak I Didn’t See Coming

Why tracemalloc First, memray Later

Comparing Snapshots: The Delta That Matters

The Production Setup: Periodic Snapshots Without Killing Performance

When tracemalloc Isn’t Enough: Enter memray

Live Tracking with memray attach

The Real-World Leak Pattern I See Most

Profiling Memory Growth Over Time: The Snapshot Diff Pattern

A Gotcha: Reference Cycles and gc.get_objects()

My Current Production Setup

What I Still Don’t Know

When to Reach for Each Tool

Related Posts

Comments

Leave a Reply Cancel reply

Why `tracemalloc` First, `memray` Later

When `tracemalloc` Isn’t Enough: Enter `memray`

Live Tracking with `memray attach`

A Gotcha: Reference Cycles and `gc.get_objects()`