Getting Started with Quantitative Investment in Python

Updated Feb 6, 2026

Why Most Beginners Build the Wrong Foundation

Most people learning quantitative investment start by trying to predict stock prices. They download historical data, throw it into a random forest, and wonder why their “95% accurate” model loses money in live trading.

The problem isn’t the model. It’s that they skipped the infrastructure work that actually matters: data hygiene, position sizing, and the difference between backtest performance and tradeable returns. You can have the world’s best alpha signal, but if your execution slippage eats 0.5% per trade and you’re rebalancing daily, you’re done.

I’d argue the first skill to build isn’t modeling—it’s understanding what makes a strategy testable in the first place. That means working data pipelines, reproducible backtests, and a clear mental model of transaction costs. This post walks through that foundation using Python tools that don’t abstract away the important details.

Flat lay of stock market analysis tools including calculator, graphs, and magnifying glass.
Photo by Hanna Pad on Pexels

The Minimal Stack You Actually Need

You don’t need a Bloomberg terminal or a $50k budget. For retail quant work, three libraries cover 90% of your needs: yfinance for data, pandas for manipulation, and numpy for math. Later you’ll add backtrader or zipline for backtesting, but start simple.

Here’s the realistic setup (tested on Python 3.11, though 3.9+ should work):

import yfinance as yf
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Download 5 years of daily OHLCV data for SPY
ticker = yf.Ticker("SPY")
df = ticker.history(period="5y", interval="1d")

print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Date range: {df.index[0]} to {df.index[-1]}")

Output:

Shape: (1258, 7)
Columns: ['Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits']
Date range: 2021-02-04 to 2026-02-03

Notice yfinance auto-adjusts for splits and dividends by default. That’s convenient but dangerous—if you’re backtesting a strategy that trades on raw prices (say, options strikes), you need unadjusted data. Check the docs for auto_adjust=False if you need that.

What “Clean” Data Actually Looks Like

Before running any strategy, you need to know your data quality. Missing bars, zero-volume days, and obvious errors (negative prices, prices that spike 1000x) will silently ruin your backtest.

Here’s a basic sanity check I run on every dataset:

def audit_price_data(df):
    issues = []

    # Check for missing dates (trading days only)
    date_diff = df.index.to_series().diff()
    gaps = date_diff[date_diff > timedelta(days=5)]  # weekend + 1 day buffer
    if len(gaps) > 0:
        issues.append(f"Found {len(gaps)} date gaps > 5 days")

    # Zero or negative prices (shouldn't happen but...)
    if (df['Close'] <= 0).any():
        issues.append("Negative or zero close prices detected")

    # Unrealistic daily returns (>30% moves)
    returns = df['Close'].pct_change()
    extreme = returns[abs(returns) > 0.30]
    if len(extreme) > 0:
        issues.append(f"{len(extreme)} days with >30% return (check for errors)")

    # Volume anomalies
    if (df['Volume'] == 0).any():
        zero_vol_count = (df['Volume'] == 0).sum()
        issues.append(f"{zero_vol_count} days with zero volume")

    return issues if issues else ["No issues detected"]

print(audit_price_data(df))

For SPY this should return clean, but try it on a thinly-traded small-cap stock and you’ll see why this matters. I’ve seen backtests claim 200% annual returns because a single illiquid stock had a data error where the close price was recorded as $0.01 for one day, making the “return” astronomical.

Returns Are Not What You Think

The most common beginner mistake is calculating returns wrong. Simple returns (rt=PtPt1Pt1r_t = \frac{P_t – P_{t-1}}{P_{t-1}}) seem intuitive, but they don’t chain correctly over time. If a stock drops 50% then rises 50%, you’re not back to even—you’re down 25%.

Log returns fix this. The continuously compounded return is:

rt=ln(PtPt1)r_t = \ln\left(\frac{P_t}{P_{t-1}}\right)

Log returns are additive: rtotal=t=1Trtr_{total} = \sum_{t=1}^{T} r_t. They’re symmetric around zero (a +10% move and a -10% move have equal magnitude). And they’re the mathematically correct input for most statistical models.

But here’s the catch: when you report performance to a human, convert back to simple returns. Nobody intuitively understands “I made 0.15 log-returns this year.”

# Calculate both for comparison
df['simple_return'] = df['Close'].pct_change()
df['log_return'] = np.log(df['Close'] / df['Close'].shift(1))

# Over a full year, the difference compounds
annual_simple = (1 + df['simple_return']).prod() - 1
annual_log = df['log_return'].sum()

print(f"1-year simple return: {annual_simple:.2%}")
print(f"1-year log return: {annual_log:.4f}")
print(f"Log converted back: {(np.exp(annual_log) - 1):.2%}")  # should match simple

For daily returns on liquid stocks, the difference is small. For leveraged strategies or long time horizons, it matters a lot.

Your First Tradeable Signal (Not a Model)

Forget machine learning for now. Start with something you can reason about: mean reversion on a volatility-normalized basis.

The idea: when a stock deviates far from its recent average, it tends to revert. But “far” depends on how volatile the stock is. A 2% daily move in a biotech penny stock is nothing; in a utilities stock, it’s huge. We normalize by rolling standard deviation.

The z-score of returns measures how many standard deviations away from normal the current return is:

zt=rtμwindowσwindowz_t = \frac{r_t – \mu_{window}}{\sigma_{window}}

where μwindow\mu_{window} and σwindow\sigma_{window} are the mean and standard deviation over a rolling window (say, 20 days).

window = 20
df['roll_mean'] = df['log_return'].rolling(window).mean()
df['roll_std'] = df['log_return'].rolling(window).std()
df['z_score'] = (df['log_return'] - df['roll_mean']) / df['roll_std']

# Simple mean reversion rule:
# If z > 2, expect reversion down (sell signal)
# If z < -2, expect reversion up (buy signal)
df['signal'] = 0
df.loc[df['z_score'] > 2, 'signal'] = -1  # sell
df.loc[df['z_score'] < -2, 'signal'] = 1   # buy

print(f"Buy signals: {(df['signal'] == 1).sum()}")
print(f"Sell signals: {(df['signal'] == -1).sum()}")

This isn’t a good strategy (mean reversion works better on pairs or portfolios than single trending assets like SPY). But it’s testable. You have a clear entry rule, and you can measure what happens next.

The Backtest That Tells You Nothing

Here’s what most tutorials do next: calculate hypothetical returns assuming you bought at the close price whenever signal == 1. They multiply signal by next-day return, sum it up, and declare victory.

That’s not a backtest. That’s fantasy.

Real backtests account for:
Execution slippage: You don’t get the close price. Market orders slip, limit orders don’t always fill.
Transaction costs: Commissions (even $0 commissions have SEC fees and spread costs)
Position sizing: How much capital per trade? Fixed dollar amount? Kelly criterion? Volatility-scaled?
Rebalancing frequency: Daily rebalancing means daily costs. Weekly is more realistic.
Survivorship bias: Did you only test stocks that survived to today?

For now, let’s add basic transaction costs. Assume 5 basis points (0.05%) per trade, which is realistic for retail equity trading including spread:

# Forward-fill signals (hold position until next signal)
df['position'] = df['signal'].replace(0, np.nan).ffill().fillna(0)

# Calculate strategy returns
df['strategy_return'] = df['position'].shift(1) * df['log_return']

# Transaction cost: 5 bps every time position changes
df['position_change'] = df['position'].diff().abs()
df['transaction_cost'] = df['position_change'] * 0.0005

df['net_return'] = df['strategy_return'] - df['transaction_cost']

# Cumulative performance
cum_strategy = (df['net_return']).cumsum()
cum_bnh = df['log_return'].cumsum()  # buy and hold

print(f"Strategy cumulative log return: {cum_strategy.iloc[-1]:.4f}")
print(f"Buy-and-hold cumulative log return: {cum_bnh.iloc[-1]:.4f}")
print(f"Number of trades: {df['position_change'].sum() / 2:.0f}")  # /2 because entry+exit

On SPY over 5 years, this mean-reversion strategy probably underperforms buy-and-hold (SPY trends up, and we’re betting against momentum). That’s fine. The point isn’t to get rich—it’s to build the machinery that lets you test ideas honestly.

What This Doesn’t Cover (And Why That’s OK)

We haven’t talked about risk management, Sharpe ratios, maximum drawdown, Monte Carlo simulations, or any of the stuff that makes a strategy production-ready. We also haven’t addressed data snooping bias (if you test 100 strategies and pick the best one, you’ve overfit).

But here’s what you do have now: a repeatable workflow for ingesting data, cleaning it, calculating features, generating signals, and measuring performance with costs included. That’s the foundation. Everything else builds on this.

If I were starting over, I’d spend less time tweaking models and more time understanding regime changes. A strategy that works in a bull market often implodes in a bear market (or vice versa). The z-score strategy above, for instance, probably does better in sideways markets than trending ones. You won’t know until you split your backtest by market regime—something we’ll tackle in Part 4 when we build a proper backtesting framework.

For now, the goal is to get comfortable with the data layer. In Part 2, we’ll pull real-time data from multiple sources, handle corporate actions properly, and build a data pipeline that doesn’t break when Yahoo Finance changes their API (again).

Quant Investment with Python Series (1/8)

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 396 | TOTAL 2,619