Data Collection and Preprocessing for Quant Trading in Python

Updated Feb 6, 2026

The 90% You Don’t See

Most stock price data you download is wrong.

Not dramatically wrong — not off by dollars or anything that obvious. It’s wrong in subtle, portfolio-destroying ways: unadjusted splits, survivorship bias, missing delisted tickers, timezone mismatches between exchanges. Run a backtest on raw Yahoo Finance data and you’ll get results that look amazing on paper but would have lost money in reality. The gap between “downloaded CSV” and “usable quant data” is where most beginners quietly give up.

As we set up our environment in Part 1, the focus was on tools and libraries. Now comes the unglamorous part — actually getting clean data into those tools. This is the step that determines whether everything downstream (indicators, backtests, ML models) means anything at all.

A person reads 'Python for Unix and Linux System Administration' indoors. — Photo by Christina Morillo on Pexels

Getting Price Data Without Losing Your Mind

There are roughly three tiers of market data access: free, cheap, and expensive. Free sources like Yahoo Finance (via yfinance) and FRED work fine for learning and prototyping. Paid APIs like Polygon.io, Alpha Vantage, or Tiingo sit in the $10–50/month range and give you cleaner data with better rate limits. And then there’s Bloomberg Terminal, Refinitiv, and institutional feeds — if you need to ask the price, you probably don’t need them yet.

For most quant projects that aren’t managing real capital, yfinance is the pragmatic starting point. It’s free, it handles adjusted prices, and the API is dead simple. But it has quirks.

import yfinance as yf
import pandas as pd

# Download daily OHLCV for a few tickers
tickers = ["AAPL", "MSFT", "GOOGL", "AMZN", "META"]
raw = yf.download(tickers, start="2020-01-01", end="2024-01-01", group_by="ticker")

print(raw.shape)  # (1006, 30) — 5 tickers × 6 columns
print(raw["AAPL"].head())

That looks clean enough. But watch what happens when you include a ticker that got delisted or changed its symbol:

# FB was renamed to META in June 2022
test = yf.download(["FB", "META"], start="2020-01-01", end="2024-01-01", group_by="ticker")
print(test["FB"].dropna().shape)   # (0, 6) — completely empty
print(test["META"].dropna().shape)  # (1006, 6) — full history including pre-rename

The yfinance library silently maps “FB” to nothing and gives you “META” with the full historical series. No error, no warning. If your universe list includes old ticker symbols, you’ll just have gaps in your data and never know unless you check. This is the kind of thing that makes quant data work tedious — the failures are silent.

A minimal sanity check after any download:

def validate_download(df, tickers, min_rows=200):
    """Flag tickers with suspicious data gaps."""
    issues = []
    for t in tickers:
        if t not in df.columns.get_level_values(0):
            issues.append((t, "missing entirely"))
            continue
        chunk = df[t].dropna(how="all")
        if len(chunk) < min_rows:
            issues.append((t, f"only {len(chunk)} rows"))
        # Check for suspiciously constant prices (data feed froze)
        if chunk["Close"].nunique() < 5:
            issues.append((t, "almost no price variation — stale data?"))
    return issues

problems = validate_download(raw, tickers)
print(problems)  # Hopefully []

Why check for constant prices? Because I’ve seen feeds where a ticker just returns the same closing price for weeks — the API didn’t error out, it just served stale cached data. My best guess is it’s a CDN caching issue on Yahoo’s side, but I haven’t dug deep enough to confirm.

Adjusted Prices and the Split Trap

Here’s where things get genuinely tricky. Stock splits and dividends mean that the raw historical price of a stock doesn’t represent what you would have actually experienced holding it. Apple’s 4:1 split in August 2020 means its pre-split price of ~ $500 needs to be shown as ~$ 125 for any meaningful time-series analysis.

The yfinance library returns an “Adj Close” column that accounts for splits and dividends. But here’s the catch: if you’re computing returns from the Close column instead of Adj Close, your returns will show massive artificial drops on split dates. The daily return $r_t$ should be computed as:

$r_t = \frac{P^{adj}_t – P^{adj}_{t-1}}{P^{adj}_{t-1}}$

Not from raw close prices. This seems obvious written out, but it’s a mistake that shows up constantly in beginner backtest code.

aapl = yf.download("AAPL", start="2020-06-01", end="2020-10-01")

# Wrong: returns from raw Close
aapl["bad_return"] = aapl["Close"].pct_change()

# Right: returns from Adj Close
aapl["good_return"] = aapl["Adj Close"].pct_change()

# On 2020-08-31 (split date), the difference is dramatic
split_date = "2020-08-31"
print(f"Raw return: {aapl.loc[split_date, 'bad_return']:.4f}")   # -0.7506 (looks like a 75% crash)
print(f"Adj return: {aapl.loc[split_date, 'good_return']:.4f}")  # -0.0076 (normal day)

That -75% “return” from raw prices would completely destroy any momentum or mean-reversion signal. And if you’re training an ML model on those returns, congratulations — you’ve just taught it that stocks occasionally lose three-quarters of their value in a day.

But adjusted prices have their own problem: they change retroactively. Every new dividend payment causes the entire historical adjusted series to shift. If you downloaded AAPL’s adjusted close on January 1st and again on February 1st (after a dividend), the values for, say, last March won’t match. For reproducible research, you either need to store your data at download time and note when you pulled it, or use a point-in-time database. Most free sources don’t offer point-in-time data — that’s one of the things you’re paying for with institutional feeds.

Resampling and Aligning Multi-Frequency Data

Quant strategies often mix data at different frequencies: daily prices, weekly options data, monthly economic indicators, maybe even intraday volume profiles. Getting these aligned properly is where Pandas earns its keep — and where off-by-one errors multiply.

The core operation is resampling. Converting daily OHLCV to weekly:

aapl_daily = yf.download("AAPL", start="2023-01-01", end="2024-01-01")

aapl_weekly = aapl_daily.resample("W-FRI").agg({
    "Open": "first",
    "High": "max",
    "Low": "min",
    "Close": "last",
    "Adj Close": "last",
    "Volume": "sum"
}).dropna()

print(f"Daily rows: {len(aapl_daily)}, Weekly rows: {len(aapl_weekly)}")
# Daily rows: 251, Weekly rows: 52

The W-FRI anchor matters. Using plain "W" defaults to Sunday, which means your “weekly” candle ends on a non-trading day and the last aggregation grabs Friday’s value anyway — but the index timestamp says Sunday, which can cause join misalignment with other weekly data.

Merging data at different frequencies requires a decision about how to handle the temporal mismatch. If you have monthly GDP data and daily stock prices, you can’t just pd.merge() on dates — they won’t match. The standard approach is merge_asof, which does a left join with tolerance:

# Simulating monthly macro data
import numpy as np
macro = pd.DataFrame({
    "date": pd.date_range("2023-01-31", "2023-12-31", freq="ME"),
    "gdp_growth": np.random.normal(2.1, 0.5, 12),
    "cpi_yoy": np.random.normal(3.2, 0.8, 12)
})

daily_prices = aapl_daily[["Adj Close"]].reset_index()
daily_prices.columns = ["date", "adj_close"]
macro = macro.sort_values("date")
daily_prices = daily_prices.sort_values("date")

merged = pd.merge_asof(
    daily_prices, macro,
    on="date",
    direction="backward"  # use most recent macro reading available at each date
)
print(merged.tail())

The direction="backward" is doing something important here: it ensures you only use macro data that was available at the time of each daily observation. Using direction="nearest" would leak future information — January’s stock price might get matched to February’s GDP number. This is a textbook case of look-ahead bias, and it’s one of the most common data preprocessing errors in quant research.

Formally, any feature $x_t$ used to predict $y_t$ must satisfy $x_t \in \mathcal{F}_{t-1}$ , where $\mathcal{F}_{t-1}$ is the information set available at time $t-1$ . Violating this is equivalent to peeking at tomorrow’s newspaper.

Handling Missing Data and Outliers

Real market data has gaps. Trading holidays, halted stocks, newly listed companies — you’ll hit NaN values constantly. How you fill them matters more than most people think.

Forward-fill (ffill) is the standard for price data because it matches market reality: if a stock didn’t trade today, its “price” is still yesterday’s close. But forward-filling volume data makes less sense — zero volume is more honest than a fabricated number.

def clean_ohlcv(df):
    """Clean OHLCV data with type-appropriate gap handling."""
    price_cols = ["Open", "High", "Low", "Close", "Adj Close"]

    # Forward-fill prices (market convention)
    df[price_cols] = df[price_cols].ffill()

    # Zero-fill volume (no trades = 0 volume, not yesterday's volume)
    df["Volume"] = df["Volume"].fillna(0)

    # Drop any remaining NaN at the start (before first valid price)
    df = df.dropna(subset=["Close"])

    return df

Outlier detection in financial data is harder than in most domains because real markets do produce extreme moves. The S&P 500 dropped about 12% on March 16, 2020. That’s a genuine 7-sigma event if you assume normal returns (which you shouldn’t). Blindly clipping returns beyond $\pm 3\sigma$ would remove real market crashes from your training data.

A more reasonable approach is to flag outliers relative to a rolling window rather than the global distribution:

def flag_return_outliers(returns, window=60, n_sigma=4):
    """Flag returns that are extreme relative to recent volatility."""
    rolling_mean = returns.rolling(window).mean()
    rolling_std = returns.rolling(window).std()

    z_scores = (returns - rolling_mean) / rolling_std
    outliers = z_scores.abs() > n_sigma

    print(f"Flagged {outliers.sum()} / {len(returns)} observations ({100*outliers.mean():.2f}%)")
    return outliers

aapl = yf.download("AAPL", start="2019-01-01", end="2024-01-01")
returns = aapl["Adj Close"].pct_change().dropna()
outlier_mask = flag_return_outliers(returns)
# Flagged 8 / 1257 observations (0.64%)

Eight flagged days out of five years seems about right — those likely correspond to earnings surprises, COVID crash, and maybe a couple of flash-crash-style intraday moves. Whether you remove, winsorize, or keep those points depends entirely on what you’re building. For a trend-following strategy, keep them. For a mean-reversion model, you might want to winsorize at the $\pm 3\sigma$ level relative to the rolling window:

$r_t^{\text{winsor}} = \max\left(\mu_{t,w} – k\sigma_{t,w},\; \min\left(r_t,\; \mu_{t,w} + k\sigma_{t,w}\right)\right)$

where $\mu_{t,w}$ and $\sigma_{t,w}$ are the rolling mean and standard deviation over window $w$ , and $k$ is your threshold (typically 3 or 4).

Survivorship Bias: The Silent Strategy Killer

If you download today’s S&P 500 constituents and backtest a strategy over the last 10 years, your universe contains only the winners — companies that survived and grew large enough to stay in the index. All the companies that went bankrupt, got delisted, or shrank out of the index are invisible. Your strategy looks better than it would have performed in real-time because you never had the opportunity to buy the losers.

This isn’t a small effect. Research by Elton, Gruber, and Blake (1996, The Review of Financial Studies) showed survivorship bias can inflate mutual fund returns by 0.9–1.5% annually. For individual stocks in a broad universe, the distortion can be even larger.

There’s no clean free fix. Survivorship-free datasets are one of the main things you get from paid providers like Norgate, CRSP, or Sharadar. If you’re using free data, the honest thing is to acknowledge the bias and be skeptical of your backtest results — especially if your strategy tends to buy small-cap or distressed companies.

One partial mitigation: use ETF data instead of individual stocks. An ETF like SPY already handles reconstitution internally — when a company drops out of the index, the ETF sells it and buys the replacement. Your backtest on SPY won’t have survivorship bias (though it has its own issues, like creation/redemption dynamics).

Storing Data Efficiently

Once you’ve collected and cleaned your data, you need to store it somewhere that isn’t “re-download from Yahoo every time I run my script.” The practical options:

Parquet files are the sweet spot for most quant projects. They’re columnar (fast for reading specific columns), compressed (5-10x smaller than CSV), and preserve dtypes including datetime indices. And they’re fast — reading a 10-year daily dataset for 500 stocks takes about 200ms from Parquet vs 3+ seconds from CSV on my machine (M2 MacBook, Python 3.11, pandas 2.1).

import time

# Save to parquet
raw.to_parquet("market_data.parquet", engine="pyarrow")

# Compare read times
t0 = time.time()
df_parquet = pd.read_parquet("market_data.parquet")
t_parquet = time.time() - t0

raw.to_csv("market_data.csv")
t0 = time.time()
df_csv = pd.read_csv("market_data.csv", parse_dates=["Date"])
t_csv = time.time() - t0

print(f"Parquet: {t_parquet:.3f}s, CSV: {t_csv:.3f}s")
# Parquet: 0.012s, CSV: 0.089s (for this small dataset)

For anything beyond a single-strategy project, SQLite works surprisingly well. It handles concurrent reads, supports SQL queries for slicing by date range or ticker, and the entire database is a single file you can copy around. I wouldn’t bother with PostgreSQL or TimescaleDB until you’re dealing with intraday tick data or multi-user access.

import sqlite3

def store_to_sqlite(df, ticker, db_path="quant_data.db"):
    conn = sqlite3.connect(db_path)
    df_flat = df.reset_index()
    df_flat["ticker"] = ticker
    df_flat.to_sql("daily_ohlcv", conn, if_exists="append", index=False)
    conn.close()

def load_from_sqlite(ticker, start, end, db_path="quant_data.db"):
    conn = sqlite3.connect(db_path)
    query = """
        SELECT * FROM daily_ohlcv 
        WHERE ticker = ? AND Date BETWEEN ? AND ?
        ORDER BY Date
    """
    df = pd.read_sql_query(query, conn, params=(ticker, start, end), parse_dates=["Date"])
    conn.close()
    return df.set_index("Date")

Building a Reproducible Data Pipeline

Here’s a pattern that’s saved me from data-related headaches: version your datasets. Not with git (binary files in git is a bad idea) — just with a simple naming convention and a metadata file.

import json
from datetime import datetime

def snapshot_dataset(df, name, notes=""):
    """Save dataset with metadata for reproducibility."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{name}_{timestamp}.parquet"

    meta = {
        "filename": filename,
        "created": timestamp,
        "shape": list(df.shape),
        "date_range": [str(df.index.min()), str(df.index.max())],
        "columns": list(df.columns),
        "notes": notes,
        # This shouldn't happen, but catch it if the index got mangled
        "index_is_datetime": str(df.index.dtype).startswith("datetime")
    }

    df.to_parquet(filename)
    with open(f"{name}_{timestamp}_meta.json", "w") as f:
        json.dump(meta, f, indent=2)

    print(f"Saved {filename} ({df.shape[0]} rows, {df.shape[1]} cols)")
    return filename

That index_is_datetime check exists because Parquet round-tripping sometimes converts datetime indices to object dtype depending on the pyarrow version. I’m not entirely sure which versions have this issue — it bit me once on pyarrow 10.x but seemed fine on 12+. Worth checking after any pyarrow upgrade.

What Actually Matters Here

Data preprocessing for quant trading boils down to three non-negotiable steps: use adjusted prices for any return calculation, never let future data leak into past observations, and validate your data before trusting your results. Everything else — storage format, cleaning strategy, resampling method — is optimization.

If I were starting a new quant project today, I’d spend the first day building the data pipeline and validation checks, and only then start thinking about strategies. The temptation is always to jump straight to the exciting backtesting part, but every hour spent on data quality saves ten hours of debugging phantom alpha that was really just a data artifact.

And should you pay for data? If you’re managing real money, absolutely — the cost of a Polygon.io subscription is trivial compared to the cost of acting on bad data. For learning and research, free sources are fine as long as you understand their limitations.

In Part 3, we’ll start building on this clean data foundation with technical indicators and feature engineering — turning raw OHLCV into signals that might actually predict something. The Pandas rolling() and ewm() methods are about to become your best friends.

Quant Investment with Python Series (2/8)

← Previous: Getting Started with Quantitative Investment in Python Next: Technical Indicators and Feature Engineering with Pandas for Quant Trading →

Did you find this helpful?

☕ Buy me a coffee

Data Collection and Preprocessing for Quant Trading in Python

The 90% You Don’t See

Getting Price Data Without Losing Your Mind

Adjusted Prices and the Split Trap

Resampling and Aligning Multi-Frequency Data

Handling Missing Data and Outliers

Survivorship Bias: The Silent Strategy Killer

Storing Data Efficiently

Building a Reproducible Data Pipeline

What Actually Matters Here

Comments

Leave a Reply Cancel reply

Data Collection and Preprocessing for Quant Trading in Python

The 90% You Don’t See

Getting Price Data Without Losing Your Mind

Adjusted Prices and the Split Trap

Resampling and Aligning Multi-Frequency Data

Handling Missing Data and Outliers

Survivorship Bias: The Silent Strategy Killer

Storing Data Efficiently

Building a Reproducible Data Pipeline

What Actually Matters Here

Related Posts

Comments

Leave a Reply Cancel reply