Part 5: Automating Earnings Call Summarization with LLMs

Q: How does Series Recap work?

Throughout this AI-Based Financial Text Mining series, we have built a comprehensive toolkit for extracting insights from financial text. We started with sentiment analysis of financial news using FinBERT (Episode 1), mapped market volatility to global headlines (Episode 2), decoded central bank spe

Updated Feb 6, 2026

Series Recap

Throughout this AI-Based Financial Text Mining series, we have built a comprehensive toolkit for extracting insights from financial text. We started with sentiment analysis of financial news using FinBERT (Episode 1), mapped market volatility to global headlines (Episode 2), decoded central bank speeches with NLP (Episode 3), and extracted alpha signals from social media (Episode 4). In this final episode, we tackle one of the most data-rich and analytically demanding tasks in financial NLP: automating the summarization of earnings calls and SEC filings using large language models.

Earnings calls and 10-K/10-Q filings contain hundreds of pages of dense financial language. Analysts spend countless hours reading these documents to extract key metrics, forward guidance, and risk factors. LLMs offer a path to automate this process — but the stakes are high. A hallucinated revenue figure or a misinterpreted risk factor can lead to costly investment decisions.

The SEC EDGAR Ecosystem

What Is EDGAR?

The Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system is the SEC’s public database where companies file mandatory disclosures. The two most important filing types for our purposes are:

Filing Type	Frequency	Content
10-K	Annual	Comprehensive financial report, audited statements, risk factors, MD&A
10-Q	Quarterly	Unaudited quarterly financials, updated risk factors, interim MD&A
8-K	Event-driven	Material events (earnings releases, M&A, leadership changes)
DEF 14A	Annual	Proxy statements, executive compensation

Kaggle Dataset: The SEC EDGAR Database (10-K/10-Q) dataset on Kaggle provides pre-downloaded filings with structured metadata. This is an excellent starting point if you want to skip the API retrieval step and jump straight into text processing.

Retrieving Filings from the EDGAR API

The SEC provides a free full-text search API. Let’s build a retrieval module:

import requests
import time
from dataclasses import dataclass
from typing import Optional

# SEC requires a User-Agent header identifying you
HEADERS = {
    "User-Agent": "YourName your.email@example.com",
    "Accept-Encoding": "gzip, deflate",
}

# SEC rate limit: 10 requests per second
RATE_LIMIT_DELAY = 0.12


@dataclass
class SECFiling:
    """Represents a single SEC filing with metadata."""
    company: str
    cik: str
    filing_type: str
    date: str
    accession_number: str
    document_url: str


def get_cik(ticker: str) -> str:
    """Look up a company's CIK number from its ticker symbol."""
    url = "https://efts.sec.gov/LATEST/search-index?q=%s&dateRange=custom&startdt=2024-01-01&enddt=2025-01-01&forms=10-K" % ticker
    # Use the company tickers JSON endpoint instead
    tickers_url = "https://www.sec.gov/files/company_tickers.json"
    resp = requests.get(tickers_url, headers=HEADERS)
    resp.raise_for_status()
    data = resp.json()
    for entry in data.values():
        if entry["ticker"].upper() == ticker.upper():
            # CIK must be zero-padded to 10 digits
            return str(entry["cik_str"]).zfill(10)
    raise ValueError(f"Ticker {ticker} not found")


def get_recent_filings(cik: str, filing_type: str = "10-K", count: int = 5) -> list[dict]:
    """Retrieve recent filings metadata for a given CIK."""
    url = f"https://data.sec.gov/submissions/CIK{cik}.json"
    resp = requests.get(url, headers=HEADERS)
    resp.raise_for_status()
    time.sleep(RATE_LIMIT_DELAY)

    filings_data = resp.json()
    recent = filings_data["filings"]["recent"]

    results = []
    for i in range(len(recent["form"])):
        if recent["form"][i] == filing_type:
            accession = recent["accessionNumber"][i].replace("-", "")
            doc_name = recent["primaryDocument"][i]
            results.append(SECFiling(
                company=filings_data["name"],
                cik=cik,
                filing_type=filing_type,
                date=recent["filingDate"][i],
                accession_number=recent["accessionNumber"][i],
                document_url=f"https://www.sec.gov/Archives/edgar/data/{cik}/{accession}/{doc_name}"
            ))
            if len(results) >= count:
                break
    return results


def download_filing_text(filing: SECFiling) -> str:
    """Download the full text of a filing."""
    resp = requests.get(filing.document_url, headers=HEADERS)
    resp.raise_for_status()
    time.sleep(RATE_LIMIT_DELAY)
    return resp.text


# Example: get Apple's recent 10-K filings
cik = get_cik("AAPL")
filings = get_recent_filings(cik, "10-K", count=3)
for f in filings:
    print(f"{f.date} | {f.filing_type} | {f.company}")

Important: The SEC enforces a strict rate limit of 10 requests per second. Always include a proper User-Agent header with your name and email, or your IP may be temporarily blocked.

Chunking Strategies for Long Financial Documents

10-K filings can easily exceed 100,000 tokens — well beyond the context window of most LLMs. Even models with 128K or 200K context windows struggle with the full document because of attention dilution: the model “pays less attention” to information in the middle of very long inputs.

The Chunking Problem

Let $D$ represent a document with $N$ total tokens. If the LLM context window is $W$ tokens, we need to split $D$ into $k$ chunks:

$k = \left\lceil \frac{N}{W – P} \right\rceil$

where $P$ is the token budget reserved for the prompt and output. For a 100K-token filing with a 16K context window and 2K reserved for prompt/output:

$k = \left\lceil \frac{100{,}000}{16{,}000 – 2{,}000} \right\rceil = \left\lceil 7.14 \right\rceil = 8 \text{ chunks}$

Section-Aware Chunking

Naive fixed-size chunking breaks documents at arbitrary points, often splitting tables or sentences. A better approach is section-aware chunking that respects the document’s structure:

import re
from bs4 import BeautifulSoup
from dataclasses import dataclass


@dataclass
class DocumentSection:
    """A semantically meaningful section of a financial document."""
    title: str
    content: str
    token_estimate: int


def extract_sections_from_10k(html_text: str) -> list[DocumentSection]:
    """Parse a 10-K filing into semantic sections.

    10-K filings follow a standard structure defined by SEC regulation.
    We use this structure to create meaningful chunks.
    """
    soup = BeautifulSoup(html_text, "html.parser")
    text = soup.get_text(separator="\n", strip=True)

    # Standard 10-K section headers (SEC regulation S-K)
    section_patterns = [
        (r"(?i)item\s*1[^0-9ABab].*?business", "Business Overview"),
        (r"(?i)item\s*1a.*?risk\s*factors", "Risk Factors"),
        (r"(?i)item\s*1b.*?unresolved\s*staff", "Unresolved Staff Comments"),
        (r"(?i)item\s*2.*?properties", "Properties"),
        (r"(?i)item\s*3.*?legal\s*proceedings", "Legal Proceedings"),
        (r"(?i)item\s*5.*?market", "Market Information"),
        (r"(?i)item\s*6.*?selected\s*financial", "Selected Financial Data"),
        (r"(?i)item\s*7[^a].*?discussion\s*and\s*analysis", "MD&A"),
        (r"(?i)item\s*7a.*?quantitative", "Market Risk Disclosures"),
        (r"(?i)item\s*8.*?financial\s*statements", "Financial Statements"),
    ]

    sections = []
    # Find section boundaries
    boundaries = []
    for pattern, label in section_patterns:
        match = re.search(pattern, text)
        if match:
            boundaries.append((match.start(), label))

    boundaries.sort(key=lambda x: x[0])

    # Extract content between boundaries
    for i, (start, label) in enumerate(boundaries):
        end = boundaries[i + 1][0] if i + 1 < len(boundaries) else len(text)
        content = text[start:end].strip()
        # Rough token estimate: ~4 characters per token for English
        token_est = len(content) // 4
        sections.append(DocumentSection(
            title=label,
            content=content,
            token_estimate=token_est
        ))

    return sections


def chunk_section(section: DocumentSection, max_tokens: int = 12000, 
                  overlap_tokens: int = 500) -> list[str]:
    """Split a large section into overlapping chunks.

    Uses paragraph boundaries to avoid splitting mid-sentence.
    Overlap ensures context continuity between chunks.
    """
    if section.token_estimate <= max_tokens:
        return [section.content]

    paragraphs = section.content.split("\n\n")
    chunks = []
    current_chunk = []
    current_tokens = 0

    for para in paragraphs:
        para_tokens = len(para) // 4
        if current_tokens + para_tokens > max_tokens and current_chunk:
            chunks.append("\n\n".join(current_chunk))
            # Keep last few paragraphs for overlap
            overlap_text = ""
            overlap_paras = []
            for p in reversed(current_chunk):
                if len(overlap_text) // 4 < overlap_tokens:
                    overlap_paras.insert(0, p)
                    overlap_text = "\n\n".join(overlap_paras)
                else:
                    break
            current_chunk = overlap_paras
            current_tokens = len(overlap_text) // 4

        current_chunk.append(para)
        current_tokens += para_tokens

    if current_chunk:
        chunks.append("\n\n".join(current_chunk))

    return chunks

The key sections for financial analysis are typically Item 1A (Risk Factors), Item 7 (MD&A), and Item 8 (Financial Statements). For earnings call summarization, you can prioritize these sections.

Prompt Engineering for Financial Summarization

Structured Extraction Prompts

Generic summarization prompts produce vague outputs. Financial summarization requires structured extraction that targets specific metrics and categories:

import json
from typing import Optional


def build_extraction_prompt(section_text: str, section_name: str, 
                            company: str, filing_date: str) -> str:
    """Build a structured extraction prompt for financial document analysis."""
    return f"""You are a senior financial analyst reviewing {company}'s SEC filing 
dated {filing_date}. Analyze the following {section_name} section and extract 
structured information.

IMPORTANT RULES:
- Only extract information explicitly stated in the text
- If a metric is not mentioned, use null — NEVER estimate or infer
- All dollar amounts must include the unit (millions, billions)
- Flag any forward-looking statements with [FORWARD-LOOKING]

Extract the following as JSON:
{{
  "revenue": {{
    "reported": "string or null",
    "yoy_change": "string or null",
    "guidance": "string or null [FORWARD-LOOKING]"
  }},
  "earnings": {{
    "eps_reported": "string or null",
    "eps_estimate_beat": "string or null",
    "net_income": "string or null"
  }},
  "key_metrics": [
    {{"metric": "name", "value": "value", "context": "brief explanation"}}
  ],
  "risk_factors": [
    {{"risk": "description", "severity": "high/medium/low", "new": true/false}}
  ],
  "management_outlook": {{
    "tone": "bullish/neutral/bearish",
    "key_themes": ["theme1", "theme2"],
    "guidance_changes": "string or null"
  }},
  "notable_quotes": [
    {{"quote": "exact quote from text", "speaker": "name/role if known", "significance": "why it matters"}}
  ]
}}

DOCUMENT SECTION:
{section_text}
"""


def build_summarization_prompt(section_text: str, company: str) -> str:
    """Build a summarization prompt optimized for financial content."""
    return f"""Summarize the following section from {company}'s SEC filing. 
Your summary must:

1. Lead with the single most important finding
2. Include all quantitative data points (exact numbers, percentages)
3. Highlight any changes from prior periods
4. Note any new risk factors or discontinued items
5. Keep the summary under 300 words
6. Use bullet points for clarity
7. End with an "Analyst Note" — one sentence on what this means for investors

Do NOT add information that is not in the source text.

SOURCE TEXT:
{section_text}
"""

Multi-Pass Summarization

For documents that span multiple chunks, a single-pass approach loses context. Instead, use a hierarchical summarization strategy:

def hierarchical_summarize(chunks: list[str], company: str, 
                           llm_call: callable) -> str:
    """Summarize a long document using hierarchical multi-pass approach.

    Pass 1: Summarize each chunk independently
    Pass 2: Synthesize chunk summaries into a coherent final summary
    """
    # Pass 1: Individual chunk summaries
    chunk_summaries = []
    for i, chunk in enumerate(chunks):
        prompt = f"""Summarize chunk {i+1}/{len(chunks)} of {company}'s filing.
Focus on: quantitative data, key changes, forward-looking statements.
Keep to 200 words. Preserve all numbers exactly.

CHUNK:
{chunk}"""
        summary = llm_call(prompt)
        chunk_summaries.append(summary)

    # Pass 2: Synthesis
    combined = "\n\n---\n\n".join(
        [f"**Chunk {i+1} Summary:**\n{s}" for i, s in enumerate(chunk_summaries)]
    )

    synthesis_prompt = f"""You are given {len(chunks)} partial summaries of 
{company}'s SEC filing. Synthesize these into ONE coherent executive summary.

Rules:
- Eliminate redundancy across chunks
- Maintain all unique quantitative data points
- Structure: Overview → Financial Highlights → Risk Assessment → Outlook
- Maximum 500 words
- If chunk summaries contain contradictory information, flag it explicitly

PARTIAL SUMMARIES:
{combined}"""

    return llm_call(synthesis_prompt)

The quality of hierarchical summarization can be measured by information retention rate:

$R = \frac{|F_{\text{summary}} \cap F_{\text{source}}|}{|F_{\text{source}}|} \times 100\%$

where $F_{\text{summary}}$ is the set of factual claims in the summary and $F_{\text{source}}$ is the set of factual claims in the source document. A good financial summarizer should achieve $R > 90\%$ for key metrics.

Fine-Tuned Models vs. Zero-Shot LLMs

Comparison Framework

There are two main approaches to financial document summarization:

Aspect	Fine-Tuned Models	Zero-Shot LLMs (GPT-4, Claude)
Setup Cost	High (data labeling, training)	Low (prompt engineering only)
Domain Accuracy	Very high for trained patterns	Good, but may miss domain nuances
Hallucination Risk	Lower on trained distributions	Higher, especially for numbers
Flexibility	Limited to trained task format	Highly flexible output formats
Latency	Fast (smaller models)	Slower (large model inference)
Cost per Query	Low (self-hosted)	High (API pricing)
Maintenance	Retraining needed for new patterns	Prompt updates only

FinBERT and Domain-Specific Models

As we explored in Episode 1, FinBERT excels at sentiment classification but is not designed for long-form summarization. For summarization, consider:

FinBART: Fine-tuned BART for financial summarization
SEC-BERT: Pre-trained on SEC filings for better domain understanding
LLaMA-Finance: Open-source LLM fine-tuned on financial corpora

Zero-Shot Evaluation

import time
from dataclasses import dataclass


@dataclass
class EvalResult:
    """Evaluation result for a single model on a summarization task."""
    model: str
    factual_accuracy: float  # % of facts correctly stated
    completeness: float       # % of key metrics captured
    hallucination_count: int  # Number of fabricated claims
    latency_seconds: float
    cost_usd: float


def evaluate_summarization(source_text: str, summary: str, 
                           ground_truth_facts: list[str]) -> dict:
    """Evaluate a financial summary against ground truth facts.

    This is a simplified evaluation — production systems should use
    human evaluators or NLI-based fact verification.
    """
    facts_found = 0
    facts_missing = []

    for fact in ground_truth_facts:
        # Simple containment check — replace with NLI model for production
        if any(keyword in summary.lower() for keyword in fact.lower().split()):
            facts_found += 1
        else:
            facts_missing.append(fact)

    completeness = facts_found / len(ground_truth_facts) if ground_truth_facts else 0

    return {
        "completeness": completeness,
        "facts_found": facts_found,
        "facts_missing": facts_missing,
        "total_facts": len(ground_truth_facts)
    }


# Example ground truth for Apple 10-K
ground_truth = [
    "total net revenue $383 billion",
    "iPhone revenue declined 2%",
    "Services segment grew 14%",
    "gross margin 46.6%",
    "share repurchase $77 billion",
    "Greater China revenue risk",
    "AI and machine learning investment"
]

Practical finding: In our testing, Claude and GPT-4 achieve 85-92% completeness on key metric extraction from 10-K filings, but hallucination rates climb to 8-15% when the model is asked to provide specific dollar amounts not prominently featured in the text. Always verify extracted numbers against source documents.

Building an Automated Pipeline

Let’s combine everything into an end-to-end pipeline that monitors for new filings, processes them, and sends alerts:

import json
import hashlib
import logging
from datetime import datetime, timedelta
from pathlib import Path
from typing import Optional

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("filing_pipeline")


class FilingPipeline:
    """Automated pipeline: filing detection -> extraction -> summarization -> alert."""

    def __init__(self, watchlist: list[str], llm_call: callable, 
                 alert_callback: callable, cache_dir: str = "./cache"):
        """
        Args:
            watchlist: List of ticker symbols to monitor
            llm_call: Function that takes a prompt string and returns LLM response
            alert_callback: Function to send alerts (Slack, email, etc.)
            cache_dir: Directory to cache processed filings
        """
        self.watchlist = watchlist
        self.llm_call = llm_call
        self.alert_callback = alert_callback
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        self.processed_filings: set[str] = self._load_processed()

    def _load_processed(self) -> set[str]:
        """Load set of already-processed filing accession numbers."""
        processed_file = self.cache_dir / "processed.json"
        if processed_file.exists():
            return set(json.loads(processed_file.read_text()))
        return set()

    def _save_processed(self):
        """Persist processed filings to disk."""
        processed_file = self.cache_dir / "processed.json"
        processed_file.write_text(json.dumps(list(self.processed_filings)))

    def _filing_hash(self, filing: SECFiling) -> str:
        """Generate unique hash for a filing."""
        return hashlib.sha256(
            f"{filing.cik}:{filing.accession_number}".encode()
        ).hexdigest()[:16]

    def check_new_filings(self) -> list[SECFiling]:
        """Check all watchlist tickers for new filings."""
        new_filings = []
        for ticker in self.watchlist:
            try:
                cik = get_cik(ticker)
                # Check both 10-K and 10-Q
                for filing_type in ["10-K", "10-Q"]:
                    filings = get_recent_filings(cik, filing_type, count=3)
                    for filing in filings:
                        fhash = self._filing_hash(filing)
                        if fhash not in self.processed_filings:
                            new_filings.append(filing)
            except Exception as e:
                logger.error(f"Error checking {ticker}: {e}")
        return new_filings

    def process_filing(self, filing: SECFiling) -> dict:
        """Full processing pipeline for a single filing."""
        logger.info(f"Processing: {filing.company} {filing.filing_type} ({filing.date})")

        # Step 1: Download
        raw_text = download_filing_text(filing)
        logger.info(f"Downloaded {len(raw_text)} characters")

        # Step 2: Extract sections
        sections = extract_sections_from_10k(raw_text)
        logger.info(f"Extracted {len(sections)} sections")

        # Step 3: Prioritize key sections
        priority_sections = ["Risk Factors", "MD&A", "Financial Statements"]
        key_sections = [s for s in sections if s.title in priority_sections]

        # Step 4: Chunk and summarize each key section
        section_results = {}
        for section in key_sections:
            chunks = chunk_section(section, max_tokens=12000)
            if len(chunks) == 1:
                # Single chunk — direct extraction
                prompt = build_extraction_prompt(
                    chunks[0], section.title, filing.company, filing.date
                )
                result = self.llm_call(prompt)
            else:
                # Multiple chunks — hierarchical summarization
                result = hierarchical_summarize(
                    chunks, filing.company, self.llm_call
                )
            section_results[section.title] = result

        # Step 5: Generate executive summary
        exec_prompt = f"""Based on the following section analyses of {filing.company}'s 
{filing.filing_type} filing dated {filing.date}, generate a concise executive summary 
for investment professionals.

Format:
## {filing.company} {filing.filing_type} Summary ({filing.date})
### Key Highlights
- (3-5 bullet points)
### Financial Metrics
- (all extracted numbers)
### Risk Assessment
- (top 3 risks with severity)
### Analyst Take
- (2-3 sentences, actionable)

SECTION ANALYSES:
{json.dumps(section_results, indent=2, default=str)}"""

        executive_summary = self.llm_call(exec_prompt)

        # Step 6: Cache result
        fhash = self._filing_hash(filing)
        result = {
            "filing": {
                "company": filing.company,
                "type": filing.filing_type,
                "date": filing.date,
                "accession": filing.accession_number
            },
            "sections": section_results,
            "executive_summary": executive_summary,
            "processed_at": datetime.now().isoformat()
        }

        cache_file = self.cache_dir / f"{fhash}.json"
        cache_file.write_text(json.dumps(result, indent=2, default=str))

        self.processed_filings.add(fhash)
        self._save_processed()

        return result

    def run(self):
        """Main pipeline execution loop."""
        new_filings = self.check_new_filings()
        logger.info(f"Found {len(new_filings)} new filings")

        for filing in new_filings:
            try:
                result = self.process_filing(filing)
                # Send alert
                self.alert_callback(
                    title=f"New {filing.filing_type}: {filing.company}",
                    summary=result["executive_summary"],
                    metadata=result["filing"]
                )
                logger.info(f"Successfully processed: {filing.company}")
            except Exception as e:
                logger.error(f"Failed to process {filing.company}: {e}")


# Example usage
def slack_alert(title: str, summary: str, metadata: dict):
    """Send alert to Slack channel (placeholder implementation)."""
    print(f"\n{'='*60}")
    print(f"ALERT: {title}")
    print(f"Date: {metadata['date']}")
    print(f"{'='*60}")
    print(summary)


def mock_llm_call(prompt: str) -> str:
    """Replace with actual LLM API call."""
    # In production, use OpenAI, Anthropic, or local model
    return "[LLM response would appear here]"


# Initialize and run
pipeline = FilingPipeline(
    watchlist=["AAPL", "MSFT", "GOOGL", "AMZN", "NVDA"],
    llm_call=mock_llm_call,
    alert_callback=slack_alert
)
pipeline.run()

Pipeline Architecture Diagram

┌─────────────────┐     ┌──────────────────┐     ┌───────────────────┐
│  SEC EDGAR API  │────▶│  Filing Detector  │────▶│  Section Parser   │
│  (10-K / 10-Q)  │     │  (new filings)    │     │  (BeautifulSoup)  │
└─────────────────┘     └──────────────────┘     └───────────────────┘
                                                          │
                                                          ▼
┌─────────────────┐     ┌──────────────────┐     ┌───────────────────┐
│  Alert System   │◀────│  Summary Engine   │◀────│  Chunking Engine  │
│  (Slack/Email)  │     │  (Hierarchical)   │     │  (Section-aware)  │
└─────────────────┘     └──────────────────┘     └───────────────────┘
                                │
                                ▼
                        ┌──────────────────┐
                        │  Cache / Storage  │
                        │  (JSON on disk)   │
                        └──────────────────┘

Regulatory Considerations and Hallucination Risks

The Hallucination Problem in Financial NLP

Hallucination in financial contexts is not just an inconvenience — it can have legal and financial consequences. The primary types of hallucinations in financial summarization:

Type	Example	Risk Level
Numeric Fabrication	Inventing a revenue figure not in the filing	Critical
Attribution Error	Assigning a quote to the wrong executive	High
Temporal Confusion	Mixing current and prior period numbers	High
Causal Invention	Creating cause-effect relationships not stated	Medium
Omission Bias	Systematically omitting negative information	Medium

Mitigation Strategies

def verify_numeric_claims(summary: str, source_text: str) -> list[dict]:
    """Cross-reference numbers in the summary against the source document.

    Extracts all numeric values from the summary and checks if they 
    appear in the source text. Flags unverified numbers.
    """
    import re

    # Extract numbers with context from summary
    number_pattern = r'[\$]?\d+[\.,]?\d*\s*(?:billion|million|percent|%|bps)'
    summary_numbers = re.findall(number_pattern, summary, re.IGNORECASE)

    verification_results = []
    for num_str in summary_numbers:
        # Normalize the number for matching
        normalized = num_str.strip().lower()
        # Check if this number appears in the source
        found = normalized in source_text.lower()

        # Also check common reformattings
        if not found:
            # Try without dollar sign, different spacing, etc.
            variants = [
                normalized.replace("$", ""),
                normalized.replace(",", ""),
                normalized.replace("billion", "B"),
                normalized.replace("million", "M"),
            ]
            found = any(v in source_text.lower() for v in variants)

        verification_results.append({
            "claim": num_str,
            "verified": found,
            "status": "VERIFIED" if found else "UNVERIFIED — REVIEW REQUIRED"
        })

    return verification_results


def add_confidence_scores(extraction_result: dict) -> dict:
    """Add confidence scores to extracted data based on source grounding.

    Confidence scoring:
    - HIGH: Exact number found in source text
    - MEDIUM: Derived/calculated from source numbers  
    - LOW: Not directly verifiable in source
    """
    # In production, use NLI (Natural Language Inference) models
    # such as cross-encoder/nli-deberta-v3-base for fact verification
    for key, value in extraction_result.items():
        if isinstance(value, dict):
            extraction_result[key]["confidence"] = "REQUIRES_VERIFICATION"
    return extraction_result

Regulatory Guardrails

SEC Regulation FD requires that material nonpublic information be disclosed to all investors simultaneously. Your summarization system must only process publicly available filings — never internal documents or pre-release information.

Key regulatory considerations:

Not Investment Advice: Always include disclaimers that LLM-generated summaries are not investment advice
Source Attribution: Every claim in a summary should be traceable to a specific section of the original filing
Timeliness: Summaries of old filings should be clearly dated to prevent stale information from influencing decisions
Audit Trail: Maintain logs of all LLM inputs and outputs for compliance review

def add_compliance_wrapper(summary: str, filing: SECFiling) -> str:
    """Wrap a generated summary with compliance disclaimers and metadata."""
    disclaimer = (
        "DISCLAIMER: This summary was generated by an AI system and has not been "
        "reviewed by a licensed financial analyst. It is provided for informational "
        "purposes only and does not constitute investment advice. Always refer to "
        "the original SEC filing for authoritative information."
    )

    metadata = (
        f"Source: {filing.filing_type} filed {filing.date} | "
        f"Accession: {filing.accession_number} | "
        f"Generated: {datetime.now().isoformat()}"
    )

    return f"{disclaimer}\n\n{metadata}\n\n---\n\n{summary}"

Working with the Kaggle SEC EDGAR Dataset

The SEC EDGAR Database dataset on Kaggle provides pre-downloaded 10-K and 10-Q filings with structured metadata. Here is how to use it effectively:

import pandas as pd
from pathlib import Path

# After downloading from Kaggle
dataset_path = Path("./sec-edgar-dataset/")

# Load the metadata index
index_df = pd.read_csv(dataset_path / "full_index.csv")
print(f"Total filings: {len(index_df):,}")
print(f"Filing types: {index_df['form_type'].value_counts().head()}")

# Filter for recent 10-K filings from major companies
recent_10k = index_df[
    (index_df["form_type"] == "10-K") &
    (index_df["date_filed"] >= "2023-01-01")
].copy()

print(f"Recent 10-K filings: {len(recent_10k):,}")

# Load a specific filing
sample_filing = recent_10k.iloc[0]
filing_text = (dataset_path / sample_filing["filename"]).read_text(
    encoding="utf-8", errors="ignore"
)

# Process with our pipeline
sections = extract_sections_from_10k(filing_text)
for section in sections:
    print(f"{section.title}: ~{section.token_estimate:,} tokens")

This dataset is particularly useful for benchmarking your summarization pipeline against a large corpus of filings without hitting SEC rate limits during development.

Conclusion

Automating earnings call and SEC filing summarization with LLMs is one of the most high-impact applications of financial NLP. In this episode, we built a complete pipeline — from retrieving filings through the EDGAR API, to section-aware chunking, structured prompt engineering, hierarchical summarization, and automated alerting.

The key takeaways from this series finale:

Section-aware chunking dramatically outperforms naive fixed-size splitting for financial documents
Structured extraction prompts with explicit JSON schemas produce far more reliable outputs than open-ended summarization requests
Hierarchical multi-pass summarization preserves information across long documents that exceed context windows
Hallucination verification is non-negotiable in financial applications — always cross-reference extracted numbers against source text
Regulatory compliance must be built into the pipeline from day one, not bolted on later

Looking back at the full series, we have covered the spectrum of financial text mining: from sentiment analysis with FinBERT, through news-volatility mapping, central bank speech decoding, and social media signal extraction, to this final automated filing summarization pipeline. Each technique addresses a different data source and analytical need, but they share a common thread: financial text is uniquely challenging because the cost of errors is measured in dollars, not just accuracy points.

The future of financial NLP lies in combining these approaches — using FinBERT sentiment as a signal layer, LLM-based summarization for depth, and cross-source verification for reliability. Build carefully, verify relentlessly, and always keep a human in the loop for critical decisions.

AI-Based Financial Text Mining Series (5/5)

← Previous: Part 4: Extracting Alpha Signals from Social Media (Twitter/X)

Did you find this helpful?

☕ Buy me a coffee