Part 4: Extracting Alpha Signals from Social Media (Twitter/X)

Q: How does The Bitcoin Tweets Sentiment Dataset work?

The Bitcoin Tweets Sentiment Dataset available on Kaggle contains thousands of tweets mentioning Bitcoin, labeled with sentiment scores. This dataset provides an excellent starting point for exploring social media-based alpha generation. import pandas as pd import numpy as np import re from datetim

Updated Feb 6, 2026

Introduction

In the previous episodes of this series, we explored sentiment analysis of financial news with FinBERT, mapped market volatility to global headlines, and decoded central bank speeches using NLP. Now we turn our attention to one of the most challenging yet potentially rewarding data sources in financial text mining: social media.

Social media platforms, particularly Twitter/X, have become vital information channels for financial markets. Retail traders, institutional investors, influencers, and even CEOs use these platforms to share opinions, rumors, and analysis. For cryptocurrency markets especially, social sentiment can drive significant price movements within minutes.

In this episode, we’ll build a practical system to extract alpha signals from Bitcoin-related tweets using the Bitcoin Tweets Sentiment Dataset from Kaggle. We’ll cover data cleaning challenges unique to social media, implement real-time sentiment pipelines, compare different sentiment analysis approaches, and backtest a sentiment-based trading strategy.

The Bitcoin Tweets Sentiment Dataset

The Bitcoin Tweets Sentiment Dataset available on Kaggle contains thousands of tweets mentioning Bitcoin, labeled with sentiment scores. This dataset provides an excellent starting point for exploring social media-based alpha generation.

import pandas as pd
import numpy as np
import re
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Bitcoin tweets dataset
# Download from: https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets
df = pd.read_csv('Bitcoin_tweets.csv')

# Examine the structure
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nDate range: {df['date'].min()} to {df['date'].max()}")

Typically, this dataset includes tweet text, timestamps, user information, and engagement metrics (likes, retweets). Understanding the temporal distribution and quality of this data is crucial before building trading signals.

Social media text presents unique preprocessing challenges compared to formal financial news. We must handle URLs, mentions, hashtags, emojis, slang, sarcasm, and potential bot activity.

Advanced Text Cleaning Pipeline

import emoji
from textblob import TextBlob
import langdetect

class TweetCleaner:
    def __init__(self):
        # Common crypto slang mappings
        self.slang_dict = {
            'hodl': 'hold',
            'btfd': 'buy the dip',
            'fomo': 'fear of missing out',
            'fud': 'fear uncertainty doubt',
            'rekt': 'wrecked',
            'moon': 'price increase',
            'lambo': 'wealthy',
            'whale': 'large investor',
            'shill': 'promote'
        }

    def remove_noise(self, text):
        """Remove URLs, mentions, and special characters"""
        if pd.isna(text):
            return ""

        # Remove URLs
        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

        # Remove mentions but keep context
        text = re.sub(r'@\w+', '', text)

        # Keep hashtags but remove the # symbol (semantic value)
        text = re.sub(r'#(\w+)', r'\1', text)

        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()

        return text

    def handle_emojis(self, text):
        """Convert emojis to text for sentiment analysis"""
        return emoji.demojize(text, delimiters=("", ""))

    def normalize_slang(self, text):
        """Replace crypto slang with standard terms"""
        words = text.lower().split()
        normalized = [self.slang_dict.get(word, word) for word in words]
        return ' '.join(normalized)

    def detect_language(self, text):
        """Filter non-English tweets"""
        try:
            return langdetect.detect(text) == 'en'
        except:
            return False

    def clean_tweet(self, text):
        """Full cleaning pipeline"""
        text = self.remove_noise(text)
        text = self.handle_emojis(text)
        text = self.normalize_slang(text)
        return text

# Apply cleaning
cleaner = TweetCleaner()
df['cleaned_text'] = df['text'].apply(cleaner.clean_tweet)
df['is_english'] = df['cleaned_text'].apply(cleaner.detect_language)

# Filter English tweets only
df_clean = df[df['is_english'] & (df['cleaned_text'].str.len() > 10)].copy()
print(f"Tweets after cleaning: {len(df_clean)} ({len(df_clean)/len(df)*100:.1f}%)")

Bot Detection and Quality Filtering

Not all social media accounts are equal. Bots, spam accounts, and low-quality posters can introduce noise into our signals.

class BotDetector:
    def __init__(self):
        self.bot_keywords = ['bot', 'automated', 'crypto bot', 'trading bot']

    def calculate_bot_score(self, row):
        """Heuristic bot detection score"""
        score = 0

        # Suspicious username patterns
        if 'user_name' in row:
            username = str(row['user_name']).lower()
            if any(kw in username for kw in self.bot_keywords):
                score += 3
            if re.search(r'\d{4,}', username):  # Many numbers
                score += 2

        # Tweet frequency (if available)
        if 'user_tweet_count' in row and row['user_tweet_count'] > 50:
            score += 2

        # Repetitive content
        if 'cleaned_text' in row:
            # Check for excessive hashtags
            hashtag_count = row['text'].count('#')
            if hashtag_count > 5:
                score += 2

        return score

    def is_likely_bot(self, row, threshold=5):
        return self.calculate_bot_score(row) >= threshold

bot_detector = BotDetector()
df_clean['bot_score'] = df_clean.apply(bot_detector.calculate_bot_score, axis=1)
df_clean = df_clean[df_clean['bot_score'] < 5].copy()

print(f"Tweets after bot filtering: {len(df_clean)}")

Sentiment Analysis: VADER vs Transformers

For social media text, we need sentiment models that can handle informal language, slang, and abbreviations. Let’s compare two approaches.

VADER: Lexicon-Based Sentiment

VADER (Valence Aware Dictionary and sEntiment Reasoner) was specifically designed for social media text and performs well on short, informal content.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

vader_analyzer = SentimentIntensityAnalyzer()

def get_vader_sentiment(text):
    """Get VADER compound sentiment score [-1, 1]"""
    scores = vader_analyzer.polarity_scores(text)
    return scores['compound']

df_clean['vader_sentiment'] = df_clean['cleaned_text'].apply(get_vader_sentiment)

# Categorize sentiment
def categorize_sentiment(score):
    if score >= 0.05:
        return 'positive'
    elif score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

df_clean['vader_category'] = df_clean['vader_sentiment'].apply(categorize_sentiment)

print(df_clean['vader_category'].value_counts())
print(f"\nMean VADER sentiment: {df_clean['vader_sentiment'].mean():.3f}")

Transformer-Based Sentiment: Twitter-RoBERTa

For comparison, let’s use a transformer model fine-tuned on Twitter data.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
import torch

class TransformerSentimentAnalyzer:
    def __init__(self, model_name='cardiffnlp/twitter-roberta-base-sentiment-latest'):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model.eval()

    def analyze(self, text, max_length=128):
        """Get sentiment score using transformer model"""
        # Tokenize and truncate
        encoded = self.tokenizer(
            text, 
            truncation=True, 
            max_length=max_length, 
            return_tensors='pt'
        )

        with torch.no_grad():
            output = self.model(**encoded)

        # Get probabilities
        scores = output.logits[0].numpy()
        probs = softmax(scores)

        # Convert to compound score [-1, 1]
        # Labels: 0=negative, 1=neutral, 2=positive
        compound = (probs[2] - probs[0])

        return compound

# Apply transformer sentiment (sample for speed)
transformer_analyzer = TransformerSentimentAnalyzer()

# For demonstration, analyze a sample
sample_size = 1000
df_sample = df_clean.sample(min(sample_size, len(df_clean)))
df_sample['transformer_sentiment'] = df_sample['cleaned_text'].apply(
    transformer_analyzer.analyze
)

# Compare VADER vs Transformer
print("\nCorrelation between VADER and Transformer sentiment:")
print(f"{df_sample['vader_sentiment'].corr(df_sample['transformer_sentiment']):.3f}")

Handling Sarcasm and Context

Sarcasm remains one of the hardest challenges in sentiment analysis. Consider this example:

sarcastic_examples = [
    "Bitcoin crashing again! What a surprise! #crypto",
    "Oh great, another dip. Just what I needed today.",
    "Yeah, sure, Bitcoin will hit 100k tomorrow."
]

print("Sarcasm Detection Examples:\n")
for text in sarcastic_examples:
    cleaned = cleaner.clean_tweet(text)
    vader = get_vader_sentiment(cleaned)
    print(f"Text: {text}")
    print(f"VADER: {vader:.3f} ({categorize_sentiment(vader)})")
    print()

VADER often misclassifies sarcastic tweets because it relies on lexical cues. For production systems, consider:

Sarcasm detection models as a preprocessing step
User history analysis (chronic pessimists vs optimists)
Engagement patterns (sarcastic tweets often get different reaction patterns)

Building Alpha Signals from Sentiment

Raw sentiment scores alone rarely provide tradeable alpha. We need to construct sophisticated signals that capture market dynamics.

Signal 1: Sentiment Momentum

Sentiment momentum captures the rate of change in collective opinion.

def calculate_sentiment_momentum(df, window_hours=6):
    """
    Calculate sentiment momentum over rolling windows
    Momentum = (current_sentiment - lagged_sentiment) / lagged_sentiment
    """
    # Ensure datetime format
    df['timestamp'] = pd.to_datetime(df['date'])
    df = df.sort_values('timestamp')

    # Resample to hourly aggregates
    hourly = df.set_index('timestamp').resample('1H').agg({
        'vader_sentiment': 'mean',
        'user_followers': 'sum'  # Total reach
    }).fillna(method='ffill')

    # Calculate momentum
    hourly['sentiment_momentum'] = (
        hourly['vader_sentiment'].pct_change(periods=window_hours)
    )

    # Z-score normalization
    hourly['momentum_zscore'] = (
        (hourly['sentiment_momentum'] - hourly['sentiment_momentum'].mean()) / 
        hourly['sentiment_momentum'].std()
    )

    return hourly

sentiment_momentum = calculate_sentiment_momentum(df_clean)
print(sentiment_momentum[['vader_sentiment', 'sentiment_momentum', 'momentum_zscore']].tail(10))

The momentum z-score formula:

$z_t = \frac{m_t – \mu_m}{\sigma_m}$

Where:
– $z_t$ is the normalized momentum at time $t$
– $m_t$ is the raw sentiment momentum
– $\mu_m$ is the mean momentum over the lookback period
– $\sigma_m$ is the standard deviation of momentum

Signal 2: Volume-Weighted Sentiment

Not all tweets carry equal weight. Tweets from accounts with more followers or higher engagement should have greater influence.

def calculate_vws(df):
    """
    Volume-Weighted Sentiment (VWS)
    Weight each tweet by follower count and engagement
    """
    # Calculate engagement score
    df['engagement'] = (
        df.get('user_favourites', 0) + 
        df.get('user_retweet_count', 0) * 2  # Retweets count more
    )

    # Weight by followers + engagement
    df['weight'] = df.get('user_followers', 1) * (1 + np.log1p(df['engagement']))

    # Calculate VWS per time window
    df['timestamp'] = pd.to_datetime(df['date'])
    df = df.sort_values('timestamp')

    hourly_vws = df.set_index('timestamp').resample('1H').apply(
        lambda x: np.average(x['vader_sentiment'], weights=x['weight']) 
        if len(x) > 0 else np.nan
    ).fillna(method='ffill')

    return hourly_vws

vws = calculate_vws(df_clean)
print(f"\nVolume-Weighted Sentiment Stats:")
print(vws.describe())

The Volume-Weighted Sentiment formula:

$VWS_t = \frac{\sum_{i=1}^{N_t} w_i \cdot s_i}{\sum_{i=1}^{N_t} w_i}$

Where:
– $VWS_t$ is the volume-weighted sentiment at time $t$
– $N_t$ is the number of tweets in time window $t$
– $w_i$ is the weight of tweet $i$ (followers × engagement)
– $s_i$ is the sentiment score of tweet $i$

Signal 3: Contrarian Indicator

Extreme sentiment often precedes reversals. When everyone is extremely bullish, markets may be overbought.

def calculate_contrarian_signal(df, extreme_threshold=1.5):
    """
    Generate contrarian signals from extreme sentiment
    Returns -1 (sell) when sentiment too positive, +1 (buy) when too negative
    """
    df['timestamp'] = pd.to_datetime(df['date'])
    hourly = df.set_index('timestamp').resample('1H')['vader_sentiment'].mean()

    # Rolling z-score
    rolling_mean = hourly.rolling(window=24).mean()
    rolling_std = hourly.rolling(window=24).std()
    zscore = (hourly - rolling_mean) / rolling_std

    # Contrarian signal
    signal = pd.Series(0, index=zscore.index)
    signal[zscore > extreme_threshold] = -1  # Too bullish -> sell signal
    signal[zscore < -extreme_threshold] = 1  # Too bearish -> buy signal

    return signal, zscore

contrarian_signal, sentiment_zscore = calculate_contrarian_signal(df_clean)
print(f"\nContrarian signals generated:")
print(contrarian_signal.value_counts())

Building a Real-Time Sentiment Pipeline

For production trading systems, we need real-time sentiment scoring as tweets arrive.

from collections import deque
import time

class RealtimeSentimentPipeline:
    def __init__(self, window_minutes=60, min_tweets=10):
        self.window_minutes = window_minutes
        self.min_tweets = min_tweets
        self.tweet_buffer = deque(maxlen=1000)
        self.cleaner = TweetCleaner()
        self.vader = SentimentIntensityAnalyzer()

    def process_tweet(self, tweet_data):
        """Process incoming tweet and update sentiment metrics"""
        # Clean and score
        cleaned_text = self.cleaner.clean_tweet(tweet_data['text'])
        sentiment = self.vader.polarity_scores(cleaned_text)['compound']

        # Add to buffer with timestamp
        self.tweet_buffer.append({
            'timestamp': datetime.now(),
            'sentiment': sentiment,
            'weight': tweet_data.get('followers', 1)
        })

        return self.get_current_metrics()

    def get_current_metrics(self):
        """Calculate real-time sentiment metrics"""
        now = datetime.now()
        cutoff = now - timedelta(minutes=self.window_minutes)

        # Filter recent tweets
        recent = [t for t in self.tweet_buffer if t['timestamp'] > cutoff]

        if len(recent) < self.min_tweets:
            return None

        # Calculate metrics
        sentiments = [t['sentiment'] for t in recent]
        weights = [t['weight'] for t in recent]

        metrics = {
            'timestamp': now,
            'tweet_count': len(recent),
            'mean_sentiment': np.mean(sentiments),
            'vws': np.average(sentiments, weights=weights),
            'sentiment_std': np.std(sentiments),
            'positive_ratio': sum(1 for s in sentiments if s > 0.05) / len(sentiments)
        }

        return metrics

# Example usage (simulation)
pipeline = RealtimeSentimentPipeline(window_minutes=60)

print("Simulating real-time pipeline:\n")
for i, row in df_clean.head(100).iterrows():
    tweet_data = {
        'text': row['text'],
        'followers': row.get('user_followers', 1000)
    }

    metrics = pipeline.process_tweet(tweet_data)

    if metrics and i % 20 == 0:  # Print every 20 tweets
        print(f"Processed {metrics['tweet_count']} tweets")
        print(f"Mean sentiment: {metrics['mean_sentiment']:.3f}")
        print(f"VWS: {metrics['vws']:.3f}")
        print(f"Positive ratio: {metrics['positive_ratio']:.2%}\n")

Backtesting a Sentiment-Based Strategy

Let’s build and backtest a simple trading strategy using our sentiment signals.

import yfinance as yf

class SentimentTradingStrategy:
    def __init__(self, initial_capital=10000):
        self.capital = initial_capital
        self.position = 0  # BTC position size
        self.trades = []

    def generate_signals(self, sentiment_df, price_df):
        """
        Generate trading signals from sentiment
        Strategy: Buy when sentiment momentum positive, sell when negative
        """
        # Merge sentiment and price data
        merged = pd.merge_asof(
            price_df.sort_index(),
            sentiment_df.sort_index(),
            left_index=True,
            right_index=True,
            direction='backward'
        )

        # Generate signals
        merged['signal'] = 0
        merged.loc[merged['momentum_zscore'] > 0.5, 'signal'] = 1  # Buy
        merged.loc[merged['momentum_zscore'] < -0.5, 'signal'] = -1  # Sell

        # Add stop-loss and take-profit
        merged['returns'] = merged['Close'].pct_change()

        return merged

    def backtest(self, signals_df):
        """
        Execute backtest with transaction costs
        """
        portfolio_value = []
        transaction_cost = 0.001  # 0.1% per trade

        for idx, row in signals_df.iterrows():
            price = row['Close']
            signal = row['signal']

            # Execute trades
            if signal == 1 and self.position == 0:  # Buy
                self.position = self.capital * (1 - transaction_cost) / price
                self.capital = 0
                self.trades.append(('BUY', idx, price))

            elif signal == -1 and self.position > 0:  # Sell
                self.capital = self.position * price * (1 - transaction_cost)
                self.position = 0
                self.trades.append(('SELL', idx, price))

            # Calculate portfolio value
            if self.position > 0:
                value = self.position * price
            else:
                value = self.capital

            portfolio_value.append(value)

        return pd.Series(portfolio_value, index=signals_df.index)

# Download Bitcoin price data
btc_prices = yf.download('BTC-USD', start='2021-01-01', end='2023-12-31')

# Run backtest
strategy = SentimentTradingStrategy(initial_capital=10000)
signals = strategy.generate_signals(sentiment_momentum, btc_prices)
portfolio = strategy.backtest(signals)

# Calculate performance metrics
total_return = (portfolio.iloc[-1] / 10000 - 1) * 100
max_drawdown = ((portfolio.cummax() - portfolio) / portfolio.cummax()).max() * 100
sharpe = (portfolio.pct_change().mean() / portfolio.pct_change().std()) * np.sqrt(365)

print(f"\n=== Backtest Results ===")
print(f"Total Return: {total_return:.2f}%")
print(f"Max Drawdown: {max_drawdown:.2f}%")
print(f"Sharpe Ratio: {sharpe:.2f}")
print(f"Number of Trades: {len(strategy.trades)}")
print(f"\nBuy & Hold Return: {(btc_prices['Close'].iloc[-1] / btc_prices['Close'].iloc[0] - 1) * 100:.2f}%")

Performance Visualization

fig, axes = plt.subplots(3, 1, figsize=(14, 10))

# Plot 1: Portfolio value vs Buy & Hold
axes[0].plot(portfolio.index, portfolio.values, label='Sentiment Strategy', linewidth=2)
buy_hold = 10000 * (btc_prices['Close'] / btc_prices['Close'].iloc[0])
axes[0].plot(buy_hold.index, buy_hold.values, label='Buy & Hold', linewidth=2, alpha=0.7)
axes[0].set_title('Portfolio Value: Sentiment Strategy vs Buy & Hold', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Portfolio Value ($)')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Plot 2: Sentiment momentum with signals
axes[1].plot(sentiment_momentum.index, sentiment_momentum['momentum_zscore'])
axes[1].axhline(y=0.5, color='g', linestyle='--', alpha=0.5, label='Buy threshold')
axes[1].axhline(y=-0.5, color='r', linestyle='--', alpha=0.5, label='Sell threshold')
axes[1].set_title('Sentiment Momentum Z-Score', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Z-Score')
axes[1].legend()
axes[1].grid(alpha=0.3)

# Plot 3: Drawdown
drawdown = (portfolio.cummax() - portfolio) / portfolio.cummax() * 100
axes[2].fill_between(drawdown.index, 0, drawdown.values, color='red', alpha=0.3)
axes[2].set_title('Strategy Drawdown', fontsize=14, fontweight='bold')
axes[2].set_ylabel('Drawdown (%)')
axes[2].set_xlabel('Date')
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()

Critical Challenges and Pitfalls

1. Survivorship Bias

Our dataset only includes tweets we could collect. Deleted tweets, banned accounts, and private accounts create survivorship bias.

def detect_survivorship_bias(df):
    """
    Check for signs of survivorship bias in tweet data
    """
    # Check for gaps in timeline
    df['timestamp'] = pd.to_datetime(df['date'])
    df = df.sort_values('timestamp')

    time_gaps = df['timestamp'].diff()
    large_gaps = time_gaps[time_gaps > pd.Timedelta(hours=6)]

    print(f"Detected {len(large_gaps)} large time gaps (>6 hours)")
    print(f"These may indicate missing data or collection issues\n")

    # Check for sudden drops in tweet volume
    hourly_volume = df.set_index('timestamp').resample('1H').size()
    volume_drops = hourly_volume.pct_change()
    severe_drops = volume_drops[volume_drops < -0.5]

    print(f"Detected {len(severe_drops)} periods with >50% volume drop")

    return large_gaps, severe_drops

gaps, drops = detect_survivorship_bias(df_clean)

Mitigation strategies:
– Use multiple data sources (Twitter API, Reddit, Telegram)
– Track data quality metrics over time
– Be cautious during periods with data gaps
– Validate signals against price data

2. Data Snooping Bias

When we optimize parameters on the same dataset we test on, we risk overfitting to historical noise.

from sklearn.model_selection import TimeSeriesSplit

def walk_forward_validation(sentiment_df, price_df, param_grid):
    """
    Walk-forward validation to avoid look-ahead bias
    Train on expanding window, test on next period
    """
    tscv = TimeSeriesSplit(n_splits=5)
    results = []

    for train_idx, test_idx in tscv.split(sentiment_df):
        # Split data
        train_sentiment = sentiment_df.iloc[train_idx]
        test_sentiment = sentiment_df.iloc[test_idx]

        train_prices = price_df.iloc[train_idx]
        test_prices = price_df.iloc[test_idx]

        # Optimize parameters on training set
        best_params = optimize_strategy_params(train_sentiment, train_prices, param_grid)

        # Test on out-of-sample data
        test_strategy = SentimentTradingStrategy()
        test_signals = test_strategy.generate_signals(test_sentiment, test_prices)
        test_portfolio = test_strategy.backtest(test_signals)

        test_return = (test_portfolio.iloc[-1] / 10000 - 1) * 100
        results.append({
            'fold': len(results) + 1,
            'return': test_return,
            'params': best_params
        })

    return pd.DataFrame(results)

def optimize_strategy_params(sentiment_df, price_df, param_grid):
    """
    Grid search for optimal strategy parameters
    """
    best_return = -np.inf
    best_params = None

    for momentum_threshold in param_grid['momentum_threshold']:
        for vws_weight in param_grid['vws_weight']:
            # Test parameter combination
            strategy = SentimentTradingStrategy()
            # ... implement parameter-based signal generation
            # This is a simplified example

            if return_value > best_return:
                best_return = return_value
                best_params = {
                    'momentum_threshold': momentum_threshold,
                    'vws_weight': vws_weight
                }

    return best_params

# Example parameter grid
param_grid = {
    'momentum_threshold': [0.3, 0.5, 0.7, 1.0],
    'vws_weight': [0.3, 0.5, 0.7]
}

print("Running walk-forward validation...")
print("This ensures we don't peek into the future during optimization.\n")

3. Market Impact and Slippage

Our backtest assumes we can trade at observed prices without moving the market. In reality, large orders cause slippage.

def estimate_slippage(trade_size_usd, market_depth_data):
    """
    Estimate slippage based on order book depth
    Simplified model: slippage increases with trade size
    """
    # Typical BTC market depth (example values)
    depth_levels = [
        (10000, 0.0001),   # Up to $10k: 0.01% slippage
        (50000, 0.0005),   # $10k-$50k: 0.05%
        (100000, 0.001),   # $50k-$100k: 0.1%
        (500000, 0.003),   # $100k-$500k: 0.3%
    ]

    for threshold, slippage_rate in depth_levels:
        if trade_size_usd <= threshold:
            return slippage_rate

    return 0.005  # >$500k: 0.5% slippage

# Adjust backtest for realistic slippage
print("Example slippage costs:")
for size in [1000, 10000, 50000, 100000]:
    slippage = estimate_slippage(size, None)
    cost = size * slippage
    print(f"Trade size ${size:,}: {slippage*100:.2f}% slippage (${cost:.2f})")

Advanced Extensions

For production-grade systems, consider these enhancements:

Multi-Source Sentiment Fusion

class MultiSourceSentiment:
    def __init__(self):
        self.sources = {
            'twitter': 0.4,   # Weights based on reliability
            'reddit': 0.3,
            'telegram': 0.2,
            'news': 0.1
        }

    def fuse_sentiments(self, source_sentiments):
        """
        Combine sentiment from multiple sources
        Uses weighted average with confidence scores
        """
        weighted_sum = 0
        total_weight = 0

        for source, sentiment_data in source_sentiments.items():
            if source in self.sources:
                weight = self.sources[source] * sentiment_data['confidence']
                weighted_sum += weight * sentiment_data['score']
                total_weight += weight

        return weighted_sum / total_weight if total_weight > 0 else 0

Sentiment-Price Causality Testing

from statsmodels.tsa.stattools import grangercausalitytests

def test_granger_causality(sentiment_series, price_returns, max_lag=12):
    """
    Test if sentiment Granger-causes price movements
    """
    # Prepare data
    data = pd.DataFrame({
        'price_returns': price_returns,
        'sentiment': sentiment_series
    }).dropna()

    # Run Granger causality test
    print("Testing if sentiment Granger-causes price returns:\n")
    results = grangercausalitytests(data[['price_returns', 'sentiment']], max_lag, verbose=True)

    return results

# Example usage
if len(sentiment_momentum) > 50:
    btc_returns = btc_prices['Close'].pct_change()
    aligned_sentiment = sentiment_momentum['vader_sentiment'].reindex(btc_returns.index, method='ffill')

    print("\nGranger Causality Test:")
    print("Null hypothesis: Sentiment does NOT Granger-cause price returns")
    # test_granger_causality(aligned_sentiment, btc_returns)

Conclusion

Extracting alpha signals from social media represents both a significant opportunity and a substantial challenge in modern quantitative finance. In this episode, we’ve covered the complete pipeline from raw tweets to backtested trading strategies.

Key takeaways:

Data quality is paramount: Social media data requires aggressive cleaning to remove bots, spam, and noise. The quality of your signals depends entirely on the quality of your input data.
Sentiment models matter: VADER works well for quick, lexicon-based analysis of informal text, while transformer models like Twitter-RoBERTa provide more nuanced understanding at the cost of computational resources. Choose based on your latency and accuracy requirements.
Sophisticated signal construction: Raw sentiment scores rarely work. Momentum signals, volume-weighted sentiment, and contrarian indicators capture different market dynamics and can be combined for robust alpha generation.
Real-time infrastructure: Production systems need efficient pipelines that process tweets as they arrive, maintain rolling windows of metrics, and integrate seamlessly with trading systems.
Rigorous validation: Survivorship bias, data snooping, and unrealistic assumptions can destroy strategy performance in live trading. Use walk-forward validation, account for transaction costs and slippage, and test causality relationships.

The Bitcoin Tweets Sentiment Dataset from Kaggle provides an excellent sandbox for experimentation, but remember that real-world deployment requires handling streaming data, maintaining infrastructure, and constantly adapting to changing market dynamics.

In the final episode of this series, we’ll explore automating earnings call summarization with large language models, showing how modern AI can extract structured insights from long-form financial audio and text. This will complete our journey through the landscape of financial text mining, from news sentiment to social media signals to corporate communications.

AI-Based Financial Text Mining Series (4/5)

← Previous: Part 3: Decoding Central Bank Speeches with NLP (Fed Meetings)Next: Part 5: Automating Earnings Call Summarization with LLMs →

Did you find this helpful?

☕ Buy me a coffee

Part 4: Extracting Alpha Signals from Social Media (Twitter/X)

Introduction

The Bitcoin Tweets Sentiment Dataset

Advanced Text Cleaning Pipeline

Bot Detection and Quality Filtering

Sentiment Analysis: VADER vs Transformers

VADER: Lexicon-Based Sentiment

Transformer-Based Sentiment: Twitter-RoBERTa

Handling Sarcasm and Context

Building Alpha Signals from Sentiment

Signal 1: Sentiment Momentum

Signal 2: Volume-Weighted Sentiment

Signal 3: Contrarian Indicator

Building a Real-Time Sentiment Pipeline

Backtesting a Sentiment-Based Strategy

Performance Visualization

Critical Challenges and Pitfalls

1. Survivorship Bias

2. Data Snooping Bias

3. Market Impact and Slippage

Advanced Extensions

Multi-Source Sentiment Fusion

Sentiment-Price Causality Testing

Conclusion

Comments

Leave a Reply Cancel reply

Part 4: Extracting Alpha Signals from Social Media (Twitter/X)

Introduction

The Bitcoin Tweets Sentiment Dataset

Cleaning Social Media Data: The Noise Challenge

Advanced Text Cleaning Pipeline

Bot Detection and Quality Filtering

Sentiment Analysis: VADER vs Transformers

VADER: Lexicon-Based Sentiment

Transformer-Based Sentiment: Twitter-RoBERTa

Handling Sarcasm and Context

Building Alpha Signals from Sentiment

Signal 1: Sentiment Momentum

Signal 2: Volume-Weighted Sentiment

Signal 3: Contrarian Indicator

Building a Real-Time Sentiment Pipeline

Backtesting a Sentiment-Based Strategy

Performance Visualization

Critical Challenges and Pitfalls

1. Survivorship Bias

2. Data Snooping Bias

3. Market Impact and Slippage

Advanced Extensions

Multi-Source Sentiment Fusion

Sentiment-Price Causality Testing

Conclusion

Related Posts

Comments

Leave a Reply Cancel reply