English translation
Input example
The core idea of self-supervised learning is to generate supervisory signals directly from the data itself. It excels in scenarios where labeled data is scarce but raw, unlabeled data is abundant. This article focuses on architecture: first clarify the data flow, key modules, and output layers—then revisit the underlying formulas or code.
I will separately examine pretraining tasks and downstream tasks to verify that representations truly transfer—not merely that pretraining metrics look good.
Self-supervised learning is an emerging learning paradigm that enables effective model training on large volumes of unlabeled data—without requiring any human-annotated labels. In this article, we explore commonly used model architectures in self-supervised learning and assess their effectiveness in specific applications.
Model Architectures for Self-Supervised Learning
In self-supervised learning, model architectures typically build upon deep learning frameworks such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Below are several widely adopted architectures:
While reading this article, treat the sequence
“Model Architectures → Variational Autoencoder (VAE) → Case Study → Contrastive Learning Models”
as a verification thread: first identify the object, path, and evidence, then return to case studies, code, or evaluation metrics for cross-checking.
1. Variational Autoencoder (VAE)
A VAE learns a latent distribution over the input data to serve as a generative model. Its objective is to maximize the evidence lower bound (ELBO), balancing reconstruction fidelity and regularization:
In self-supervised learning, VAEs are trained by reconstructing input data—thereby encouraging the model to learn meaningful, compact representations.
Case Study
Suppose we have a collection of unlabeled handwritten digit images. We can train a VAE to generate new digit-like samples; these generated images—or more commonly, the learned latent representations—can then be leveraged for downstream classification tasks.
import torch
from torch import nn
class VAE(nn.Module):
def __init__(self):
super(VAE, self).__init__()
self.encoder = nn.Sequential(
nn.Linear(784, 400),
nn.ReLU(),
nn.Linear(400, 20) # mean
)
self.decoder = nn.Sequential(
nn.Linear(20, 400),
nn.ReLU(),
nn.Linear(400, 784),
nn.Sigmoid()
)
def forward(self, x):
z_mean = self.encoder(x)
z = self.reparameterize(z_mean)
return self.decoder(z)
def reparameterize(self, z_mean):
std = torch.exp(0.5 * z_mean) # Assume variance is learned
eps = torch.randn_like(std)
return z_mean + eps * std
2. Contrastive Learning Models
Contrastive learning trains models by distinguishing positive pairs (e.g., augmented views of the same sample) from negative pairs (views of different samples). SimCLR and MoCo are two prominent contrastive learning frameworks.
The model learns representations by maximizing similarity between positive pairs while minimizing similarity across negative pairs—typically using cosine similarity and a temperature-scaled InfoNCE loss.
Case Study
For an image classification task, contrastive learning can pretrain feature extractors effectively. Here’s a minimal implementation:
import torch
import torch.nn.functional as F
def contrastive_loss(z_i, z_j, temperature=0.5):
batch_size = z_i.size(0)
# Compute similarity matrix
sim_matrix = F.cosine_similarity(z_i.unsqueeze(1), z_j.unsqueeze(0), dim=-1) / temperature
labels = torch.arange(batch_size).to(z_i.device)
# Compute contrastive loss (InfoNCE)
loss = F.cross_entropy(sim_matrix, labels)
return loss
3. Self-Supervised Transformers
In natural language processing (NLP), the Transformer architecture has become foundational for self-supervised learning. BERT and GPT are both Transformer-based models pretrained via self-supervised objectives—such as masked language modeling (MLM) and next-sentence prediction—to capture rich contextual representations.
Case Study: BERT
BERT is pretrained by masking random tokens in input text and predicting them. This forces the model to deeply understand context and semantic relationships.
from transformers import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
# Input example
inputs = tokenizer("The cat sat on the [MASK].", return_tensors="pt")
outputs = model(**inputs)
Here, [MASK] marks the token the model must predict. The loss is computed via backpropagation, updating model weights accordingly.
When reviewing “Model Architectures for Self-Supervised Learning”, consolidate key concepts, procedural steps, and observable outcomes onto a single page for efficient revision.
When practicing “Model Architectures for Self-Supervised Learning”, explicitly write down the input conditions, processing actions, and observable outcomes together—making future review and debugging straightforward.
Summary
Self-supervised learning leverages purpose-built model architectures to extract rich, transferable features from unlabeled data—offering a powerful alternative where labeling is costly, impractical, or infeasible. In upcoming articles, we’ll dive deeper into real-world adoption patterns and concrete use cases.
After finishing “Model Architectures for Self-Supervised Learning”, reflect on three questions:
- What problem does it solve?
- Which step is most error-prone?
- Can I implement and run a small working example end-to-end?
Continue