If you’ve ever watched a motion capture system struggle with a person’s fingers, or seen a segmentation model fail to distinguish teeth from gums, you already understand why human-centric computer vision is hard. Humans are not just objects, they come with articulated structure, fine surface details, and enormous variation in pose, clothing, lighting, and ethnicity. Getting a model to understand all of that, at once, across arbitrary real-world images, is genuinely difficult.
Meta AI research team introduced Sapiens2, the second generation of its foundation model family for human-centric vision. Trained on a newly curated dataset of 1 billion human images, spanning model sizes from 0.4B to 5B parameters, and designed to operate at native 1K resolution with hierarchical variants supporting 4K, Sapiens2 is a substantial leap over its predecessor across every benchmark the team evaluated.


What Sapiens2 is Trying to Solve
The original Sapiens model relied primarily on Masked Autoencoder (MAE) pretraining. MAE works by masking a large portion of input image patches, 75% in this case, and training the model to reconstruct the missing pixels. This forces the model to learn spatial details and textures, which is useful for dense prediction tasks like segmentation or depth estimation.

The problem is that MAE, as a form of masked image modeling (MIM), learns largely through compression. It doesn’t naturally learn high-level semantics. It can tell you what something looks like, but not necessarily what it means in the context of a human body. That’s where contrastive learning (CL) methods like DINO and SimCLR shine: they organize representations semantically by training the model to treat different views of the same image as similar and views of different images as distinct.
But CL has its own tradeoff. Its aggressive augmentation strategies like color jitter, blurring, can strip away appearance cues like skin tone or lighting conditions that are critical for tasks like albedo estimation (recovering the true color of a surface independent of lighting). This is what the research team calls representation drift.
Sapiens2 addresses this problem directly by combining both objectives: a masked image reconstruction loss (LMAE) to preserve low-level fidelity, and a global contrastive loss (LCL) on the [CLS] token using a student-teacher framework based on DINOv3, where the teacher’s parameters are an exponential moving average (EMA) of the student. Crucially, color augmentations are not applied to global views used for the MAE objective, preserving the appearance cues needed for photorealistic tasks. The joint objective is L = LMAE + λLCL.


The Data: Humans-1B
Getting 1 billion training images right required a multi-stage filtering pipeline. Starting from a web-scale pool of approximately 4 billion images, Meta team applied bounding box detection, head-pose estimation, aesthetic and realism scoring, CLIP-based feature filtering, and text-overlay detection. The result is a curated corpus where every image contains at least one prominent person with a minimum short-side resolution of 384 pixels.
To ensure diversity, the research team used perceptual hashing and deep-feature nearest-neighbor pruning for deduplication, then clustered visual embeddings and applied selective sampling to balance the dataset across poses, viewpoints, occlusion levels, clothing types, and lighting conditions. No task labels or human-specific priors were injected during pretraining — just images.
The Architecture: Scaling to 5B and 4K
Sapiens2 introduces four model sizes: 0.4B, 0.8B, 1B, and 5B parameters, each at native 1K resolution. The 5B model is the highest-FLOPs vision transformer reported to date at 15.722 TFLOPs.
For 4K resolution, the research team adopted a hierarchical windowed attention design. The first K layers apply windowed self-attention locally to capture fine texture and boundaries within spatial windows. A [CLS]-guided pooling step then downsamples the 2D token grid by a spatial stride √ω, and the subsequent L layers apply global self-attention over this reduced sequence. This layout is compatible with MAE-style pretraining because masked tokens can be dropped after the local stage, preventing information from leaking across masked regions — a problem that convolutional backbones typically need masked convolutions to avoid.
The masking strategy itself is also carefully designed: Sapiens2 uses mixed blockwise/patchwise masking (blockwise probability 0.4) at a 75% mask ratio with patch size 16. At 1024×768 resolution (64×48 = 3072 patches), this masks approximately 2304 patches per image which is enough to create coarse occlusions that regularize MAE while preserving sufficient context for the contrastive objective.
For stability at scale, the architecture incorporates several improvements: RMSNorm replacing LayerNorm, Grouped-Query Attention (GQA) in mid-depth blocks for higher throughput, QK-Norm for robust high-resolution training, and SwiGLU feed-forward layers. The decoder uses pixel-shuffle upsampling for sub-pixel reasoning. Decoder output resolution was also increased from 0.5K to 1K for base backbones, and to 2K for 4K backbones.
Post-Training: Five Human Tasks, 10× More Supervision
A critical improvement over the original Sapiens is the scale and quality of task-specific supervision. Relative to the first generation, Sapiens2 scales task-specific labels by 10×, typically reaching around 1 million labels per task. After pretraining, the backbone is fine-tuned for five downstream tasks using lightweight task-specific heads while leaving the backbone unchanged:
- Pose Estimation: A 308-keypoint full-body skeleton with dense face (243 keypoints) and hand (40 keypoints) coverage. The research team newly annotated 100K in-the-wild images to complement studio capture data, improving generalization significantly.
- Body-Part Segmentation: 29 semantic classes (extended from 28 by adding eyeglasses), trained with per-pixel weighted cross-entropy combined with Dice loss for sharper boundaries.
- Pointmap Estimation: Rather than predicting relative depth, Sapiens2 regresses a per-pixel 3D pointmap P̂(u) ∈ ℝ³ in the camera frame — a harder task that requires reasoning about camera intrinsics.
- Normal Estimation: Per-pixel surface unit normals, decoded using multiple PixelShuffle layers for artifact-free upsampling.
- Albedo Estimation: Per-pixel diffuse albedo Â(u) ∈ [0,1]³, trained purely on synthetic high-fidelity data and designed to recover true skin tone and clothing color under varying illumination.
Results
The numbers are difficult to argue with. On the 11K-image in-the-wild pose test set, Sapiens2-5B achieves 82.3 mAP compared to 78.3 mAP for Sapiens-2B — a +4 mAP improvement. On body-part segmentation, even the smallest model, Sapiens2-0.4B, scores 79.5 mIoU (+21.3 over Sapiens-2B*), while Sapiens2-5B reaches 82.5 mIoU — a +24.3 mIoU gain over the previous generation’s largest model. The 4K variant, Sapiens2-1B-4K, further pushes segmentation to 81.9 mIoU and 92.0 mAcc, demonstrating the benefit of higher-resolution reasoning.
On surface normal estimation, Sapiens2-0.4B already achieves a mean angular error of 8.63°, outperforming the previous state-of-the-art DAViD-L at 10.73°. The 5B model brings this down further to 6.73°, and the 4K variant reaches 6.98° with a median angular error of just 3.08°.
For albedo estimation, Sapiens2-5B achieves an MAE of 0.012 and a PSNR of 32.61 dB, with consistent improvement across all model sizes. On pointmap estimation, all Sapiens2 model sizes outperform MoGe, which was previously state-of-the-art for monocular geometry estimation.
In dense probing evaluations, where the backbone is frozen and only lightweight decoders are trained with identical hyperparameters, Sapiens2-5B surpasses all baselines across every task, including DINOv3-7B (6.71B parameters), despite Sapiens2 being a human-specialist model evaluated against a general-purpose backbone nearly 1.5× its size.
Check out the Model Weights with Demos, Paper and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us










Leave a Reply