[paper]
INTRODUCTION
We argue that such human-centric models should satisfy three criteria:
- Generalization, 2. Broad applicability, and 3. Hhigh fidelity.
- Generalization ensures robustness to unseen conditions, enabling the model to perform consistently across varied environments.
- Broad applicability indicates the versatility of the model, making it suitable for a wide range of tasks with minimal modifications.
- High fidelity denotes the ability of the model to produce precise, high-resolution outputs, essential for faithful human generation tasks.
Pretraining
Masked Auto Encoders for its simplicity and efficiency.
For 2D pose estimation:
we introduce a comprehensive collection of 308 keypoints encompassing the body, hands, feet, surface, and face.
For body segmenatation:
Additionally, we expand the segmentation class vocabulary to 28 classes, covering body parts such as the hair, tongue, teeth, upper/lower lip, and torso.
For Depths and Surface Normals:
We also utilize human-centric synthetic data for depth and normal estimation, leveraging 600 detailed scans from RenderPeople[84] to generate high-resolution depth maps and surface normals.

Dataset, model architecture, prameter count
To study the influence of pretraining data distribution on human-specific tasks, we collect the Humans-300M dataset, featuring 300 million diverse human images.
These unlabelled images are used to pretrain a family of vision transformers [27] from scratch, with parameter counts ranging from 300M to 2B.
Contributions of Sapiens
• We introduce Sapiens, a family of vision transformers pretrained on a large-scale dataset of human images.
• This study shows that simple data curation and large-scale pretraining significantly boost the model’s performance with the same computational budget.
• Our models, fine-tuned with high-quality or even synthetic labels, demonstrate in-the-wild generalization.
• The first 1K resolution model that natively supports high-fidelity inference for human-centric tasks, achieving state-of-the-art performance on benchmarks for
2D pose, body-part segmentation, depth, and normal estimation.
METHOD
- Humans-300M Dataset:
1-Billion 'in-the-wild' images, focusing exclusively on human beings.
Pre-processing:
Discarding images with watermarks, artitstic depictions or any type of unnatural elements.
Retain those images only that has a score of more than 0.9 on a person bounding box detector and where the dimension of the bounding box exceeds 300 pixels.
- Pretraining:
MAE approach for pretraining. Model is trained to reconstruct the original HB given its partial observation (Added Gaussian Noise).
Encoder maps the visible image to a latent space and Decoder reconstructs the original image from latent space.
Pretraining dataset consists of single and multi human images and each image size is resized to a square aspect ratio.
Similar to ViT the images are divided into non-overlapping patches with a fixed-patch size. Some of the patches are maksed randomly. Masking ratio is fixed throughout the pretraining.
Each patch token in the model accounts for 0.02% of the image area compared to the 0.4% in standard ViTs.
- 2D Pose Estimation:
Model introduces a pose estimation transformer to detect the location of K Keypts from an input image. The encoder and decoder of the pose estimation transformer is finetuned across
mulitple skeletons with K=308 to capture the full body pose.
- Body-Part Segmentation:
BPS or the Human Parsing, aims to classify pixels in the input image 'I' into 'C' classes. The model can classify different human body parts into 28 classes.