INTRODUCTION

We argue that such human-centric models should satisfy three criteria:

Generalization, 2. Broad applicability, and 3. Hhigh fidelity.
Generalization ensures robustness to unseen conditions, enabling the model to perform consistently across varied environments.
Broad applicability indicates the versatility of the model, making it suitable for a wide range of tasks with minimal modifications.
High fidelity denotes the ability of the model to produce precise, high-resolution outputs, essential for faithful human generation tasks.

Pretraining

Masked Auto Encoders for its simplicity and efficiency.

For 2D pose estimation: we introduce a comprehensive collection of 308 keypoints encompassing the body, hands, feet, surface, and face.

For body segmenatation: Additionally, we expand the segmentation class vocabulary to 28 classes, covering body parts such as the hair, tongue, teeth, upper/lower lip, and torso.

For Depths and Surface Normals: We also utilize human-centric synthetic data for depth and normal estimation, leveraging 600 detailed scans from RenderPeople[84] to generate high-resolution depth maps and surface normals.

Dataset, model architecture, prameter count

To study the influence of pretraining data distribution on human-specific tasks, we collect the Humans-300M dataset, featuring 300 million diverse human images. These unlabelled images are used to pretrain a family of vision transformers [27] from scratch, with parameter counts ranging from 300M to 2B.

Contributions of Sapiens

• We introduce Sapiens, a family of vision transformers pretrained on a large-scale dataset of human images. • This study shows that simple data curation and large-scale pretraining significantly boost the model’s performance with the same computational budget. • Our models, fine-tuned with high-quality or even synthetic labels, demonstrate in-the-wild generalization. • The first 1K resolution model that natively supports high-fidelity inference for human-centric tasks, achieving state-of-the-art performance on benchmarks for 2D pose, body-part segmentation, depth, and normal estimation.

METHOD

Humans-300M Dataset: 1-Billion 'in-the-wild' images, focusing exclusively on human beings. Pre-processing: Discarding images with watermarks, artitstic depictions or any type of unnatural elements. Retain those images only that has a score of more than 0.9 on a person bounding box detector and where the dimension of the bounding box exceeds 300 pixels.
Pretraining: MAE approach for pretraining. Model is trained to reconstruct the original HB given its partial observation (Added Gaussian Noise). Encoder maps the visible image to a latent space and Decoder reconstructs the original image from latent space. Pretraining dataset consists of single and multi human images and each image size is resized to a square aspect ratio. Similar to ViT the images are divided into non-overlapping patches with a fixed-patch size. Some of the patches are maksed randomly. Masking ratio is fixed throughout the pretraining. Each patch token in the model accounts for 0.02% of the image area compared to the 0.4% in standard ViTs.
2D Pose Estimation: Model introduces a pose estimation transformer to detect the location of K Keypts from an input image. The encoder and decoder of the pose estimation transformer is finetuned across mulitple skeletons with K=308 to capture the full body pose.
Body-Part Segmentation: BPS or the Human Parsing, aims to classify pixels in the input image 'I' into 'C' classes. The model can classify different human body parts into 28 classes.