
Meta's "Segment Everything" model has been upgraded! It can "understand human language" and process images containing hundreds of objects in just 30 milliseconds

More news, ongoing updates
On Wednesday, the 19th, Eastern Time, Meta released the third generation "Segment Anything" model, Segment Anything Models (SAM) — SAM 3, achieving a significant breakthrough by supporting users for the first time to identify, segment, and track any object in videos through natural language descriptions and image examples. Meta also released an open-source model for 3D reconstruction, SAM 3D, and plans to integrate these technologies into the Instagram video creation app Edits and the Meta AI app.
The core innovation of SAM 3 lies in the introduction of a capability called Promptable Concept Segmentation (PCS). Users only need to input natural language prompts like "striped red umbrella," and the model can automatically identify and segment all qualifying instances in images or videos, breaking through the limitations of traditional models that rely on fixed label sets.
In terms of processing speed, the SAM 3 model takes only 30 milliseconds to process a single image containing over 100 objects on an NVIDIA H200 GPU, maintaining near real-time performance in video scenes with about five concurrent target objects.
Meta's SA-Co benchmark testing shows that SAM 3's performance has doubled compared to existing systems. In the zero-shot segmentation task on the LVIS dataset, SAM 3 achieved an accuracy of 47.0, significantly surpassing the previous 38.5. In user preference tests, SAM 3's output was approximately three times better than the strongest benchmark model OWLv2.
Meta introduced that the aforementioned technological breakthroughs will first be applied to a new feature in Facebook Marketplace called "View in Room," helping users visualize the placement of products in their personal space before purchasing home decor items. Meta also launched the Segment Anything Playground platform, allowing ordinary users without a technical background to experience the capabilities of these cutting-edge AI models.
Breaking the Limitations of Fixed Labels, Supporting Open Vocabulary Segmentation
The biggest challenge faced by traditional image segmentation models is the difficulty in associating natural language with specific visual elements in images. Existing models typically can only segment predefined concepts like "person," but struggle to understand more detailed descriptions like "striped red umbrella."
SAM 3 addresses this limitation by introducing the capability of promptable concept segmentation. The model accepts text prompts in the form of phrases and image example prompts, completely freeing itself from the constraints of fixed label sets. To evaluate large vocabulary detection and segmentation performance, Meta created the SA-Co benchmark dataset, which includes 214,000 unique concepts, 124,000 images, and 1,700 videos, covering more than 50 times the range of existing benchmarks.
The model also supports various prompting methods, including simple noun phrases and image examples as concept prompts, as well as visual prompts like points, boxes, and masks introduced in SAM 1 and SAM 2. This greatly enhances the flexibility and usability of segmentation, especially for rare or difficult-to-describe concepts SAM 3 can also serve as a perception tool for multimodal large language models, handling more complex prompts such as "a person sitting but not holding a gift box." When used in conjunction with multimodal large language models, SAM 3 outperforms previous research in complex text segmentation benchmarks that require reasoning, such as ReasonSeg and OmniLabel, without the need for training on any representative segmentation or reasoning segmentation data.
Innovative Data Engine, Human-Machine Collaboration Accelerates by 5 Times
Acquiring high-quality annotated images with segmentation masks and text labels is a significant challenge, especially when it comes to meticulously annotating the occurrence locations of each object category in videos, which is both time-consuming and complex. Building a comprehensive dataset that covers a large and diverse vocabulary across multiple visual domains requires substantial time and resources.
Meta addresses this issue by creating a scalable data engine that combines SAM 3, human annotators, and AI models, significantly accelerating the annotation speed. For negative prompts (concepts that do not exist in images or videos), the annotation speed is approximately 5 times faster than purely manual efforts, and for positive prompts, it is 36% faster even in challenging fine-grained domains. This human-machine hybrid system enables the team to create a large-scale diverse training set containing over 4 million unique concepts.
The pipeline composed of AI models, including SAM 3 and a Llama-based image description system, automatically mines images and videos, generates descriptions, parses the descriptions into text labels, and creates initial segmentation masks. Human and AI annotators then verify and correct these proposals, forming a feedback loop that rapidly expands the coverage of the dataset while continuously improving data quality.
AI annotators, based on the specially trained Llama 3.2v model, achieve or exceed human accuracy on annotation tasks, such as verifying mask quality or checking whether all instances of a concept in an image have been thoroughly labeled. By delegating some human annotation tasks to AI annotators, throughput has increased by more than double compared to purely manual annotation pipelines.
SAM 3D Sets a New Standard for 3D Reconstruction of the Physical World
SAM 3D includes two new industry-leading models: SAM 3D Objects for object and scene reconstruction, and SAM 3D Body for human pose and shape estimation. These two models set a new standard for 3D reconstruction of physical world scenes.
SAM 3D Objects represents a new approach to visual localization 3D reconstruction and object pose estimation, capable of reconstructing detailed 3D shapes, textures, and object layouts from a single natural image. The innovation of this model comes from breaking through long-standing barriers in 3D data of the physical world. By building a powerful data annotation engine and combining it with a new multi-stage training scheme for 3D design, SAM 3D Objects annotated nearly 1 million different images, generating approximately 3.14 million meshes involving models.
In head-to-head human preference tests, SAM 3D Objects achieved a win rate of at least 5 to 1 compared to other leading models. The model can return quality-equivalent complete texture reconstructions in seconds through diffusion shortcuts and other engineering optimizations, making 3D near-real-time applications possible, such as serving as a 3D perception module for robots SAM 3D Body focuses on accurate 3D human pose and shape estimation from a single image, capable of handling complex situations involving unusual poses, occlusions, or multiple people. The model supports interactive inputs such as segmentation masks and 2D keypoints, allowing users to guide and control the model's predictions.
SAM 3D Body achieves accurate and robust 3D human pose and shape estimation using large-scale high-quality data. The research team started with a large dataset containing billions of images, utilizing images from a diverse collection of large-scale photographs, high-quality videos from various multi-camera capture systems, and professionally constructed synthetic data. They then employed a scalable automated data engine to mine high-value images, selecting those with unusual poses and rare capture conditions. The team assembled a high-quality training dataset of approximately 8 million images, training the model to be robust against occlusions, rare poses, and diverse clothing. SAM 3D Body has achieved a stepwise improvement in accuracy and robustness across multiple 3D benchmarks, outperforming previous models.
Application Expansion to Wildlife Conservation and Marine Research
SAM 3 has begun to be applied in the field of scientific research. Meta has collaborated with Conservation X Labs and Osa Conservation to combine field wildlife monitoring with SAM 3 to create an open research-grade raw video dataset. The publicly available SA-FARI dataset contains over 10,000 camera trap videos covering more than 100 species, with each animal in every frame annotated with bounding boxes and segmentation masks.
FathomNet is a unique research collaboration led by the Monterey Bay Aquarium Research Institute (MBARI) aimed at advancing AI tools for ocean exploration. Segmentation masks customized for underwater images and a new instance segmentation benchmark are now available to the marine research community through the FathomNet database. SA-FARI and FathomNet are accessible to the broader AI community to develop innovative methods for discovering, monitoring, and protecting terrestrial and marine wildlife.
Meta has also partnered with Roboflow to enable users to annotate data, fine-tune, and deploy SAM 3 to meet specific needs. As part of the code release, Meta shared fine-tuning methods for the community to adapt SAM 3 to their use cases.
Despite significant progress, SAM 3 still has limitations in certain scenarios. The model struggles to generalize to fine-grained out-of-domain concepts in a zero-shot manner, particularly specific terminology requiring specialized domain knowledge, such as "platelets" in medical or scientific images. When applied to video, SAM 3 tracks each object in a manner similar to SAM 2, meaning that inference costs increase linearly with the number of tracked objects. Each object is processed individually, utilizing only shared per-frame embeddings without communication between objects

