Meta FAIR is sharing new research artifacts that highlight our commitment to advanced machine intelligence (AMI) and the impact of open source technologies on advancing AI. By sharing our research, we aim to create intelligent systems that can understand and respond to complex human needs, ultimately enhancing our daily lives.
The work we’re sharing includes the Meta PARTNR dataset and benchmark, aimed at building socially intelligent robots that can assist people in everyday tasks, such as grabbing a delivery that’s at the front door or helping around the house.
As part of our continued work with UNESCO to promote linguistic diversity in technology, we’re inviting collaborators to join us in enhancing and expanding machine translation and language technologies. Through collaboration, we want to promote linguistic diversity and inclusivity in the digital world.
We’re also sharing advancements in audio processing, multilingual communication, and language technology—all crucial developments on the path towards achieving AMI.
Meta Fundamental AI Research (FAIR) is focused on achieving advanced machine intelligence (AMI) and using it to power products and innovation for the benefit of everyone. Today, we’re excited to share some of our most recent research and models that support our goal of achieving AMI and our long-held commitment to sharing open and reproducible science.
Meta PARTNR: Unlocking human-robot collaboration
Imagine a world where robots are intuitive partners in our daily lives. They clean up, grab deliveries, and assist in cooking, all while understanding our needs and adapting to the dynamic environment of a bustling home. Today, we’re excited to introduce PARTNR, a research framework that brings us closer to this reality by fueling research in seamless human-robot collaboration. Most robots today operate in isolation, which limits their potential as helpful assistive agents of the future. With PARTNR, we aim to change the status quo by open sourcing a large-scale benchmark, dataset, and model aimed at studying human-robot collaboration on everyday tasks. At its core, PARTNR provides a mechanism to train social robots through large-scale training in simulation, followed by deployment in the physical world.
PARTNR stands on the shoulders of previous highly impactful work that has been shared with the open source community. It builds on the advances made with Habitat 1.0, which trained virtual robots to navigate in 3D scans of real homes, and Habitat 2.0, which trained virtual robots to clean up houses by rearranging objects. With Habitat 3.0, a simulator designed to train models for human-robot collaboration, we made another leap forward. Habitat 3.0 enabled training human-robot collaboration models at large scale, which isn’t feasible in the physical world due to safety concerns and scalability issues.
We’re also introducing the PARTNR benchmark, which aims to benchmark collaborative robots and ensure they can perform well in both simulated and physical environments. Our benchmark consists of 100,000 tasks, including household chores such as cleaning up dishes and toys. We’re also releasing the PARTNR dataset consisting of human demonstrations of the PARTNR tasks in simulation, which can be used for training embodied AI models. The PARTNR benchmark highlights major shortcomings of existing models, such as poor coordination and failures in task tracking, and recovery from errors. We encourage the academic community to continue to build on our work and fuel progress in the field of human-robot collaboration.
We’ve also made advances in models that can collaborate with humans in both simulation and physical environments. Using large-scale simulation data, we trained a large planning model that outperforms state-of-the-art baselines in terms of both speed and performance. This model achieves an 8.6x increase in speed, while enabling humans to be 24% more efficient at completing tasks compared to existing top-performing models. It can interpret long-horizon instructions, breaking down complex tasks into actionable steps, and providing meaningful assistance to human users. We’ve successfully deployed this model on Boston Dynamics’ Spot, showcasing its ability to work alongside humans in real-world settings. To enhance transparency and trust, we’ve also developed a mixed reality interface that visualizes the robot’s actions and thought processes, offering a window into its decision making.
The potential for innovation and development in the field of human-robot collaboration is vast. With PARTNR, we want to reimagine robots as future partners—not just agents—and jump-start research in this exciting field.
Download the code
Download the dataset
Read the paper
Democratizing language technology for the International Decade of Indigenous Languages
Language is a fundamental part of who we are, and yet many people around the world are excluded from the digital conversation because their language isn’t supported by technology. To help bridge this gap, we’re inviting the linguistics community to collaborate with us on improving and broadening the coverage of Meta’s open source language technologies, including speech recognition and machine translation.
Language Technology Partner Program
We’re seeking partners to collaborate with us on advancing language technologies, including speech recognition and machine translation. Our efforts are especially focused on under-served languages, in support of UNESCO’s work and part of the private sector’s contribution to digital empowerment through the framework of the International Decade of Indigenous Languages. We are looking for partners who can contribute 10+ hours of speech recordings with transcriptions, large corpora of written text (200+ sentences), and sets of translated sentences in diverse languages. Partners will work with our teams to help integrate these languages into AI-driven speech recognition and machine translation models, which we intend to open source and make freely available to the community. As a partner, you’ll also gain access to workshops led by our research teams where you’ll learn how to leverage our open source models to build language technologies. We’re pleased that the Government of Nunavut, Canada has agreed to work with us on this exciting initiative.
Open source machine translation benchmark
In addition to our Language Technology Partner Program, we’re launching an open source machine translation benchmark of carefully crafted sentences by linguistics experts to showcase the diversity of human language. We invite you to access the benchmark in seven languages and contribute translations that will be open sourced for others. We aim to collectively build an unprecedented multilingual machine translation benchmark for the community.
Our commitment to supporting more languages and developing open source technologies for them is ongoing. In 2022, we released No Language Left Behind (NLLB), a groundbreaking open source machine translation engine that laid the foundation for future research and development in this area. As the first neural machine translation model for many languages, NLLB paved the way for further innovation. Since its release, the open source community has built upon this work, expanding its capabilities to support dozens of additional languages. We’re also pleased that UNESCO and Hugging Face collaborated with us to build a language translator based on NLLB, which we announced during the United Nations General Assembly week last September. As we continue to develop this technology, we’re excited to collaborate with language communities to enhance and expand machine translation and other language technologies.
To support digital empowerment, which is a key thematic area of the Global Action Plan of the International Decade of Indigenous Languages, we recently introduced the Massively Multilingual Speech (MMS) project, which scales audio transcription to over 1,100 languages. We’ve continued to improve and expand its capabilities, including the addition of zero-shot speech recognition, which enables the model to transcribe audio in languages it’s never seen before without prior training. These technologies have significant implications for language support and accessibility, particularly for under-served communities.
By promoting the implementation of the International Decade of Indigenous Languages, we’re aiming to address the challenges posed by the proliferation of English language models and working towards equal representation for all languages, contributing to the achievement of the United Nations’ Sustainable Development Goals.
In addition to its potential impact on language support and accessibility, our work also has broader implications for the development of AMI. By working on multilingual problems and under-served languages, the model demonstrates the ability to learn from minimal data. These developments mark a crucial step towards creating intelligent systems that can adapt to new situations and learn from experience.
Ultimately, our goal is to create intelligent systems that can understand and respond to complex human needs, regardless of language or cultural background, and build technology that is inclusive of our world’s languages and cultures.
Meta Audiobox Aesthetics: A new standard for audio processing
Traditionally, measuring audio aesthetics has been a complex task due to its subjective nature. Unlike objective metrics like frequency response or signal-to-noise ratio, audio aesthetics require a nuanced understanding of human perception. Today, we’re excited to open source Meta Audiobox Aesthetics, a model that enables the automatic evaluation of audio aesthetics, providing a comprehensive assessment of audio quality across speech, music, and sound. The model makes predictions that analyze content enjoyment, content usefulness, production complexity, and production quality. Addressing the challenges of subjective audio evaluation leads to improved audio content quality and the development of more advanced generative audio models.
Existing evaluation methods often provide submodality-specific results with vague instructions that are difficult to interpret. Audiobox Aesthetics overcomes these limitations by offering a structured approach to audio evaluation.
To develop Audiobox Aesthetics, we designed a comprehensive annotation protocol, resulting in the collection of 562 hours of audio aesthetic data. Our dataset was annotated by professional raters to ensure high-quality data. The annotation process involved evaluating audio samples on a scale of one to 10 across four defined metrics: production quality, production complexity, content enjoyment, and content usefulness. This process allowed for the creation of a unified aesthetics score calibrated across different audio modalities, ensuring consistency and reliability in the model’s predictions.
▲ higher is better. Across the four quality assessment dimensions, Audiobox Aesthetics shows better correlation with human judgement than competitors across speech, sound, and music.
Extensive experiments showed that Audiobox Aesthetics outperformed prior works with higher correlation to human judgement, proving its effectiveness as an automatic metric for quality evaluation. The model, which we’re releasing with a CC-BY 4.0 license, also boosts the quality of various audio generation models through data filtering and quality prompting, achieving significant improvements in text-to-speech, text-to-music, and text-to-sound applications.
Audiobox Aesthetics has already been leveraged to enhance Meta Movie Gen, helping to facilitate high-quality multimedia content, further driving progress and innovation in the industry. We hope this work will be used to enhance the quality of audio content and support the development of more sophisticated generative audio models.
Download the model weights and code
Read the paper
WhatsApp Voice Message Transcripts: Unlocking seamless communication
As we continue to build the future of human connection and the technology that makes it possible, we launched an update on WhatsApp to make communication feel even more seamless. Voice Message Transcripts utilize advanced on-device technology to generate transcripts from audio messages locally and securely, ensuring that personal voice messages are end-to-end encrypted. Currently, this feature supports English, Spanish, Portuguese, and Russian, broadening its reach across diverse communities.
The development of Voice Message Transcripts was made possible by leveraging insights from FAIR’s Seamless Communication research. WhatsApp can continue to innovate and improve its services by using this research, ultimately driving progress towards the goal of achieving AMI with multilingual capabilities. We’ve extensively explored, developed, and shared best practices for model fine-tuning with the research community for the public releases of Seamless M4T models. These techniques were applied and further improved, along with distillation, to adjust them to the genre of WhatsApp voice messages.
This advancement enhances peoples’ experiences while protecting private messaging and sets the stage for future innovations in multilingual communication.