Workshop on Multimodal Superintelligence

ALL ARE INVITED! NOW ACCEPTING: Grand Challenge Participation Proposals. Receive Up to $20K Compute Credits on Lambda.ai and Let Your Ideas Come to Life!!




Overview

The Workshop on Multimodal Superintelligence is a global gathering of researchers, engineers, and visionaries committed to accelerating progress in open-source multimodal intelligence. This initiative is designed to be both collaborative and competitive, encouraging breakthroughs at the intersection of vision, language, audio, and 3D. It is a unique event that invites all researchers from different disciplines and applications across multimodal learning to participate.

The Workshop

The workshop invites the broader scientific community to contribute. Research areas of focus are highlighted below:

  • Multimodal foundation and world models
  • Multimodal fusion, alignment, representation learning, co-learning and transfer learning
  • Joint learning of language, vision and audio
  • Multimodal commonsense and reasoning
  • Multimodal healthcare
  • Multimodal educational systems
  • Multimodal RL and control
  • Multimodal AI for science
  • Multimodal and multimedia resources
  • Multimodal dialogue, affect, and social intelligence
  • Creative applications of multimodal learning in e-commerce, art, and other impact areas

Accepted submissions will be featured on the website and presented during the culminating symposium.

The Grand Challenge - Language, Vision, Audio, 3D

An 80-day competition to build the blueprints of some of the most capable multimodal superintelligence systems out there.

  • Vision: Not to produce bold numbers, but to create breakthrough proof of concepts.
  • Criteria: Idea matters, you have 80 days to show that your idea works. We don't expect a fully functional foundation model. That comes after the challenge.
  • Who: Up to 15 teams selected from the global AI research community
  • What: Compete to build open-source systems capable of unified multimodal understanding and reasoning
  • When: The challenge lasts 80 days
  • Support: Each team receives a monthly compute grant of $2,000-$5,000, provided by Lambda
  • How: Participants will work under shared benchmarks, with intermediate check-ins, leaderboards, and collaborative opportunities

This is not just a competition — it's a movement to build AI that sees, hears, reads, speaks, and reasons.

Learn More

Further details about participation, paper submission, evaluation criteria, timelines, and open-source infrastructure can be found throughout this website.

Let’s Build the Future — Together.



Important Dates


The following are the important dates for the events:

Event Date (11:59PM anywhere on earth)
Challenge Proposal Deadline Aug 1st
Challenge Proposal Acceptance Aug 5th
Challenge Begins Aug 10th
Workshop Paper Deadline August 20th
Workshop Paper Acceptance September 20th
Challenge Final Github Deadline November 10th
Challenge Final Train Deadline (no training after) December 1st
Workshop Day / Challenge Winners Announced Early December
All deadlines are 11:59pm anywhere on earth.

Workshop


The workshop track allows submissions covering the following research areas:
  • Multimodal foundation and world models
  • Multimodal fusion, alignment, representation learning, co-learning and transfer learning
  • Joint learning of language, vision and audio
  • Multimodal commonsense and reasoning
  • Multimodal healthcare
  • Multimodal educational systems
  • Multimodal RL and control
  • Multimodal AI for science
  • Multimodal and multimedia resources
  • Multimodal dialogue, affect, and social intelligence
  • Creative applications of multimodal learning in e-commerce, art, and other impact areas

Keynotes

The following are the keynotes for the workshop.

Dr. Kristen Grauman is a Professor in the Department of Computer Science at the University of Texas at Austin. She is an IEEE Fellow, AAAS Fellow, AAAI Fellow, Sloan Fellow, a Microsoft Research New Faculty Fellow, and a recipient of NSF CAREER and ONR Young Investigator awards, the PAMI Young Researcher Award in 2013, the 2013 Computers and Thought Award from the International Joint Conference on Artificial Intelligence (IJCAI), the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2013.

Dr. Amir Zamir is an Assistant Professor of Computer Science at the Swiss Federal Institute of Technology (EPFL). His research interests are in computer vision, machine learning, and perception-for-robotics. Prior to joining EPFL in 2020, he spent time at UC Berkeley, Stanford, and UCF. He has been recognized with CVPR 2018 Best Paper Award, CVPR 2016 Best Student Paper Award, CVPR 2020 Best Paper Award Nomination, and NVIDIA Pioneering Research Award 2018. His research has been covered by popular press outlets, such as The New York Times or Forbes.

Dr. Dorsa Sadigh is an assistant professor in Computer Science at Stanford University. Her research interests lie in the intersection of robotics, machine learning, and human-AI interaction. She is awarded the Sloan Fellowship, NSF CAREER, ONR Young Investigator Award, AFOSR Young Investigator Award, DARPA Young Faculty Award, Okawa Foundation Fellowship, MIT TR35, and the IEEE RAS Early Academic Career Award.

Submission Portal

Schedule

The following is the workshop schedule.
  • 9:00–9:15: Opening
  • 9:15–10:30: Keynote 1
  • 10:30–11:45: Keynote 2
  • 11:45–13:00: Lunch Break
  • 13:00–14:00: Oral Session for Best Paper and Runner Up
  • 14:00–15:15: Keynote 3
  • 15:15–15:30: Buffer
  • 15:30–17:00: Posters, Panel and Discussion
  • 18:00–21:00: Authors and Organizers Dinner (Optional to Attend)

Grand-Challenge


Outline

The Grand Challenge encompasses four core data modalities. Each modality is encountered through a variety of practical “views” that researchers and engineers commonly work with:
  1. Text: UTF-8 encoded English words that appear as automatic transcripts, question–answer pairs, subtitle files, and free-form descriptive paragraphs. Text serves both as a control signal (prompts) and as an evaluation target (e.g. BLEU, ROUGE, BERTScore).
  2. Audio: raw PCM waveforms capturing speech, music, or multi-speaker conversation. The waveform representation preserves prosody and timing cues that are invisible in plain transcripts, enabling tasks such as emotion transfer or speech super-resolution.
  3. Video: either a single RGB frame or a sequence of frames from any domain—documentaries, gaming footage, instructional clips—optionally accompanied by a synchronized audio track and timestamped captions. Video combines spatial and temporal reasoning in a way no other modality can.
  4. 3-D: voxel grids (or voxel sequences) describing geometry in three spatial dimensions. These grids may depict a static object observed from multiple angles or a dynamic scene that evolves over time, suitable for tasks such as view-consistent generation, physics prediction, and embodied planning.

The central goal is to create any-to-any multimodal models. Participants are therefore required to accept prompts containing any subset of the modalities above and to generate any subset in response. Custom data-loader utilities will be provided so that preprocessing effort is minimal and teams can focus on modeling innovations rather than file-format wrangling.

You are permitted to specialise in only one or two modalities, but you must justify—with empirical evidence or a compelling theoretical argument—why this specialised route is likely to outperform state-of-the-art systems in that slice of the problem space.

Evaluation metrics (detailed later on this page) combine automatic scores tailored to each output type—e.g. BLEU for text, PESQ/LSD for audio, CLIP retrieval for images and video, Chamfer-L2 or F-score for 3-D—with a final round of expert review to judge qualitative aspects that numbers may miss.

The challenge runs for 80 days, including two intermediate checkpoints at days 30 and 60 where teams submit progress snapshots. Upon completion, the organising chairs will invite a select group of top-performing teams to collaborate with Lambda. Those teams will receive substantial GPU credits and engineering support to scale their work into publicly available, open-source foundation models for the research community.

Participation

The application portal for the Grand Challenge opens on June 20 2025 and closes on Aug 1 2025 at 23:59 anywhere-on-earth (AoE). Submissions received after the AoE cut-off will not be considered.

Applicants must upload a proposal written in the official NeurIPS 2025 submission format. The main body is limited to three pages; reference pages are unrestricted. Exactly four sections are permitted:

  1. Introduction: Present the overarching narrative of your approach and explain how it departs from prior art.
  2. Idea: Detail the core concept—model architecture, specialised data processing pipelines, and, most importantly, the novelty you claim.
  3. 60-Day Implementation Plan: Provide a week-by-week roadmap for coding, experimentation, and intermediate milestones during the first 90 days.
  4. 20-Day Final Execution Plan: Describe the procedures for the final training run in the last ten days. No code modifications are allowed during this window.

All material submitted at this stage remains strictly confidential. Only the Grand Challenge chairs will access the files for review, and all submissions will be permanently deleted once the decision process is complete.

Applicants will be notified of acceptance or rejection via e-mail no later than August 5 2025. Teams selected for the next phase will receive detailed onboarding instructions and immediate access to the required compute resources.

Public Train Data

The main task of the grand challenge is any-to-any multimodal models. We allow the usage of any publicly available training data as part of this grand challenge. Datasets can be used fully or in parts. Participants are required to disclose the datasets used for reproducibility of their work. If any changes are made to the publicly available dataset, then such changes shall be released publicly.

Public Test Evaluation

Measuring the performance of models on public test data is allowed only for the publication purposes, but it is not required. Public test data will not be used to judge the performance of participants in the grand-challenge - that will be based on the private test set. If using public test, participants should do their utmost to do fair comparison to related works. We suggest participants spend minimal amount of time during the challenge on this type of test since the final test is not publicly available.

Private Test Evaluation

The main test platform of the grand challene is a private test set. This private test set will not be released to the public, even after the challenge is over. The main test bed will be QA style. This private test data is provided to your code using LAILA. The following github contains example codes that will guide you through the evaluation process.

Final Submission

Submission are done though open review. The following portal can be used to submit the final report of the grand challenge for each team. Teams also have to release a github link that will be used to perform the final evaluation of the models for determining the winners. The final released codes have to use LAILA in order to be able to participate in the private test evaluation. The github shoule be self containing, an should be able to run on a machine with 8xH100 @ 80GBs VRAM (quantization accepted, any such quantization will be done by the authors).

Rules & Regulations

Use of AI

Participants are more than welcome to use AI in any capacity, including but not limited to the following:
  1. Coding, debugging, consulting.
  2. Data cleanup, augmentation and verification.
  3. Release of clean code to github.
  4. Writing of the paper.
Up to 100% of the above can be AI generated. However, participants should verify the AI generated content to make sure it remains true and correct. Especially the main table of results should remain verifiable and correct.

Policy

Academic Conduct

Lambda condemns harassment, discrimination and racism. Any behavior before, during, and after the grand challenge reflecting such biases can lead to disqualification or rejection of the papers.

Use of AI

Both grand challenge and workshop participants can use AI in experiments and writing - as long as they fact-check and ensure the generated content reflects the reality and correct results.

Workshop Staff

Keynotes

Kristen Grauman

Kristen Grauman

University of Texas at Austin

Dorsa Sadigh

Dorsa Sadigh

Stanford

Amir Zamir

Amir Zamir

EPFL


Organizers and advisors

Amir Zadeh

Amir Zadeh

Staff ML Researcher, Lambda

Chuan Li

Chuan Li

Chief Science Officer, Lambda

Soujanya Poria

Soujanya Poria

Associate Professor, Nanyang Technological University

Mohit Iyyer

Mohit Iyyer

Associate Professor, University of Maryland

Sewon Min

Sewon Min

Assistant Professor, UC Berkeley

Mohit Bansal

Mohit Bansal

Distinguished Professor, UNC Chapel Hill

Joyce Chai

Joyce Chai

Professor, University of Michigan

Shri Narayanan

Shri Narayanan

Professor, University of Southern California

Katerina Fragkiadaki

Katerina Fragkiadaki

Associate Professor, Carnegie Mellon University

Paul Sebexen

Paul Sebexen

Head of Special Projects, Lambda

Hamed Zamani

Hamed Zamani

Associate Professor, University of Massachusetts Amherst

Tal Daniel

Tal Daniel

Postdoc, Carnegie Mellon University

Thomas Bordes

Thomas Bordes

Head of Marketing, Lambda

Jessican Nicholson

Jessican Nicholson

ML Engineer, Lambda, University of Bath

Ranjay Krishna

Ranjay Krishna

Assistant Professor, University of Washington

Taylor Gautreaux

Taylor Gautreaux

Data Research, Lambda


Junior organizers

Jed Yang

Jed Yang

University of Michigan

Mihir Prabhudesai

Mihir Prabhudesai

Carnegie Mellon University

Rahul Duggal

Rahul Duggal

Lambda

Lea Alcantara

Lea Alcantara

Lambda


Secretary

Nicole Espinosa

Nicole Espinosa

University Relations, Lambda