Multimodal Superintelligence

Overview

The Workshop on Multimodal Superintelligence is a global gathering of researchers, engineers, and visionaries committed to accelerating progress in open-source multimodal intelligence. This initiative is designed to be both collaborative and competitive, encouraging breakthroughs at the intersection of vision, language, audio, and 3D. It is a unique event that invites all researchers from different disciplines and applications across multimodal learning to participate.

The Workshop

The workshop invites the broader scientific community to contribute. Research areas of focus are highlighted below:

Multimodal foundation and world models
Multimodal fusion, alignment, representation learning, co-learning and transfer learning
Joint learning of language, vision and audio
Multimodal commonsense and reasoning
Multimodal healthcare
Multimodal educational systems
Multimodal RL and control
Multimodal AI for science
Multimodal and multimedia resources
Multimodal dialogue, affect, and social intelligence
Creative applications of multimodal learning in e-commerce, art, and other impact areas

Accepted submissions will be featured on the website and presented during the culminating symposium.

The Grand Challenge - Language, Vision, Audio, 3D

A challenge (ending Dec. 10, 2025) to build the blueprints of some of the most capable open-source multimodal superintelligence systems out there. Focused on enabling ideas and pushing open source forward in the first stage (Dec 10). Winners of the first stage can have their models further supported by Lambda and trained for public foundation model release.

Vision: Not to produce bold numbers, but to create breakthrough proof of concepts.
Criteria: Idea matters, you have until December 10 to show that your idea works. We don't expect a fully functional foundation model. That comes after the challenge.
Support: Each team receives a monthly compute grant of $2,000-$5,000, provided by Lambda.

This is not just a competition — it's a movement to build AI that sees, hears, reads, speaks, and reasons.

Let's Build the Future of Multimodal Machine Learning — Together.