Overview
The Workshop on Multimodal Superintelligence is a global gathering of researchers, engineers, and visionaries committed to accelerating progress in open-source multimodal intelligence. This initiative is designed to be both collaborative and competitive, encouraging breakthroughs at the intersection of vision, language, audio, and 3D. It is a unique event that invites all researchers from different disciplines and applications across multimodal learning to participate.
The Workshop
The workshop invites the broader scientific community to contribute. Research areas of focus are highlighted below:
- Multimodal foundation and world models
- Multimodal fusion, alignment, representation learning, co-learning and transfer learning
- Joint learning of language, vision and audio
- Multimodal commonsense and reasoning
- Multimodal healthcare
- Multimodal educational systems
- Multimodal RL and control
- Multimodal AI for science
- Multimodal and multimedia resources
- Multimodal dialogue, affect, and social intelligence
- Creative applications of multimodal learning in e-commerce, art, and other impact areas
Accepted submissions will be featured on the website and presented during the culminating symposium.
The Grand Challenge - Language, Vision, Audio, 3D
A challenge (ending Dec. 10, 2025) to build the blueprints of some of the most capable open-source multimodal superintelligence systems out there. Focused on enabling ideas and pushing open source forward in the first stage (Dec 10). Winners of the first stage can have their models further supported by Lambda and trained for public foundation model release.
- Vision: Not to produce bold numbers, but to create breakthrough proof of concepts.
- Criteria: Idea matters, you have until December 10 to show that your idea works. We don't expect a fully functional foundation model. That comes after the challenge.
- Support: Each team receives a monthly compute grant of $2,000-$5,000, provided by Lambda.
This is not just a competition — it's a movement to build AI that sees, hears, reads, speaks, and reasons.
Let's Build the Future of Multimodal Machine Learning — Together.