OSKAR: Omnimodal Self-supervised Knowledge Abstraction and Representation
Mohamed Abdelfattah*
Kaouther Messoud*
Alexandre Alahi
École Polytechnique Fédérale de Lausanne (EPFL)
firstname.lastname@epfl.ch
Abstract
We present OSKAR, the first multimodal foundation model based on bootstrapped latent feature prediction. Unlike generative or contrastive methods, it avoids mem- orizing unnecessary details (e.g., pixels), and does not require negative pairs, large memory banks, or hand-crafted augmentations. We propose a novel pretraining strategy: given masked tokens from multiple modalities, predict a subset of missing tokens per modality, supervised by momentum-updated uni-modal target encoders. This design efficiently utilizes the model capacity in learning high-level representa- tions while retaining modality-specific information. Further, we propose a scalable design which decouples the compute cost from the number of modalities using a fixed representative token budget—in both input and target tokens—and introduces a parameter-efficient cross-attention predictor that grounds each prediction in the full multimodal context. Without loss of generality, we instantiate OSKAR on video, skeleton, and text modalities. Extensive experimental results show that OSKAR’s unified pretrained encoder outperforms models with specialized archi- tectures of similar size in action recognition (rgb, skeleton, frozen, low-shot) and localization, video-text retrieval, and video question answering.
Framework
OSKAR Architecture. Multimodal tokens (e.g., video, skeleton, text) are projected into a shared space and split into two branches: (1) a fusion encoder processes visible tokens to produce fused representations; (2) modality-specific target encoders generate target embeddings. A predictor estimates masked representations from fused tokens, supervised via MSE loss against targets. Target encoders are updated via EMA of the fusion encoder and are used exclusively in fine-tuning.
Citation
@inproceedings{abdelfattah2024sjepa, author={Abdelfattah, Mohamed, Messoud, Kaouther and Alahi, Alexandre}, journal={Advances in Neural Information Processing Systems (NeurIPS)}, year={2025} }
Reference
[1]. Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., ... & Zhu, S. (2025). Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. In Proceedings of the CVPR.
[2]. Mizrahi, D., Bachmann, R., Kar, O., Yeo, T., Gao, M., Dehghan, A., & Zamir, A. (2023). 4m: Massively multimodal masked modeling. NeurIPS.
[3]. Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., ... & Ballas, N. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of CVPR.