OSKAR: Omnimodal Self-supervised Knowledge Abstraction and Representation

Mohamed Abdelfattah* Kaouther Messoud* Alexandre Alahi

École Polytechnique Fédérale de Lausanne (EPFL)
firstname.lastname@epfl.ch

OSKARis a self-supervised multimodal foundation model that learns in the latent space using a fuse-then-predict strategy. It combines multiple modalities to capture paired cross-modal features. Then, latent predictions are matched against targets produced by modality-specific momen- tum encoders. This design preserves uni-modal structure while enabling rich cross-modal learning. Remarkably, finetuning our unified encoder outperforms specialized models in video/skeleton/text downstream tasks, demonstrating the versatility of the learned multimodal representations.

Abstract

We present OSKAR, the first multimodal foundation model based on bootstrapped latent feature prediction. Unlike generative or contrastive methods, it avoids mem- orizing unnecessary details (e.g., pixels), and does not require negative pairs, large memory banks, or hand-crafted augmentations. We propose a novel pretraining strategy: given masked tokens from multiple modalities, predict a subset of missing tokens per modality, supervised by momentum-updated uni-modal target encoders. This design efficiently utilizes the model capacity in learning high-level representa- tions while retaining modality-specific information. Further, we propose a scalable design which decouples the compute cost from the number of modalities using a fixed representative token budget—in both input and target tokens—and introduces a parameter-efficient cross-attention predictor that grounds each prediction in the full multimodal context. Without loss of generality, we instantiate OSKAR on video, skeleton, and text modalities. Extensive experimental results show that OSKAR’s unified pretrained encoder outperforms models with specialized archi- tectures of similar size in action recognition (rgb, skeleton, frozen, low-shot) and localization, video-text retrieval, and video question answering.

Framework

OSKAR Architecture. Multimodal tokens (e.g., video, skeleton, text) are projected into a shared space and split into two branches: (1) a fusion encoder processes visible tokens to produce fused representations; (2) modality-specific target encoders generate target embeddings. A predictor estimates masked representations from fused tokens, supervised via MSE loss against targets. Target encoders are updated via EMA of the fusion encoder and are used exclusively in fine-tuning.

Citation

@inproceedings{abdelfattah2024sjepa, author={Abdelfattah, Mohamed, Messoud, Kaouther and Alahi, Alexandre}, journal={Advances in Neural Information Processing Systems (NeurIPS)}, year={2025} }

Reference

[1]. Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., ... & Zhu, S. (2025). Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. In Proceedings of the CVPR.

[2]. Mizrahi, D., Bachmann, R., Kar, O., Yeo, T., Gao, M., Dehghan, A., & Zamir, A. (2023). 4m: Massively multimodal masked modeling. NeurIPS.

[3]. Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., ... & Ballas, N. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of CVPR.