Bootstrapped latent prediction
OSKAR learns in feature space rather than reconstructing pixels, allowing model capacity to focus on semantic structure and transferable abstractions.
Abstract
We present OSKAR, the first multimodal foundation model based on bootstrapped latent feature prediction. Unlike generative or contrastive methods, it avoids memorizing unnecessary details and does not require negative pairs, large memory banks, or hand-crafted augmentations.
We propose a novel pretraining strategy: given masked tokens from multiple modalities, predict a subset of missing tokens per modality, supervised by momentum-updated uni-modal target encoders. This design efficiently utilizes model capacity for high-level representations while retaining modality-specific information.
Further, we propose a scalable design that decouples compute cost from the number of modalities using a fixed representative token budget and a parameter-efficient cross-attention predictor. We instantiate OSKAR on video, skeleton, and text modalities, and show that its unified pretrained encoder outperforms specialized models across action recognition, localization, video-text retrieval, and video question answering.
Method
OSKAR’s end-to-end training recipe. During pretraining, visible tokens from multiple modalities are fused in a shared latent space, while modality-specific predictors infer masked targets using momentum-updated uni-modal encoders as supervision.
The same unified representation then transfers to a broad set of downstream tasks, including action recognition, action localization, text-video retrieval, and video question answering. This is the core promise of OSKAR: a single multimodal pretraining objective with strong transfer across video, motion, and language tasks.
OSKAR Features
OSKAR learns in feature space rather than reconstructing pixels, allowing model capacity to focus on semantic structure and transferable abstractions.
The model fuses visible multimodal context and predicts missing latent targets for each modality using momentum-updated uni-modal encoders.
OSKAR decouples compute from the number of modalities with a fixed representative token budget and an efficient cross-attention predictor.
Separate target encoders preserve uni-modal structure while still enabling rich alignment and interaction across modalities.
A single pretrained encoder transfers to action recognition, action localization, text-video retrieval, and video question answering.
OSKAR provides one coherent framework for dense, sparse, and symbolic modalities while remaining compatible with standard transformer pipelines.
Selected Results
Action Recognition
86.1%
K400 top-1 accuracy
Skeleton Recognition
91.1%
NTU120-XSub accuracy
Action Localization
37.9
AVA mAP
Text-Video Retrieval
50.4
MSRVTT R@1
Video Question Answering
49.3%
MSRVTT-QA accuracy, alongside strong performance in low-sample, low-parameter, and low-label settings reported in the paper.
Citation
If OSKAR is useful in your research, please cite:
@inproceedings{
abdelfattah2025oskar,
title={{OSKAR}: Omnimodal Self-supervised Knowledge Abstraction and Representation},
author={Mohamed O Abdelfattah and Kaouther Messaoud and Alexandre Alahi},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=LWuhOoHpo5}
}