OSKAR: Omnimodal Self-supervised Knowledge Abstraction and Representation

High-level idea

OSKAR is a self-supervised multimodal foundation model that learns from video, skeleton, and text through a fuse-then-predict latent objective. It preserves modality-specific structure while building transferrable representations.

Abstract

We present OSKAR, the first multimodal foundation model based on bootstrapped latent feature prediction. Unlike generative or contrastive methods, it avoids memorizing unnecessary details and does not require negative pairs, large memory banks, or hand-crafted augmentations.

We propose a novel pretraining strategy: given masked tokens from multiple modalities, predict a subset of missing tokens per modality, supervised by momentum-updated uni-modal target encoders. This design efficiently utilizes model capacity for high-level representations while retaining modality-specific information.

Further, we propose a scalable design that decouples compute cost from the number of modalities using a fixed representative token budget and a parameter-efficient cross-attention predictor. We instantiate OSKAR on video, skeleton, and text modalities, and show that its unified pretrained encoder outperforms specialized models across action recognition, localization, video-text retrieval, and video question answering.

Method

OSKAR’s end-to-end training recipe. During pretraining, visible tokens from multiple modalities are fused in a shared latent space, while modality-specific predictors infer masked targets using momentum-updated uni-modal encoders as supervision.

The same unified representation then transfers to a broad set of downstream tasks, including action recognition, action localization, text-video retrieval, and video question answering. This is the core promise of OSKAR: a single multimodal pretraining objective with strong transfer across video, motion, and language tasks.

OSKAR Features

Bootstrapped latent prediction

OSKAR learns in feature space rather than reconstructing pixels, allowing model capacity to focus on semantic structure and transferable abstractions.

Fuse-then-predict pretraining

The model fuses visible multimodal context and predicts missing latent targets for each modality using momentum-updated uni-modal encoders.

Scalable token budgeting

OSKAR decouples compute from the number of modalities with a fixed representative token budget and an efficient cross-attention predictor.

Modality-specific supervision

Separate target encoders preserve uni-modal structure while still enabling rich alignment and interaction across modalities.

Strong downstream transfer

A single pretrained encoder transfers to action recognition, action localization, text-video retrieval, and video question answering.

Unified multimodal design

OSKAR provides one coherent framework for dense, sparse, and symbolic modalities while remaining compatible with standard transformer pipelines.

Selected Results

Strong transfer across diverse downstream tasks

Action Recognition

86.1%

K400 top-1 accuracy

Skeleton Recognition

91.1%

NTU120-XSub accuracy

Action Localization

37.9

AVA mAP

Text-Video Retrieval

50.4

MSRVTT R@1

Video Question Answering

49.3%

MSRVTT-QA accuracy, alongside strong performance in low-sample, low-parameter, and low-label settings reported in the paper.

Citation

BibTeX

If OSKAR is useful in your research, please cite:

@inproceedings{
abdelfattah2025oskar,
title={{OSKAR}: Omnimodal Self-supervised Knowledge Abstraction and Representation},
author={Mohamed O Abdelfattah and Kaouther Messaoud and Alexandre Alahi},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=LWuhOoHpo5}
}