Hi, I'm Aljosa! I come from the Alpine side of Slovenia. I am a Senior Research Scientist at NVIDIA, alum University of Bonn, RWTH Aachen University, TU Munich & Robotics Institute, Carnegie Mellon University.
I started my journey during my Ph.D. at RWTH Aachen University with development of joint, 3D stereo-based geometry, ego pose and object tracking, starting with canonical objects, and pushed the frontier towards tracking and reconstruction of any object — demonstrating these pipelines can power data auto-labeling. I am continuing this journey at NVIDIA, turning years of my foundational academic work in video-based reconstruction, tracking and object mining and autolabeling into real-world systems at scale.
Looking ahead, I believe the next frontier is memory: future agents will need to operate not for seconds, but over a lifetime. I lay out this vision in my research statement (2023), establishing (implicit) visual tracking as the key mechanism for building structured, queryable memory at test time. Our recent work on scalable feed-forward 3D reconstruction and structured sparse attention for world modelling are steps in this direction.
Built the groundwork for camera-based 4D scene understanding during my PhD at RWTH Aachen — joint geometry & pose estimation and any-object tracking/discovery. Co-authored HOTA, the standard tracking evaluation metric.
Building on prior work on tracking and reconstructing any object, I started & led the SAL project (open-vocabulary 4D localization & completion), powering auto-labeling for AV perception (featured at NVIDIA GTC 2024).
The next frontier is memory: agents that operate over a lifetime in physical world. Laid out in my research statement, recently advanced in context of 3D reconstruction and structured sparse attention for world modelling.
X. Wu, S. Elflein, J. Lucas, O. Russakovsky, L. Leal-Taixé, D. Paschalidou, J. Lorraine, A. Ošep: WorldTrace: Addressable Memory for Video World Models, Preprint, 2026. paper page
S. Elflein, R. Li, S. Agostinho, Ž. Gojčič, L. Leal-Taixé, A. Ošep: VGG-T3: Offline Feed-Forward 3D Reconstruction at Scale, CVPR, 2026. paper page
A. Ošep, T. Meinhardt, F. Ferroni, N. Peri, D. Ramanan, L. Leal-Taixe: Better Call SAL: Towards Learning to Segment Anything in Lidar, ECCV, 2024. paper video page
P. Dendorfer, A. Ošep, A. Milan, K. Schindler, D. Cremers, I. Reid, S. Leal-Taixé: MOTChallenge: A Benchmark for Single-camera Multiple Target Tracking, IJCV, 2020. paper
A. Ošep, P. Voigtlaender, J. Luiten, S. Breuers, B. Leibe: Towards Large-Scale Video Object Mining, ECCV 2018 Workshop on Interactive and Adaptive Learning in an Open World, 2018. paper
A. Ošep, A. Hermans, F. Engelmann, D. Klostermann, M. Mathias, B. Leibe: Multi-Scale Object Candidates for Generic Object Tracking in Street Scenes, ICRA, 2016. paper
D. Mitzel, J. Diesel, A. Ošep, U. Rafi, B. Leibe: A Fixed-Dimensional 3D Shape Representation for Matching Partially Observed Objects in Street Scenes, ICRA, 2015. paper
M. Weinmann, A. Ošep, R. Ruiters, R. Klein: Multi-View Normal Field Integration for 3D Reconstruction of Mirroring Objects, ICCV, 2013. paper
M. Weinmann, R. Ruiters, A. Ošep, C. Schwartz, R. Klein: Fusing Structured Light Consistency and Helmholtz Normals for 3D Reconstruction, BMVC, 2012. paper