Motivation: As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind – 3D tracking active objects using observations captured through an egocentric camera.
Abstract: As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environ- ment, even if these objects are currently out of sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind – 3D tracking active objects using observations captured through an egocentric camera. We introduce Lift, Match and Keep (LMK), a method which lifts partial 2D observations to 3D world coordinates, matches them over time using visual appear- ance, 3D location and interactions to form object tracks, and keeps these object tracks even when they go out-of-view of the camera – hence keeping in mind what is out of sight. We test LMK on 100 long videos from EPIC-KITCHENS. Our results demonstrate that spatial cognition is critical for correctly locating objects over short and long time scales. E.g., for one long egocentric video, we estimate the 3D location of 50 active objects. Of these, 60% can be correctly positioned in 3D after 2 minutes of leaving the camera view.
We produce scene geometry for 100 videos from EPIC-KITCHENS-100. Those are 3D mesh extracted using a classical Multi-View Stereopsis pipeline that runs patch matching to find dense correspondences
between stereo image pairs, triangulates the correspondences to estimate depth,
and fuses them together into a dense 3D point cloud with surface normals.
We call our method Lift, Match, and Keep (LMK). It operates by first lifting 2D observations of objects to 3D, matching them over time, and keeping objects in memory when they are out-of-sight.
@inproceedings{Plizzari2025OSNOM,
title={Spatial Cognition from Egocentric Video:
Out of Sight, Not Out of Mind},
author={Plizzari, Chiara and Goel, Shubham and Perrett, Toby and Chalk, Jacob and Kanazawa, Angjoo and Damen, Dima},
booktitle={2025 International Conference on 3D Vision (3DV)},
year={2025}
}
Acknowledgements
Research at Bristol is supported by
EPSRC Fellowship UMPIRE (EP/T004991/1) and EPSRC
Program Grant Visual AI (EP/T028572/1). We particularly
thank Jitendra Malik for early discussions and insights on this work. We also
thank members of the BAIR community for helpful discussions.
This project acknowledges the use of University of Bristol’s Blue Crystal 4
(BC4) HPC facilities.