We are building on a great tradition of video understanding symposiums/summits previously held in Europe (2019, 2022) and the US (2019, 2017) - inviting you to come together for a much-needed discussion on next steps for video understanding. Despite tremendous efforts and an increase in the number and scale of video datasets, video understanding remains a bottleneck for research.
We thus invite you for a 1.5-days invitation-only closed event, to exchange ideas, form research directions, establish tight collaborations and recommend new research directions.
The Video AI Symposium will be held in Central London, at the Google DeepMind offices on Saturday 30 Sep and Sunday 1 Oct 2023. The event will bring together 50 researchers for an opportunity to exchange ideas and connect over a mutual interest in video understanding.
Sponsored by: Google DeepMind, Google Research and Meta AI
Saturday 30 Sep | ||||||||||||||||||||
08:45-09:30 | Breakfast | |||||||||||||||||||
09:30-09:45 | Introduction | |||||||||||||||||||
09:45-10:45 | Rohit Girdhar | Learning visual representations with minimal supervision | ||||||||||||||||||
Limin Wang | Towards building video foundation models | |||||||||||||||||||
Christoph Feichtenhofer | Self-Supervised Video Understanding | |||||||||||||||||||
10:45-11:15 | Open Discussion | How to learn from Raw Videos? | ||||||||||||||||||
11:15-11:45 | Coffee Break | |||||||||||||||||||
11:45-12:45 | Kristen Grauman | See What I See and Hear What I Hear: First-Person Perception and the Future of AR and Robotics | ||||||||||||||||||
Michael Ryoo | Video Representations for Robot Learning | |||||||||||||||||||
Arsha Nagrani | How can LLMs help with video understanding? | |||||||||||||||||||
12:45-13:30 | Lunch | |||||||||||||||||||
13:30-14:10 | Bill Freeman | Watching videos out of the corner of your eye | ||||||||||||||||||
Stratis Gavves | Causal Computer Vision towards Embodied General Intelligence | |||||||||||||||||||
14:10-14:40 | Open Discussion | One dataset to solve it all - from tiktok to robotics | ||||||||||||||||||
14:40-15:00 | Coffee Break | |||||||||||||||||||
15:00-16:00 | Cordelia Schmid | Dense video captioning and beyond | ||||||||||||||||||
Cees Snoek | Towards Human-Aligned Video-AI | |||||||||||||||||||
Dima Damen | Should we still seek fine-grained perception in video? | |||||||||||||||||||
16:00-16:20 | Coffee Break | |||||||||||||||||||
16:20-17:00 | Carl Vondrick | System 2 and Video | ||||||||||||||||||
17:00-17:45 | Open Discussion | The crisis of downstream tasks... Are current benchmarks a good measure of research progress? | ||||||||||||||||||
19:00-22:00 | Dinner | |||||||||||||||||||
Sunday 1 October | ||||||||||||||||||||
08:45-09:30 | Breakfast | |||||||||||||||||||
09:30-10:30 | Joseph Tighe | A new benchmark for an embodied AI assistant | ||||||||||||||||||
Adam Harley | Tracking Any Pixel in a Video | |||||||||||||||||||
Carl Doersch | Tracking Any Point | |||||||||||||||||||
10:30-10:45 | Coffee Break | |||||||||||||||||||
10:45-11:45 | Gul Varol | Beyond Text Queries for Search: Composed Video Retrieval | ||||||||||||||||||
Andrew Owens | Multimodal Learning from the Bottom Up | |||||||||||||||||||
Jitendra Malik | Unsolved problems in video understanding | |||||||||||||||||||
11:45-12:00 | Coffee Break | |||||||||||||||||||
12:00-12:40 | Angela Yao | VideoQA in the Time of Large Language Models | ||||||||||||||||||
Laura Sevilla Lara | Video Understanding Using Less Compute and Less Training Data | |||||||||||||||||||
12:40-13:15 | Open Discussion | Camera view vs world view - should video be studied in 3D | ||||||||||||||||||
13:15-14:30 | Lunch and Prep to Leaving to Train | |||||||||||||||||||
14:30- | 10 mins walk to St Pancras Station to take Eurostar Train | |||||||||||||||||||
16:31-19:47 | Eurostar Train to Paris Gare du Nord (Dep-Arr) | |||||||||||||||||||