We are building on a great tradition of video understanding symposiums/summits previously held in Europe (2019, 2022) and the US (2019, 2017) - inviting you to come together for a much-needed discussion on next steps for video understanding. Despite tremendous efforts and an increase in the number and scale of video datasets, video understanding remains a bottleneck for research.
We thus invite you for a 1.5-days invitation-only closed event, to exchange ideas, form research directions, establish tight collaborations and recommend new research directions.
The Video AI Symposium will be held in Central London, at the Google DeepMind offices on Saturday 30 Sep and Sunday 1 Oct 2023. The event will bring together 50 researchers for an opportunity to exchange ideas and connect over a mutual interest in video understanding.
Sponsored by: Google DeepMind, Google Research and Meta AI
Google Research
INRIA, Google Research
University of Oxford, Google DeepMind
UC Berkeley and Meta AI
MIT and Google Research
University of Amsterdam
Meta AI
UC Berkeley
INRIA Paris
Weizmann Institute
University of Oxford and Meta
MIT
Meta AI and UT Austin
Google DeepMind
University of Bristol and Google DeepMind
Meta AI
Meta AI
Salesforce - Stanford University
University of Amsterdam
University of Bonn
KAUST
Columbia University
University of Bonn and MIT-IBM Watson AI Lab
Google DeepMind
Singapore University
Nanjing University
New York University
Stony Brook University and Google DeepMind
University of Michigan
Meta AI
École des Ponts ParisTech
Google DeepMind
Meta AI
Stanford University
Google Research
Google DeepMind
Shanghai Jiao Tong University
University of Edinburgh
University of Oxford
University of Bristol
University of Amsterdam
University of Leiden
Google DeepMind
Google DeepMind
Google DeepMind
Google DeepMind
Google DeepMind
Google DeepMind
Google DeepMind
Google DeepMind
Google DeepMind
Google DeepMind
Google DeepMind
Google DeepMind
Google DeepMind
Google DeepMind
Saturday 30 Sep | ||||||||||||||||||||
08:45-09:30 | Breakfast | |||||||||||||||||||
09:30-09:45 | Introduction | |||||||||||||||||||
09:45-10:45 | Rohit Girdhar | Learning visual representations with minimal supervision | ||||||||||||||||||
Limin Wang | Towards building video foundation models | |||||||||||||||||||
Christoph Feichtenhofer | Self-Supervised Video Understanding | |||||||||||||||||||
10:45-11:15 | Open Discussion | How to learn from Raw Videos? | ||||||||||||||||||
11:15-11:45 | Coffee Break | |||||||||||||||||||
11:45-12:45 | Kristen Grauman | See What I See and Hear What I Hear: First-Person Perception and the Future of AR and Robotics | ||||||||||||||||||
Michael Ryoo | Video Representations for Robot Learning | |||||||||||||||||||
Arsha Nagrani | How can LLMs help with video understanding? | |||||||||||||||||||
12:45-13:30 | Lunch | |||||||||||||||||||
13:30-14:10 | Bill Freeman | Watching videos out of the corner of your eye | ||||||||||||||||||
Stratis Gavves | Causal Computer Vision towards Embodied General Intelligence | |||||||||||||||||||
14:10-14:40 | Open Discussion | One dataset to solve it all - from tiktok to robotics | ||||||||||||||||||
14:40-15:00 | Coffee Break | |||||||||||||||||||
15:00-16:00 | Cordelia Schmid | Dense video captioning and beyond | ||||||||||||||||||
Cees Snoek | Towards Human-Aligned Video-AI | |||||||||||||||||||
Dima Damen | Should we still seek fine-grained perception in video? | |||||||||||||||||||
16:00-16:20 | Coffee Break | |||||||||||||||||||
16:20-17:00 | Carl Vondrick | System 2 and Video | ||||||||||||||||||
17:00-17:45 | Open Discussion | The crisis of downstream tasks... Are current benchmarks a good measure of research progress? | ||||||||||||||||||
19:00-22:00 | Dinner | |||||||||||||||||||
Sunday 1 October | ||||||||||||||||||||
08:45-09:30 | Breakfast | |||||||||||||||||||
09:30-10:30 | Joseph Tighe | A new benchmark for an embodied AI assistant | ||||||||||||||||||
Adam Harley | Tracking Any Pixel in a Video | |||||||||||||||||||
Carl Doersch | Tracking Any Point | |||||||||||||||||||
10:30-10:45 | Coffee Break | |||||||||||||||||||
10:45-11:45 | Gul Varol | Beyond Text Queries for Search: Composed Video Retrieval | ||||||||||||||||||
Andrew Owens | Multimodal Learning from the Bottom Up | |||||||||||||||||||
Jitendra Malik | Unsolved problems in video understanding | |||||||||||||||||||
11:45-12:00 | Coffee Break | |||||||||||||||||||
12:00-12:40 | Angela Yao | VideoQA in the Time of Large Language Models | ||||||||||||||||||
Laura Sevilla Lara | Video Understanding Using Less Compute and Less Training Data | |||||||||||||||||||
12:40-13:15 | Open Discussion | Camera view vs world view - should video be studied in 3D | ||||||||||||||||||
13:15-14:30 | Lunch and Prep to Leaving to Train | |||||||||||||||||||
14:30- | 10 mins walk to St Pancras Station to take Eurostar Train | |||||||||||||||||||
16:31-19:47 | Eurostar Train to Paris Gare du Nord (Dep-Arr) | |||||||||||||||||||