Top 10 Egocentric Video Datasets for Physical AI

Srishty Sharon

Blog Writer

As AI systems move beyond screens and servers into the physical world, one thing becomes clear: models trained only on third-person video aren’t enough. To build robots, AR systems, and embodied AI that truly understand how humans interact with their environments, you need data from the human perspective. That’s where egocentric video datasets come in.

Captured through wearable cameras, AR/VR headsets, and head-mounted devices, egocentric (first-person) video datasets record the world the way a human actually experiences it; hands in frame, attention shifting, objects being picked up, tasks being completed in real-world environments. This kind of data is becoming critical infrastructure for the next generation of AI.

In this blog, we cover the most influential egocentric video datasets shaping research and commercial AI development and why first-person video datasets are becoming essential for embodied AI training data, robotics, and computer vision data annotation pipelines.

What Makes Egocentric Video Datasets Different?

Unlike fixed-camera or third-person datasets, egocentric datasets capture:

Natural hand and wrist motion during real task execution
Gaze and attention patterns tied to decision-making
Environmental context; lighting changes, occlusions, spatial transitions
Temporal depth; long sequences of tasks unfolding over time, not just isolated clips
Rich human-object interaction data that reflects real-world manipulation, not staged scenarios

They also tend to be multimodal, combining RGB video with audio, depth data, IMU sensors, and gaze tracking. This makes them particularly valuable for foundation models, embodied AI, and robotics systems that need to understand not just what’s happening, but how and why. That complexity also makes egocentric video data annotation significantly more demanding than standard computer vision data annotation.

Top 10 Egocentric Video Datasets

1. Ego4D

Developed by Meta AI in collaboration with academic partners across 9 countries, Ego4D is among the largest and most comprehensive egocentric datasets ever assembled. It captures daily life scenarios from cooking and shopping to social interactions and construction with an intentional emphasis on diversity across geography, culture, and demographics.

Key features:

3,600+ hours of first-person video
Multimodal: audio, narration, gaze, hand pose
Benchmarks for episodic memory, forecasting, hand-object interaction, and social interaction
Designed to support long-horizon activity understanding

Best for: Multimodal foundation models, embodied AI research, long-form activity understanding

2. EPIC-KITCHENS

EPIC-KITCHENS is the benchmark dataset for fine-grained kitchen activity recognition. Recorded in participants’ own homes across multiple countries, it captures unscripted, naturalistic cooking and food preparation sequences making it far more representative of real human behavior than lab-based datasets.

Key features:

100+ hours of unscripted kitchen activity
Dense verb-noun action annotations (e.g., “cut onion,” “open fridge”)
Temporal action localization and segmentation
Object interaction labels

Best for: Action recognition, activity anticipation, skill learning, human-object interaction modeling

3. Ego-Exo4D

An extension of the Ego4D project, Ego-Exo4D pairs first-person video with synchronized third-person (exocentric) recordings of the same activities. This dual-perspective approach enables richer cross-view learning and is particularly useful for robotics imitation learning where a model trained on human demonstrations needs to translate that understanding into robot actions.

Key features:

Synchronized ego + exo multi-camera capture
Skilled activity domains: cooking, sports, medical procedures, construction
Motion capture integration for precise body and hand tracking

Best for: Imitation learning, cross-view representation learning, human motion understanding

4. EGTEA Gaze+

EGTEA Gaze+ is a gaze-centric kitchen activity dataset that adds a critical layer many datasets miss: where is the person actually looking, and when? By pairing eye fixation data with activity annotations, it supports research into attention, intent estimation, and predictive AI systems.

Key features:

Eye fixation and gaze trajectory labels
Meal preparation task sequences
Action segmentation and object interaction tracking

Best for: Attention modeling, human intent estimation, gaze-aware AI systems

5. Something-Something V2

While not purely egocentric, Something-Something V2 (created by Twenty Billion Neurons, now part of Qualcomm) is widely used for teaching AI systems the physics of how humans interact with objects. Rather than labeling “what” the activity is, it focuses on “how” the nuanced motion and causality of physical interactions.

Key features:

220,000+ video clips of humans performing defined physical actions with objects
Action labels defined by motion type, not object class (e.g., “moving something closer to something”)
Strong focus on temporal reasoning and disambiguation

Best for: Physical reasoning, interaction modeling, temporal action understanding

6. First-Person Hand Action Benchmark (FPHA)

FPHA focuses specifically on the hands one of the most important visual signals in egocentric video for robotics and AR applications. Captured using RGB-D cameras and a magnetic sensor glove for precise hand pose ground truth, it supports hand tracking and grasp learning research.

Key features:

1,175 action sequences across 45 hand action categories
RGB-D recordings with synchronized depth data
Hand pose annotations from magnetic sensors
6 object categories used across manipulation tasks

Best for: Hand tracking, gesture recognition, robotics grasp learning

For teams annotating similar hand pose data, iMerit‘s Ango Hub Skeleton Tool offers keypoint-based annotation purpose-built for this kind of work.

7. Charades-Ego

Charades-Ego adapts the popular Charades benchmark into an egocentric format by pairing first-person and third-person recordings of the same scripted activities. The cross-view pairing makes it particularly useful for learning viewpoint-invariant representations.

Key features:

Paired ego and exo recordings of scripted indoor activities
Temporal action localization labels
157 activity classes

Best for: Viewpoint adaptation, representation transfer, action localization

8. Aria Everyday Activities Dataset (AEA)

Developed by Meta in conjunction with the Project Aria research glasses, the AEA dataset is designed for wearable AI and spatial computing research. It captures daily activities with rich sensor fusion including audio, video, IMU, eye tracking, and magnetometer data making it one of the most multimodal egocentric datasets available.

Key features:

Captured using AR glasses with embedded sensors
Full sensor fusion: video, audio, eye tracking, IMU, barometer
Diverse daily activity scenarios across multiple environments
Designed for 3D scene reconstruction and contextual AI

Best for: Spatial computing, wearable AI, contextual assistants, multimodal perception

9. HOI4D

HOI4D is a 4D (spatiotemporal) egocentric dataset focused on rich human-object interaction in indoor environments. Its combination of RGB-D capture and dense 4D scene annotations makes it particularly valuable for robotics perception and physical AI systems that need to understand object state changes over time.

Key features:

4D spatiotemporal scene labeling
800+ egocentric video sequences
Dense annotations of human-object interactions
3D scene reconstruction across 610 distinct rooms

Best for: Physical AI, robotics perception, 4D scene understanding

10. ADL (Activities of Daily Living)

The Activities of Daily Living (ADL) dataset captures unscripted, routine daily activities in real home environments. It was one of the earlier egocentric datasets to focus on naturalistic behavior, and remains relevant for assistive AI and context-aware computing research.

Key features:

Wearable camera capture across real home environments
Object-use annotations tied to daily routines
Temporal event labeling across extended sequences

Best for: Daily activity recognition, assistive AI, context-aware computing

Conclusion

The ten egocentric video datasets covered here represent the foundation of a rapidly growing research and commercial ecosystem. From large-scale multimodal collections like Ego4D to specialized benchmarks like FPHA and EGTEA Gaze+, each dataset addresses a different dimension of how AI systems can learn from human perspective and behavior.

As physical AI, robotics, and AR/VR systems mature, the demand for high-quality first-person video datasets both public and custom will only grow. Organizations that understand what’s available today are better positioned to identify gaps and build the embodied AI training data pipelines they’ll need tomorrow.

Looking to build your own? Learn more about iMerit’s Egocentric Video Data Collection

Post

Top 10 Egocentric Video Datasets Advancing Physical AI and Robotics

Srishty Sharon

What Makes Egocentric Video Datasets Different?

Top 10 Egocentric Video Datasets

1. Ego4D

2. EPIC-KITCHENS

3. Ego-Exo4D

4. EGTEA Gaze+

5. Something-Something V2

6. First-Person Hand Action Benchmark (FPHA)

7. Charades-Ego

8. Aria Everyday Activities Dataset (AEA)

9. HOI4D

10. ADL (Activities of Daily Living)

Conclusion

Srishty Sharon

What Makes Egocentric Video Datasets Different?

Top 10 Egocentric Video Datasets

1. Ego4D

2. EPIC-KITCHENS

3. Ego-Exo4D

4. EGTEA Gaze+

5. Something-Something V2

6. First-Person Hand Action Benchmark (FPHA)

7. Charades-Ego

8. Aria Everyday Activities Dataset (AEA)

9. HOI4D

10. ADL (Activities of Daily Living)

Conclusion

Subscribe to our newsletter