Post

Top 10 Egocentric Video Datasets Advancing Physical AI and Robotics

As AI systems move beyond screens and servers into the physical world, one thing becomes clear: models trained only on third-person video aren’t enough. To build robots, AR systems, and embodied AI that truly understand how humans interact with their environments, you need data from the human perspective. That’s where egocentric video datasets come in.

Captured through wearable cameras, AR/VR headsets, and head-mounted devices, egocentric (first-person) video datasets record the world the way a human actually experiences it; hands in frame, attention shifting, objects being picked up, tasks being completed in real-world environments. This kind of data is becoming critical infrastructure for the next generation of AI.

In this blog, we cover the most influential egocentric video datasets shaping research and commercial AI development and why first-person video datasets are becoming essential for embodied AI training data, robotics, and computer vision data annotation pipelines.

What Makes Egocentric Video Datasets Different?

Unlike fixed-camera or third-person datasets, egocentric datasets capture:

  • Natural hand and wrist motion during real task execution
  • Gaze and attention patterns tied to decision-making
  • Environmental context; lighting changes, occlusions, spatial transitions
  • Temporal depth; long sequences of tasks unfolding over time, not just isolated clips
  • Rich human-object interaction data that reflects real-world manipulation, not staged scenarios

They also tend to be multimodal, combining RGB video with audio, depth data, IMU sensors, and gaze tracking. This makes them particularly valuable for foundation models, embodied AI, and robotics systems that need to understand not just what’s happening, but how and why. That complexity also makes egocentric video data annotation significantly more demanding than standard computer vision data annotation.

Egocentric hand keypoint annotation during dishwashing

Top 10 Egocentric Video Datasets

1. Ego4D

Developed by Meta AI in collaboration with academic partners across 9 countries, Ego4D is among the largest and most comprehensive egocentric datasets ever assembled. It captures daily life scenarios from cooking and shopping to social interactions and construction with an intentional emphasis on diversity across geography, culture, and demographics.

Key features:

  • 3,600+ hours of first-person video
  • Multimodal: audio, narration, gaze, hand pose
  • Benchmarks for episodic memory, forecasting, hand-object interaction, and social interaction
  • Designed to support long-horizon activity understanding

Best for: Multimodal foundation models, embodied AI research, long-form activity understanding

2. EPIC-KITCHENS

EPIC-KITCHENS is the benchmark dataset for fine-grained kitchen activity recognition. Recorded in participants’ own homes across multiple countries, it captures unscripted, naturalistic cooking and food preparation sequences making it far more representative of real human behavior than lab-based datasets.

First-person view of washing dishes in a home kitchen, captured through a wearable head-mounted camera for egocentric AI training.

Key features:

  • 100+ hours of unscripted kitchen activity
  • Dense verb-noun action annotations (e.g., “cut onion,” “open fridge”)
  • Temporal action localization and segmentation
  • Object interaction labels

Best for: Action recognition, activity anticipation, skill learning, human-object interaction modeling

3. Ego-Exo4D

An extension of the Ego4D project, Ego-Exo4D pairs first-person video with synchronized third-person (exocentric) recordings of the same activities. This dual-perspective approach enables richer cross-view learning and is particularly useful for robotics imitation learning where a model trained on human demonstrations needs to translate that understanding into robot actions.

Key features:

  • Synchronized ego + exo multi-camera capture
  • Skilled activity domains: cooking, sports, medical procedures, construction
  • Motion capture integration for precise body and hand tracking

Best for: Imitation learning, cross-view representation learning, human motion understanding

4. EGTEA Gaze+

EGTEA Gaze+ is a gaze-centric kitchen activity dataset that adds a critical layer many datasets miss: where is the person actually looking, and when? By pairing eye fixation data with activity annotations, it supports research into attention, intent estimation, and predictive AI systems.

Key features:

  • Eye fixation and gaze trajectory labels
  • Meal preparation task sequences
  • Action segmentation and object interaction tracking

Best for: Attention modeling, human intent estimation, gaze-aware AI systems

5. Something-Something V2

While not purely egocentric, Something-Something V2 (created by Twenty Billion Neurons, now part of Qualcomm) is widely used for teaching AI systems the physics of how humans interact with objects. Rather than labeling “what” the activity is, it focuses on “how” the nuanced motion and causality of physical interactions.

Key features:

  • 220,000+ video clips of humans performing defined physical actions with objects
  • Action labels defined by motion type, not object class (e.g., “moving something closer to something”)
  • Strong focus on temporal reasoning and disambiguation

Best for: Physical reasoning, interaction modeling, temporal action understanding

6. First-Person Hand Action Benchmark (FPHA)

FPHA focuses specifically on the hands one of the most important visual signals in egocentric video for robotics and AR applications. Captured using RGB-D cameras and a magnetic sensor glove for precise hand pose ground truth, it supports hand tracking and grasp learning research.

Key features:

  • 1,175 action sequences across 45 hand action categories
  • RGB-D recordings with synchronized depth data
  • Hand pose annotations from magnetic sensors
  • 6 object categories used across manipulation tasks

Best for: Hand tracking, gesture recognition, robotics grasp learning

For teams annotating similar hand pose data, iMerit‘s Ango Hub Skeleton Tool offers keypoint-based annotation purpose-built for this kind of work.

7. Charades-Ego

Charades-Ego adapts the popular Charades benchmark into an egocentric format by pairing first-person and third-person recordings of the same scripted activities. The cross-view pairing makes it particularly useful for learning viewpoint-invariant representations.

Split-screen view of the same kitchen activity captured from first-person and third-person perspectives for egocentric AI training.


Key features:

  • Paired ego and exo recordings of scripted indoor activities
  • Temporal action localization labels
  • 157 activity classes

Best for: Viewpoint adaptation, representation transfer, action localization

8. Aria Everyday Activities Dataset (AEA)

Developed by Meta in conjunction with the Project Aria research glasses, the AEA dataset is designed for wearable AI and spatial computing research. It captures daily activities with rich sensor fusion including audio, video, IMU, eye tracking, and magnetometer data making it one of the most multimodal egocentric datasets available.

Key features:

  • Captured using AR glasses with embedded sensors
  • Full sensor fusion: video, audio, eye tracking, IMU, barometer
  • Diverse daily activity scenarios across multiple environments
  • Designed for 3D scene reconstruction and contextual AI

Best for: Spatial computing, wearable AI, contextual assistants, multimodal perception

9. HOI4D

HOI4D is a 4D (spatiotemporal) egocentric dataset focused on rich human-object interaction in indoor environments. Its combination of RGB-D capture and dense 4D scene annotations makes it particularly valuable for robotics perception and physical AI systems that need to understand object state changes over time.

Key features:

  • 4D spatiotemporal scene labeling
  • 800+ egocentric video sequences
  • Dense annotations of human-object interactions
  • 3D scene reconstruction across 610 distinct rooms

Best for: Physical AI, robotics perception, 4D scene understanding

10. ADL (Activities of Daily Living)

The Activities of Daily Living (ADL) dataset captures unscripted, routine daily activities in real home environments. It was one of the earlier egocentric datasets to focus on naturalistic behavior, and remains relevant for assistive AI and context-aware computing research.

Key features:

  • Wearable camera capture across real home environments
  • Object-use annotations tied to daily routines
  • Temporal event labeling across extended sequences

Best for: Daily activity recognition, assistive AI, context-aware computing

Conclusion

The ten egocentric video datasets covered here represent the foundation of a rapidly growing research and commercial ecosystem. From large-scale multimodal collections like Ego4D to specialized benchmarks like FPHA and EGTEA Gaze+, each dataset addresses a different dimension of how AI systems can learn from human perspective and behavior.

As physical AI, robotics, and AR/VR systems mature, the demand for high-quality first-person video datasets both public and custom will only grow. Organizations that understand what’s available today are better positioned to identify gaps and build the embodied AI training data pipelines they’ll need tomorrow.

Looking to build your own? Learn more about iMerit’s Egocentric Video Data Collection