Facebook researchers collect thousands of hours of first-person video to train AI

DMCA / Correction Notice
- Advertisement -

If the AI ​​of the future is, as many tech companies hope, going to see with our eyes in the form of AR glasses and other wearables, they will need to learn how to sense human perspective. We’re certainly used to it, but there’s notably less first-person video footage of everyday tasks — which is why Facebook puts Collected a few thousand hours for a new publicly available dataset.

- Advertisement -

The challenge Facebook is attempting to hold on to is that the most influential models of object and scene recognition models today have been trained almost exclusively on third-person perspectives. So it can recognize the person cooking, but only when it sees the person standing in the kitchen and not the person’s eyes. Or it will recognize a bike, but not from the rider’s perspective. It’s a perspective shift that we take lightly, because it’s a natural part of our experience, but it seems quite daunting to computers.

The solution to machine learning problems is usually either more or better data, and in this case it can’t hurt to have both. So Facebook contacted research partners around the world to collect first-person video of common activities like cooking, grocery shopping, typing shoelaces, or just hanging out.


The 13 partner universities collected thousands of hours of video from over 700 participants in 9 countries, and it must be said at the outset that they were volunteers and controlled their level of participation and identity. Those thousands of hours were reduced to 3,000 by a research team who viewed, edited and hand-annotated the videos, while adding their own footage from staged environments they could not capture in the wild. All this is described in this research paper.

The footage was captured in a variety of ways, from eyeglass cameras to GoPros and other devices, and some researchers even chose to scan the environment in which the person was working, while others tracked gaze direction and other metrics. . All of this is moving to a dataset Facebook called Ego4D that will be made freely available to the research community at large.

- Advertisement -

Two pictures, one showing computer vision successfully identifying objects and the other showing it failing in the first person.

“For AI systems to interact with the world the way we do, the AI ​​field needs to evolve to an entirely new paradigm of first-person perception. This means real-time motion, Teaching AI to understand activities of daily living through the human eye in the context of interactions and multi-sensory observations,” said lead researcher Kristin Grauman in a Facebook blog post.

As hard as it may be to believe, this research and Ray-Ban Stories Smart Shades are completely unrelated, except that Facebook clearly thinks that first-person understanding is important for many topics. (However, 3D scans can be used in the company’s Habitat AI training simulator.)

“Our research is strongly inspired by applications in augmented reality and robotics,” Grauman told Nerdshala. “First-person perception is critical to enabling AI assistants of the future, especially as wearables like AR glasses become an integral part of how people live and move in everyday life. Think how rewarding this is. Would be if the assistants on your devices could take the cognitive overload out of your life, understand your world through your eyes.”

The global nature of the collected video is a very deliberate move. To include only images of a single country or culture would be fundamentally short-sighted. Kitchens in America look different from those of French, Rwandan and Japanese. Making the same dishes with the same ingredients, or doing the same general tasks (cleaning, exercising) can look very different even between individuals, let alone entire cultures. Therefore, as Facebook’s post states, “Compared to the current data set, the Ego4D data set offers a greater diversity of scenes, people and activities, which is more likely to require trained models for people of background, ethnicity, occupations, and age.” Increases usability. “

Examples of first-person video on Facebook and the environment where it was taken.

Examples of first-person video on Facebook and the environment where it was taken.

Databases aren’t the only thing Facebook is releasing. With such leaps in data collection, it is also common to have a set of benchmarks to test how well a model is using information. For example, with a set of images of dogs and cats, you’ll want a standard benchmark that tests the model’s efficacy in telling which is which.

In this case things are a bit more complicated. Identifying objects just from a first person point of view isn’t that hard – it’s a different angle really – and it won’t be that new or even useful. Do you really need a pair of AR glasses to tell “that’s a tomato”? No: Like any other tool, the AR device will be telling you something No know, and to do so it requires a deep understanding of things like intentions, context, and the actions involved.

To this end, the researchers came up with five tasks that could, in theory anyway, be accomplished by analyzing this first-person imagery:

  • episodic memory: tracking objects and concepts in time and space so that “where are my keys?” Like random questions. can be answered.
  • Forecast: Understanding the sequence of events so that “What’s next in the recipe?” Like questions. Answers can be given, or things can be noted in advance, such as “You left your car keys at home.”
  • hand-object conversation: recognizing how people hold and manipulate objects, and what happens when they do, which may feed into episodic memory or perhaps inform actions of robots that imitate those actions should do.
  • audio-visual diurization: associating sound with events and objects so that speech or music can be intelligently tracked for situations such as what song was playing in the cafe, or what the boss said at the end of the meeting. (“Diarization” is his “word”.)
  • social interaction: understanding who is talking to whom and what is being said, both for the purpose of informing other processes and for moment-to-moment use such as captioning in a noisy room with many people.

These aren’t the only possible applications or benchmarks, they are, of course, a set of initial ideas for testing whether a given AI model is actually what’s happening in first-person video. The Facebook researchers performed a base-level run on each of the tasks described in their paper, which served as a starting point. There’s also a one-of-a-kind pie-in-the-sky video example of how each of these tasks was successful. In this video Summarizing the research.

While 3,000 hours—over 250,000 researcher hours—was carefully annotated by hand, Grauman was careful to point out—what is now an order of magnitude higher, still has to grow. There’s plenty of room for that, he noted. They plan to expand the dataset and are actively adding partners as well.

If you’re interested in using the data, keep your eye on the Facebook AI Research blog and maybe contact one of the many people listed on the paper. It will be released over the next few months when the union figures out how to actually do it.

- Advertisement -

Stay on top - Get the daily news in your inbox

Recent Articles

Related Stories