Google rolled out a new version of Open Images that adds millions of additional data points.
Along with human motion annotations, and image-level labels, it has added a new type of multimodal annotations called localized narratives.
With these new additions, it has created localized narratives for about 500,000 Open Images files so far and it is expected that v6 will further stimulate progress towards genuine scene understanding.
Today, Google’s Open Images corpus for computer vision tasks got a boost with new visual relationships, human action annotations, and image-level labels, as well as a new form of multimodal annotations, called localized narratives. It expands the annotation of the Open Images dataset with a large set of new visual relationships (e.g., “dog catching a flying disk”), human action annotations (e.g., “person jumping”), and image-level labels (e.g., “paisley”). In Open Images V6, these localized narratives are available for 500k of its images. Additionally, in order to facilitate comparison to previous works, Google also releases localized narrative annotations for the full 123k images of the COCO dataset.
Google says this last addition could create “potential avenues of research” for studying how people describe images, which could lead to interface design insights (and subsequent improvements) across the web, desktop, and mobile apps.
In 2016, Google introduced Open Images, a data set of millions of labeled images spanning thousands of object categories. Major updates arrived in 2018 and 2019, bringing with them 15.4 million bounding-boxes for 600 object categories. It is the , for use in training the latest deep convolutional neural networks for computer vision tasks. With the introduction of version 5 last May, the Open Images dataset includes 9M images annotated with 36M image-level labels, 15.8M bounding boxes, 2.8M instance segmentations, and 391k visual relationships. Along with the dataset itself, the associated Open Images Challenges have spurred the latest advances in object detection, , and .
“Along with the data set itself, the associated Open Images challenges have spurred the latest advances in object detection, instance segmentation, and visual relationship detection.”
Jordi Pont-Tuset, a research scientist at Google Research.
As Pont-Tuset explains, one of the motivations behind localized narratives is to leverage the connection between vision and language, which is typically done via image captioning (i.e., images paired with written descriptions of their content). But image captioning lacks visual “grounding.” To mitigate this, some researchers have drawn bounding boxes for the nouns in captions after the fact — in contrast to localized narratives, where every word in the description is grounded.
The localized narratives in Open Images were generated by annotators who provided spoken descriptions of images while hovering over regions they were describing with a computer mouse. The annotators manually transcribed their description, after which Google researchers aligned it with automatic speech transcriptions, ensuring that the speech, text, and mouse trace were correct and synchronized.
“Speaking and pointing simultaneously is very intuitive, which allowed us to give the annotators very vague instructions about the task. We hope that it will further stimulate progress toward genuine scene understanding, ” said Pont-Tuset.
Speaking and pointing simultaneously is very intuitive, which allowed Google to give the annotators very vague instructions about the task. This creates potential avenues of research for studying how people describe images. For example, observe different styles when indicating the spatial extent of an object — circling, scratching, underlining, etc. — the study of which could bring valuable insights for the design of new user interfaces.
“To get a sense of the amount of additional data these localized narratives represent, the total length of mouse traces is ~6400 km long, and if read aloud without stopping, all the narratives would take ~1.5 years to listen to!”
New Visual Relationships, Human Actions, and Image-Level Annotations
In addition to the localized narratives, in Open Images V6 we increased the types of visual relationship annotations by an order of magnitude (up to 1.4k), adding for example “a person riding a skateboard”, “persons holding hands”, and “dog catching a flying disk”.
People in images have been at the core of computer vision interests since its inception and understanding what those people are doing is of utmost importance for many applications. That is why Open Images V6 also includes 2.5M annotations of humans performing standalone actions, such as “jumping”, “smiling”, or “laying down”. As Google images are contributing to healthcare as well, recently, the company confirmed its comparable to dermatologists.
In short, Open Images V6 is a significant qualitative and quantitative step towards improving the unified annotations for image classification, object detection, visual relationship detection, and instance segmentation, and takes a novel approach in connecting vision and language with localized narratives. Let’s hope that Open Images V6 will further stimulate progress towards genuine scene understanding.