Credit: NYU Center for Data Science
AI systems, such as GPT-4, can now learn and use human language, but they learn from astronomical amounts of linguistic input – far more than children receive when learning to understand and speak a language. The best AI systems train on texts with billions of words, while children only receive millions of words per year.
Because of this huge data gap, researchers are skeptical that recent advances in AI can tell us much about human learning and development. An ideal test to demonstrate a connection would involve training an AI model, not on massive data from the web, but only on the information received by a single child. What could the model then learn?
A team of researchers from New York University conducted exactly this experiment. They trained a multimodal AI system through the eyes and ears of a single child, using head-mounted camera video recordings from the time he was 6 months old until his second birthday. They examined whether the AI model could learn words and concepts present in a child’s daily experience.
Their conclusions, reported in the journal Science, showed that the model, or neural network, could, in fact, learn a significant number of words and concepts using limited slices of what the child had experienced. In other words, the video only captured about 1% of the child’s waking hours, but that was enough for real language learning.
In this video, the researchers describe their work in more detail:
“We show, for the first time, that a neural network trained on this developmentally realistic input from a single child can learn to link words to their visual counterparts,” says Wai Keen Vong, a research scientist at the Center for Data Science at NYU and the first author of the article.
“Our results demonstrate how recent algorithmic advances, coupled with a child’s naturalistic experience, have the potential to reshape our understanding of early language and concept acquisition.”
“By using AI models to study the real language learning problem children face, we can address classic debates about what ingredients children need to learn words – whether they need biases language-specific, innate knowledge or simply associative learning to begin with.” ” adds Brenden Lake, assistant professor in the Center for Data Science and the Department of Psychology at NYU and lead author of the paper. “It seems that we can achieve more by simply learning than is generally thought.”
Vong, Lake, and their NYU colleagues Wentao Wang and Emin Orhan analyzed a child’s learning process captured on first-person video—via a lightweight head-mounted camera—on a weekly basis starting in 6 months and up to 25 months, using over 60 hours of footage.
Video footage captured by a child wearing a head-mounted camera. Credit: NYU Center for Data Science
The images contained approximately a quarter of a million word instances (that is, the number of words communicated, many of them repeatedly) that are linked to video images of what the child saw when these words were spoken and included a wide range of different activities across development, including meals, book reading, and child play.
The NYU researchers then trained a multimodal neural network with two separate modules: one that takes into account single video frames (the vision encoder) and another that takes into account transcribed child-directed speech (the language encoder).
These two encoders were combined and trained using an algorithm called “contrastive learning,” which aims to learn useful input features and their cross-modal associations. For example, when a parent says something in sight of the child, it is likely that some of the words used refer to something the child can see, meaning that understanding is instilled by connecting visual cues and linguistics.
“This gives the model a clue as to which words should be associated with which objects,” says Vong. “Combining these cues allows contrastive learning to gradually determine which words belong to which visuals and capture a child’s first word learning.”
After training the model, the researchers tested it using the same types of assessments used to measure word learning in infants: presenting the model with the target word and a set of four picture options different and asking him to select the image that corresponds to the target word. .
Their results showed that the model was capable of learning a significant number of words and concepts present in the child’s daily experience. Furthermore, for some of the words learned by the model, it could generalize them to visual instances very different from those seen during training, reflecting an aspect of generalization also observed in children when tested in the laboratory.
“These results suggest that this aspect of word learning is achievable from the type of naturalistic data that children receive while using relatively generic learning mechanisms such as those found in neural networks,” observes Lake.
More information:
Wai Keen Vong et al, Anchored Language Acquisition through the Eyes and Ears of a Single Child, Science (2024). DOI: 10.1126/science.adi1374. www.science.org/doi/10.1126/science.adi1374
Provided by New York University
Quote: New research shows how child-style language learning is possible using AI tools (February 1, 2024) retrieved February 1, 2024 from
This document is subject to copyright. Except for fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.