research
Broadly, my research aims to understand the variation in children’s early language environments, and its effects on children’s language abilities. Below are a few key research areas and some representative projects.
characterising early language experience
What kinds of language inputs do children receive? How do they vary within and across children? I am interested in collecting naturalistic data from young children and analysing the distributions of such data.
With the BabyView project, we capture the audiovisual experiences of young children “in the wild” via egocentric videos. These videos can help us to understand the distributions of input features that children experience, including objects and activities. We can also assess the general alignment of the visual and linguistic streams to understand how they might learn from multimodal input. This method can also be extended to study other settings, including early childhood education contexts.
One particularly interesting context is shared book reading, which contains rich, high quality language input. We can quantify the distributional differences between book texts and child-directed speech, finding that books contain more complex language than child-directed speech, even spontaneous speech during book sharing episodes.
early word learning across linguistic environments
What are the cross-linguistic patterns in word learning? How are early vocabularies shaped by experiences of multilingualism? I use statistical modelling to understand word- and child-level predictors of word learning.
Given the large variability in language features, it is notable that children are nonetheless able to acquire language rapidly in the early years of life without much instruction. Investigating the trajectories of word learning across languages suggests that there are both consistencies and variations in early vocabularies across languages. We can capitalise on the regions of consistencies to develop new instruments to measure vocabulary in low-resource languages.
Additional complexities arise when considering children acquiring multiple languages, who have to learn how to manage multiple systems of language. We find that the presence of translation equivalents (words for the same concept in different languages) can help bootstrap language learning, especially for younger children. Bilinguals can also help us to tease apart cognitive, linguistic, and cultural factors shaping early word learning, for example by observing the syntactic compositions of their vocabularies. Some ongoing work aims to more systematically investigate the role of language exposure on bilingual vocabularies to understand the relationship between language input and abilities.
machine learning models as cognitive models
The process of language learning is difficult to model, but recent advances in machine learning have given rise to a potential approach requiring few inductive biases. Can we use machine learning models as plausible models of language learning and language use?
One crucial prerequisite is to evaluate the closeness of language models to the process of human language development. DevBench is a multimodal developmental benchmark that aims to characterise language learning not just in terms of accuracy, but in terms of similarity to human response patterns; we find that vision–language models sometimes recapitulate the developmental trajectories of human language learners. We also observe such a parallel when considering production–comprehension asymmetries in language models and in children, finding that the gap between production and comprehension decreases over training or age.
We can also attempt to train vision–language models using naturalistic developmental data to determine whether current algorithms are able to learn as efficiently from noisy data as humans are; this work is ongoing but preliminary results suggest that the answer is no—models struggle to learn from multimodal data when the visual and linguistic streams are misaligned, even though such misalignment is rampant in children’s early experiences.
Moving to yet more complex language tasks—those requiring pragmatics—we find that models can sometimes do reasonably well on iterated reference games, although they seem to be a lot more sensitive to context than humans are. In ongoing work, we aim to more fully characterise the role of contextual information in vision–language model performance and understand features that result in model–human divergence.
open science, meta-science, and big team science
I believe that science is best advanced through information sharing and collaboration, and have worked on several open data repositories and large-scale collaborative endeavours.
One way to promote data sharing and reuse is through aggregating data into centralised open data repositories, where the standardised format and easy access permits novel secondary analyses. Some repositories that I have helped to maintain and construct include Wordbank, a repository of child vocabulary data from Communicative Development Inventories, Peekbank, a repository of child language processing data from looking-while-listening studies, and Refbank, a respository of iterated reference game data.
I am also a proponent of big team science as I believe it can help us better characterise the robustness and generalisability of scientific findings. I have participated in several ManyBabies projects as an analysis team member, and have also contributed to the Replication Database.