research | alvin tan

Broadly, my research aims to understand the variation in children’s early language environments, and its effects on children’s language abilities. Below are a few key research areas and some representative projects.

characterising early language experience

What kinds of language inputs do children receive? How do they vary within and across children? I am interested in collecting naturalistic data from young children and analysing the distributions of such data.

What are the actual audiovisual experiences of young children? How can we quantify and qualify their input “in the wild”?
- Long^*, Xiang^*, Stojanov^*, Sparks, Yin, Keene, Tan, Feng, Zhuang, Marchman, Yamins, & Frank (2024). “The BabyView dataset: High-resolution egocentric videos of infants’ and young children’s everyday experiences.” CCN Proceedings. [preprint]
- Sparks, Long, Keene, Perez, Tan, Marchman, & Frank (2024). “Characterizing contextual variation in children’s preschool language environment using naturalistic egocentric videos.” CogSci Proceedings. [paper]
Shared book reading is a particularly rich source of children’s language input. How do children’s books differ from child-directed speech?
- Dawson, Hsiao, Tan, Banerji, & Nation (2021). “Features of lexical richness in children’s books: Comparisons with child-directed speech.” Language Development Research. [paper]
- Tan, Read, Gamboa, Bang, & Marchman (in prep.). “The power of the page: Comparing richness in text and talk during book sharing with two-year old children.”

early word learning across linguistic environments

What are the cross-linguistic patterns in word learning? How are early vocabularies shaped by experiences of multilingualism? I use statistical modelling to understand word- and child-level predictors of word learning.

How do the different languages heard by a bilingual child affect their language learning?
- Tan, Marchman, & Frank (2024). “The role of translation equivalents in bilingual word learning.” Developmental Science. [paper]
- Tan & Frank (2024). “Syntactic category bias in early bilingual vocabularies.” Bay Area Developmental Symposium. [preprint]
- Tan, Kachergis, Marchman, Frank, Mayor, et al. (in progress). “Exploring the relationship between language exposure and vocabulary in bilingual children.”
What does word learning look like cross-linguistically? What are the consistencies and variations in early vocabulary across languages?
- Tan^*, Loukatou^*, Braginsky, Mankewitz, & Frank (2024). “Predicting ages of acquisition for children’s early vocabulary across 27 languages and dialects.” CogSci Proceedings. [paper]
- Tan^*, Kachergis^*, Marchman, Dale, & Frank (2023). “Measuring children’s early vocabulary in low-resource languages using a Swadesh-style word list.” CogSci Proceedings. [abs]

machine learning models as cognitive models

The process of language learning is difficult to model, but recent advances in machine learning have given rise to a potential approach requiring few inductive biases. Can we use machine learning models as plausible models of language learning in young children?

How do we evaluate the closeness of a vision–language model to the process of human language development?
- Tan, Yu, Long, Ma, Murray, Silverman, Yeatman, & Frank (2024). “DevBench: A multimodal developmental benchmark for language learning.” NeurIPS Proceedings. [paper]
Can we train vision–language models on naturalistic developmental training data?
- Tan, Hu, Long, & Frank (in progress). “Training vision–language models from the child’s perspective.”

open science, meta-science, and big team science

I believe that science is best advanced through information sharing and collaboration, and have worked on several open data repositories and large-scale collaborative endeavours.

What does it look like to aggregate data from various contributors into a centralised open data repository?
- Wordbank, a repository of child vocabulary data from Communicative Development Inventories.
- Peekbank, a repository of child language processing data from looking-while-listening studies.
How do we leverage big team science to work on large-scale distributed projects?
- ManyBabies, a consortium of multi-lab replication efforts for key developmental science findings.
- The Replication Database, a community crowdsourced database for replication studies.