research

Broadly, my research aims to understand the variation in children’s early language environments, and its effects on children’s language abilities. Below are a few key research areas and some representative projects.

characterising early language experience

What kinds of language inputs do children receive? How do they vary within and across children? I am interested in collecting naturalistic data from young children and analysing the distributions of such data.

  • What are the actual audiovisual experiences of young children? How can we quantify and qualify their input “in the wild”?
    • Long*, Xiang*, Stojanov*, Sparks, Yin, Keene, Tan, Feng, Zhuang, Marchman, Yamins, & Frank (2024). “The BabyView dataset: High-resolution egocentric videos of infants’ and young children’s everyday experiences.” CCN Proceedings. [paper]
    • Sparks, Long, Keene, Perez, Tan, Marchman, & Frank (2024). “Characterizing contextual variation in children’s preschool language environment using naturalistic egocentric videos.” CogSci Proceedings. [paper]
    • Yang, Sepuri, Tan, Frank, & Long (2025). “Quantifying infants’ everyday experiences with objects in a large corpus of egocentric videos.” CCN Abstracts. [abs]
    • Sepuri*, Aw*, Tan*, Sparks, Marchman, Frank, & Long (2025). “Characterizing young children’s everyday activities using video question-answering models.” NeurIPS DBM Workshop Paper. [paper]
    • Tan*, Yang*, Sepuri, Aw, Sparks, Zi, Marchman, Frank, & Long (2025). “Assessing the alignment between infants’ visual and linguistic experience using multimodal language models.” arXiv preprint [preprint]
  • Shared book reading is a particularly rich source of children’s language input. How do children’s books differ from child-directed speech?
    • Dawson, Hsiao, Tan, Banerji, & Nation (2021). “Features of lexical richness in children’s books: Comparisons with child-directed speech.” Language Development Research. [paper]
    • Tan, Read, Gamboa, Bang, & Marchman (under review). “The power of the page: Comparing richness in text and talk during book sharing with two-year old children.” [preprint]

early word learning across linguistic environments

What are the cross-linguistic patterns in word learning? How are early vocabularies shaped by experiences of multilingualism? I use statistical modelling to understand word- and child-level predictors of word learning.

  • How do the different languages heard by a bilingual child affect their language learning?
    • Tan, Marchman, & Frank (2024). “The role of translation equivalents in bilingual word learning.” Developmental Science. [paper]
    • Tan & Frank (2025). “Syntactic category bias in early bilingual vocabularies.” CogSci Proceedings. [preprint]
    • Tan, Kachergis, Marchman, Frank, Mayor, et al. (in progress). “Investigating the effect of language exposure on expressive vocabulary in young bilingual children.” [prereg]
  • What does word learning look like cross-linguistically? What are the consistencies and variations in early vocabulary across languages?
    • Tan*, Loukatou*, Braginsky, Mankewitz, & Frank (2024). “Predicting ages of acquisition for children’s early vocabulary across 27 languages and dialects.” CogSci Proceedings. [paper]
    • Kachergis*, Tan*, Marchman, Dale, & Frank (under review). “Measuring children’s early vocabulary in low-resource languages using a Swadesh-style word list.” [preprint]

machine learning models as cognitive models

The process of language learning is difficult to model, but recent advances in machine learning have given rise to a potential approach requiring few inductive biases. Can we use machine learning models as plausible models of language learning and language use?

  • How do we evaluate the closeness of a language models to the process of human language development?
    • Tan, Yu, Long, Ma, Murray, Silverman, Yeatman, & Frank (2024). “DevBench: A multimodal developmental benchmark for language learning.” NeurIPS Proceedings. [paper]
    • Hu, Tan, Feng, & Frank (2025). “Language production is harder than comprehension for children and language models.” CogSci Proceedings. [abs]
  • Can we train vision–language models on naturalistic developmental training data?
    • Tan, Hu, Long, & Frank (in progress). “Training vision–language models from the child’s perspective.”
  • How do models and humans compare in more complex language tasks requiring pragmatics?
    • Boyce, Prystawski, Tan, & Frank (2025). “Idiosyncratic but not opaque: Linguistic conventions formed in reference games are interpretable by naïve humans and vision–language models.” CogSci Proceedings. [preprint]
    • Tan*, Prystawski*, Boyce, & Frank (2025). “Context informs pragmatic interpretation in vision–language models.” NeurIPS CogInterp Workshop Paper. [paper]

open science, meta-science, and big team science

I believe that science is best advanced through information sharing and collaboration, and have worked on several open data repositories and large-scale collaborative endeavours.

  • How do we aggregate data from various contributors into a centralised open data repository?
    • Wordbank, a repository of child vocabulary data from Communicative Development Inventories.
    • Peekbank, a repository of child language processing data from looking-while-listening studies.
    • Refbank, a respository of iterated reference game data.
  • How do we leverage big team science to work on large-scale distributed projects?
    • ManyBabies, a consortium of multi-lab replication efforts for key developmental science findings.
    • The Replication Database, a community crowdsourced database for replication studies.