Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video
BabyMind learns word meanings from egocentric child video by first building object tracks, then aligning utterances to bags of objects. This object-first bias beats prior contrastive baselines on SAYCam and out-of-distribution tests, suggesting more human-like pathways for grounded language learning.
Sathira Silva, Abrham Kahsay Gebreselasie