Research
Our group investigates the computational and cognitive foundations of natural language, studying how it arises and functions in both humans and machines. Recently, our focus has turned to vision-language models (VLMs)—multimodal systems that aim to capture perceptual and conceptual information in unified representations. These systems not only offer a testbed for theories of meaning, perception, and communication, but also push the boundaries of scalable AI
We are especially focused on:
- Evaluating Vision-Language Models: We develop rigorous benchmarks and interpretability tools to assess how well VLMs reason about the physical, spatial, temporal, and commonsense structure of the visual world. Our goal is to move beyond surface-level performance and uncover the mechanisms that underpin robust multimodal understanding.
- Designing New Multimodal Architectures: We explore novel ways of fusing visual and linguistic information, seeking architectures that go beyond feature concatenation and better approximate the integrative nature of human perception. We draw inspiration from cognitive neuroscience and deep learning to inform design principles.
- Emergent Communication: We study how artificial agents can develop communication systems through interaction, modeling the dynamics of language evolution and learning. These simulations provide insights into the cognitive and social functions of language and offer frameworks for developing more adaptive AI.
- Compositionality: Our research builds theoretical bridges between linguistic notions of compositionality and the empirical behavior of neural models, developing task-independent evaluations that reveal how—and whether—models systematically generalize meaning.
- Grounded Language Learning: We teach models to connect linguistic forms to perceptual and sensory data, aiming to foster context-sensitive and action-relevant language use. This work spans tasks like instruction following, spatial reasoning, and interactive visual dialogue.
Our interdisciplinary methodology draws from linguistics, cognitive science, computer vision, and machine learning to push the boundaries of language understanding in both single- and multi-modal settings.