
Different Minds Collaborative Virtual Spring Conference
April 8th, 2026
Trainee Presenters
Please join us for an exciting series of talks featuring the trainees of the Different Minds Collaborative.
Ginni Strehle
Vanderbilt University
PI: Dr. Frank Tong
Face-trained deep neural network is severely misaligned with human perceptual judgments of face shape and texture​
​
Recent studies have suggested human perception of facial similarity can be accurately predicted by face-trained deep neural network (DNN) models (e.g., Jozwik et al., 2022; Dobs et al., 2023). However, it remains unclear whether human observers and DNN models are responding to the same dimensions of facial appearance. We explored this question by leveraging a 3D morphable model of face appearance (Paysan et al., 2009), which represented the shapes and textures of real human faces in separate principal component spaces. This model allowed us to independently and explicitly alter face shape and texture. In a forced-choice task, participants (n=26) viewed a target face and two modified faces that differed along a specific shape or texture principal component. Participants then chose the face that appeared more different. We conducted an analogous comparison on a ResNet-50 model trained on VGGFace2 by computing cosine distances between response patterns in the penultimate layer to those faces. Whereas humans perceived shape alterations as more salient, the model treated texture alterations as more distinct. Next, in a method-of-adjustment task, we generated pairs of faces that differed along multiple dimensions of shape and/or texture. Participants (n=23) traversed these dimensions by adjusting a probe face until they perceived a just-noticeable difference in identity from a reference face. We then compared human adjustment thresholds to the cosine distances between penultimate layer response patterns generated from the median human threshold faces and the associated reference faces. Both the human identity thresholds and the DNN response patterns were more sensitive to face shape than texture in this full dimensional space. However, human-to-human similarity far exceeded that of DNN-to-human similarity, especially in shape trials (human: r = 0.60; DNN: r = -0.31). Overall, our findings highlight major differences in how DNNs and humans process faces, particularly regarding face shape.
Suna Yoolf
University of British Columbia
PI: Dr. Ipek Oruc
Perceptual Narrowing in Face Perception: Effects of Multiethnic Exposure​
​
Face recognition is central to social interactions. Exposure to a heterogeneous face-diet can lead to distinct patterns of face expertise. In our previous work on face memory, our findings supported the experience-limited hypothesis: participants with substantial exposure to both East Asian and Caucasian faces demonstrated native-like expertise in both face categories, without a super-native advantage. However, it remained unclear whether these effects are specific to face memory or extend to face perception. Here, we examine how face-diet impacts face perception. Participants from three groups (DUAL (N=9), MONO-EA (N=8), MONO-CAU (N=13) were recruited based on the Social Exposure Survey, a self-report measure of exposure to East Asian, Caucasian, and African faces, and completed an online odd-one-out face perception task (Pavlovia.org). Five seed faces per category (MR2 Face Database) were used to create morph continua, forming 10 pairs per category. Eleven difficulty levels (80-4%) were sampled. Participants completed 10 catch trials and 660 test trials across three sessions, in a 3-alternative forced-choice (3AFC) paradigm. In a 3x3 repeated-measures design with stimulus category (EA, CAU, AF) as the within-subjects factor and exposure (DUAL, MONO-EA, MONO-CAU) as the between-subjects factor, we report discrimination thresholds estimated at 67% criterion accuracy. Preliminary results show lower thresholds for own-race and the DUAL group, relative to other-race for CAU and EA faces. No group differences emerged for AF faces. These trends are consistent with the experience-limited account. Data collection is ongoing, and inferential statistical analyses will follow. Understanding how expertise develops in face perception advances models of perceptual plasticity and high-level form vision, suggesting that exposure shapes perception as well as memory for faces, in an experience-dependent manner.
Gillian Rosenberg
Barnard College
PI: Dr. Michelle Greene
The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding​
​
What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene understanding tasks, spanning general knowledge, affordances, sensory experiences, affective responses, and future prediction. Because many tasks lack ground truth answers, we developed a Human-Calibrated Cosine Distance (HCD) metric that measures VLM output similarity to the distribution of human responses, scaled by within-human variability. In Experiment 1, VLMs approached human-level performance on general knowledge tasks, but showed a robust deficit for affordance tasks that resisted prompt engineering and did not improve with newer model releases. In Experiment 2, we tested six mechanistic hypotheses for explaining this affordance gap, finding that the deficit was structural rather than stylistic and was not resolved by providing explicit spatial information. Corpus analyses revealed that image captioning datasets contain sparse agent-addressed affordance language, consistent with Gricean accounts of why embodied knowledge may be systematically underrepresented in language. Together, these findings suggest that distributional learning from images and text is insufficient for affordance-based scene understanding, implying that some dimensions of human visual cognition may require the kind of agent-centered, three-dimensional experience that no photograph or caption can encode.
Ilgin Cebioglu
Newcastle University
PI: Dr. Quoc Vuong
Do chromatic visual-evoked potentials predict chromatic sensitivity in colour vision deficiency?
​
Congenital colour vision deficiencies (CVD) are defined by alterations in the spectral sensitivities of the long- (L) or middle- (M) wavelength sensitive cone photoreceptors in the retina, resulting in reduced chromatic sensitivity in the L vs M (‘red-green’) chromatic pathway. The N1 visual-evoked potential (VEP) component, associated with chromatic processing, has been shown to have lower amplitudes and longer peak latencies in individuals with CVD. However, it is unclear whether the amplitude and/or latency of the N1 component are reliable indicators of individual differences in chromatic sensitivity. Here we examine whether behavioural measures of chromatic sensitivity are predicted by the amplitude and latency of the N1 component. 38 participants (19 normal trichromats, 19 with CVD) completed a battery of behavioural colour vision assessments, including the Colour Assessment and Diagnosis (CAD). CAD was used to measure chromatic sensitivity thresholds along the L vs M and S vs LM (‘blue-yellow’) chromatic axes. Participants also completed an EEG task in which they viewed Gabor patches modulated along both chromatic axes at four contrast levels. Stimuli were presented across 65 consecutive trials per condition (300 ms onset, 1000-1200 ms offset). Participants responded with a key press when a rare vertical grating appeared, which served solely as an attentional control. VEPs were obtained by averaging EEG responses to horizontal gratings from occipital electrodes (Oz, O1 and O2) across all trials for each condition. N1 amplitude and latency were defined as the minimum amplitude within 90-220 ms following stimulus onset. Preliminary findings indicate that N1 amplitudes and latencies differ between participants with normal colour vision and CVD, but do not correlate with CAD thresholds along the two chromatic axes. These findings suggest that the N1 component captures group-level differences in chromatic processing but may not provide a reliable neural measure of individual chromatic sensitivity.
Deepkhushi Baidwan
University of Victoria
PI: Dr. Jim Tanaka
VisDeep: Automating Stimulus Selection for Perceptual Research​
Researchers studying perception routinely need stimulus sets that are matched across experimental conditions, for example, selecting face sets that differ in race but are comparable on attractiveness, trustworthiness, and other perceptual dimensions. Traditional approaches rely on manual inspection or matching on a handful of attributes, leaving studies vulnerable to hidden confounds and limiting reproducibility. VisDeep is a Python-based framework that addresses this need by enabling objective, distribution-level stimulus matching across high-dimensional feature spaces using Earth Mover's Distance (EMD). Rather than requiring researchers to hand-pick stimuli or eyeball scatter plots, VisDeep automates the selection process: a researcher specifies the groups they need, the features they care about, and how to weight them, and the tool returns optimally matched subsets with full statistical documentation. As a working example, VisDeep was applied to the Chicago Face Database to select 20 Asian and 20 White male faces matched across neural, physical, and perceptual features, a common requirement in cross-race face perception research. The resulting sets were closely matched (EMD = 0.028) while preserving natural within-group variability, demonstrating that the tool can deliver what researchers typically spend hours doing manually. Because VisDeep operates on any rated feature space, it generalizes beyond faces: any domain where stimuli can be described by measurable attributes (objects, scenes, voices, or biological stimuli) can use the same workflow. This talk will focus on the practical research scenarios VisDeep supports and how it fits into an experimental design pipeline.