Machine Learning Classification of Natural Conversational Utterances Using Acoustic Features Drawn from Children with ASD and Typical Controls

Poster Presentation
Friday, May 3, 2019: 11:30 AM-1:30 PM
Room: 710 (Palais des congres de Montreal)
S. Cho1, M. Liberman2, N. Ryant1, K. Bartley3, R. T. Schultz4 and J. Parish-Morris4, (1)Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, (2)University of Pennsylvania, Philadelphia, PA, (3)Center for Autism Research, The Children's Hospital of Philadelphia, Philadelphia, PA, (4)Center for Autism Research, Children's Hospital of Philadelphia, Philadelphia, PA

The earliest descriptions of autism spectrum disorder (ASD) include mention of atypical speech patterns, including unusual prosody. Although phonetic properties of speech have been explored in ASD, most prior research samples were either elicited in a highly structured context (e.g., reading sentences or word lists) or drawn from semi-structured clinical interviews with an autism expert (i.e., ADOS evaluations). While valuable, these studies produce results that may not generalize to the everyday conversations that really matter for children on the autism spectrum. In this study, we address a gap in the literature by developing and testing a machine learning (ML) classification approach to children’s natural interactions with a naïve conversational partner.


Automatically measure phonetic features in the natural conversations of children with ASD and typically-developing controls (TD). Develop an ML classifier to predict the diagnostic category of the speaker.


Seventy children with ASD (N=35, 13 females) or TD (N=35, 11 females), matched on IQ (ASD: 105; TD: 107; t=-.53, p=.6) and age (ASD: 11.42; TD: 10.57; t=1.33, p=.19), completed a 5-minute “get-to-know-you” conversation with a novel confederate who was not an autism expert (N=22, 19 females). At the turn level, we extracted 31 acoustic features from each participant’s utterance using the RAPT algorithm (as implemented in get_f0) for pitch (Talkin, 1995), with Praat (Boersma & Weenink, 2017) and VoiceSauce (Shue et al., 2011) for other intensity and spectral features (Table). To avoid pitch-tracking errors, we ran the pitch-tracker twice, once to estimate the modal pitch range of each speaker and once to pitch-track within the obtained speaker-specific pitch range. Pitch values were normalized from Hz to semitones using the 5th percentile of each speaker as the base. We trained a support vector machine using Scikit-learn (Pedregosa et al., 2011) with a radial basis function kernel and 5-fold cross-validation. All acoustic features were scaled using MinMaxScaler in Scikit-learn. With the classification results of all turns from each child, we implemented a simple majority vote to predict the child’s final diagnostic status.


Our acoustic turn-level classifier correctly identified whether an utterance came from an ASD or TD participant 60.59% of the time. A majority vote at the speaker level using turn-level results correctly distinguished ASD and TD 66.67% of the time. The accuracy of this acoustic/phonetic classifier is better than the 59% average recall reported using phonetic features at the turn level in prior research (Bone et al. 2016), and is especially promising given that current data are drawn from natural conversations with naïve interlocutors, which tend to be messier and more variable than other types of data.


This preliminary exploration suggests that acoustic features of natural conversation are useful for distinguishing children’s diagnostic category, and advances the goal of scaling real-world machine learning applications in ASD. Our next step is to combine multiple classifiers, using more sophisticated algorithms, feature selection methods, and an expanded feature set that includes lexical information (e.g., word choice, frequency of non-speech vocalizations, filled pauses) to predict diagnostic status and estimate symptom severity.