30080
Robustness Analysis for Computational Speech Features during Naturalistic Clinician-Child Interactions
Computational analysis of autism diagnostic sessions using automatically extracted features combined with machine learning has provided valuable insights (Bone et al., 2016). Specifically, speech and language features extracted from both child and the interacting clinician have shown to be significantly predictive of autism symptom severity (Bone et al., 2015). However, feature extraction is dependent on availability of who spoke when and about what (i.e., speaker labels, durations and transcripts), which can be time-consuming and expensive to obtain manually. Recent technological developments have enabled accurate automatic speaker segmentation, labeling and speech-to-text conversion. However, behavioral feature extraction using these algorithms need to be refined in order to obtain clinically meaningful interpretation for ASD.
Objectives:
We test the feasibility and validity of an automated end-to-end speech processing pipeline (consisting of speech detection, speaker diarization, automatic speech recognition (ASR) and role-assignment modules) to extract speech and language features from audio obtained from naturalistic clinician-child interactions. By replacing oracle labels with system outputs at each module, we simulated multiple error conditions which result in inaccurate features. For each condition, we studied the feature robustness across categories (lexical, turn-taking, prosodic) and the autism symptom severity.
Methods:
As a preliminary analysis, we selected 27 semi-structured, examiner-child interaction sessions from the new treatment outcome (Brief Observation of Social Communication Change [BOSCC; Grzadzinski et al., 2016]; n=24) and gold-standard diagnostic (Autism Diagnostic Observation Schedule-2 [ADOS-2, Lord et al., 2012]; n=3) measures (age: µ=9.3 years, σ=3.4; verbal IQ: µ=98.6, σ=24.3). Trained annotators labeled speaker boundaries and transcripts, which were used to extract session-level features (oracle features). Next, the speech pipeline was employed to extract an identical set of features (pipeline features) wherein at each module the oracle labels were replaced with previous module outputs. Feature robustness is estimated by comparing pipeline features to oracle features. For each type of module error, we used normalized mean squared error (NMSE) to identify a subset of robust features. Using linear regression models, we compared the ADOS calibrated severity scores (CSS) predicted using pipeline features versus oracle features to assess predictive power of symptom severity in the former.
Results:
We obtained perfect role assignment for all sessions, word error rates of 45.61% (clinician) and 75.63% (child) from the ASR system, speaker error of 9.42% from the speaker diarization system and an f-score of 0.90 from the speech detection system. Similar feature subsets were found to be robust under both conditions: (lexical) first-person pronoun use by clinician, (turn-taking) clinician speaking fraction and utterance lengths and (prosodic) intonation slope and intercept from both speakers (Table 1). Further, the regression model (adjusted R2=0.26, DF=16) trained from pipeline features predicted almost similar CSS scores (Fig. 1) when compared to the oracle features (NMSE=0.026).
Conclusions:
We built a fully automated speech pipeline to extract behavioral features during semi-structured, naturalistic examiner-child interactions. We identified features robust to different module errors. We further demonstrated similar predictive power between robust pipeline features and oracle features. These results are promising for more automated, scalable analyses fo speech while removing the need for manually annotated speaker labels.