Combining Supervised and Unsupervised Learning to Subgroup Autism Spectrum Disorder According to Regional Brain Volumes
Objectives: Utilizing both supervised and unsupervised machine learning techniques we aim to 1) identify the most important brain volumes for classifying ASD-DM and individuals with ASD but more typical brain volumes (ASD-N) and 2) use these identified brain regions to cluster individuals into groups with more homogenous volumetric neural phenotypes.
Methods: We acquired structural magnetic resonance imaging (MRI) scans of 147 male preschool aged children with ASD. ASD-DM classification was defined as having a total cerebral volume greater that 1.5 standard deviations from an established sample of age matched typically developing (TD) children, resulting in 16 ASD-DM and 131 ASD-N cases. Volumetric data from 239 brain regions were extracted using an automated T1-segmentation pipeline (https://mricloud.org) and normalized for total brain volume. Regional brain volumes were input as features for classification of ASD-DM and ASD-N using a RandomForest model. Model accuracy (ACC), specificity (SP) and sensitivity (SN) were estimated using a 10-fold cross validation scheme utilizing SMOTE to account for sampling bias as well as within an independent sample of 7 ASD-DM and 36 ASD-N cases. Model significance was assessed via n=1000 permutations of the class labels. The most discriminative features were determined according to measures of MeanDecreased accuracy and Out-of-Bag (OOB) error. Hierarchical clustering was then performed on the entire sample utilizing the most discriminative volumetric features in order to identify clusters of individuals with homogenous volumetric neural phenotypes.
Results: RandomForest was able to classify ASD-DM from ASD-N with a cross-validated ACC=89%/SN=74%/SP=90%. Similar results were observed when tested on an independent sample (ACC=86%/SN=75%/SP=87%). Permutation testing showed all classification results to be significant below chance level (p<0.05). After ranking features according to mean decreased accuracy it was determined that selecting 28 features resulted in the lowest OOB error, thus the top 28 features which included the superior frontal gyrus, middle temporal gyrus, and posterior cingulate gyrus, were further exported for hierarchical clustering of the entire sample. Cutting the resulting dendrogram at the second level resulted in two clusters containing n=4 ASD-DM/110 ASD-N and n=12 ASD-DM/21 ASD-N respectively.
Conclusions: Combining supervised and unsupervised machine learning techniques offers a powerful methodological framework for classifying and grouping individuals across the autism spectrum according to more homogenous biologically based subgroups. Such techniques represent a valuable tool in future efforts to identify new ASD subgroups with shared biological features.