Can Computational Surveillance of Autism Spectrum Disorder Fuel Big-Data Science?

Warren, Zachary

Background: The research community is increasingly recognizing that in order to unravel the complex heterogeneity associated with autism spectrum disorder (ASD) we will need to be able to meaningfully access extremely large datasets tied both to important phenotypic and genetic data. To this end, a number of initiatives have been established that aim to develop big and/or open data resources and collections for autism research. However, construction of these datasets has often relied on expensive and intensive methods for recruiting and phenotyping children in companion with collection of biological samples. As such, even the most successful consortium collections may fall well short of the numbers and quality of data necessary to help answer extremely important questions.

Objectives:

In the current study, we describe our methods and work for utilizing advanced computational methods to model and validate cohorts of individuals with ASD within an existing large electronic health record (EHR) structure that also contains important neurodevelopmental and genetic information (BioVU).

Methods: We describe our process for utilizing machine-learning to develop and validate ASD case models across (a) an existing ‘gold-standard’ tertiary clinical research center and (b) sequential cohorts of children identified as part of the CDC ADDM network. We hypothesized that robust models developed from deep and broad data sources could result in hybrid computational surveillance and case identification methods by which existing EHR structures could be leveraged to identify ‘collections’ of individuals with ASD. Notably, this model was employed within a bioinformatics structure simultaneously banking phenotypic and genetic material as part of standard care.

Results:

We successfully linked both our clinical research registry (differentiating over 3,500 individuals with ASD from other complex neurodevelopmental conditions) and our initial 2006 cycle CDC ADDM population (all 8 year-olds in surrounding 11 county region) to our university EHR bioinformatics structure. Some 73.9% of our population catchment (18,436 of 24,940 children within the birth year ) were represented within the EHR. Initial model results examining prediction of entire population case-status from the clinical research model in comparison to simple administrative (code based) identification was compared. Both strategies yielded the ability to identify large numbers of children quickly (≥68.5% of children identified by across models/process); however, our results highlighted several potent challenges related to applying advanced computational surveillance and un/supervised learning within the EHR. Importantly, despite having relatively large numbers of children in our system (11.8% of studied EHR population), these numbers were still relatively small given the sparse population of data within EHR for important variables/codes. As such, many key variables were left data hungry due to large amounts of variation within class (i.e. 15,962 medication codes; 4,010 CPT codes for sample).

Conclusions:

Advanced computational methods may prove successful for creating large, meaningful data structures for unraveling complex neurodevelopmental and genetic underpinnings of ASD. However, such methods will likely require even larger cohorts of individuals for predictive modeling and may still will require hybrid methodologies for optimization.

31484 Can Computational Surveillance of Autism Spectrum Disorder Fuel Big-Data Science?

31484
Can Computational Surveillance of Autism Spectrum Disorder Fuel Big-Data Science?