Improving the Identification of ASD Cases from Claims Data Using Machine Learning and Latent Class Analysis

Friday, May 12, 2017: 12:00 PM-1:40 PM
Golden Gate Ballroom (Marriott Marquis Hotel)
M. Brucato1, C. Ladd-Acosta2, R. Musci3, X. Hong1, D. M. Caruso4, M. D. Fallin5, X. Wang6 and E. Stuart7, (1)Johns Hopkins University School of Public Health, Baltimore, MD, (2)Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, (3)Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, (4)Johns Hopkins University, Baltimore, MD, (5)Department of Mental Health, Johns Hopkins School of Public Health, Baltimore, MD, (6)Johns Hopkins University School of Public Health, Baltimore, MD, (7)Johns Hopkins School of Public Health, Baltimore, MD
Background:  Current methods for identifying ASD cases in electronic medical records, particularly when restricted to claims data and without access to notes or laboratory results, rely on the presence of specific diagnostic codes (299 and its children for ICD-9-CM). Claims datasets are a potential source of data for establishing studies with large sample sizes, but ICD-9-CM diagnoses are prone to error and are limited in their sensitivity and specificity when compared to gold standard diagnostic measures. However, the considerable co-morbidity in ASD may allow the utilization of other, non-ASD diagnoses to improve identification of potential ASD cases in claims datasets, in addition to distinguishing clinically relevant ASD subtypes.

Objectives:  We aim to improve our ability to identify ASD cases from claims data, exploiting the full amount of variability in ICD-9-CM diagnoses, using machine learning and latent class analysis.

Methods: We used data from the Boston Birth Cohort (BBC), a prospective birth cohort that enrolls predominantly urban, low-income minority mothers and their children. We restricted our analysis to the 2992 children with claims data available from the era of ICD-9-CM (1 Oct 2003 – 30 Sept 2015). Of these children, 771 were given the Social Communication Questionnaire (SCQ) by research staff.

In the 771 patients with SCQ data, random forests were used to identify the specific ICD-9-CM codes, out of over 2500 unique codes present in the claims data, most predictive of the continuous SCQ score. The codes with the highest importance criteria in the random forest were used as indicators in a Latent Class Analysis (LCA) of the full dataset (n=2992).

We compared the LCA-based diagnostic results to using the simple presence of ASD-specific ICD-9-CM hospital diagnosis codes (299.00, 299.80, 299.90) from pediatric outpatient, inpatient, and emergency room visits, which is the current standard in studies that are limited to claims data.

Results:  Using random forests, we identified the 14 ICD-9-CM codes most predictive of SCQ score. These codes were used as indicators in LCA of the full dataset (2992 children). LCA identified four classes among the subjects; one class (93% of the sample) had a low probability of a 299 code or other developmental diagnoses (normal class), while one class (4% of the sample) had a high probability of carrying a 299 diagnosis (ASD-type class). When compared to the standard ASD definitions (n=120), the ASD-type class derived from LCA was smaller (n=113). There were 9 children who carried a 299 code but did not cluster within the ASD-type class; and there were 2 children who did not carry a 299 code but were assigned to the ASD-type class on the basis of other characteristics. These children may represent false positive and false negative ASD cases, respectively, when using the standard ICD-9-CM-based identification.

Conclusions:  Machine learning and LCA show promise in improving ICD-9-CM-based ASD case identification. This may also help in identifying subgroups of children with ASD based on their clinical heterogeneity.

See more of: Epidemiology
See more of: Epidemiology