Improving the Identification of ASD Cases from Claims Data Using Machine Learning and Latent Class Analysis
Objectives: We aim to improve our ability to identify ASD cases from claims data, exploiting the full amount of variability in ICD-9-CM diagnoses, using machine learning and latent class analysis.
Methods: We used data from the Boston Birth Cohort (BBC), a prospective birth cohort that enrolls predominantly urban, low-income minority mothers and their children. We restricted our analysis to the 2992 children with claims data available from the era of ICD-9-CM (1 Oct 2003 – 30 Sept 2015). Of these children, 771 were given the Social Communication Questionnaire (SCQ) by research staff.
In the 771 patients with SCQ data, random forests were used to identify the specific ICD-9-CM codes, out of over 2500 unique codes present in the claims data, most predictive of the continuous SCQ score. The codes with the highest importance criteria in the random forest were used as indicators in a Latent Class Analysis (LCA) of the full dataset (n=2992).
We compared the LCA-based diagnostic results to using the simple presence of ASD-specific ICD-9-CM hospital diagnosis codes (299.00, 299.80, 299.90) from pediatric outpatient, inpatient, and emergency room visits, which is the current standard in studies that are limited to claims data.
Results: Using random forests, we identified the 14 ICD-9-CM codes most predictive of SCQ score. These codes were used as indicators in LCA of the full dataset (2992 children). LCA identified four classes among the subjects; one class (93% of the sample) had a low probability of a 299 code or other developmental diagnoses (normal class), while one class (4% of the sample) had a high probability of carrying a 299 diagnosis (ASD-type class). When compared to the standard ASD definitions (n=120), the ASD-type class derived from LCA was smaller (n=113). There were 9 children who carried a 299 code but did not cluster within the ASD-type class; and there were 2 children who did not carry a 299 code but were assigned to the ASD-type class on the basis of other characteristics. These children may represent false positive and false negative ASD cases, respectively, when using the standard ICD-9-CM-based identification.
Conclusions: Machine learning and LCA show promise in improving ICD-9-CM-based ASD case identification. This may also help in identifying subgroups of children with ASD based on their clinical heterogeneity.