Integration of Knowledge Extracted from Clinical Notes with Patient Reported Outcomes and Genetic Reports for Advancing Research into Phelan Mcdermid Syndrome
Objectives: The objective of this work is to demonstrate the integration of: a) knowledge extracted from patient clinical notes with b) PRO data sourced from the PMS International Registry (PMSIR) and c) the content of curated genetic reports, on the PCORI PMS Data Network (PMS_DN).
Methods: Manual curation of clinical notes deleted pages with unusable content such as illegible text, images, ECG readouts, and cursive handwriting. Forms and surveys with checkboxes and formatted questions that could lead to false positive errors in the knowledge extraction process were also removed. The Tesseract optical character recognition tool extracted raw text content from the curated notes. Then, the MITRE MIST scrubber and the Scrubber toolkit (in the Apache cTAKES natural language processing engine) erased Protected Health Information. Apache cTAKES extracted knowledge by identifying occurrences of concepts defined in the Unified Medical Language System and mapping these concepts to concept definitions in 20 clinical terminologies including ICD-9/10, MeSH, SNOMED, and the Human Phenotype Ontology. Knowledge extracted from clinical notes can be verified by experts with a novel validation tool that allows them to crosscheck the identified concepts against the raw text. PRO data from the PMSIR, comprising answers to developmental, clinical, and adult behavior questions were preprocessed for statistical analysis. Extract-Transform-Load pipelines loaded the knowledge extracted from clinical notes and the PRO data along with curated genetic reports of the PMS patients into the i2b2-tranSMART clinical data repository.
Results: PMS Foundation (PMSF), a nonprofit organization advocating for PMS research, obtained informed consent to participate in PMSIR from 981 families (461 in USA). Of these, 557 families (300 in USA) consented to share their data, including PRO, genetic reports, and clinical notes, with Harvard Medical School. Manual curation of clinical notes of 114 patients (with 47938 pages), sourced from healthcare providers by a 3rd party vendor, removed 7618 unusable pages. Apache cTAKES extracted the knowledge content of the remaining 40320 pages. This knowledge was loaded into PMS_DN alongwith preprocessed PRO data from 344 patients and curated genetic information from 121 patients. Authorized autism investigators can browse and interrogate the aggregates of this integrated patient data on the PMS_DN Web portal (https://pmsdn.hms.harvard.edu). Investigators with the appropriate IRB clearances can obtain advanced, raw data download privileges on PMS_DN from PMSF.
Conclusions: PMS_DN facilitates research into PMS by providing authorized investigators access to high-quality knowledge extracted from clinical notes, PRO data, and genetic reports, enabling novel insights into the origin, progression, and treatment of this disorder. This exemplifies the potential of collaborations between academic researchers and family organizations such as PMSF to drive clinical research.