Objectives: Our goals were to (1) make high quality data sets accessible within minutes rather than weeks or months; (2) reduce redundant data manipulations for simple revisions to a research question; and (3) implement a transparent and reproducible mechanism for data regeneration.
Methods: We developed an iDMS that can accommodate multiple studies, heterogeneous data types, and multi-site collaboration. We implemented processes that encouraged up-front data cleaning at the time of acquisition, including data integrity constraints, improved electronic data capture, and data quality reporting. The iDMS and processes have been adopted by several research groups; we reviewed the use of the iDMS by two data managers to fulfill a range of requests for clean data sets by investigators. The data managers produced multiple data sets over the past year, and the average time it took to complete each was recorded. The recorded time included investigator communications and query revisions.
Results: Our solution was to (1) centralize all data sources into a single database; (2) clean the data once, up-front; and (3) enable data managers to easily construct and save complex queries. These three procedural changes reduced the time required to produce a data set. Though quantitative measures are unavailable for the “before” state, anecdotal evidence strongly suggests that producing each data set using the JIT process took weeks or months. With an iDMS, the average time per request was 5.8 hours for the 85 ad hoc data set requests. Interestingly, 33% of requests took 30 minutes or less to complete. Additionally, data quality was higher as a consequence of upfront cleaning and staff allocation was lower due to elimination of redundant data cleaning activities. The queries could be saved in the system and later reused for revisions and comparisons.
Conclusions: Adopting an integrated data management (iDM) process with a compatible system can significantly reduce costs and increase data quality. The greatest challenge was integrating data sources with varying degrees of cleanliness. Data cleaning and organization are likely to remain challenging, as the number of data sources to manage continues to increase. We expect iDM practices will be most valuable when both the cost of data acquisition and the probability of reuse are high. We believe the efficiency gains described here will accelerate scientific discovery and translation.