Challenges in Clinical Research Informatics: Data Quality and Transferability in Publically Available Databases

Document Type


Lead Author Type

MBI Masters Student


Dr. Guenter Tusch, tushcg@gvsu.edu

Embargo Period



Background: Publically-available databases were established with the purpose of making a large amount of patient information readily accessible to researchers and clinicians who would otherwise not be able to obtain it. Some of these databases number in the tens or even hundreds of thousands of datapoints, and are formatted in order to be easily analyzed using programs like R, SPSS and Excel. Through these methods, sub-analysis can be conducted on existing information, adding to the overall knowledge base, and providing new insights on a wide range of indications, from genetic conditions to many different forms of cancer.

As a researcher, both academically and professionally, I have utilized both public and privately held databases. Both have their advantages and disadvantages. Electronic medical records have in-depth, very individualized information, essential for high-quality manuscripts, rated at Level II in terms of evidence. However, these records are limited by the number of patients within a single practice or hospital system, and can be lacking in terms of specific information of interest to the investigator. These measures can include subjective outcomes, and specialty-specific scoring tools.

Alternatively, I used the Chemical Effects on Biological Systems (CEBS) for a group research paper in a previous semester. Initially, after deciding our population of interest, our plan was to perform a statistical analysis on basic patient demographic data. These include the most straightforward information, including gender, age and ethnicity. We found to our surprise, that the database of nearly 86,000 patient was severely lacking in this regard. Ultimately, only ethnicity was considered to be useable, as we were able to demonstrate that the locations that we chose tended to be culturally homogenous.

I found this to be a considerable issue, especially given my professional experience will study protocol and manuscript preparation. While I’m fully aware of the requirements and regulations set in place by HIPPA (Health Insurance Privacy and Portability Act), and the importance of protecting the confidentiality of sensitive patient information, the datapoints I was concerned with fell well outside the realm of these parameters. I chose this topic for my capstone as it coincides well with new challenges with clinical research information systems, and as well as future professional and scholarly projects.

Purpose: The purpose of this project is to closely analyze the number and quality of manuscripts published from select publically-available databases.

Methods: Using the following databases (Chemical Effects in Biological Systems (CEBS), National Database for Clinical Trials Related to Mental Illness (NDCT), Clinicaltrials.gov and The National Cardiovascular Data Registry (NCRD). I selected a random sample of 20 publications which used data from these publically-available databases for four specialties (this was reduced to three groups due to a low response rate). Experienced raters were selected and given abstracts and literature excerpts at two intervals in February and March of 2016. A total of 20 sets were given to each subject. In all cases, sets were randomized to avoid bias.

Results: Using SPSS (v20), a Cohen’s Kappa coefficient analysis was conducted to test interrater reliability. Cardiovascular (clinical research) subset consisted of poor reliability in terms of my testing parameters. K values for rounds 1 and 2 were 0.274 and 0.178. Sociobehavioral (K = 0.379 and 0.437) and basic research (K = 0.555 and 0.531) fared somewhat better, with fair to good reliability results. However, the questionnaire system needs to be modified to make the comparisons that I hoped to make.

Conclusion: I learned quite a bit more about the complexities of publishing from publically-available data. Many of the studies I used in my examples laid the groundwork for more important prospective work. It is difficult to create a scale to cover this many broad topics. I hope to modify the scales, and retest in a larger population over a longer period of time to help address the shortfalls of this preliminary project.

This document is currently not available here.