Improving a Diabetes Type 2 Risk Calculator: A Machine Learning Approach

Document Type


Lead Author Type

MBI Masters Student


Dr. Guenter Tusch, tuschg@gvsu.eud

Embargo Period



The National Center for Health Statistics (NCHS) of the Centers for Disease Control and Prevention (CDC) conducts a survey research program called National Health and Nutrition Examination Survey (NHANES) to keep track of the health, nutritional and medical status of the United States population (adults and children both) over time. The survey consists of information such as demographic, laboratory examination results, interviews and medical questionnaires. The American Diabetes Association (ADA) recommends that individuals of any age with one or more risk factors or any individual above age 45 with or without any risk factors for diabetes be screened for type 2 diabetes. According to the National Diabetes fact sheet of 2007 5.7 million people in the United States have undiagnosed diabetes. One of the major reasons behind the failure of ADA’s recommendations is the cost of the screening process and inconvenience of testing procedures. One of the solutions to the above problem may be to develop a screening test that is economical, convenient and that does not involve complex procedures of screening. Screening tools for diabetes can be extremely useful in preventing or delaying the occurrence of the disease. Diabetes prediction models have been built using the NHANES data. The aim of my project is to perform a study similar to the article by Heikes, K.E, et al., Diabetes risk calculator: A simple tool for detecting undiagnosed diabetes and pre-diabetes, 2008, Diabetes Care, 31(5), pp. 1040-1045. The authors use the NHANES III (1988-1994) data to build a paper-based screening tool that can predict undiagnosed diabetes or pre-diabetes based on a person’s characteristic features. My objective was to perform a similar study using the NHANES 1999-2000 dataset, but adding new variables and testing the performance of my model against the authors in terms of estimating the prediction accuracies when applied to a different population of the NHANES data series. The article uses body measurements, demographics, blood pressure, family history, two hour oral glucose tolerance test (OGTT), and fasting plasma glucose (FPG) to predict undiagnosed diabetes or pre-diabetes. I included all these variables in my model except the OGTT values because the test was absent in 1999-2004 NHANES series. I also added new variables that were present in the 1999-2000 dataset such as “Ulcer/sore not healed within 4 weeks”, “numbness in hands/feet from past 3 months”, “pain/tingling in hands/feet from past 3 months”, and “pain in either leg while walking”. The resulting classification tree model classifies a person as diabetic or healthy based on the variables used in my analysis. The model was validated against the NHANES 2001-02 dataset. The University of Maryland, the American Diabetes Association (ADA), and others have developed similar screening tests. My model uses four new additional variables providing more characteristic features to improve the prediction process.

This document is currently not available here.