NHANES data in R

Home/Statistics/R/NHANES data in R

NHANES data in R

Background

NHANES is a national survey of health in the US, conducted every 2 years. Unfortunately, all the data are stored as SAS transport files, and there are many files for each 2-year wave. We are slowly working on a package for NHANES analyses and have put some scripts on GitHub for downloading all of the NHANES data sets and saving them as R files. There are also a couple of functions to make it easier to pull together an analysis file. So, in a few short steps, a researcher can download the data and put together an analysis dataset.

There are over 700 files across all the NHANES waves, and they take up 250 MB (compressed as RDS files). These scripts are a work in progress – there is very little error handling, including no “try catch” to handle downloading issues. (I do have a draft version of the “try catch” approach that will be uploaded soon.)

I was looking for a reason to play with the NHANES data to think about creating some analysis-related functions, and I stumbled across some data about US runners diagnosed with asthma and thyroid disease provided by running coach Alberto Salazar. In his response to a controversy over his coaching methods, he cited a few statistics about the 55 runners he has coached professionally. He cites a prevalence of 8/55 (14.5%) of his runners diagnosed with exercise-induced asthma. He also cites an incidence rate of 5/55 (9.1%) of his runners diagnosed with hypothyroidism after he started coaching them. So, let’s see how this compares to US average values.

Methods

Let’s operationalize these numbers as follows, so we can compare to US national data. How many US adults age 18-35 have ever been diagnosed with asthma? How many have it now? How many have received asthma medication in the last 3 months? Asthma is pretty simple in NHANES – there are variables for each of the questions listed.

For thyroid disease, let’s ask what proportion of US adults age 18-35 would have hypothyroidism if we were to test them? Hypothyroidism is a little more complicated. According to the US Preventive Services Task Force, there is no reason to screen for it (http://www.uspreventiveservicestaskforce.org/Page/Topic/recommendation-summary/thyroid-dysfunction-screening). So, it is certainly something that could exist without being diagnosed. Let’s define hypothyroidism as either subclinical hypothyroidism (elevated thyroid stimulating hormone

[TSH]) with normal serum T4) as well as hypothyroidism (elevated TSH and low T4). In fact, to make it simpler, we will just use TSH above 4.5 mU/L. And we will exclude anyone who reports that they have thyroid disease to make sure these are “unknown” cases.

And, to top it off, let’s focus on the subset of the population who exercises regularly, defined as at least 90 minutes of vigorous recreational activity each week. Not the same as world-class athletes, but the best we can do with the sample. At least we rule out the non-exercisers, and provide a more comparable population.

Code

The code to get started is below. Note that I use the data.table and magrittr packages a lot, so some of the syntax makes use of those. I also wrote another function (“out_tab”) to generate nicer output from the results of survey objects. Also, I combined data across multiple waves of data, from 2007, 2009, and 2011. In 2007, all patients were tested for thyroid disease, but in the other waves, it was part of the fasting subsample. Hence the weights are different. The asthma results are based on the interview, so are just weighted equally for each wave.

library(data.table)
library(survey)
library(devtools)
library(magrittr)
options(survey.lonely.psu = "adjust", scipen = 10)

# nhanes functions
source_url("https://raw.githubusercontent.com/outcomesinsights/nhanes_tools/master/create_datasets.R")

# survey output function
# prints out dataframe of result, se, 95% CI for svy estimates
out_tab <- function(t) {
    x <- 
        cbind(
        coef(t),
        SE(t),
        confint(t)
        ) %>% 
        format(big.mark = ",", digits = 4) %>% 
        as.data.frame
    names(x) <- c("Total", "SE", "2.5%", "97.5%")
    return(x)
}

# thyroid and asthma data sets and labels from 2009 and 2011
listing <- c("mcq", "thyrod", "paq") # these are the NHANES files to open; the "demo" file is assumed since it is required
full_07 <- load_merge(listing, 2007) # load listed datasets and merge together
full_09 <- load_merge(listing, 2009) 
full_11 <- load_merge(listing, 2011) 
full_labs_07 <- load_labs_merge(listing, 2007)  # create dataset of labels
full_labs_09 <- load_labs_merge(listing, 2009)  
full_labs_11 <- load_labs_merge(listing, 2011) 

# get the variables we need and merge both nhanes waves together
# note that 2007 included thyroid testing in all MEC patients.  2009 and 2011 were subsamples in 1/3 of population
keep <- c("LBXTSH1", "PAD660", "MCQ010", "MCQ035", "MCQ051","MCQ170M", "RIDAGEYR", "SDMVPSU", "SDMVSTRA", "WTSA2YR", "WTINT2YR")
keep_07 <- c("LBXTSH1", "PAD660", "MCQ010", "MCQ035", "MCQ051", "MCQ170M", "RIDAGEYR", "SDMVPSU", "SDMVSTRA", "WTMEC2YR", "WTINT2YR")

full <- rbind(
    full_07[, .SD, .SDcols = keep_07][, year := 2007], 
    full_09[, .SD, .SDcols = keep][, year := 2009], 
    full_11[, .SD, .SDcols = keep][, year := 2011], 
    fill = TRUE)

full[, `:=`(
    thyroid_wt = ifelse(year == 2007, 1 / 3 * WTMEC2YR, 1 / 3 * WTSA2YR), # MEC in 2007, otherwise subsample A weights
    asthma_wt  = 1/3 * WTINT2YR)] # correct the weights to combine over 3 waves (6 years)

# create analysis variables
full[, `:=`(
    thyr_prob = ifelse(!is.na(MCQ170M) & MCQ170M == 1, 1, 0), # has current thyroid problem
    hypo = ifelse(!is.na(LBXTSH1) & LBXTSH1 >= 4.5, 1, 0), # TSH
    exer_mins = ifelse(is.na(PAD660), 0, PAD660), # minutes of vigororous exercise per week, including 0 for NA
    exer90 = ifelse(!is.na(PAD660) & PAD660 >= 90, 1, 0), # minutes of vigororous exercise per week
    exer30 = ifelse(!is.na(PAD660) & PAD660 >= 30, 1, 0),
    asthma = ifelse(!is.na(MCQ010) & MCQ010 == 1, 1, 0), # ever diagnosed with asthma
    asthma_now = ifelse(!is.na(MCQ035) & MCQ035 == 1, 1, 0), # have asthma now
    asthma_med = ifelse(!is.na(MCQ051) & MCQ051 == 1, 1, 0), # asthma medication in last 3 months
    agegrp = ifelse(!is.na(RIDAGEYR) & RIDAGEYR < 18, "<18",
                ifelse(!is.na(RIDAGEYR) & RIDAGEYR < 36, "18-35", "36+")) # divide age groups (could use cut)
)]

# create survey objects for thyroid and asthma surveys
asthma_design <- svydesign(id = ~ SDMVPSU, strata = ~ SDMVSTRA, nest = TRUE, weight = ~ asthma_wt, data = full)
thyroid_design <- svydesign(id = ~ SDMVPSU, strata = ~ SDMVSTRA, nest = TRUE, weight = ~ thyroid_wt, data = full[!is.na(thyroid_wt)])
thyroid_design2 <- subset(thyroid_design, subset = (thyr_prob != 1))

Results

How many vigorous exercisers are there in the US? (About 26 million.)

svyby(~ exer90, ~ agegrp, asthma_design, svytotal, na.rm = TRUE) %>% 
    out_tab
           Total         SE       2.5%      97.5%
<18    7,402,523    493,918  6,434,461  8,370,585
18-35 11,572,231    901,320  9,805,676 13,338,787
36+    6,993,640    493,787  6,025,836  7,961,445

Thyroid disease

How many people who do not report having thyroid disease but have an elevated TSH? (Looks like about 8 million.)

svytotal(~ hypo, thyroid_design2, na.rm = TRUE) %>% 
    out_tab
         Total        SE      2.5%     97.5%
hypo 7,933,290   919,789 6,130,537 9,736,043

What is the prevalence of elevated TSH, by age? Looks like it varies between 1% and 4% based on these age groups (it increases with age, particularly in women).

svyby(~ hypo, ~ agegrp, thyroid_design2, svymean, na.rm = TRUE) %>% 
    out_tab
         Total       SE     2.5%    97.5%
<18   0.010759 0.003258 0.004373 0.017145
18-35 0.031558 0.006359 0.019094 0.044022
36+   0.037662 0.004189 0.029453 0.045872

How about the prevalence in 18-35 year old vigorous exercisers? (Looks like about 1.3%, with a 95% CI of 0.07% to 6%.)

svyciprop(~ hypo, subset(thyroid_design2, subset = (agegrp == "18-35" & exer90 == 1)), method = "beta")
                  2.5% 97.5%
hypo 0.013665 0.000738  0.06

Asthma

How many people report ever being diagnosed with asthma? (About 44 million.)
About 25 million have asthma now (at the time of the survey).
And about 16.5 million received treatment in the 3 months before the survey.

svytotal(~ asthma, asthma_design, na.rm = TRUE) %>% 
    out_tab 
                Total         SE       2.5%      97.5% 
asthma     43,671,524  2,015,687 39,720,851 47,622,198

svytotal(~ asthma_now, asthma_design, na.rm = TRUE) %>% 
    out_tab
                Total         SE       2.5%      97.5%
asthma_now 24,825,364  1,366,168 22,147,725 27,503,004

svytotal(~ asthma_med, asthma_design, na.rm = TRUE) %>% 
    out_tab
                Total         SE       2.5%      97.5%
asthma_med 16,467,848    905,845 14,692,425 18,243,271

What is the prevalence of having asthma now? (8-9%)
What is the prevalence of receiving medication? (4-7%)

svyby(~ asthma_now, ~ agegrp, asthma_design, svymean, na.rm = TRUE) %>%
    out_tab
         Total       SE     2.5%    97.5%
<18   0.094497 0.003625 0.087392 0.101602
18-35 0.079893 0.005022 0.070051 0.089735
36+   0.077406 0.004769 0.068059 0.086754

svyby(~ asthma_med, ~ agegrp, asthma_design, svymean, na.rm = TRUE) %>% 
    out_tab
         Total       SE     2.5%    97.5%
<18   0.068994 0.002817 0.063472 0.074515
18-35 0.043684 0.003814 0.036210 0.051159
36+   0.052614 0.003737 0.045290 0.059937

What is the prevalence in 18-35 year old vigorous exercisers? (Looks like about 9% are currently diagnosed and 3% received medication.)

svyciprop(~ asthma_now, subset(asthma_design, subset = (agegrp == "18-35" & exer90 == 1)), method = "beta")
                    2.5% 97.5%
asthma_now 0.0883 0.0645  0.12

svyciprop(~ asthma_med, subset(asthma_design, subset = (agegrp == "18-35" & exer90 == 1)), method = "beta")
                   2.5% 97.5%
asthma_med 0.0326 0.0205  0.05

Conclusion

I found this to be very interesting. It seems that competitive runners have a higher prevalence of asthma and hypothyroidism than the general population of 18-35 year old vigorous exercisers. Note that I did not do any statistical comparisons, and this post is not intended to opine about the use of prescription medications by elite runners. The purpose was to motivate a reason to look at the data, and to point out that NHANES data can be easy to work with, if the right tools are available. In particular, it is really helpful to have some R functions to download all the data, to convert it to R data files, and to do some analyses. This will eventually become a package (I hope). But until then, the scripts are up there to allow others to access the data and answer questions of interest.

By | 2017-05-18T19:11:44+00:00 July 10th, 2015|R|0 Comments

About the Author: