R for Medical Research
Welcome to R for researchers
The purpose of this course is to teach you how to handle, analyze and present research data using R. In the modern era of data science and e-learning, researchers and students are increasingly aware of the importance of handling data themselves, irrespective of their background. Indeed, handling and analyzing data has become an integral part of many professions. The field of data science has exploded in recent years, such that students and researchers in all fields are exposed to data science in some way. The dissemination of powerful software and know-how has made it possible for anyone to handle data.
This course intends to teach you how to analyze and present data for research. The authors of this course are medical researchers with many years experience in teaching research methods, programming and data science.1–5 This includes both experimental studies and observational research. This course is an attempt to provide you with all you need to understand how to design a research study, select appropriate analyses and subsequently performing the analyses and present the research findings. We will be using R, which is an extremely powerful tool for research. There are currently more than 15.000 packages (i.e libraries with useful, ready-made functions) available in R. These packages facilitate all aspect of analyzing data in R.
R is an easy language to learn and it offers very powerful tools for researchers, irrespective of the field. Anyone can learn R in a few days and produce impressive reports.
This course will give you all tools necessary to conduct research, regardless of whether your interest lies in medicine or finance. However, all examples and emphasis will be one methods in medical research.
A quick example
Let us see how quickly we can create the core components of a clinical study. We will be using data on patients with colon cancer.
1. Load data
After loading the survival
package we can now load data on colon cancer patients.
data(colon)
2. View 10 random patients
Let’s see 10 random patients:
sample_n(colon, 10)
id | study | rx | sex | age | obstruct | perfor | adhere | nodes | status | differ | extent | surg | node4 | time | etype | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1390 | 695 | 1 | Lev+5FU | 0 | 51 | 0 | 0 | 0 | 1 | 0 | 2 | 3 | 1 | 0 | 2008 | 1 |
797 | 399 | 1 | Lev+5FU | 1 | 66 | 0 | 0 | 0 | 7 | 0 | 2 | 3 | 0 | 1 | 2540 | 2 |
1088 | 544 | 1 | Lev | 0 | 46 | 0 | 1 | 0 | 7 | 1 | 2 | 3 | 0 | 1 | 313 | 1 |
513 | 257 | 1 | Lev | 0 | 57 | 1 | 0 | 0 | 8 | 1 | 2 | 3 | 0 | 1 | 232 | 2 |
335 | 168 | 1 | Obs | 1 | 50 | 0 | 0 | 0 | 4 | 1 | 3 | 3 | 1 | 0 | 717 | 2 |
284 | 142 | 1 | Lev | 1 | 28 | 0 | 0 | 0 | 9 | 1 | 3 | 3 | 0 | 1 | 246 | 1 |
456 | 228 | 1 | Lev | 1 | 53 | 0 | 0 | 0 | 1 | 1 | 2 | 3 | 0 | 0 | 1976 | 1 |
188 | 94 | 1 | Lev+5FU | 0 | 26 | 0 | 0 | 1 | NA | 0 | 2 | 4 | 1 | 1 | 2869 | 1 |
1598 | 799 | 1 | Obs | 0 | 69 | 0 | 0 | 0 | 1 | 0 | 2 | 2 | 1 | 0 | 1929 | 1 |
198 | 99 | 1 | Lev | 1 | 71 | 0 | 0 | 1 | NA | 1 | 2 | 4 | 0 | 1 | 219 | 1 |
3. Quick overview of all columns (variables, features)
Let us get a quick overview of variable names, types, missingness, means, etc.
summarizeColumns(colon)
name | type | na | mean | disp | median | mad | min | max | nlevs |
---|---|---|---|---|---|---|---|---|---|
id | numeric | 0 | 465.0000000 | 268.2512426 | 465.0 | 343.9632 | 1 | 929 | 0 |
study | numeric | 0 | 1.0000000 | 0.0000000 | 1.0 | 0.0000 | 1 | 1 | 0 |
rx | factor | 0 | NA | 0.6609257 | NA | NA | 608 | 630 | 3 |
sex | numeric | 0 | 0.5209903 | 0.4996937 | 1.0 | 0.0000 | 0 | 1 | 0 |
age | numeric | 0 | 59.7545748 | 11.9456696 | 61.0 | 11.8608 | 18 | 85 | 0 |
obstruct | numeric | 0 | 0.1937567 | 0.3953469 | 0.0 | 0.0000 | 0 | 1 | 0 |
perfor | numeric | 0 | 0.0290635 | 0.1680298 | 0.0 | 0.0000 | 0 | 1 | 0 |
adhere | numeric | 0 | 0.1453175 | 0.3525156 | 0.0 | 0.0000 | 0 | 1 | 0 |
nodes | numeric | 36 | 3.6597146 | 3.5715810 | 2.0 | 1.4826 | 0 | 33 | 0 |
status | numeric | 0 | 0.4951561 | 0.5001111 | 0.0 | 0.0000 | 0 | 1 | 0 |
differ | numeric | 46 | 2.0629139 | 0.5141981 | 2.0 | 0.0000 | 1 | 3 | 0 |
extent | numeric | 0 | 2.8869752 | 0.4880173 | 3.0 | 0.0000 | 1 | 4 | 0 |
surg | numeric | 0 | 0.2658773 | 0.4419182 | 0.0 | 0.0000 | 0 | 1 | 0 |
node4 | numeric | 0 | 0.2744887 | 0.4463764 | 0.0 | 0.0000 | 0 | 1 | 0 |
time | numeric | 0 | 1537.5457481 | 946.7038105 | 1855.0 | 1184.5974 | 8 | 3329 | 0 |
etype | numeric | 0 | 1.5000000 | 0.5001346 | 1.5 | 0.7413 | 1 | 2 | 0 |
4. Create a correlation plot to inspect distributions among 8 selected variables
We will create a correlation matrix using the first 8 columns in the colon dataset. We will also color the data points according to the treatment received, which is specified by the rx
variable.
ggpairs(colon,
columns=1:8,
aes(color = rx))

5. Get a glimpse of missing patterns in data
This will result in an image that resembles our dataframe. The columns represent the variables and the rows the patients. The cells for variables with missing values are drawn in black, as denoted by NA
(Not Available) in the legend.
vis_dat(colon)

6. Use a sophisticated prediction model to impute (fill in) missing data
library(mice)
myImputation <- mice(colon)
colon2 <- complete(myImputation)
vis_dat(colon2)

7. Create a regression model to study association between patient characteristics and survival
We will create a logistic regression model using the glm()
function. The outcome studied is the very first argument (status
) and the variables that we study are listed after ~
(tilde sign), separated by +
marks. We also specify which data frame to use, and that the distribution is binomial (i.e logistic regression). Finally, we tell R to create a publication ready forest plot with variable names, categories, number of observations, odds ratios and p-values.
# First, we create the logistic regression model
my_model <- glm(status ~ age + sex + rx + nodes,
data=colon2,
family="binomial")
# Second, we plot the results
forest_model(my_model)

8. Report the exact regression equation
Our regression model is actually a survival model and the journal requires us to provide the exact regression equation.
extract_eq(my_model, use_coefs = TRUE)
Pretty good, right?
As you can see, R is amazing.