Avsnitt 5 of 7
Pågående

R Objects

R Objects

R uses objects to store information. Here is a very simple object, which only stores the numeric value 1:

my_number <- 1

We created an object named my_number and assigned (by using the symbol <-) the numeric value 1to it. Now the object my_number contains information, i.e a numeric value. Let’s ask R what type of object my_number is:

class(my_number)
## [1] "numeric"

According to the output, R interprets my_number as a “numeric” object, which is satisfactory. Let’s do the same procedure with a character string instead:

my_string <- "This is a string"
class(my_string)
## [1] "character"

This object, on the other hand, is a “character” object, which is also expected. It follows that as we create objects, R will try to guess the nature of the objects. The guessing is done by evaluating the contents of the object. Let’s see how R reacts to a numeric value wrapped in quotation marks:

my_string <- "1"
class(my_string)
## [1] "character"

Interesting! Because we wrapped the number 1 in quotation marks, R believes that the object is a character object. There are many types of objects in R, ranging from data frames to regression objects, and we will discuss them in this chapter.

You’ll start by building simple R objects that represent playing cards and then work your way up to a full-blown table of data. In short, you’ll build the equivalent of an Excel spreadsheet from scratch. When you are finished, your deck of cards will look something like this:

Objects store data in R. Objects can store many types of data.

Atomic Vectors

The most simple type of object in R is an atomic vector. Atomic vectors are everywhere in R, which you will soon notice. Let’s create an atomic vector by using the c() function. This function combines several arguments to form a vector. We will name the object my_first_vector and it will contain six numeric values:

my_first_vector <- c(1, 2, 3, 4, 5, 6)

We can check whether R considers my_first_vector as an atomic vector. This is done by using the function is.vector():

# is.vector() tests whether an object is an atomic vector
is.vector(my_first_vector)
## [1] TRUE

The function is.vector() returned the value TRUE which means that my_first_vector is an atomic vector. Since my_first_vector only contain numerics values, R will also classify it as a numeric vector, which you can check by using the function is.numeric():

# is.numeric() tests whether an object is a numeric vector
is.numeric(my_first_vector)
## [1] TRUE

Atomic vectors store values as one-dimensional vectors. Each atomic vector can only store one type of data. To handle all types of data there are six basic types of atomic vectors:

  • double: numeric values with decimals
  • integer: numeric values without decimals
  • character: any character value
  • logical: these vectors can only be TRUE or FALSE.
  • complex and raw are less commonly used.
my_doubles <- c(1.2, 2.4, 32.1)
my_integers <- c(1L, 2L, 3L)
my_characters <- c("A", "B", "Hello", "My phone is dead")
my_logicals <- c(TRUE, TRUE, FALSE, TRUE, FALSE)

You can check wheter a vector is any of the above by using the functions is.double()is.numeric()is.character()is.logical(). These functions return the value TRUE or FALSE.

In most cases you will know your variables well. However, if you need to check what type of vector an object is, you can use the typeof() function:

my_doubles <- c(1.2, 2.4, 32.1)
typeof(my_doubles)
## [1] "double"

The length of vectors

The length of a vector is simply the number of information units within it. Let’s create an atomic vector with 5 numeric values and check the length of it:

my_values <- c(1, 2, 3, 4, 5)
length(my_values)
## [1] 5

Let’s create a vector with a long string to check the length:

my_string <- c("I am becoming a great data scientist and I love it.")
length(my_string)
## [1] 1

The length of this object was 1, which is simply because it contained 1 element, which was the sentence “I am becoming a great data scientist and I love it.”. Let’s add another sentence to that vector:

my_string <- c("I am becoming a great data scientist and I love it.", "My name is difficult to spell.")
length(my_string)
## [1] 2

As evident, these two sentences (or more precisely, these two character strings) are contained in an object of length 2.

An atomic vector can only store one type of data. Let’s challenge this by creating an atomic vector with both numeric values, strings and logicals.

# Start by creating the vector
mixed <- c("I love data science.", 1, 2, TRUE, FALSE)
# Print out the result
mixed
## [1] "I love data science." "1"                    "2"                   
## [4] "TRUE"                 "FALSE"
# Check how R interprets the vector
class(mixed)
## [1] "character"

Notice that all arguments, including 12TRUE and FALSE are printed with quotation marks. This is because they have been transformed to character strings! This is called coercion, i.e one data type was automatically transformed to another. The class() function claims that the vector mixed is of type character. So you can no longer perform mathematical operations with the numbers 1 and 2 in the object, since they have been coerced to characters.

Occassionally you want to create atomic vectors with numbers but then treat the numbers as characters. This can be done by either using quotation marks around the numbers, or use the function as.character():

Example 1: Creating a numeric vector

my_numbers <- c(1, 2, 3, 4)
# Print it
my_numbers
## [1] 1 2 3 4

Example 2: Creating a character vector

my_characters <- c("1", "2", "3", "4")
# Print it
my_characters
## [1] "1" "2" "3" "4"

Notice how the numbers in the print out are wrapped in quotation marks.

Example 3: Converting numeric to character

my_numbers <- c(1, 2, 3, 4)
my_characters <- as.character(my_numbers)
# Print it
my_characters
## [1] "1" "2" "3" "4"

Hence, you can convert vector betweeen types. Let’s try the opposite, ie going from character to numeric: Example 4: Converting character to numeric

my_characters <- c("1", "2", "3", "4")
my_numbers <- as.numeric(my_characters)
# Print it
my_numbers
## [1] 1 2 3 4

This turned out well since the numbers appeared in the correct order and without quotation marks (so they are actual numeric values). However, covnerting between vector types can produce unexpected results so always check manually whether the conversion yielded the expected results. Let’s try to convert the string Hello to a numeric value:

my_characters <- c("Hello")
my_numbers <- as.numeric(my_characters)
## Warning: NAs introduced by coercion
# Print it
my_numbers
## [1] NA

This returns a warning: “Warning: NAs introduced by coercion”NA is the symbol for missing values in R. So this means that missing values were obtained. When we print the object my_numbers we note that there is only a NA (missing value). So “Hello” could not be coerced to a numeric value.

Double vector

A double vector stores numeric values; small values, large values, negative values and positive values. The numbers can have decimals. R will save most numeric vectors as doubles.

my_numbers <- c(1, 2, 3)
is.double(my_numbers)
## [1] TRUE
your_numbers <- c(1.1, 2.2, 3.3)
is.double(your_numbers)
## [1] TRUE

Even though the object my_numbers consist of integers, R will save it as a double instead of integer. You can check that:

is.integer(my_numbers)
## [1] FALSE

Double = Numeric

Doubles may also be referred to as numerics. You can choose either term you like.

Integer vector

Integer vectors store integers. An integer is a numeric value without decimals. In R you rarely need integers since doubles can contain the same information; so you can always use doubles instead of integers and most R functions will do this automatically for you. As explained above, to save a vector as an integer, you need to add the letter L to all integers, as follows:

int <- c(1L, 5L, 3L, 10L)
is.integer(int)
## [1] TRUE

If you leave out the L, R will save the vector as a double.

Being explicit with integers

If you want a vector to be of type integer, then you should add the letter L to the integer values, as follows:

int <- c(1L, 5L, 3L, 10L)
is.integer(int)
## [1] TRUE

If we leave out the L then R will create a numeric vector, but not of type integer:

int <- c(1, 5, 3, 10)
is.integer(int)
## [1] FALSE
is.numeric(int)
## [1] TRUE

Having different type of vectors is very useful. Different vectors are used for different purposes. Mathematical operations are only possible for numeric vectors. We will demonstrate this by using the mean() function, which calculates the mathematical mean of a series of numeric values:

my_values <- c(5, 10, 15, 20)
my_mean <- mean(my_values)
my_mean
## [1] 12.5

Did you know that we could have saved a row by wrapping the c() function with the mean()function? Here is how:

my_mean <- mean(c(5, 10, 15, 20))
my_mean
## [1] 12.5

Character vector

Texts and symbols are stored in character vectors. The following command crates a character vector by surrounding a string with quotes:

my_string <- c("This is one string")
my_string
## [1] "This is one string"

The character vector my_string stores one string, namely “This is one string.” Hence, a string is an element in a character vector. A string can contain symbols, letters and numbers. Here follows a character vector with 2 strings, of which the second contains numbers and symbols:

my_string <- c("This is one string", "This is string #2")
my_string
## [1] "This is one string" "This is string #2"

Note that if you surround numbers with quotes, they will be treated as characters (and not as numbers!):

my_numbers <- c(1, 2, 3)
my_numbers
## [1] 1 2 3
your_numbers <- c("1", "2", "3")
typeof(your_numbers)
## [1] "character"
your_numbers
## [1] "1" "2" "3"

Note that the numbers in the object your_numbers were surrounded with quotes. That is why R converted the object to a character vector. When printing that object, R shows the quotes to make clear that those are character strings! All character strings are printed using quotes in R.

A character string contains characters and symbols. It is not possible to perform mathematical operations on character strings, even if they contain numbers. Let’s see two examples:

Example 1: Create a string with letters

my_string <- "This is a string"
# Multiply it by two
my_string*2
## Error in my_string * 2 : non-numeric argument to binary operator

Example 2: Create a string with a number

my_string <- "100"
# Multiply it by two
my_string*2
## Error in my_string * 2 : non-numeric argument to binary operator

In both examples R replies that the operation cannot be completed because the argument (my_string) is not numeric (non-numeric).

So charachter strings can contain numbers but they are not treated as numeric values. Remember that R will depict numbers which are characters by surrounding them with quotes. Actually, anything surrounded by quotes is treated as a character string, irrespective of the contents between the quotes. Not only does R use quotes to denote that the data is of type character, but you must also use quotes when referring to those values. Let’s see a simple example, where we will first create two vectors, which we will combine into a data frame. Then we will try to select all men in that data frame:

# Create a vector called "sex"
sex <- c("Man", "Woman", "Man", "Woman")
# Create a vector called "pressure"
pressure <- c(140, 130, 120, 150)
# Combine vectors into a data frame, using the function data.frame()
my_data_frame <- data.frame(sex, pressure)
# View data frame
my_data_frame
##     sex pressure
## 1   Man      140
## 2 Woman      130
## 3   Man      120
## 4 Woman      150
# Create new data frame only including men
only_men <- subset(my_data_frame, sex=="Man")
# View new data frame
only_men
##   sex pressure
## 1 Man      140
## 3 Man      120

The command that selects all men is: subset(my_data_frame, sex=="Man"). As seen here, Man must be placed in quotes. Otherwise an error will be returned, as seen here:

only_men <- subset(my_data_frame, sex==Man)
## Error in eval(e, x, parent.frame()) : object 'Man' not found

I think that we both agree that the error message is a bit cryptic, which is often the case when working in R. Fortunately, you will gradually recognize the error messages and their meaning. Nevertheless, because we left out the quotation marks, R started to look for an object with the name Man and could not find one, which is why an error was returned.

5.1.5 Logicals (boolean values)

Some variables are (more or less) binary in nature. If such binary variables can be characterized as TRUE or FALSE, then they are referred to as logicals. Exampels follow: * If a person is dead or alivemay be considered as a logical variable. In that scenario, the variable could (for example) be named dead and the value set to TRUE if the person is dead and FALSE if the person is alive. * If a person has cancer or not can also be classified as TRUE or FALSE. * You can compare two numbers to see if one is larger or smaller than the other.

Let’s see an example where we ask R if the number 10 is greater than the number 1:

10>1
## [1] TRUE

Some researchers and analysts are more comfortable using 1/0instead of TRUE/FALSE, which is perfectly fine. Hence, logicals in R are actually booleans; a boolean value can only take one of two possible values.

R recognizes TRUE and FALSE if you type them using capital letters, without quotation marks. This tells R to treat the inputs as logical data. R also accepts the shorthand T for TRUE and F for FALSE.

Let’s see an example where we ask R if the number 10 is greater than the number 1:

# Create vectors with values
name <- c("Adam", "Umit", "Joanna", "Sarah")
dead <- c(TRUE, FALSE, TRUE, FALSE)
employed <- c(F, T, F, F)
# Create a data frame
my_data_frame <- data.frame(name, dead, employed)
# Print data frame
my_data_frame
##     name  dead employed
## 1   Adam  TRUE    FALSE
## 2   Umit FALSE     TRUE
## 3 Joanna  TRUE    FALSE
## 4  Sarah FALSE    FALSE

As seen above, R accepted the shorthands T and F.

Complex and Raw vectors

You will most likely never use or see these vectors in R, which is why we’ll skip them.

Dates and Times

It is common to work with dates and times when doing data analysis. For example, survival analysis concerns modelling of survival time, which is typically the time interval between start of the observation until the end of observation. Dates and times are also used to assess temporal (time) trends etc. Every data analyst must be comfortable with date and time variables. It turns out that date and time variables can be handled like numeric variables, so that every date/time can be assigned to a numeric value, which can then be used in various operations. In order to assign a numeric value to a specific date, another date must serve as the reference date, or starting point. In SAS, that date is set to 1960-01-01 00:00:00 GMT. Any date/time in the future can then be calculated as the number of days/hours/seconds that has elapsed since that starting point. In R, you can define that starting point yourself.

To work with dates and times as numeric variables, R uses atomic vectors of class POSIXct or POSIXt, which are actually doubles. Although these vector types are actually numeric values, you can use strings to represent them. Let’s start by creating two atomic vectors of character type with dates:

# This is a simple character vector
birth_date <- c("1990-01-01", "1999-01-04", "2002-05-05", "2001-12-12")
death_date <- c("2005-02-02", "2009-11-14", "2018-01-01", "2011-11-09")
class(birth_date)
## [1] "character"
class(death_date)
## [1] "character"

R states that both vectors are of class character, which we expected. Let’s convert these character vectors to date vectors by using base R functions:

birth_date <- as.POSIXct(birth_date)
death_date <- as.POSIXct(death_date)
birth_date
## [1] "1990-01-01 CET"  "1999-01-04 CET"  "2002-05-05 CEST" "2001-12-12 CET"
death_date
## [1] "2005-02-02 CET" "2009-11-14 CET" "2018-01-01 CET" "2011-11-09 CET"
class(birth_date)
## [1] "POSIXct" "POSIXt"
class(death_date)
## [1] "POSIXct" "POSIXt"
# We can calculate the time difference between birth and death
difftime(death_date, birth_date)
## Time differences in days
## [1] 5511.000 3967.000 5720.042 3619.000

R has now converted the character strings into date times, with the classes POSIXct and POSIXt. Using POSIXct, each time is represented by the number of seconds that have passed between the time and 12:00 AM January 1st 1970 (in the Universal Time Coordinated (UTC) zone). However, R printed the time differences in days. This is the default setting in R, but you can change that and specify that you desire the time difference in seconds, as follows:

difftime(death_date, birth_date, units="secs")
## Time differences in secs
## [1] 476150400 342748800 494211600 312681600

R also added time zones; CET is Central European Time and CEST is Central European Summer Time (CEST). This was unexpected, since we would actually prefer using the same time zone in most cases. Indeed, base R functions for handling dates and times can be complicated, or even problematic. Therefore, Garrett Grolemund, Hadley Wickham and Vitalie Spinu created the lubridate package, which makes handling of dates and times very simple. Let’s install and activate the lubridatepackage:

#install.packages("lubridate")
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.2
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Lubridate stores dates as the number of days since 1970-01-01. You can get the specific date for 18000 days since 1970-01-01, as follows:

as_date(18000) 
## [1] "2019-04-14"

Hence, 18000 days since 1970-01-01 equals the date 2019-04-04.

Let’s store some date values using lubridate. There are many functions in lubridate to do this. The principle for using these functions are as follows: 1. Identify the order of the year (y), month (m), day (d), hour (h), minute (m) and second (s) elements in the data. 2. Use the appropriate function, which order of the elements fits your data.

Example:

  • Date format: “1990-01-01,” which represents “YYYY-MM-DD.”
  • Hence, the order is Y, M, D.
  • Appropriate function in lubridate: ymd().
  • The function ymd() will create a date vector of a string vector with the following order (Y)EAR, (M)ONTH and (D)AY, regardless of how these elements are defined.

Let’s test that by creating a data frame with 3 individuals, including their names, birth date, date of first examination and date of death. All the dates will be provided with varying separators in order to test whether the ymd() function understands the date definitions:

# Create the vectors
names <- c("David", "Mohammed", "Christina")
birth <- ymd("1990-01-01", "1999-01-04", "2002-05-05")
exami <- ymd("1990.01.01", "1999.01.04", "2002.05.05")
death <- ymd("2005/02/02", "2009/11/14", "2018/01/01")
# Combine the vectors into a data frame
my_data <- data.frame(names, birth, exami, death)
# View the data frame
my_data
##       names      birth      exami      death
## 1     David 1990-01-01 1990-01-01 2005-02-02
## 2  Mohammed 1999-01-04 1999-01-04 2009-11-14
## 3 Christina 2002-05-05 2002-05-05 2018-01-01

As you can see in the output, R did recognize the different date forms. Let’s change the order of year, month and day in our input; note that we will need to rearrange the order of the letters (y, m, d):

dmy("30-01-2000", "25-12-2001")
## [1] "2000-01-30" "2001-12-25"
mdy("12-25-1999", "09-29-2004")
## [1] "1999-12-25" "2004-09-29"
ymd("20181230", "20191129")
## [1] "2018-12-30" "2019-11-29"

Again, R did parse the dates correctly. There are many more lubridate functions that parse dates and you can view them all here: LÄNK.

In the next example we will parse a date which includes a clock time. To parse this correctly, we will add the letters h (hour), m (minutes) and s (seconds), as follows:

ymd_hms("2011-06-04 12:00:00")
## [1] "2011-06-04 12:00:00 UTC"

You can see that R assigned this to the time zone UTC. R recognizes roughly 600 time zones; each encodes the time zone, daylight savings time, and historical calendar variations for an area. R assigns one time zone per vector. You can get a list of all time zones by executing the following command:

OlsonNames()
##   [1] "Africa/Abidjan"                   "Africa/Accra"                    
##   [3] "Africa/Addis_Ababa"               "Africa/Algiers"                  
##   [5] "Africa/Asmara"                    "Africa/Asmera"                   
##   [7] "Africa/Bamako"                    "Africa/Bangui"                   
##   [9] "Africa/Banjul"                    "Africa/Bissau"                   
##  [11] "Africa/Blantyre"                  "Africa/Brazzaville"              
##  [13] "Africa/Bujumbura"                 "Africa/Cairo"                    
##  [15] "Africa/Casablanca"                "Africa/Ceuta"                    
##  [17] "Africa/Conakry"                   "Africa/Dakar"                    
##  [19] "Africa/Dar_es_Salaam"             "Africa/Djibouti"                 
##  [21] "Africa/Douala"                    "Africa/El_Aaiun"                 
##  [23] "Africa/Freetown"                  "Africa/Gaborone"                 
##  [25] "Africa/Harare"                    "Africa/Johannesburg"             
...
...
...
...
...                  
## [561] "Pacific/Tahiti"                   "Pacific/Tarawa"                  
## [563] "Pacific/Tongatapu"                "Pacific/Truk"                    
## [565] "Pacific/Wake"                     "Pacific/Wallis"                  
## [567] "Pacific/Yap"                      "Poland"                          
## [569] "Portugal"                         "PRC"                             
## [571] "PST8PDT"                          "ROC"                             
## [573] "ROK"                              "Singapore"                       
## [575] "Turkey"                           "UCT"                             
## [577] "Universal"                        "US/Alaska"                       
## [579] "US/Aleutian"                      "US/Arizona"                      
## [581] "US/Central"                       "US/East-Indiana"                 
## [583] "US/Eastern"                       "US/Hawaii"                       
## [585] "US/Indiana-Starke"                "US/Michigan"                     
## [587] "US/Mountain"                      "US/Pacific"                      
## [589] "US/Pacific-New"                   "US/Samoa"                        
## [591] "UTC"                              "W-SU"                            
## [593] "WET"                              "Zulu"                            
## attr(,"Version")
## [1] "2020a"

Let’s specify the time zone to US/Pacific:

ymd_hms("2011-06-04 12:00:00", tz="US/Pacific")
## [1] "2011-06-04 12:00:00 PDT"

Calculating time intervals

Let’s create some data with 3 individuals and their birth and death dates:

# Create the vectors
names <- c("David", "Mohammed")
birth <- ymd("1990-01-01", "1999-01-04")
death <- ymd("2005/02/02", "2009/11/14")
# Combine the vectors into a data frame
my_data <- data.frame(names, birth, death)
# View the data frame
my_data
##      names      birth      death
## 1    David 1990-01-01 2005-02-02
## 2 Mohammed 1999-01-04 2009-11-14

We want to create a new variable (column) in our data frame my_data. The new variable, which we call dayslived, should be the time difference between birth and death date.

# Calculate time difference by subtracting the dates from each other
my_data$dayslived <- my_data$death-my_data$birth
# View the data
my_data
##      names      birth      death dayslived
## 1    David 1990-01-01 2005-02-02 5511 days
## 2 Mohammed 1999-01-04 2009-11-14 3967 days

We can use the interval() function to define a time interval, which can then be used for various comparisons. Let’s create a time interval variable and call it time_interval:

my_data$time_interval <- interval(my_data$birth, my_data$death)
my_data
##      names      birth      death dayslived                  time_interval
## 1    David 1990-01-01 2005-02-02 5511 days 1990-01-01 UTC--2005-02-02 UTC
## 2 Mohammed 1999-01-04 2009-11-14 3967 days 1999-01-04 UTC--2009-11-14 UTC

You can see that David lived during the interval 1990-01-01 UTC--2005-02-02 UTC. Let’s see if David’s and Mohammed’s lives ever coincided with World War II.

# define the start of WW2:
ww2start    <- ymd_hms("1939-01-01 00:00:00")
ww2end      <- ymd_hms("1945-12-31 00:00:00")
ww2interval <- interval(ww2start, ww2end)
my_data$expww2 <- int_overlaps(my_data$time_interval, ww2interval)
my_data
##      names      birth      death dayslived                  time_interval
## 1    David 1990-01-01 2005-02-02 5511 days 1990-01-01 UTC--2005-02-02 UTC
## 2 Mohammed 1999-01-04 2009-11-14 3967 days 1999-01-04 UTC--2009-11-14 UTC
##   expww2
## 1  FALSE
## 2  FALSE

It seems that none of them lived during World War II, since the result was FALSE for both David and Mohammed.

Durations A duration is the time difference between two time points. You calculate durations with lubridate by subtracting date times, for example:

ymd("2018-01-01") - ymd("1999-01-04")
## Time difference of 6937 days

Periods You can add or subtract periods by using simple functions, as follows:

ymd(20110101) + years(1)
## [1] "2012-01-01"
ymd(20110101) + months(12)
## [1] "2012-01-01"
ymd(20110101) - days(2)
## [1] "2010-12-30"
ymd(20110101) - seconds(2000)
## [1] "2010-12-31 23:26:40 UTC"

Ofcourse, there are instances where you would prefer using dates as categorical variables instead of numeric variables. The lubridate package is useful when you need to handle time periods, durations or intervals.

Factors

Factors in R are categorical variables, such as gender, hair color, ethnicity, disease status etc. A factor is simply a categorical variable which has 2 or more levels. The levels may have an inherent order, but it is not necessary. Most new users have difficulties distinguishing factor vectors from character vectors. In terms of doing statistics, there’s no difference in how R treats factor and character vectors. You do not need to convert characters to factors when doing statistical calculations (R will handle characters as factors). When manipulating dataframes, however, character vectors and factors are treated very differently. You could run into unexpected errors and warning messages when dealing with characters/factors in R. Generally, when doing data manipulation, you could also stick to character vectors. Let’s create a character variable and an identical factor variable:

my_names <- c("David", "Mohammed", "Christina", "Yusuf", "Djemba", "Liu", "Christina")
character_names <- as.character(my_names)
factor_names <- as.factor(my_names)

Let’s create a table for each vector; the table tells us the number of observations for each category of the character vector:

table(character_names)
## character_names
## Christina     David    Djemba       Liu  Mohammed     Yusuf 
##         2         1         1         1         1         1
table(factor_names)
## factor_names
## Christina     David    Djemba       Liu  Mohammed     Yusuf 
##         2         1         1         1         1         1

You can see the possible levels for a factor through the levels() function:

levels(character_names)
## NULL
levels(factor_names)
## [1] "Christina" "David"     "Djemba"    "Liu"       "Mohammed"  "Yusuf"

As you can see, you can create tables with both vector types, but it seems that the character vector does not have levels (it returns NULL), whereas the factor vector does have levels. This is one of the characteristics of a factor vector, i.e it has levels. This is useful when doing statistical modelling. You can change the reference level of the factor variable, which may be desireable when doing statistical modelling. The reference level is the cateogry with which the other categories are compared with when doing statistical modelling (e.g linear regression, logistical regression etc).

For example, you may wish you compare the survival in people with various types of cancers: lung cancer, pancreatic cancer, colorectal cancer and breast cancer. In such a comparison, you would typically use one of these cancer types as the reference and compare the others with that reference. By using the cancer type variable as a factor, you can change the reference level. A simple example follows:

# Create a character vector
cancer_type <- c("breast", "lung", "colorectal", "pancreatic")
# Convert it into a factor vector
cancer_type <- as.factor(cancer_type)
# Check the levels
# Remember that the first level is the reference level
levels(cancer_type)
## [1] "breast"     "colorectal" "lung"       "pancreatic"

Since breast cancer is printed first, it is used as the reference level for this factor variable. We can change that by using the function relevel(), as follows:

# Relevel
cancer_type <- relevel(cancer_type, "pancreatic")
# Check levels again
levels(cancer_type)
## [1] "pancreatic" "breast"     "colorectal" "lung"

If you use the base R functions for reading data into R (e.g read.csv() function), then R may actually convert character variables as factor variables. This can cause confusion and unexpected errors and warnings during data manipulation. In general, it can be advised that you do not let R make factors until you ask for them.

Coercion

R has several built-in mechanisms that convert variables depending on their content. The rules are as follows:

  • If a variable (column) contains a character (i.e any non-numeric string), it will be coerced (automatically converted) into a character variable.
  • If a variable (column) only contains numeric values, it will be converted into a numeric variable.
  • If a variable (column) only contains logical values, it will be converted into a logical variable (TRUE/FALSE).
  • If a variable (column) contains logical and numeric values, the logical values will be converted into numbers (TRUE = 1, FALSE = 0).

If you ever encounter the situation where you fail to calculate the mean of a numeric variable, then it is likely that R has coerced it into a character vector due to the presence of a character string somewhere along the variable.

Let’s see some examples of coercion. The function c() combines multiple elements into a vector. The function sum() sums all numeric values in a vector. We will apply both functions simultaneously below.

sum(c(TRUE, TRUE, FALSE))
## [1] 2
sum(c(1, 2, 3))
## [1] 6
sum(c("1", "2", "3"))
sum(c("1", "2", "3A"))

You can explicitly ask R to convert data from one type to another with the as functions. R will convert the data whenever there is a sensible way to do so:

as.character(1)
## "1"
as.logical(1)
## TRUE
as.numeric(FALSE)
## 0

A vector, matrix or array can only contain one data type. Data frames or lists are required to store data (variables) of different types. Why does some statistical models (notably in deep learning and machine learning) require data to be stored as matrices or arrays? The answer is that these models are very computationally intensive and calculations on vectors, matrices and arrays are fast and they can be stored efficiently in the computers memory.

Names

Let’s create a simple data frame with 2 columns (variables), namely sex and blood pressure.

# Create a vector called "Sex"
Sex <- c("Man", "Woman", "Man", "Woman")
# Create a vector called "BloodPressure"
BloodPressure <- c(140, 130, 120, 150)
# Use the function data.frame() to create a data frame
my_data_frame <- data.frame(Sex, BloodPressure)

We now have a regular table, which we can view:

my_data_frame
##     Sex BloodPressure
## 1   Man           140
## 2 Woman           130
## 3   Man           120
## 4 Woman           150

We can check the column names in our data frame using the names() function:

names(my_data_frame)
## [1] "Sex"           "BloodPressure"

There are multiple ways of changing column names, which we will demonstrate in subsequent chapters. One way of doing it is to simply create a new vector (using the c() function) containing variable names and assign that to the names property of the data frame, as follows:

# Assign new names to the columns
names(my_data_frame) <- c("Gender", "Pressure")
# Check results
names(my_data_frame)
## [1] "Gender"   "Pressure"

Use names to refer to columns

Use the $ sign to extract a column in a data frame. The following will extract the blood pressure values:

my_data_frame$Pressure
## [1] 140 130 120 150

Let’s create a new variable using data stored in Pressure. The new variable will simply be Pressure multiplied by two, and it will be stored in the same data frame in a column called PressureMultiplied:

# Create new variable
my_data_frame$PressureMultiplied <- my_data_frame$Pressure*2
# Print the data frame
my_data_frame
##   Gender Pressure PressureMultiplied
## 1    Man      140                280
## 2  Woman      130                260
## 3    Man      120                240
## 4  Woman      150                300

Matrix

A matrix is an array that can store data in rows and columns. Matrices can only store one type of data. You create matrices using the matrix() function, which requires you to specify the number of rows and columns, as follows:

# Create a vector, which will be converted into a matrix
my_vector <- c(1, 2, 3, 4, 5, 6)
# Create the matrix
my_matrix <- matrix(my_vector, nrow=2, ncol=3)
# View it
my_matrix
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

The function matrix() will take the vector and fill it one column at the time, until all values have been used. The byrow argument can be used to fill a matrix one row at a time instead, as follows:

# Create the matrix
my_matrix <- matrix(my_vector, nrow=2, ncol=3, byrow=T)
# View it
my_matrix
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

Arrays

An array can contain multiple matrices. Let’s create an array with 3 rows, 3 columns and 2 dimensions. We will use the dim() argument to define rows, columns and dimensions:

my_vector <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
my_array <- array(my_vector, dim=c(3, 3, 2))
# View my_array
my_array
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

As you can see, the array() function recycles the content of the vector to fill up the matrices.

Lists

Vectors, matrices and arrays can only store one type of data. You cannot create an array with numeric values and factors mixed. You must use lists or data frames (see below) to do so.

my_vector <- c("Unstable angina", 150, TRUE)
# Print the vector
my_vector
## [1] "Unstable angina" "150"             "TRUE"

As evident, all elements were wrapped with quotation marks (""), which indicates that they have been converted to character strings.

Lists are very versatile, as they can store multiple types of data, of varying lengths. You can create lists consisting of data frames, graphs, formulas and functions. Any R object can be inserted into a list. Let’s create a list:

# Create atomic vectors
blood_pressure <- c(120, 180, 130)
condition <- c("Diabetes", "Cancer", "Heart failure")
dead <- c(TRUE, TRUE, FALSE, TRUE)
# Combine atomic vectors into list
my_list <- list(blood_pressure, condition, dead)
# View my list
my_list
## [[1]]
## [1] 120 180 130
## 
## [[2]]
## [1] "Diabetes"      "Cancer"        "Heart failure"
## 
## [[3]]
## [1]  TRUE  TRUE FALSE  TRUE

This list contains three objects, each separated by double brackets [[]] in the printed output. As seen above, the first object in the list contains the numeric values 120, 180 and 130. Let’s create a list also containing a data frame.

# Create atomic vector
my_data_frame <- data.frame(country=c("Sweden", "UK", "USA"),
                            treatment=c("warfarin", "aspirin", "eptifibatide"))
# Create a list
anoter_list <- list(my_data_frame, blood_pressure, condition, dead)
# View my list
anoter_list
## [[1]]
##   country    treatment
## 1  Sweden     warfarin
## 2      UK      aspirin
## 3     USA eptifibatide
## 
## [[2]]
## [1] 120 180 130
## 
## [[3]]
## [1] "Diabetes"      "Cancer"        "Heart failure"
## 
## [[4]]
## [1]  TRUE  TRUE FALSE  TRUE

5.7 Data Frames

Data frames are similar to Excel spreadsheets. They are also the most common form of data to be used in medical research. In the vast majority of cases, the rows represent the observations (e.g individuals), and columns represent the features of the observations (i.e variables).

If you study patients, then the rows would conventionally be the individual patients, and the columns would be the variables describing those patients. Data frames are central to R and research in general. Most regression functions, machine learning functions, statistical tests, etc, are tailored to be used on data frames.

You can manually create a data frame by combining vectors. The vectors will become the variables (i.e columns) in the data frame. Let’s create a data frame with 3 vectors:

# Create a number sequence from 1 to 15
variable1 <- 1:5
# Generate 15 values randomly using the rnorm() function
variable2 <- rnorm(5)
# Create a character vector
variable3 <- c("A", "B", "W", "C", "D")
# Combine them into a data frame using the data.frame() function
my_data_frame <- data.frame(variable1, variable2, variable3)
my_data_frame
##   variable1  variable2 variable3
## 1         1  1.7611362         A
## 2         2  1.0577679         B
## 3         3  0.4517601         W
## 4         4 -1.3073682         C
## 5         5 -1.0834233         D
  • Every column contains the same type of data.
  • If you create a data frame using vectors, then R will recycle values if the vectors are of varying length.

As with any R object, you can use the str() function to check the structure of a data frame:

str(my_data_frame)
## 'data.frame':	5 obs. of  3 variables:
##  $ variable1: int  1 2 3 4 5
##  $ variable2: num  1.761 1.058 0.452 -1.307 -1.083
##  $ variable3: chr  "A" "B" "W" "C" ...

The str() function returns an output showing the following:

  • variable1 is an integer
  • variable2 is numeric
  • variable3 is a factor.

Note that variable3 was created as a character variable, but the data.frame() function converted it into a factor. This is acceptable in most cases, particularly since most prediction models require data to be either numeric or factor. If you, you can create a data frame with the argument stringsAsFactors = FALSE to avoid R coercing strings to factors:

my_data_frame <- data.frame(variable1, variable2, variable3, stringsAsFactors = FALSE)
str(my_data_frame)
## 'data.frame':	5 obs. of  3 variables:
##  $ variable1: int  1 2 3 4 5
##  $ variable2: num  1.761 1.058 0.452 -1.307 -1.083
##  $ variable3: chr  "A" "B" "W" "C" ...

Variable3 is now of character type (chr).