4 R Data Types and Structures

The R language supports a broad array of operations such as mathematical calculations, logical analyses, and text manipulation. However, the applicability of a function to a variable depends on the variable’s data type. For instance, an arithmetic function to add two variables won’t work if the variables store text.

R supports various data types, including numeric (x <- 15), character (x <- "Hello!"), and logical (x <- TRUE or x <- FALSE). In addition to single values (scalars), R allows variables to hold collections of numbers or strings using vectors, matrices, lists, or data frames. Advanced data structures such as tibbles, data tables, and xts objects provide additional features beyond traditional data frames.

In this chapter, we will explore the following data types:

  1. Scalar: A single data element, such as a number or a character string.
  2. Vector: A one-dimensional array that contains elements of the same type.
  3. Matrix (matrix): A two-dimensional array with elements of the same type.
  4. List (list): A one-dimensional array capable of storing various data types.
  5. Data Frame (data.frame): A two-dimensional array that can accommodate columns of different types.
  6. Tibble (tbl_df): An enhanced version of data frames, offering user-friendly features.
  7. Data Table (data.table): An optimized data frame extension designed for speed and handling large datasets.
  8. Extensible Time Series (xts): A time-indexed data frame specifically designed for time series data.

Understanding the data type of variables is crucial because it determines the operations and functions that can be applied to them.

It’s worth noting that R provides so-called wrapper functions, which are functions that have the same name but perform different actions depending on the data object. These wrapper functions adapt their behavior based on the input data type, allowing for more flexible and intuitive programming. For example, the summary() function in R is a wrapper function. When applied to a numeric vector, it provides statistical summaries such as mean, median, and quartiles. However, when applied to a data frame, it gives a summary of each variable, including the minimum, maximum, and quartiles for numerical variables, as well as counts and levels for categorical variables.

4.1 Scalar

Scalars in R are variables holding single objects. You can determine an object’s type by applying the class() function to the variable.

Numbers, Characters, and Logical Values

# Numeric (a.k.a. Double)
w <- 5.5  # w is a decimal number.
class(w)  # Returns "numeric".

# Integer
x <- 10L  # The L tells R to store x as an integer instead of a decimal number.
class(x)  # Returns "integer".

# Complex
u <- 3 + 4i # u is a complex number, where 3 is real and 4 is imaginary.
class(u)  # Returns "complex".

# Character
y <- "Hello, World!"  # y is a character string.
class(y)  # Returns "character".

# Logical
z <- TRUE  # z is a logical value.
class(z)  # Returns "logical".
## [1] "numeric"
## [1] "integer"
## [1] "complex"
## [1] "character"
## [1] "logical"

An object’s type dictates which functions can be applied. For example, mathematical functions are applicable to numbers but not characters:

# Mathematical operations
2 + 2  # Results in 4.
3 * 5  # Results in 15.
(1 + 2) * 3  # Results in 9 (parentheses take precedence).

# Logical operations
TRUE & FALSE  # Results in FALSE (logical AND).
TRUE | FALSE  # Results in TRUE (logical OR).

# String operations
paste("Hello", "World!")  # Concatenates strings, results in "Hello World!".
nchar("Hello")  # Counts characters in a string, results in 5.
## [1] 4
## [1] 15
## [1] 9
## [1] FALSE
## [1] TRUE
## [1] "Hello World!"
## [1] 5

Dates and Times

When conducting economic research, it is common to deal with data types specifically designed for storing date and time information:

# Date
v <- as.Date("2023-06-30")  # v is a Date.
# The default input format is %Y-%m-%d, where
# - %Y is year in 4 digits,
# - %m is month with 2 digits, and 
# - %d is day with 2 digits.
class(v)  # Returns "Date".
## [1] "Date"
# POSIXct (Time)
t <- as.POSIXct("2023-06-30 18:47:10", tz = "CDT")  # t is a POSIXct.
# The default input format is %Y-%m-%d %H:%M:%S, where
# - %H is hour out of 24,
# - %M is minute out of 60, and
# - %S is second out of 60.
# The tz input is the time zone, where CDT = Central Daylight Time.
class(t)  # Returns "POSIXct".
## [1] "POSIXct" "POSIXt"

The default input format, %Y-%m-%d or %Y-%m-%d %H:%M:%S, can be changed by specifying a format input. The output format can be adjusted by applying the format() function to the object:

# Date with custom input format:
v <- as.Date("April 6 -- 23", format = "%B %d -- %y")
v  # Returns default output format: %Y-%m-%d.
## [1] "2023-04-06"
format(v, format = "%B %d, %Y")  # Returns a custom output format: "%B %d, %Y".
## [1] "April 06, 2023"

The syntax for different date formats can be found by typing ?strptime in the R console. Some of the most commonly used formats are outlined in the table below:

Table 4.1: Syntax for Date Format
Specification Description Example
%a Abbreviated weekday Sun, Thu
%A Full weekday Sunday, Thursday
%b or %h Abbreviated month May, Jul
%B Full month May, July
%d Day of the month, 0-31 27, 07
%j Day of the year, 001-366 148, 188
%m Month, 01-12 05, 07
%U Week, 01-53, with Sunday as first day of the week 22, 27
%w Weekday, 0-6, Sunday is 0 0, 4
%W Week, 00-53, with Monday as first day of the week 21, 27
%x Date, locale-specific
%y Year without century, 00-99 84, 05
%Y Year with century, on input: 00 to 68 prefixed by 20, 69 to 99 prefixed by 19 1984, 2005
%C Century 19, 20
%D Date formatted %m/%d/%y 5/27/84
%u Weekday, 1-7, Monday is 1 7, 4
%n Newline on output or arbitrary whitespace on input
%t Tab on output or arbitrary whitespace on input

Here are some example operations for Date objects:

# Date Operations
date1 <- as.Date("2023-06-30")
date2 <- as.Date("2023-01-01")

# Subtract dates to get the number of days between
days_between <- date1 - date2
days_between
## Time difference of 180 days
# Add days to a date
date_in_future <- date1 + 30
date_in_future
## [1] "2023-07-30"

4.2 Vector

In R, a vector is a homogeneous sequence of elements, meaning they must all be of the same basic type. As such, a vector can hold multiple numbers, but it cannot mix types, such as having both numbers and words. The function c() (for combine) can be used to create a vector:

# Numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
class(numeric_vector)  # Returns "numeric".

# Character vector
character_vector <- c("Hello", "World", "!")
class(character_vector)  # Returns "character".

# Logical vector
logical_vector <- c(TRUE, FALSE, TRUE)
class(logical_vector)  # Returns "logical".
## [1] "numeric"
## [1] "character"
## [1] "logical"

The function c() can also be used to add elements to a vector:

# Add elements to existing vector:
x <- c(1, 2, 3)
x <- c(x, 4, 5, 6)
x
## [1] 1 2 3 4 5 6

The sec() function creates a sequence of numbers or dates:

# Create a sequence of numbers:
x <- seq(from = 1, to = 1.5, by = 0.1)
x
## [1] 1.0 1.1 1.2 1.3 1.4 1.5
# Create a sequence of dates:
x <- seq(from = as.Date("2004-05-01"), to = as.Date("2004-12-01"), by = "month")
x
## [1] "2004-05-01" "2004-06-01" "2004-07-01" "2004-08-01"
## [5] "2004-09-01" "2004-10-01" "2004-11-01" "2004-12-01"

Missing data is represented as NA (not available). The function is.na() indicates the elements that are missing and anyNA() returns TRUE if the vector contains any missing values:

x <- c(1, 2, NA, NA, 4, 9, 12, 5, 4, NA)
is.na(x)
##  [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
anyNA(x)
## [1] TRUE

A generalization of logical vectors are factors, which are vectors that restrict entries to be one of predefined categories:

# Unordered factors, e.g. categories "Male" and "Female":
gender_vector <- c("Male", "Female", "Male", "Male", "Male", "Female", "Male")
factor_gender_vector <- factor(gender_vector)
factor_gender_vector
## [1] Male   Female Male   Male   Male   Female Male  
## Levels: Female Male
# Ordered factors, e.g. categories with ordering Low < Medium < High:
temperature_vector <- c("High", "Low", "Low", "Low", "Medium", "Low", "Low")
factor_temperature_vector <- factor(temperature_vector, 
                                    order = TRUE, 
                                    levels = c("Low", "Medium", "High"))
factor_temperature_vector
## [1] High   Low    Low    Low    Medium Low    Low   
## Levels: Low < Medium < High

4.3 Matrix (matrix)

A matrix in R (matrix) is a two-dimensional array that extends atomic vectors, containing both rows and columns. The elements within a matrix must be of the same data type.

# Create a 3x3 numeric matrix, column-wise:
numeric_matrix <- matrix(1:9, nrow = 3, ncol = 3)
numeric_matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
class(numeric_matrix)  # Returns "matrix".
## [1] "matrix" "array"
typeof(numeric_matrix)  # Returns "numeric".
## [1] "integer"
# Create a 2x3 character matrix, row-wise:
character_matrix <- matrix(letters[1:6], nrow = 2, ncol = 3, byrow = TRUE)
character_matrix
##      [,1] [,2] [,3]
## [1,] "a"  "b"  "c" 
## [2,] "d"  "e"  "f"
class(character_matrix)  # Returns "matrix".
## [1] "matrix" "array"
typeof(character_matrix)  # Returns "character".
## [1] "character"

To select specific elements, rows, or columns within a matrix, square brackets are used. The cbind() and rbind() functions enable the combination of columns and rows, respectively.

# Print element in the second row and first column:
character_matrix[2, 1]
## [1] "d"
# Print the second row:
character_matrix[2, ]
## [1] "d" "e" "f"
# Combine matrices:
x <- matrix(1:4, nrow = 2, ncol = 2)
y <- matrix(101:104, nrow = 2, ncol = 2)
rbind(x, y)  # Combines matrices x and y row-wise.
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## [3,]  101  103
## [4,]  102  104
cbind(x, y)  # Combines matrices x and y column-wise.
##      [,1] [,2] [,3] [,4]
## [1,]    1    3  101  103
## [2,]    2    4  102  104

4.4 List (list)

A list (list) in R serve as an ordered collection of objects. In contrast to vectors, elements within a list are not required to be of the same type. Moreover, some list elements may store multiple sub-elements, allowing for complex nested structures. For instance, a single element of a list might itself be a matrix or another list.

# List
my_list <- list(1, "a", TRUE, 1+4i, 
                c(1, 2, 3), matrix(1:8, 2, 4), list("c",4))
names(my_list) <- c("num_1", "char_a", "log_T", "complex_1p4i",
                    "vec", "mat", "list")
my_list
## $num_1
## [1] 1
## 
## $char_a
## [1] "a"
## 
## $log_T
## [1] TRUE
## 
## $complex_1p4i
## [1] 1+4i
## 
## $vec
## [1] 1 2 3
## 
## $mat
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
## 
## $list
## $list[[1]]
## [1] "c"
## 
## $list[[2]]
## [1] 4
class(my_list)  # Returns "list".
## [1] "list"

The content of elements can be retrieved by using double square brackets:

# Select second element:
my_list[[2]]
## [1] "a"
# Select element named "mat":
my_list[["mat"]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8

4.5 Data Frame (data.frame)

A data frame (data.frame) in R resembles a matrix in its two-dimensional, rectangular structure. However, unlike a matrix, a data frame allows each column to contain a different data type. Therefore, within each column (or vector), the elements must be homogeneous, but different columns can accommodate distinct types. Typically, when importing data into R, the default object type used is a data frame.

# Vectors
student_names <- c("Anna", "Ella", "Sophia")
student_ages <- c(23, 21, 25)
student_grades <- c("A", "B", "A")
student_major <- c("Math", "Biology", "Physics")

# Data frame
students_df <- data.frame(name = student_names, 
                          age = student_ages, 
                          grade = student_grades, 
                          major = student_major)
students_df
##     name age grade   major
## 1   Anna  23     A    Math
## 2   Ella  21     B Biology
## 3 Sophia  25     A Physics
class(students_df)  # Returns "data.frame".
## [1] "data.frame"

Data frames are frequently used for data storage and manipulation in R. The following illustrates some common functions used on data frames:

# Access a column in the data frame
students_df$name

# Alternative way to access a column:
students_df[["name"]]

# Access second row in third column:
students_df[2, 3]
## [1] "Anna"   "Ella"   "Sophia"
## [1] "Anna"   "Ella"   "Sophia"
## [1] "B"
# When selecting just one column, data frame produces a vector
class(students_df[, 3])

# To avoid this, add drop = FALSE
class(students_df[, 3 , drop = FALSE])
## [1] "character"
## [1] "data.frame"
# Add a column to the data frame
students_df$gpa <- c(3.8, 3.5, 3.9)
students_df
##     name age grade   major gpa
## 1   Anna  23     A    Math 3.8
## 2   Ella  21     B Biology 3.5
## 3 Sophia  25     A Physics 3.9
# Subset the data frame
students_df[students_df$age > 22 & students_df$gpa > 3.6, ]
##     name age grade   major gpa
## 1   Anna  23     A    Math 3.8
## 3 Sophia  25     A Physics 3.9
# Number of columns and rows
ncol(students_df)
nrow(students_df)

# Column and row names
colnames(students_df)
rownames(students_df)
## [1] 5
## [1] 3
## [1] "name"  "age"   "grade" "major" "gpa"  
## [1] "1" "2" "3"
# Change column names
colnames(students_df) <- c("Name", "Age", "Grade", "Major", "GPA")
students_df
##     Name Age Grade   Major GPA
## 1   Anna  23     A    Math 3.8
## 2   Ella  21     B Biology 3.5
## 3 Sophia  25     A Physics 3.9
# Take a look at the data type of each column
str(students_df)
## 'data.frame':    3 obs. of  5 variables:
##  $ Name : chr  "Anna" "Ella" "Sophia"
##  $ Age  : num  23 21 25
##  $ Grade: chr  "A" "B" "A"
##  $ Major: chr  "Math" "Biology" "Physics"
##  $ GPA  : num  3.8 3.5 3.9
# Take a look at the data in a separate window
View(students_df)

These examples illustrate just a few of the operations you can perform with data frames in R. With additional libraries like dplyr, tidyr, and data.table, more complex manipulations are possible.

4.6 Tibble (tbl_df)

A tibble (tbl_df) is a more convenient version of a data frame. It is part of the tibble package in the tidyverse collection of R packages. To use tibbles, you need to install the tibble package by executing install.packages("tibble") in your console. Don’t forget to include library("tibble") at the beginning of your R script.

To create a tibble, you can use the tibble() function. Here’s an example:

# Load R package
library("tibble")

# Create a new tibble
tib <- tibble(name = letters[1:3], 
              id = sample(1:5, 3),
              age = sample(18:70, 3),
              sex = factor(c("M", "F", "F")))
tib
## # A tibble: 3 × 4
##   name     id   age sex  
##   <chr> <int> <int> <fct>
## 1 a         4    39 M    
## 2 b         3    33 F    
## 3 c         2    69 F
class(tib)
## [1] "tbl_df"     "tbl"        "data.frame"

One advantage of tibbles is that they make it easy to calculate and create new columns. Here’s an example:

tib <- tibble(tib, idvage = id/age)
tib
## # A tibble: 3 × 5
##   name     id   age sex   idvage
##   <chr> <int> <int> <fct>  <dbl>
## 1 a         4    39 M     0.103 
## 2 b         3    33 F     0.0909
## 3 c         2    69 F     0.0290

Unlike regular data frames, tibbles allow non-standard column names. You can use special characters or numbers as column names. Here’s an example:

tibble(`:)` = "smile", ` ` = "space", `2000` = "number")
## # A tibble: 1 × 3
##   `:)`  ` `   `2000`
##   <chr> <chr> <chr> 
## 1 smile space number

Another way to create a tibble is with the tribble() function. It allows you to define column headings using formulas starting with ~ and separate entries with commas. Here’s an example:

tribble(
    ~x, ~y, ~z,
    "a", 2, 3.6,
    "b", 1, 8.5
)
## # A tibble: 2 × 3
##   x         y     z
##   <chr> <dbl> <dbl>
## 1 a         2   3.6
## 2 b         1   8.5

For additional functions and a helpful cheat sheet on tibble and dplyr, you can refer to this cheat sheet.

Tidyverse

The tibble package is part of the tidyverse environment, which is a collection of R packages with a shared design philosophy, grammar, and data structures. To install tidyverse, execute install.packages("tidyverse"), which includes tibble, readr, dplyr, tidyr, ggplot2, and more. Key functions in the tidyverse include select(), filter(), mutate(), arrange(), count(), group_by(), and summarize(). An interesting operator in the tidyverse is the pipe operator %>%, which allows you to chain functions together in a readable and sequential manner. With the pipe operator, you can order the functions as they are applied, making your code more expressive and easier to understand. Here’s an example:

library("tidyverse")
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)

# Apply several functions to x:
y <- round(exp(diff(log(x))), 1)
y
## [1]  3.3  1.8  1.6  0.5  0.3  0.1 48.8  1.1
# Perform the same computations using pipe operators:
y <- x %>% log() %>% diff() %>% exp() %>% round(1)
y
## [1]  3.3  1.8  1.6  0.5  0.3  0.1 48.8  1.1

By using the %>% operator, each function is applied to the previous result, simplifying the code and improving its readability.

To delve deeper into the tidyverse, explore their official website: www.tidyverse.org. Another resource is the R-Bootcamp, available at r-bootcamp.netlify.app. Additionally, DataCamp provides a comprehensive skill track devoted to the tidyverse, named Tidyverse Fundamentals with R.

4.7 Data Table (data.table)

A data table (data.table) is similar to a data frame but with more advanced features for data manipulation. In fact, data.table and tibble can be considered competitors, with each offering enhancements over the standard data frame. While data tables offer high-speed functions and are optimized for large datasets, tibbles from the tidyverse are slower but are more user-friendly. The syntax used in data.table functions may seem esoteric, differing from that used in tidyverse. Like tibble, data.table is not a part of base R. It requires the installation of the data.table package via install.packages("data.table"), followed by library("data.table") at the beginning of your script.

To create a data table, you can use the data.table() function. Here’s an example:

# Load R package
library("data.table")

# Create a new data.table:
dt <- data.table(name = letters[1:3], 
                 id = sample(1:5,3),
                 age = sample(18:70,3), 
                 sex = factor(c("M", "F", "F")))
dt
##    name id age sex
## 1:    a  3  47   M
## 2:    b  4  25   F
## 3:    c  1  57   F
class(dt)
## [1] "data.table" "data.frame"

Columns in a data table can be referenced directly, and new variables can be created using the := operator:

# Selection with data frame vs. data table:
df <- as.data.frame(dt) # create a data frame for comparison
df[df$sex == "M", ] # select with data frame
##   name id age sex
## 1    a  3  47   M
dt[sex == "M", ] # select with data table
##    name id age sex
## 1:    a  3  47   M
# Variable assignment with data frame vs. data table:
df$id_over_age <- df$id / df$age # assign with data frame
dt[, id_over_age := id / age] # assign with data table

You can select multiple variables with a list:

dt[, list(sex, age)]
##    sex age
## 1:   M  47
## 2:   F  25
## 3:   F  57

Multiple variables can be assigned simultaneously, where the LHS of the := operator is a character vector of new variable names, and the RHS is a list of operations:

dt[, c("id_times_age", "id_plus_age") := list(id * age, id + age)]
dt
##    name id age sex id_over_age id_times_age id_plus_age
## 1:    a  3  47   M  0.06382979          141          50
## 2:    b  4  25   F  0.16000000          100          29
## 3:    c  1  57   F  0.01754386           57          58

Many operations in data analysis need to be done by group (e.g. calculating average unemployment by year). In such cases, data table introduces a third dimension to perform these operations. Specifically, the data table syntax is DT[i,j,by] with options to

  • subset rows using i (which rows?),
  • manipulate columns with j (what to do?), and
  • group according to by (grouped by what?).

Here is an example:

# Produce table with average age by sex:
dt[, mean(age), by = sex]
##    sex V1
## 1:   M 47
## 2:   F 41
# Do the same but name the columns "Gender" and "Age by Gender":
dt[, list(`Age by Gender` = mean(age)), by = list(Gender = sex)]
##    Gender Age by Gender
## 1:      M            47
## 2:      F            41
# Assign a new variable with average age by sex named "age_by_sex":
dt[, age_by_sex := mean(age), by = sex]
dt
##    name id age sex id_over_age id_times_age id_plus_age
## 1:    a  3  47   M  0.06382979          141          50
## 2:    b  4  25   F  0.16000000          100          29
## 3:    c  1  57   F  0.01754386           57          58
##    age_by_sex
## 1:         47
## 2:         41
## 3:         41

For additional information about data tables and their powerful features, check out the Intro to Data Table documentation and this cheat sheet for data.table functions. Furthermore, DataCamp provides several courses on data.table, such as:

4.8 Extensible Time Series (xts)

xts (extensible time series) objects are specialized data structures designed for time series data. These are datasets where each observation corresponds to a specific timestamp. xts objects attach an index to the data, aligning each data point with its associated time. This functionality simplifies data manipulation and minimizes potential errors:

Data with Index. *Source*: [DataCamp](https://learn.datacamp.com/courses/manipulating-time-series-data-with-xts-and-zoo-in-r).

Figure 4.1: Data with Index. Source: DataCamp.

The index attached to an xts object is usually a Date or POSIXct vector, maintaining the data in chronological order from earliest to latest. If you wish to sort data (such as stock prices) by another variable (like trade volume), you’ll first need to convert the xts object back to a data frame, as xts objects preserve the time order. xts objects are built upon zoo objects (Zeileis’ Ordered Observations), another class of time-indexed data structures. xts objects enhance these base structures by providing additional features.

Like tibble and data.table, xts is not included in base R. To use it, you need to install the xts package using install.packages("xts"), then include library("xts") at the start of your script.

To create an xts object, use the xts() function which associates data with a time index (order.by = time_index):

# Load R package
library("xts")

# Create a new xts object from a matrix:
data <- matrix(1:4, ncol = 2, nrow = 2, 
               dimnames = list(NULL, c("a", "b")))
data
##      a b
## [1,] 1 3
## [2,] 2 4
time_index <- as.Date(c("2020-06-01", "2020-07-01"))
time_index
## [1] "2020-06-01" "2020-07-01"
dxts <- xts(x = data, order.by = time_index)
dxts
##            a b
## 2020-06-01 1 3
## 2020-07-01 2 4
class(dxts)
## [1] "xts" "zoo"
tclass(dxts)
## [1] "Date"
# Extract time index
index(dxts)
## [1] "2020-06-01" "2020-07-01"
# Extract data without time index
coredata(dxts)
##      a b
## [1,] 1 3
## [2,] 2 4

To delve deeper into xts and zoo objects, consider reading the guides Manipulating Time Series Data in R with xts & zoo and Time Series in R: Quick Reference. Additionally, DataCamp provides in-depth courses on these topics:

If you’re working within the tidyverse environment, the R package tidyquant offers seamless integration with xts and zoo. Lastly, this handy cheat sheet provides a quick reference on xts and zoo functions.