4 R Data Types and Structures
The R language supports a broad array of operations such as mathematical calculations, logical analyses, and text manipulation. However, the applicability of a function to a variable depends on the variable’s data type. For instance, an arithmetic function to add two variables won’t work if the variables store text.
R supports various data types, including numeric (x <- 15
), character (x <- "Hello!"
), and logical (x <- TRUE
or x <- FALSE
). In addition to single values (scalars), R allows variables to hold collections of numbers or strings using vectors, matrices, lists, or data frames. Advanced data structures such as tibbles, data tables, and xts
objects provide additional features beyond traditional data frames.
In this chapter, we will explore the following data types:
- Scalar: A single data element, such as a number or a character string.
- Vector: A one-dimensional array that contains elements of the same type.
- Matrix (
matrix
): A two-dimensional array with elements of the same type. - List (
list
): A one-dimensional array capable of storing various data types. - Data Frame (
data.frame
): A two-dimensional array that can accommodate columns of different types. - Tibble (
tbl_df
): An enhanced version of data frames, offering user-friendly features. - Data Table (
data.table
): An optimized data frame extension designed for speed and handling large datasets. - Extensible Time Series (
xts
): A time-indexed data frame specifically designed for time series data.
Understanding the data type of variables is crucial because it determines the operations and functions that can be applied to them.
It’s worth noting that R provides so-called wrapper functions, which are functions that have the same name but perform different actions depending on the data object. These wrapper functions adapt their behavior based on the input data type, allowing for more flexible and intuitive programming. For example, the summary()
function in R is a wrapper function. When applied to a numeric vector, it provides statistical summaries such as mean, median, and quartiles. However, when applied to a data frame, it gives a summary of each variable, including the minimum, maximum, and quartiles for numerical variables, as well as counts and levels for categorical variables.
4.1 Scalar
Scalars in R are variables holding single objects. You can determine an object’s type by applying the class()
function to the variable.
Numbers, Characters, and Logical Values
# Numeric (a.k.a. Double)
w <- 5.5 # w is a decimal number.
class(w) # Returns "numeric".
# Integer
x <- 10L # The L tells R to store x as an integer instead of a decimal number.
class(x) # Returns "integer".
# Complex
u <- 3 + 4i # u is a complex number, where 3 is real and 4 is imaginary.
class(u) # Returns "complex".
# Character
y <- "Hello, World!" # y is a character string.
class(y) # Returns "character".
# Logical
z <- TRUE # z is a logical value.
class(z) # Returns "logical".
## [1] "numeric"
## [1] "integer"
## [1] "complex"
## [1] "character"
## [1] "logical"
An object’s type dictates which functions can be applied. For example, mathematical functions are applicable to numbers but not characters:
# Mathematical operations
2 + 2 # Results in 4.
3 * 5 # Results in 15.
(1 + 2) * 3 # Results in 9 (parentheses take precedence).
# Logical operations
TRUE & FALSE # Results in FALSE (logical AND).
TRUE | FALSE # Results in TRUE (logical OR).
# String operations
paste("Hello", "World!") # Concatenates strings, results in "Hello World!".
nchar("Hello") # Counts characters in a string, results in 5.
## [1] 4
## [1] 15
## [1] 9
## [1] FALSE
## [1] TRUE
## [1] "Hello World!"
## [1] 5
Dates and Times
When conducting economic research, it is common to deal with data types specifically designed for storing date and time information:
# Date
v <- as.Date("2023-06-30") # v is a Date.
# The default input format is %Y-%m-%d, where
# - %Y is year in 4 digits,
# - %m is month with 2 digits, and
# - %d is day with 2 digits.
class(v) # Returns "Date".
## [1] "Date"
# POSIXct (Time)
t <- as.POSIXct("2023-06-30 18:47:10", tz = "CDT") # t is a POSIXct.
# The default input format is %Y-%m-%d %H:%M:%S, where
# - %H is hour out of 24,
# - %M is minute out of 60, and
# - %S is second out of 60.
# The tz input is the time zone, where CDT = Central Daylight Time.
class(t) # Returns "POSIXct".
## [1] "POSIXct" "POSIXt"
The default input format, %Y-%m-%d
or %Y-%m-%d %H:%M:%S
, can be changed by specifying a format input. The output format can be adjusted by applying the format()
function to the object:
# Date with custom input format:
v <- as.Date("April 6 -- 23", format = "%B %d -- %y")
v # Returns default output format: %Y-%m-%d.
## [1] "2023-04-06"
## [1] "April 06, 2023"
The syntax for different date formats can be found by typing ?strptime
in the R console. Some of the most commonly used formats are outlined in the table below:
Specification | Description | Example |
---|---|---|
%a | Abbreviated weekday | Sun, Thu |
%A | Full weekday | Sunday, Thursday |
%b or %h | Abbreviated month | May, Jul |
%B | Full month | May, July |
%d | Day of the month, 0-31 | 27, 07 |
%j | Day of the year, 001-366 | 148, 188 |
%m | Month, 01-12 | 05, 07 |
%U | Week, 01-53, with Sunday as first day of the week | 22, 27 |
%w | Weekday, 0-6, Sunday is 0 | 0, 4 |
%W | Week, 00-53, with Monday as first day of the week | 21, 27 |
%x | Date, locale-specific | |
%y | Year without century, 00-99 | 84, 05 |
%Y | Year with century, on input: 00 to 68 prefixed by 20, 69 to 99 prefixed by 19 | 1984, 2005 |
%C | Century | 19, 20 |
%D | Date formatted %m/%d/%y | 5/27/84 |
%u | Weekday, 1-7, Monday is 1 | 7, 4 |
%n | Newline on output or arbitrary whitespace on input | |
%t | Tab on output or arbitrary whitespace on input |
Here are some example operations for Date
objects:
# Date Operations
date1 <- as.Date("2023-06-30")
date2 <- as.Date("2023-01-01")
# Subtract dates to get the number of days between
days_between <- date1 - date2
days_between
## Time difference of 180 days
## [1] "2023-07-30"
4.2 Vector
In R, a vector is a homogeneous sequence of elements, meaning they must all be of the same basic type. As such, a vector can hold multiple numbers, but it cannot mix types, such as having both numbers and words. The function c()
(for combine) can be used to create a vector:
# Numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
class(numeric_vector) # Returns "numeric".
# Character vector
character_vector <- c("Hello", "World", "!")
class(character_vector) # Returns "character".
# Logical vector
logical_vector <- c(TRUE, FALSE, TRUE)
class(logical_vector) # Returns "logical".
## [1] "numeric"
## [1] "character"
## [1] "logical"
The function c()
can also be used to add elements to a vector:
## [1] 1 2 3 4 5 6
The sec()
function creates a sequence of numbers or dates:
## [1] 1.0 1.1 1.2 1.3 1.4 1.5
# Create a sequence of dates:
x <- seq(from = as.Date("2004-05-01"), to = as.Date("2004-12-01"), by = "month")
x
## [1] "2004-05-01" "2004-06-01" "2004-07-01" "2004-08-01"
## [5] "2004-09-01" "2004-10-01" "2004-11-01" "2004-12-01"
Missing data is represented as NA
(not available). The function is.na()
indicates the elements that are missing and anyNA()
returns TRUE if the vector contains any missing values:
## [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## [1] TRUE
A generalization of logical vectors are factors, which are vectors that restrict entries to be one of predefined categories:
# Unordered factors, e.g. categories "Male" and "Female":
gender_vector <- c("Male", "Female", "Male", "Male", "Male", "Female", "Male")
factor_gender_vector <- factor(gender_vector)
factor_gender_vector
## [1] Male Female Male Male Male Female Male
## Levels: Female Male
# Ordered factors, e.g. categories with ordering Low < Medium < High:
temperature_vector <- c("High", "Low", "Low", "Low", "Medium", "Low", "Low")
factor_temperature_vector <- factor(temperature_vector,
order = TRUE,
levels = c("Low", "Medium", "High"))
factor_temperature_vector
## [1] High Low Low Low Medium Low Low
## Levels: Low < Medium < High
4.3 Matrix (matrix
)
A matrix in R (matrix
) is a two-dimensional array that extends atomic vectors, containing both rows and columns. The elements within a matrix must be of the same data type.
# Create a 3x3 numeric matrix, column-wise:
numeric_matrix <- matrix(1:9, nrow = 3, ncol = 3)
numeric_matrix
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
## [1] "matrix" "array"
## [1] "integer"
# Create a 2x3 character matrix, row-wise:
character_matrix <- matrix(letters[1:6], nrow = 2, ncol = 3, byrow = TRUE)
character_matrix
## [,1] [,2] [,3]
## [1,] "a" "b" "c"
## [2,] "d" "e" "f"
## [1] "matrix" "array"
## [1] "character"
To select specific elements, rows, or columns within a matrix, square brackets are used. The cbind()
and rbind()
functions enable the combination of columns and rows, respectively.
## [1] "d"
## [1] "d" "e" "f"
# Combine matrices:
x <- matrix(1:4, nrow = 2, ncol = 2)
y <- matrix(101:104, nrow = 2, ncol = 2)
rbind(x, y) # Combines matrices x and y row-wise.
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [3,] 101 103
## [4,] 102 104
## [,1] [,2] [,3] [,4]
## [1,] 1 3 101 103
## [2,] 2 4 102 104
4.4 List (list
)
A list (list
) in R serve as an ordered collection of objects. In contrast to vectors, elements within a list are not required to be of the same type. Moreover, some list elements may store multiple sub-elements, allowing for complex nested structures. For instance, a single element of a list might itself be a matrix or another list.
# List
my_list <- list(1, "a", TRUE, 1+4i,
c(1, 2, 3), matrix(1:8, 2, 4), list("c",4))
names(my_list) <- c("num_1", "char_a", "log_T", "complex_1p4i",
"vec", "mat", "list")
my_list
## $num_1
## [1] 1
##
## $char_a
## [1] "a"
##
## $log_T
## [1] TRUE
##
## $complex_1p4i
## [1] 1+4i
##
## $vec
## [1] 1 2 3
##
## $mat
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
##
## $list
## $list[[1]]
## [1] "c"
##
## $list[[2]]
## [1] 4
## [1] "list"
The content of elements can be retrieved by using double square brackets:
## [1] "a"
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
4.5 Data Frame (data.frame
)
A data frame (data.frame
) in R resembles a matrix in its two-dimensional, rectangular structure. However, unlike a matrix, a data frame allows each column to contain a different data type. Therefore, within each column (or vector), the elements must be homogeneous, but different columns can accommodate distinct types. Typically, when importing data into R, the default object type used is a data frame.
# Vectors
student_names <- c("Anna", "Ella", "Sophia")
student_ages <- c(23, 21, 25)
student_grades <- c("A", "B", "A")
student_major <- c("Math", "Biology", "Physics")
# Data frame
students_df <- data.frame(name = student_names,
age = student_ages,
grade = student_grades,
major = student_major)
students_df
## name age grade major
## 1 Anna 23 A Math
## 2 Ella 21 B Biology
## 3 Sophia 25 A Physics
## [1] "data.frame"
Data frames are frequently used for data storage and manipulation in R. The following illustrates some common functions used on data frames:
# Access a column in the data frame
students_df$name
# Alternative way to access a column:
students_df[["name"]]
# Access second row in third column:
students_df[2, 3]
## [1] "Anna" "Ella" "Sophia"
## [1] "Anna" "Ella" "Sophia"
## [1] "B"
# When selecting just one column, data frame produces a vector
class(students_df[, 3])
# To avoid this, add drop = FALSE
class(students_df[, 3 , drop = FALSE])
## [1] "character"
## [1] "data.frame"
## name age grade major gpa
## 1 Anna 23 A Math 3.8
## 2 Ella 21 B Biology 3.5
## 3 Sophia 25 A Physics 3.9
## name age grade major gpa
## 1 Anna 23 A Math 3.8
## 3 Sophia 25 A Physics 3.9
# Number of columns and rows
ncol(students_df)
nrow(students_df)
# Column and row names
colnames(students_df)
rownames(students_df)
## [1] 5
## [1] 3
## [1] "name" "age" "grade" "major" "gpa"
## [1] "1" "2" "3"
# Change column names
colnames(students_df) <- c("Name", "Age", "Grade", "Major", "GPA")
students_df
## Name Age Grade Major GPA
## 1 Anna 23 A Math 3.8
## 2 Ella 21 B Biology 3.5
## 3 Sophia 25 A Physics 3.9
## 'data.frame': 3 obs. of 5 variables:
## $ Name : chr "Anna" "Ella" "Sophia"
## $ Age : num 23 21 25
## $ Grade: chr "A" "B" "A"
## $ Major: chr "Math" "Biology" "Physics"
## $ GPA : num 3.8 3.5 3.9
These examples illustrate just a few of the operations you can perform with data frames in R. With additional libraries like dplyr
, tidyr
, and data.table
, more complex manipulations are possible.
4.6 Tibble (tbl_df
)
A tibble (tbl_df
) is a more convenient version of a data frame. It is part of the tibble
package in the tidyverse
collection of R packages. To use tibbles, you need to install the tibble
package by executing install.packages("tibble")
in your console. Don’t forget to include library("tibble")
at the beginning of your R script.
To create a tibble, you can use the tibble()
function. Here’s an example:
# Load R package
library("tibble")
# Create a new tibble
tib <- tibble(name = letters[1:3],
id = sample(1:5, 3),
age = sample(18:70, 3),
sex = factor(c("M", "F", "F")))
tib
## # A tibble: 3 × 4
## name id age sex
## <chr> <int> <int> <fct>
## 1 a 4 39 M
## 2 b 3 33 F
## 3 c 2 69 F
## [1] "tbl_df" "tbl" "data.frame"
One advantage of tibbles is that they make it easy to calculate and create new columns. Here’s an example:
## # A tibble: 3 × 5
## name id age sex idvage
## <chr> <int> <int> <fct> <dbl>
## 1 a 4 39 M 0.103
## 2 b 3 33 F 0.0909
## 3 c 2 69 F 0.0290
Unlike regular data frames, tibbles allow non-standard column names. You can use special characters or numbers as column names. Here’s an example:
## # A tibble: 1 × 3
## `:)` ` ` `2000`
## <chr> <chr> <chr>
## 1 smile space number
Another way to create a tibble is with the tribble()
function. It allows you to define column headings using formulas starting with ~ and separate entries with commas. Here’s an example:
## # A tibble: 2 × 3
## x y z
## <chr> <dbl> <dbl>
## 1 a 2 3.6
## 2 b 1 8.5
For additional functions and a helpful cheat sheet on tibble
and dplyr
, you can refer to this cheat sheet.
Tidyverse
The tibble
package is part of the tidyverse
environment, which is a collection of R packages with a shared design philosophy, grammar, and data structures. To install tidyverse
, execute install.packages("tidyverse")
, which includes tibble
, readr
, dplyr
, tidyr
, ggplot2
, and more. Key functions in the tidyverse
include select()
, filter()
, mutate()
, arrange()
, count()
, group_by()
, and summarize()
. An interesting operator in the tidyverse
is the pipe operator %>%
, which allows you to chain functions together in a readable and sequential manner. With the pipe operator, you can order the functions as they are applied, making your code more expressive and easier to understand. Here’s an example:
library("tidyverse")
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
# Apply several functions to x:
y <- round(exp(diff(log(x))), 1)
y
## [1] 3.3 1.8 1.6 0.5 0.3 0.1 48.8 1.1
# Perform the same computations using pipe operators:
y <- x %>% log() %>% diff() %>% exp() %>% round(1)
y
## [1] 3.3 1.8 1.6 0.5 0.3 0.1 48.8 1.1
By using the %>%
operator, each function is applied to the previous result, simplifying the code and improving its readability.
To delve deeper into the tidyverse, explore their official website: www.tidyverse.org. Another resource is the R-Bootcamp, available at r-bootcamp.netlify.app. Additionally, DataCamp provides a comprehensive skill track devoted to the tidyverse, named Tidyverse Fundamentals with R.
4.7 Data Table (data.table
)
A data table (data.table
) is similar to a data frame but with more advanced features for data manipulation. In fact, data.table
and tibble
can be considered competitors, with each offering enhancements over the standard data frame. While data tables offer high-speed functions and are optimized for large datasets, tibbles from the tidyverse
are slower but are more user-friendly. The syntax used in data.table
functions may seem esoteric, differing from that used in tidyverse
. Like tibble, data.table
is not a part of base R. It requires the installation of the data.table
package via install.packages("data.table")
, followed by library("data.table")
at the beginning of your script.
To create a data table, you can use the data.table()
function. Here’s an example:
# Load R package
library("data.table")
# Create a new data.table:
dt <- data.table(name = letters[1:3],
id = sample(1:5,3),
age = sample(18:70,3),
sex = factor(c("M", "F", "F")))
dt
## name id age sex
## 1: a 3 47 M
## 2: b 4 25 F
## 3: c 1 57 F
## [1] "data.table" "data.frame"
Columns in a data table can be referenced directly, and new variables can be created using the :=
operator:
# Selection with data frame vs. data table:
df <- as.data.frame(dt) # create a data frame for comparison
df[df$sex == "M", ] # select with data frame
## name id age sex
## 1 a 3 47 M
## name id age sex
## 1: a 3 47 M
# Variable assignment with data frame vs. data table:
df$id_over_age <- df$id / df$age # assign with data frame
dt[, id_over_age := id / age] # assign with data table
You can select multiple variables with a list:
## sex age
## 1: M 47
## 2: F 25
## 3: F 57
Multiple variables can be assigned simultaneously, where the LHS of the :=
operator is a character vector of new variable names, and the RHS is a list of operations:
## name id age sex id_over_age id_times_age id_plus_age
## 1: a 3 47 M 0.06382979 141 50
## 2: b 4 25 F 0.16000000 100 29
## 3: c 1 57 F 0.01754386 57 58
Many operations in data analysis need to be done by group (e.g. calculating average unemployment by year). In such cases, data table introduces a third dimension to perform these operations. Specifically, the data table syntax is DT[i,j,by]
with options to
- subset rows using
i
(which rows?), - manipulate columns with
j
(what to do?), and - group according to
by
(grouped by what?).
Here is an example:
## sex V1
## 1: M 47
## 2: F 41
# Do the same but name the columns "Gender" and "Age by Gender":
dt[, list(`Age by Gender` = mean(age)), by = list(Gender = sex)]
## Gender Age by Gender
## 1: M 47
## 2: F 41
# Assign a new variable with average age by sex named "age_by_sex":
dt[, age_by_sex := mean(age), by = sex]
dt
## name id age sex id_over_age id_times_age id_plus_age
## 1: a 3 47 M 0.06382979 141 50
## 2: b 4 25 F 0.16000000 100 29
## 3: c 1 57 F 0.01754386 57 58
## age_by_sex
## 1: 47
## 2: 41
## 3: 41
For additional information about data tables and their powerful features, check out the Intro to Data Table documentation and this cheat sheet for data.table
functions. Furthermore, DataCamp provides several courses on data.table
, such as:
4.8 Extensible Time Series (xts
)
xts
(extensible time series) objects are specialized data structures designed for time series data. These are datasets where each observation corresponds to a specific timestamp. xts
objects attach an index to the data, aligning each data point with its associated time. This functionality simplifies data manipulation and minimizes potential errors:
.](files/matrixwithindex-wide.png)
Figure 4.1: Data with Index. Source: DataCamp.
The index attached to an xts
object is usually a Date
or POSIXct
vector, maintaining the data in chronological order from earliest to latest. If you wish to sort data (such as stock prices) by another variable (like trade volume), you’ll first need to convert the xts
object back to a data frame, as xts
objects preserve the time order. xts
objects are built upon zoo
objects (Zeileis’ Ordered Observations), another class of time-indexed data structures. xts
objects enhance these base structures by providing additional features.
Like tibble
and data.table
, xts
is not included in base R. To use it, you need to install the xts
package using install.packages("xts")
, then include library("xts")
at the start of your script.
To create an xts
object, use the xts()
function which associates data with a time index (order.by = time_index
):
# Load R package
library("xts")
# Create a new xts object from a matrix:
data <- matrix(1:4, ncol = 2, nrow = 2,
dimnames = list(NULL, c("a", "b")))
data
## a b
## [1,] 1 3
## [2,] 2 4
## [1] "2020-06-01" "2020-07-01"
## a b
## 2020-06-01 1 3
## 2020-07-01 2 4
## [1] "xts" "zoo"
## [1] "Date"
## [1] "2020-06-01" "2020-07-01"
## a b
## [1,] 1 3
## [2,] 2 4
To delve deeper into xts
and zoo
objects, consider reading the guides Manipulating Time Series Data in R with xts & zoo and Time Series in R: Quick Reference. Additionally, DataCamp provides in-depth courses on these topics:
If you’re working within the tidyverse
environment, the R package tidyquant
offers seamless integration with xts
and zoo
. Lastly, this handy cheat sheet provides a quick reference on xts
and zoo
functions.