|>
data tidy(...) |>
filter(...) |>
select(...) |>
arrange(...) |>
modify(...) |> # mutate in the code
rename(...) |>
group(...) |>
count(...) |>
summarise(...) |>
plot(...) # actually ggplot
Introduction to tidyverse: manipulation with dplyr package
Practical workbooks of Data Programming in Master in Computational Social Science (2024-2025)
Within {tidyverse}
we will use the {dplyr}
package for the preprocessing process of the data.
The idea is that the code is as readable as possible, as if it were a list of instructions that when read tells us in a very obvious way what it is doing.
All the preprocessing process we are going to perform is on the assumption that our data is in tidydata
Remember that in {tidyverse}
the pipe operator defined as |>
(ctrl+shift+M) will be key: it will be a pipe that traverses the data and transforms it.
Let us practice with the starwars
dataset from the {dplyr}
package.
library(tidyverse)
starwars
1 Actions by rows
1.1 Sampling
One of the most common operations is what is known in statistics as sampling: a selection or filtering of records (rows) (a subsample).
Non-random (by quota): based on logical conditions on the records (
filter()
).Non-random (intentional/discretionary): based on a position (
slice()
).Simple random (
slice_sample()
).Stratified (
group_by()
+slice_sample()
).
1.1.1 Filter rows: filter()
|>
data filtro(condition)
|>
starwars filter(condition)
The simplest action by rows is when filter records based on some logical condition: with filter()
only individuals meeting certain conditions will be selected (non-random sampling by conditions).
==
,!=
: equal or different to (|> filter(variable == "a")
)>
,<
: greater or less than (|> filter(variable < 3)
)>=
,<=
: greater or equal or less or equal than (|> filter(variable >= 5)
)%in%
: values belong to a set of discrete options (|> filter(variable %in% c("blue", "green"))
)between(variable, val1, val2)
: if continuous values are inside of a range (|> filter(between(variable, 160, 180))
)
These logical conditions can be combined in different ways (and, or, or mutually exclusive).
Remember that inside filter()
there must always be something that returns a vector of logical values.
How would you go about… filter the characters with brown eyes? What type of variable is it? –> The eye_color
variable is qualitative so it is represented by texts.
|>
starwars filter(eye_color == "brown")
# A tibble: 21 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Leia Or… 150 49 brown light brown 19 fema… femin…
2 Biggs D… 183 84 black light brown 24 male mascu…
3 Han Solo 180 80 brown fair brown 29 male mascu…
4 Yoda 66 17 white green brown 896 male mascu…
5 Boba Fe… 183 78.2 black fair brown 31.5 male mascu…
6 Lando C… 177 79 black dark brown 31 male mascu…
7 Arvel C… NA NA brown fair brown NA male mascu…
8 Wicket … 88 20 brown brown brown 8 male mascu…
9 Padmé A… 185 45 brown light brown 46 fema… femin…
10 Quarsh … 183 NA black dark brown 62 male mascu…
# ℹ 11 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
How would you go about… filter the characters that do not have brown eyes?
|>
starwars filter(eye_color != "brown")
# A tibble: 66 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Owen La… 178 120 brown, gr… light blue 52 male mascu…
6 Beru Wh… 165 75 brown light blue 47 fema… femin…
7 R5-D4 97 32 <NA> white, red red NA none mascu…
8 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
9 Anakin … 188 84 blond fair blue 41.9 male mascu…
10 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
# ℹ 56 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
How would you go about … filter characters that have brown or blue eyes?
|>
starwars filter(eye_color %in% c("blue", "brown"))
# A tibble: 40 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Leia Or… 150 49 brown light brown 19 fema… femin…
3 Owen La… 178 120 brown, gr… light blue 52 male mascu…
4 Beru Wh… 165 75 brown light blue 47 fema… femin…
5 Biggs D… 183 84 black light brown 24 male mascu…
6 Anakin … 188 84 blond fair blue 41.9 male mascu…
7 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
8 Chewbac… 228 112 brown unknown blue 200 male mascu…
9 Han Solo 180 80 brown fair brown 29 male mascu…
10 Jek Ton… 180 110 brown fair blue NA <NA> <NA>
# ℹ 30 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Note that %in%
is equivalent to concatenating several ==
with a conjunction or (|
)
|>
starwars filter(eye_color == "blue" | eye_color == "brown")
# A tibble: 40 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Leia Or… 150 49 brown light brown 19 fema… femin…
3 Owen La… 178 120 brown, gr… light blue 52 male mascu…
4 Beru Wh… 165 75 brown light blue 47 fema… femin…
5 Biggs D… 183 84 black light brown 24 male mascu…
6 Anakin … 188 84 blond fair blue 41.9 male mascu…
7 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
8 Chewbac… 228 112 brown unknown blue 200 male mascu…
9 Han Solo 180 80 brown fair brown 29 male mascu…
10 Jek Ton… 180 110 brown fair blue NA <NA> <NA>
# ℹ 30 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
How would you go about … filter the characters that are between 120 and 160 cm? What type of variable is it? –> The variable height
is a continuous quantitative variable so we must filter by ranges of values (intervals) –> we will use between()
.
|>
starwars filter(between(height, 120, 160))
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Leia Org… 150 49 brown light brown 19 fema… femin…
2 Mon Moth… 150 NA auburn fair blue 48 fema… femin…
3 Nien Nunb 160 68 none grey black NA male mascu…
4 Watto 137 NA black blue, grey yellow NA male mascu…
5 Gasgano 122 NA none white, bl… black NA male mascu…
6 Cordé 157 NA brown light brown NA <NA> <NA>
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
How would you… filter characters that have eyes and are not human?
|>
starwars filter(eye_color == "brown" & species != "Human")
# A tibble: 3 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Yoda 66 17 white green brown 896 male mascu…
2 Wicket S… 88 20 brown brown brown 8 male mascu…
3 Eeth Koth 171 NA black brown brown NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
How would you… filter characters that have eyes and are not human, or are over 60 years old? Think it through: the parentheses are important: \((a+b)*c\) is not the same as \(a+(b*c)\).
|>
starwars filter((eye_color == "brown" & species != "Human") | birth_year > 60)
# A tibble: 18 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 C-3PO 167 75 <NA> gold yellow 112 none mascu…
2 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
3 Chewbac… 228 112 brown unknown blue 200 male mascu…
4 Jabba D… 175 1358 <NA> green-tan… orange 600 herm… mascu…
5 Yoda 66 17 white green brown 896 male mascu…
6 Palpati… 170 75 grey pale yellow 82 male mascu…
7 Wicket … 88 20 brown brown brown 8 male mascu…
8 Qui-Gon… 193 89 brown fair blue 92 male mascu…
9 Finis V… 170 NA blond fair blue 91 male mascu…
10 Quarsh … 183 NA black dark brown 62 male mascu…
11 Shmi Sk… 163 NA black fair brown 72 fema… femin…
12 Mace Wi… 188 84 none dark brown 72 male mascu…
13 Ki-Adi-… 198 82 white pale yellow 92 male mascu…
14 Eeth Ko… 171 NA black brown brown NA male mascu…
15 Cliegg … 183 NA brown fair blue 82 male mascu…
16 Dooku 193 80 white fair brown 102 male mascu…
17 Bail Pr… 191 NA black tan brown 67 male mascu…
18 Jango F… 183 79 black tan brown 66 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
1.1.2 Drop missings: drop_na()
|>
data drop_missings(var1, var2, ...)
|>
starwars drop_na(var1, var2, ...)
There is a special filter for one of the most common operations in debugging: remove absent. For this we can use inside a filter is.na()
, which returns TRUE/FALSE
depending on whether it is absent, or …
Use drop_na()
: if we do not specify a variable, it removes records with missing in any variable. Later on we will see how to impute those missing
|>
starwars drop_na(mass, height)
# A tibble: 59 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 49 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
|>
starwars drop_na()
# A tibble: 29 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Darth V… 202 136 none white yellow 41.9 male mascu…
3 Leia Or… 150 49 brown light brown 19 fema… femin…
4 Owen La… 178 120 brown, gr… light blue 52 male mascu…
5 Beru Wh… 165 75 brown light blue 47 fema… femin…
6 Biggs D… 183 84 black light brown 24 male mascu…
7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
8 Anakin … 188 84 blond fair blue 41.9 male mascu…
9 Chewbac… 228 112 brown unknown blue 200 male mascu…
10 Han Solo 180 80 brown fair brown 29 male mascu…
# ℹ 19 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
1.1.3 Slices of data: slice()
|> slice(positions) data
|> slice(positions) starwars
Sometimes we may be interested in performing a non-random discretionary sampling, or in other words, filter by position: with slice(positions)
we can select specific rows by passing as argument a index vector.
# 1st row
|> slice(1) starwars
# A tibble: 1 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# from the 7th to the 9th row
|> slice(7:9) starwars
# A tibble: 3 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Beru Whi… 165 75 brown light blue 47 fema… femin…
2 R5-D4 97 32 <NA> white, red red NA none mascu…
3 Biggs Da… 183 84 black light brown 24 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# 2, 7, 10 and 31th rows
|> slice(c(2, 7, 10, 31)) starwars
# A tibble: 4 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 C-3PO 167 75 <NA> gold yellow 112 none mascu…
2 Beru Whi… 165 75 brown light blue 47 fema… femin…
3 Obi-Wan … 182 77 auburn, w… fair blue-gray 57 male mascu…
4 Qui-Gon … 193 89 brown fair blue 92 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
We have some default options:
- with
slice_head(n = ...)
andslice_tail(n = ...)
we can get the header and tail of the table
|> slice_head(n = 2) starwars
# A tibble: 2 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
|> slice_tail(n = 2) starwars
# A tibble: 2 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 BB8 NA NA none none black NA none mascu…
2 Captain … NA NA none none unknown NA fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
- with
slice_max()
andslice_min()
we get the rows with smallest/largest value of a variable (if tie, all unlesswith_ties = FALSE
) which we indicate inorder_by = ...
.
|> slice_min(mass, n = 2) starwars
# A tibble: 2 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Ratts Ty… 79 15 none grey, blue unknown NA male mascu…
2 Yoda 66 17 white green brown 896 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
|> slice_max(height, n = 2) starwars
# A tibble: 2 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Yarael P… 264 NA none white yellow NA male mascu…
2 Tarfful 234 136 brown brown blue NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
1.1.4 Random sampling: slice_sample()
|>
data slice_aleatorias(positions)
|>
starwars slice_sample(positions)
The so-called simple random sampling is based on selecting individuals randomly, so that each one has certain probabilities of being selected. With slice_sample(n = ...)
we can randomly extract n (a priori equiprobable) records.
|> slice_sample(n = 2) starwars
# A tibble: 2 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 BB8 NA NA none none black NA none mascul…
2 Ric Olié 183 NA brown fair blue NA male mascul…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
“Random” does not imply equiprobable: a normal die is just as random as a trick die. There are no things “more random” than others, they simply have different underlying probability laws.
We can also indicate the proportion of data to sample (instead of the number) and if we want it to be with replacement (that can be repeated).
# 5% of random rows with replacement
|>
starwars slice_sample(prop = 0.05, replace = TRUE)
# A tibble: 4 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Ratts Ty… 79 15 none grey, blue unknown NA male mascu…
2 Boba Fett 183 78.2 black fair brown 31.5 male mascu…
3 Shaak Ti 178 57 none red, blue… black NA fema… femin…
4 Lando Ca… 177 79 black dark brown 31 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
As we said, “random” is not the same as “equiprobable”, so we can pass a probability vector. For example, let’s force that it is very improbable to draw a row other than the first two rows
|>
starwars slice_sample(n = 2, weight_by = c(0.495, 0.495, rep(0.01/85, 85)))
# A tibble: 2 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
1.1.4.1 sample()
The slice_sample()
function is simply a {tidyverse}
integration of the basic R
function known as sample()
that allows us to sample elements
For example, let’s sample 10 rolls of a die, telling it
- support of our random variable (allowed values in
x
) - sample size (
size
) - replacement (if
TRUE
then they can come out repeated, as in the case of the die).
sample(x = 1:6, size = 10, replace = TRUE)
[1] 4 1 5 6 5 3 1 5 1 6
The previous option generates events of a random variable equiprobable but as before, we can assign a vector of probabilities or mass function to it with the argument prob = ...
.
sample(x = 1:6, size = 50, replace = TRUE,
prob = c(0.5, 0.2, 0.1, 0.1, 0.05, 0.05))
[1] 2 3 2 1 1 1 5 4 2 1 5 3 4 2 1 4 1 2 1 1 2 1 1 2 6 3 3 3 2 1 2 4 1 1 1 1 1 1
[39] 1 4 1 3 4 2 2 1 1 5 3 1
How would you make the following statement?
Suppose that seasonal flu episodes have been studied in a city. Let \(X_m\) and \(X_p\) be random variables such that \(X_m=1\) if the mother has flu, \(X_m=0\) if the mother does not have flu, \(X_p=1\) if the father has flu and \(X_p=0\) if the father does not have flu. The theoretical model associated with this type of epidemics indicates that the joint distribution is given by \(P(X_m = 1, X_p=1)=0.02\), \(P(X_m = 1, X_p=0)=0.08\), \(P(X_m = 1, X_p=0)=0. 1\) and \(P(X_m = 0, X_p=0)=0.8\)
Generate a sample of size \(n = 1000\) (support "10"
, "01"
, "00"
and "11"
) by making use of runif()
and by making use of sample()
.
1.2 Sort by rows: arrange()
|> sort(var1, var2, ...) data
|> arrange(var1, var2, ...) starwars
We can also order by rows according to some variable with arrange()
.
|> arrange(mass) starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Ratts T… 79 15 none grey, blue unknown NA male mascu…
2 Yoda 66 17 white green brown 896 male mascu…
3 Wicket … 88 20 brown brown brown 8 male mascu…
4 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
5 R5-D4 97 32 <NA> white, red red NA none mascu…
6 Sebulba 112 40 none grey, red orange NA male mascu…
7 Padmé A… 185 45 brown light brown 46 fema… femin…
8 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
9 Wat Tam… 193 48 none green, gr… unknown NA male mascu…
10 Sly Moo… 178 48 none pale white NA <NA> <NA>
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
By from lowest to highest but we can reverse the order with desc()
.
|> arrange(desc(height)) starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Yarael … 264 NA none white yellow NA male mascu…
2 Tarfful 234 136 brown brown blue NA male mascu…
3 Lama Su 229 88 none grey black NA male mascu…
4 Chewbac… 228 112 brown unknown blue 200 male mascu…
5 Roos Ta… 224 82 none grey orange NA male mascu…
6 Grievous 216 159 none brown, wh… green, y… NA male mascu…
7 Taun We 213 NA none grey black NA fema… femin…
8 Rugor N… 206 NA none green orange NA male mascu…
9 Tion Me… 206 80 none grey black NA male mascu…
10 Darth V… 202 136 none white yellow 41.9 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
|> arrange(mass, desc(height)) starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Ratts T… 79 15 none grey, blue unknown NA male mascu…
2 Yoda 66 17 white green brown 896 male mascu…
3 Wicket … 88 20 brown brown brown 8 male mascu…
4 R5-D4 97 32 <NA> white, red red NA none mascu…
5 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
6 Sebulba 112 40 none grey, red orange NA male mascu…
7 Padmé A… 185 45 brown light brown 46 fema… femin…
8 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
9 Wat Tam… 193 48 none green, gr… unknown NA male mascu…
10 Sly Moo… 178 48 none pale white NA <NA> <NA>
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
1.3 Remove duplicates: distinct()
|> no_duplicates(var1, var2, ...) data
|> distinct(var1, var2, ...) starwars
Many times we will need to make sure that there are no duplicates in some variable (DNI) and we can delete duplicate rows with distinct()
.
|> distinct(sex) starwars
# A tibble: 5 × 1
sex
<chr>
1 male
2 none
3 female
4 hermaphroditic
5 <NA>
To keep all the columns of the table we will use .keep_all = TRUE
.
|> distinct(sex, .keep_all = TRUE) starwars
# A tibble: 5 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 Leia Org… 150 49 brown light brown 19 fema… femin…
4 Jabba De… 175 1358 <NA> green-tan… orange 600 herm… mascu…
5 Jek Tono… 180 110 brown fair blue NA <NA> <NA>
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
1.4 Including rows: bind_rows()
|> include_rows(tibble2) tibble1
|> bind_rows(tibble2) tibble1
Finally, we can bind new rows with bind_rows()
with new observations in table (if columns do not match fill with absent)
<- tibble("name" = c("javi", "laura"), "age" = c(33, 50))
data data
# A tibble: 2 × 2
name age
<chr> <dbl>
1 javi 33
2 laura 50
|> bind_rows(tibble("name" = c("carlos", NA), "cp" = c(28045, 28019))) data
# A tibble: 4 × 3
name age cp
<chr> <dbl> <dbl>
1 javi 33 NA
2 laura 50 NA
3 carlos NA 28045
4 <NA> NA 28019
1.5 💻 It’s your turn
Try to perform the following exercises without looking at the solutions
📝 Select from the starwars set only those characters that are androids or whose species
value is unknown.
Code
|>
starwars filter(species == "Droid" | is.na(species))
📝 Select from the starwars set only the characters whose weight is between 65 and 90 kg.
Code
|> filter(between(mass, 65, 90)) starwars
📝 After clearing absent in all variables, select from the starwars set only the characters that are human and come from Tatooine.
Code
|>
starwars drop_na() |>
filter(species == "Human" & homeworld == "Tatooine")
📝 Select from the original starwars set non-human characters, male
in sex and measuring between 120 and 170 cm, or characters with brown or red eyes.
Code
|>
starwars filter((species != "Human" & sex == "male" &
between(height, 120, 170)) |
%in% c("brown", "red")) eye_color
📝 Look for information in the str_detect()
function help of the {stringr}
package (loaded in {tidyverse}
). Tip: test the functions you are going to use with some test vector beforehand so that you can check how they work. After you know what it does, filter out only those characters with the last name Skywalker
.
Code
|> filter(str_detect(name, "Skywalker")) starwars
📝 Select only the characters that are human and brown-eyed, then sort them in descending height and ascending weight.
Code
|>
starwars filter(eye_color == "brown" & species == "Human") |>
arrange(height, desc(mass))
📝 Randomly extracts 3 records.
Code
|> slice_sample(n = 3) starwars
📝 Extracts 10% of the records randomly.
Code
|> slice_sample(prop = 0.1) starwars
📝R andomly draws 10 characters but in such a way that the probability of each character being drawn is proportional to its weight (heavier, more likely).
Code
|>
starwars drop_na(mass) |>
slice_sample(n = 10, weight_by = mass)
📝 Select the 3 oldest characters.
Code
|> slice_max(birth_year, n = 3) starwars
📝 To find out what unique values are in the hair color, remove duplicates of the hair_color
variable by first removing the missing ones from the hair_color
variable.
Code
|>
starwars drop_na(hair_color) |>
distinct(hair_color)
📝 Of the characters that are human and taller than 160 cm, eliminate duplicates in eye color, eliminate absent in weight, select the 3 tallest, and order from tallest to shortest in weight. Return the table.
Code
|>
starwars filter(species == "Human" & height > 160) |>
distinct(eye_color, .keep_all = TRUE) |>
drop_na(mass) |>
slice_max(height, n = 3) |>
arrange(desc(mass))
2 🐣 Case study I: airquality
Let’s go back to a known dataset: in the {datasets}
package (already installed by default) we had several datasets and one of them was airquality
which we already worked with. The data captures daily measurements (n = 153 observations) of air quality in New York, from May to September 1973.
At that time we worked it from the R base perspective and extracted some variables from it. The objective now will be to work it from the {tidyverse}
perspective looking at the differences from one form to the other.
library(datasets)
airquality
2.1 Question 1
Converts to tibble. Access only the first 5 records. Then access the first, second, fifth and tenth.
Code
library(tidyverse)
<- as_tibble(airquality)
airquality
# from 1 to 5
|>
airquality slice(1:5)
# other way
|>
airquality slice(c(1, 2, 3, 4, 5))
# first, second, fifth and tenth
|>
airquality slice(c(1, 2, 5, 10))
# Notice that the main difference with respect to R base
# is that now we don't need [...] and most importantly:
# the code itself is readable since it says almost verbatim
# what you want to do.
2.2 Question 2
Access only the May temperature records. Then access the elements for May, April and June.
Code
# may
|>
airquality filter(Month == 5)
# april, may, june
|>
airquality filter(Month %in% c(4, 5, 6))
2.3 Question 3
How many records do we have for May and April?
Code
# May's records
|>
airquality filter(Month == 5) |>
nrow()
# April's records
|>
airquality filter(Month == 4) |>
nrow()
2.4 Question 4
With August data only, sort the resulting dataset by temperature (coldest first, warmest second). Then sort it in reverse order
Code
|>
airquality filter(Month == 8) |>
arrange(temp)
|>
airquality filter(Month == 8) |>
arrange(desc(temp))
2.5 Question 5
Remove the absent (
NA
) from all variables (there cannot be any record left that has absent in any of the columns). Do this both in R Base (forbid|>
,filter
, etc) and in tidyverse. In both cases save the result inairquality_sin_NA
.
Code
# tidyverse
<-
airquality_sin_NA |>
airquality drop_na()
# R base
<-
airquality_sin_NA !is.na(airquality$Ozone) & !is.na(airquality$Solar.R) &
airquality[!is.na(airquality$Wind) & !is.na(airquality$Temp) &
!is.na(airquality$Month) & !is.na(airquality$Day), ]
2.6 Question 6
With the data without
NA
, keep only the summer data (June, July, August and September) and, with those data, sort the records from lowest to highest temperature and, in case of a tie, from highest to lowest ozone. Do this both in R Base (forbid|>
,filter
, etc) and in tidyverse.
Code
# tidyverse
|>
airquality_sin_NA filter(Month %in% c(6, 7, 8, 9)) |>
arrange(Temp, desc(Ozone))
# r base
<- airquality_sin_NA[airquality_sin_NA$Month %in% c(6, 7, 8, 9), ]
aux_data order(aux_data$Temp, aux_data$Ozone, decreasing = c(FALSE, TRUE)), ] aux_data[
2.7 Question 7
With the data without
NA
, randomly select 10% of the records so that the records with higher temperature have more weight (more likely to come out). Do this both in R Base (forbid|>
,filter
, etc) and in tidyverse.
Code
# tidyverse
|>
airquality_sin_NA slice_sample(prop = 0.1, weight_by = Temp)
# r base
sample(1:nrow(airquality_sin_NA),
airquality_sin_NA[size = nrow(airquality_sin_NA)*0.1,
prob = airquality_sin_NA$Temp), ]
3 Actions by columns
3.1 Select columns: select()
|> select(var1, var2, ...) data
Up to now all operations performed (even if we used column info) were by rows. In the case of columns, the simplest action is to select variables by name with select()
, giving as arguments the column names without quotes.
|> select(name, hair_color) starwars
# A tibble: 87 × 2
name hair_color
<chr> <chr>
1 Luke Skywalker blond
2 C-3PO <NA>
3 R2-D2 <NA>
4 Darth Vader none
5 Leia Organa brown
6 Owen Lars brown, grey
7 Beru Whitesun Lars brown
8 R5-D4 <NA>
9 Biggs Darklighter black
10 Obi-Wan Kenobi auburn, white
# ℹ 77 more rows
The select()
function allows us to select several variables at once, including concatenating their names as if they were numerical indexes with :
|> select(name:eye_color) starwars
# A tibble: 87 × 6
name height mass hair_color skin_color eye_color
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 blond fair blue
2 C-3PO 167 75 <NA> gold yellow
3 R2-D2 96 32 <NA> white, blue red
4 Darth Vader 202 136 none white yellow
5 Leia Organa 150 49 brown light brown
6 Owen Lars 178 120 brown, grey light blue
7 Beru Whitesun Lars 165 75 brown light blue
8 R5-D4 97 32 <NA> white, red red
9 Biggs Darklighter 183 84 black light brown
10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
# ℹ 77 more rows
And we can deselect columns with -
in front of it
|> select(-mass, -(eye_color:starships)) starwars
# A tibble: 87 × 4
name height hair_color skin_color
<chr> <int> <chr> <chr>
1 Luke Skywalker 172 blond fair
2 C-3PO 167 <NA> gold
3 R2-D2 96 <NA> white, blue
4 Darth Vader 202 none white
5 Leia Organa 150 brown light
6 Owen Lars 178 brown, grey light
7 Beru Whitesun Lars 165 brown light
8 R5-D4 97 <NA> white, red
9 Biggs Darklighter 183 black light
10 Obi-Wan Kenobi 182 auburn, white fair
# ℹ 77 more rows
We have also reserved words: everything()
all variables….
|> select(mass, homeworld, everything()) starwars
# A tibble: 87 × 14
mass homeworld name height hair_color skin_color eye_color birth_year sex
<dbl> <chr> <chr> <int> <chr> <chr> <chr> <dbl> <chr>
1 77 Tatooine Luke… 172 blond fair blue 19 male
2 75 Tatooine C-3PO 167 <NA> gold yellow 112 none
3 32 Naboo R2-D2 96 <NA> white, bl… red 33 none
4 136 Tatooine Dart… 202 none white yellow 41.9 male
5 49 Alderaan Leia… 150 brown light brown 19 fema…
6 120 Tatooine Owen… 178 brown, gr… light blue 52 male
7 75 Tatooine Beru… 165 brown light blue 47 fema…
8 32 Tatooine R5-D4 97 <NA> white, red red NA none
9 84 Tatooine Bigg… 183 black light brown 24 male
10 77 Stewjon Obi-… 182 auburn, w… fair blue-gray 57 male
# ℹ 77 more rows
# ℹ 5 more variables: gender <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
…and last_col()
to refer to last column.
|> select(name:mass, homeworld, last_col()) starwars
# A tibble: 87 × 5
name height mass homeworld starships
<chr> <int> <dbl> <chr> <list>
1 Luke Skywalker 172 77 Tatooine <chr [2]>
2 C-3PO 167 75 Tatooine <chr [0]>
3 R2-D2 96 32 Naboo <chr [0]>
4 Darth Vader 202 136 Tatooine <chr [1]>
5 Leia Organa 150 49 Alderaan <chr [0]>
6 Owen Lars 178 120 Tatooine <chr [0]>
7 Beru Whitesun Lars 165 75 Tatooine <chr [0]>
8 R5-D4 97 32 Tatooine <chr [0]>
9 Biggs Darklighter 183 84 Tatooine <chr [1]>
10 Obi-Wan Kenobi 182 77 Stewjon <chr [5]>
# ℹ 77 more rows
We can also play with patterns in the name, those that begin with a prefix (starts_with()
), [end with a suffix]{. hl-purple} (ends_with()
), contain text (contains()
) or fulfill a regular expression (matches()
).
# variables which col name finish as "color" and contains sex and gender
|> select(ends_with("color"), matches("sex|gender")) starwars
# A tibble: 87 × 5
hair_color skin_color eye_color sex gender
<chr> <chr> <chr> <chr> <chr>
1 blond fair blue male masculine
2 <NA> gold yellow none masculine
3 <NA> white, blue red none masculine
4 none white yellow male masculine
5 brown light brown female feminine
6 brown, grey light blue male masculine
7 brown light blue female feminine
8 <NA> white, red red none masculine
9 black light brown male masculine
10 auburn, white fair blue-gray male masculine
# ℹ 77 more rows
We can even select by numeric range if we have variables with a prefix and numbers.
<-
data tibble("wk1" = c(115, 141, 232), "wk2" = c(7, NA, 17),
"wk3" = c(95, 162, NA), "wk4" = c(11, 19, 15),
"wk5" = c(NA, 262, 190), "wk6" = c(21, 15, 23))
With num_range()
we can select with a prefix and a numeric sequence.
|> select(num_range("wk", 1:4)) data
# A tibble: 3 × 4
wk1 wk2 wk3 wk4
<dbl> <dbl> <dbl> <dbl>
1 115 7 95 11
2 141 NA 162 19
3 232 17 NA 15
Finally, we can select columns by datatatype using where()
and inside a function that returns a logical value of datatype.
# just numeric and string columns
|> select(where(is.numeric) | where(is.character)) starwars
# A tibble: 87 × 11
height mass birth_year name hair_color skin_color eye_color sex gender
<int> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 172 77 19 Luke Sk… blond fair blue male mascu…
2 167 75 112 C-3PO <NA> gold yellow none mascu…
3 96 32 33 R2-D2 <NA> white, bl… red none mascu…
4 202 136 41.9 Darth V… none white yellow male mascu…
5 150 49 19 Leia Or… brown light brown fema… femin…
6 178 120 52 Owen La… brown, gr… light blue male mascu…
7 165 75 47 Beru Wh… brown light blue fema… femin…
8 97 32 NA R5-D4 <NA> white, red red none mascu…
9 183 84 24 Biggs D… black light brown male mascu…
10 182 77 57 Obi-Wan… auburn, w… fair blue-gray male mascu…
# ℹ 77 more rows
# ℹ 2 more variables: homeworld <chr>, species <chr>
3.2 Move columns: relocate()
|>
data move(var1, after = var2)
|>
starwars relocate(var1, .after = var2)
To facilitate the relocation of variables we have a function for it, relocate()
, indicating in .after
or .before
behind or in front of which columns we want to move them.
|> relocate(species, .before = name) starwars
# A tibble: 87 × 14
species name height mass hair_color skin_color eye_color birth_year sex
<chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Human Luke S… 172 77 blond fair blue 19 male
2 Droid C-3PO 167 75 <NA> gold yellow 112 none
3 Droid R2-D2 96 32 <NA> white, bl… red 33 none
4 Human Darth … 202 136 none white yellow 41.9 male
5 Human Leia O… 150 49 brown light brown 19 fema…
6 Human Owen L… 178 120 brown, gr… light blue 52 male
7 Human Beru W… 165 75 brown light blue 47 fema…
8 Droid R5-D4 97 32 <NA> white, red red NA none
9 Human Biggs … 183 84 black light brown 24 male
10 Human Obi-Wa… 182 77 auburn, w… fair blue-gray 57 male
# ℹ 77 more rows
# ℹ 5 more variables: gender <chr>, homeworld <chr>, films <list>,
# vehicles <list>, starships <list>
3.3 Rename: rename()
|> rename(new = old) data
|> rename(new = old) starwars
Sometimes we may also want to modify the “meta-information” of the data, renaming columns. To do this we will use rename()
by typing first the new name and then the old.
|> rename(nombre = name, altura = height, peso = mass) starwars
# A tibble: 87 × 14
nombre altura peso hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
3.4 Extract columns: pull()
|> extract(var) data
|> pull(var) starwars
If you look at the output of the select()
still a tibble table, it preserves the nature of our data.
|> select(name) starwars
# A tibble: 87 × 1
name
<chr>
1 Luke Skywalker
2 C-3PO
3 R2-D2
4 Darth Vader
5 Leia Organa
6 Owen Lars
7 Beru Whitesun Lars
8 R5-D4
9 Biggs Darklighter
10 Obi-Wan Kenobi
# ℹ 77 more rows
Sometimes we will not want such a structure but literally extract the column in a VECTOR, something we can do with pull()
.
|> pull(name) starwars
[1] "Luke Skywalker" "C-3PO" "R2-D2"
[4] "Darth Vader" "Leia Organa" "Owen Lars"
[7] "Beru Whitesun Lars" "R5-D4" "Biggs Darklighter"
[10] "Obi-Wan Kenobi" "Anakin Skywalker" "Wilhuff Tarkin"
[13] "Chewbacca" "Han Solo" "Greedo"
[16] "Jabba Desilijic Tiure" "Wedge Antilles" "Jek Tono Porkins"
[19] "Yoda" "Palpatine" "Boba Fett"
[22] "IG-88" "Bossk" "Lando Calrissian"
[25] "Lobot" "Ackbar" "Mon Mothma"
[28] "Arvel Crynyd" "Wicket Systri Warrick" "Nien Nunb"
[31] "Qui-Gon Jinn" "Nute Gunray" "Finis Valorum"
[34] "Padmé Amidala" "Jar Jar Binks" "Roos Tarpals"
[37] "Rugor Nass" "Ric Olié" "Watto"
[40] "Sebulba" "Quarsh Panaka" "Shmi Skywalker"
[43] "Darth Maul" "Bib Fortuna" "Ayla Secura"
[46] "Ratts Tyerel" "Dud Bolt" "Gasgano"
[49] "Ben Quadinaros" "Mace Windu" "Ki-Adi-Mundi"
[52] "Kit Fisto" "Eeth Koth" "Adi Gallia"
[55] "Saesee Tiin" "Yarael Poof" "Plo Koon"
[58] "Mas Amedda" "Gregar Typho" "Cordé"
[61] "Cliegg Lars" "Poggle the Lesser" "Luminara Unduli"
[64] "Barriss Offee" "Dormé" "Dooku"
[67] "Bail Prestor Organa" "Jango Fett" "Zam Wesell"
[70] "Dexter Jettster" "Lama Su" "Taun We"
[73] "Jocasta Nu" "R4-P17" "Wat Tambor"
[76] "San Hill" "Shaak Ti" "Grievous"
[79] "Tarfful" "Raymus Antilles" "Sly Moore"
[82] "Tion Medon" "Finn" "Rey"
[85] "Poe Dameron" "BB8" "Captain Phasma"
3.5 💻 It’s your turn
Try to perform the following exercises without looking at the solutions
📝 Filter the set of characters and keep only those that do not have a missing data in the height
variable. With the data obtained from the previous filter, select only the variables name, height, as well as all those variables that CONTAIN the word color in their name.
Code
<-
starwars_2 |>
starwars drop_na(height) |>
select(name, height, contains("color"))
📝 With the data obtained from the previous Exercise, translate the names of the columns into Spanish (or your motherlanguage).
Code
|>
starwars_2 rename(nombre = name, altura = height, color_pelo = hair_color,
color_piel = skin_color, color_ojos = eye_color)
📝 With the data obtained from Exercise 1, place the hair color variable just after the name variable.
Con los data obtenidos del Exercise anterior, coloca la variable de color de pelo justo detrás de la variable de nombres.
Code
|>
starwars_2 relocate(hair_color, .after = name)
📝 With the data obtained from the Exercise 1, check how many unique modalities there are in the hair color variable (without using unique()
).
Code
|>
starwars_2 distinct(hair_color)
📝 From the original data set, it removes the list type columns, and then removes duplicates in the eye_color
variable. After removing duplicates it extracts that column into a vector.
Code
|>
starwars select(-where(is.list)) |>
distinct(eye_color, .keep_all = TRUE) |>
pull(eye_color)
📝 From the original starwars dataset, with only the characters whose height is known, extract in a vector with that variable.
Code
|>
starwars drop_na(height) |>
pull(height)
📝 After obtaining the vector from the previous Exercise, use this vector to randomly sample 50% of the data so that the probability of each character being chosen is inversely proportional to their height (shorter, more options).
Code
<-
heights |>
starwars drop_na(height) |>
pull(height)
|>
starwars slice_sample(prop = 0.5, weight_by = 1/heights)
3.6 Modify columns: mutate()
|> modify(new_var = funcion()) data
|> mutate(new_var = function()) starwars
In many occasions we will want to modify or create variables with mutate()
. Let’s create for example a new variable height_m
with the height in meters.
|> mutate(height_m = height / 100) starwars
# A tibble: 87 × 15
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>, height_m <dbl>
In addition with the optional arguments we can reposition the modified column
|>
starwars mutate(height_m = height / 100,
BMI = mass / (height_m^2), .before = name)
# A tibble: 87 × 16
height_m BMI name height mass hair_color skin_color eye_color birth_year
<dbl> <dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
1 1.72 26.0 Luke … 172 77 blond fair blue 19
2 1.67 26.9 C-3PO 167 75 <NA> gold yellow 112
3 0.96 34.7 R2-D2 96 32 <NA> white, bl… red 33
4 2.02 33.3 Darth… 202 136 none white yellow 41.9
5 1.5 21.8 Leia … 150 49 brown light brown 19
6 1.78 37.9 Owen … 178 120 brown, gr… light blue 52
7 1.65 27.5 Beru … 165 75 brown light blue 47
8 0.97 34.0 R5-D4 97 32 <NA> white, red red NA
9 1.83 25.1 Biggs… 183 84 black light brown 24
10 1.82 23.2 Obi-W… 182 77 auburn, w… fair blue-gray 57
# ℹ 77 more rows
# ℹ 7 more variables: sex <chr>, gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
When we apply mutate()
, we must remember that the operations are performed vector by vector, element by element, so the function we use inside must return a vector of equal length. Otherwise, it will return a constant.
|>
starwars mutate(constante = mean(mass, na.rm = TRUE), .before = name)
# A tibble: 87 × 15
constante name height mass hair_color skin_color eye_color birth_year sex
<dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 97.3 Luke… 172 77 blond fair blue 19 male
2 97.3 C-3PO 167 75 <NA> gold yellow 112 none
3 97.3 R2-D2 96 32 <NA> white, bl… red 33 none
4 97.3 Dart… 202 136 none white yellow 41.9 male
5 97.3 Leia… 150 49 brown light brown 19 fema…
6 97.3 Owen… 178 120 brown, gr… light blue 52 male
7 97.3 Beru… 165 75 brown light blue 47 fema…
8 97.3 R5-D4 97 32 <NA> white, red red NA none
9 97.3 Bigg… 183 84 black light brown 24 male
10 97.3 Obi-… 182 77 auburn, w… fair blue-gray 57 male
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
3.7 Recategorize
3.7.1 if_else()
We can also combine mutate()
with the if_else()
control expression to recategorize the variable: if a condition is met, it does one thing, otherwise another.
|>
starwars mutate(human = if_else(species == "Human", "Human", "Not Human"),
.after = name) |>
select(name:mass)
# A tibble: 87 × 4
name human height mass
<chr> <chr> <int> <dbl>
1 Luke Skywalker Human 172 77
2 C-3PO Not Human 167 75
3 R2-D2 Not Human 96 32
4 Darth Vader Human 202 136
5 Leia Organa Human 150 49
6 Owen Lars Human 178 120
7 Beru Whitesun Lars Human 165 75
8 R5-D4 Not Human 97 32
9 Biggs Darklighter Human 183 84
10 Obi-Wan Kenobi Human 182 77
# ℹ 77 more rows
3.7.2 case_when()
For more complex categorizations we have case_when()
, for example, to create a category of characters based on their height.
|>
starwars drop_na(height) |>
mutate(altura = case_when(height < 120 ~ "dwarf",
< 160 ~ "short",
height < 180 ~ "normal",
height < 200 ~ "tall",
height TRUE ~ "giant"), .before = name)
# A tibble: 81 × 15
altura name height mass hair_color skin_color eye_color birth_year sex
<chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 normal Luke Sk… 172 77 blond fair blue 19 male
2 normal C-3PO 167 75 <NA> gold yellow 112 none
3 dwarf R2-D2 96 32 <NA> white, bl… red 33 none
4 giant Darth V… 202 136 none white yellow 41.9 male
5 short Leia Or… 150 49 brown light brown 19 fema…
6 normal Owen La… 178 120 brown, gr… light blue 52 male
7 normal Beru Wh… 165 75 brown light blue 47 fema…
8 dwarf R5-D4 97 32 <NA> white, red red NA none
9 tall Biggs D… 183 84 black light brown 24 male
10 tall Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male
# ℹ 71 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
3.8 nest data
We can also nest or embed datasets inside each other. Imagine that we have a dataset of x
and y
variables with 2 records, another one with the same variables but only one record and another one with 3 records.
<- tibble("x" = c(0, 2), "y" = c(-1, NA))
data_1 <- tibble("x" = c(NA), "y" = c(5))
data_2 <- tibble("x" = c(-2, 6, 7), "y" = c(1.5, NA, -2)) data_3
So far the only way we know how to bind the 3 datasets is by using bind_rows()
(by the way, if you use the argument .id = “variable_name”
we can make it add a new variable that tells us to which dataset each row belonged.
<- bind_rows(data_1, data_2, data_3, .id = "dataset")
data data
# A tibble: 6 × 3
dataset x y
<chr> <dbl> <dbl>
1 1 0 -1
2 1 2 NA
3 2 NA 5
4 3 -2 1.5
5 3 6 NA
6 3 7 -2
However, in many occasions we will want to have all 3 in the same object BUT each dataset on its own: an object (a list) that stores the 3 datasets separated from each other. To do this we will use the nest()
function indicating which common variables form the datasets (in this case x
and y
).
<-
data_nest |>
data nest(data = c(x, y))
data_nest
# A tibble: 3 × 2
dataset data
<chr> <list>
1 1 <tibble [2 × 2]>
2 2 <tibble [1 × 2]>
3 3 <tibble [3 × 2]>
Note that now data_nest
is a list as each stored dataset could have different lengths. To unnest we can use unnest() indicating the column containing the datasets
|> unnest(cols = c(data)) data_nest
# A tibble: 6 × 3
dataset x y
<chr> <dbl> <dbl>
1 1 0 -1
2 1 2 NA
3 2 NA 5
4 3 -2 1.5
5 3 6 NA
6 3 7 -2
3.9 💻 It’s your turn
Try to perform the following exercises without looking at the solutions
📝 Select only the variables name, height and as well as all those variables related to the color, while keeping only those that are not absent in the height.
Code
|>
starwars select(name, height, contains("color")) |>
drop_na(height)
📝 With the data obtained from the previous Exercise, translate the names of the columns into Spanish or your mother language.
Code
|>
starwars select(name, height, contains("color")) |>
drop_na(height) |>
rename(nombre = name, altura = height,
color_pelo = eye_color, color_piel = skin_color,
color_pelo = hair_color)
📝 With the data obtained from the previous Exercise, place the hair color variable just after the name variable.
Code
|>
starwars select(name, height, contains("color")) |>
drop_na(height) |>
rename(nombre = name, altura = height,
color_pelo = eye_color, color_piel = skin_color,
color_pelo = hair_color) |>
relocate(color_pelo, .after = nombre)
📝 With the original data, check how many unique modalities there are in the hair color variable.
Code
|>
starwars distinct(hair_color) |>
nrow()
📝 From the original dataset, select only the numeric and text variables. Then define a new variable called under_18
to recategorize the age variable: TRUE
if under age and FALSE
if not.
Code
|>
starwars select(where(is.numeric) | where(is.character)) |>
mutate(under_18 = birth_year < 18)
📝 From the original dataset, create a new column named auburn
that tells us TRUE if the hair color contains that word and FALSE otherwise (reminder str_detect()
).
Code
|>
starwars mutate(auburn = str_detect(hair_color, "auburn"))
📝 From the original dataset, include a column that calculates BMI. After that, create a new variable that values NA
if not human, underweight
below 18, normal
between 18 and 30, overweight
above 30.
Code
|>
starwars mutate(IMC = mass / ((height/100)^2),
IMC_recat = case_when(species != "Human" ~ NA,
< 18 ~ "underweight",
IMC < 30 ~ "normal",
IMC TRUE ~ "overweight"),
.after = name)
4 🐣 Case study I: CIS survey (feminism)
pending to submit
5 🐣 Caso study II: Taylor Swift
Let’s go back to the analysis of Taylor Swift songs we did in the previous installment.
library(taylor)
taylor_album_songs
# A tibble: 240 × 29
album_name ep album_release track_number track_name artist featuring
<chr> <lgl> <date> <int> <chr> <chr> <chr>
1 Taylor Swift FALSE 2006-10-24 1 Tim McGraw Taylo… <NA>
2 Taylor Swift FALSE 2006-10-24 2 Picture To Bu… Taylo… <NA>
3 Taylor Swift FALSE 2006-10-24 3 Teardrops On … Taylo… <NA>
4 Taylor Swift FALSE 2006-10-24 4 A Place In Th… Taylo… <NA>
5 Taylor Swift FALSE 2006-10-24 5 Cold As You Taylo… <NA>
6 Taylor Swift FALSE 2006-10-24 6 The Outside Taylo… <NA>
7 Taylor Swift FALSE 2006-10-24 7 Tied Together… Taylo… <NA>
8 Taylor Swift FALSE 2006-10-24 8 Stay Beautiful Taylo… <NA>
9 Taylor Swift FALSE 2006-10-24 9 Should've Sai… Taylo… <NA>
10 Taylor Swift FALSE 2006-10-24 10 Mary's Song (… Taylo… <NA>
# ℹ 230 more rows
# ℹ 22 more variables: bonus_track <lgl>, promotional_release <date>,
# single_release <date>, track_release <date>, danceability <dbl>,
# energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
# acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
# tempo <dbl>, time_signature <int>, duration_ms <int>, explicit <lgl>,
# key_name <chr>, mode_name <chr>, key_mode <chr>, lyrics <list>
The difference is that now we will try to do everything from a tidyverse view instead of R base (it is interesting that you try to translate from one to the other to know how to master both views)
5.1 Question 1
How many songs are stored? How many features are stored for each song?
Code
|> nrow()
taylor_album_songs |> ncol() taylor_album_songs
Note that now we separate data vs action: data + pipe + action.
5.2 Question 2
Get the name of the (unique) albums contained in the dataset. How many are there (do not count them “by hand”)?
Code
|> distinct(album_name)
taylor_album_songs |>
taylor_album_songs distinct(album_name) |>
nrow()
5.3 Question 3
In how many songs is there a collaboration with another artist (if there is a collaboration, its name is stored in
featuring
)?
Code
# number of unique collaborations
|>
taylor_album_songs drop_na(featuring) |>
nrow()
# number of unique collaborators
|>
taylor_album_songs drop_na(featuring) |>
summarise(n_collabs = n_distinct(featuring))
5.4 Question 4
Create a new
tibble
with only the variablesalbum_name
,album_release
,track_name
,featuring
andduration_ms
. After that it sorts by date from newest to oldest.
Code
# Write the code you consider
<-
nuevo_tb |>
taylor_album_songs select(album_name, album_release, track_name, featuring, duration_ms)
|>
nuevo_tb arrange(desc(album_release))
5.5 Question 5
Add to the previous dataset two new variables with the month and year of release (use the
album_release
variable). Think about how you could determine in which month it has released more albums
Code
# Write the code you consider
library(lubridate)
<-
nuevo_tb |>
nuevo_tb mutate(month = month(album_release),
year = year(album_release))
|>
nuevo_tb count(month, sort = TRUE)
5.6 Question 6
Get the average duration of the songs in minutes (variable
duration_ms
in milliseconds).
Code
|>
nuevo_tb drop_na(duration_ms) |>
summarise(avg_dur = mean(duration_ms/60000))
6 🐣 Caso study III: The Lord of the Rings
To practice some {dplyr}
functions we are going to use data from the Lord of the Rings trilogy movies. We will load the data directly from the web (Github in this case), without going through the computer before, simply indicating as path the web where the file is
The Fellowship of the Ring -> https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv
The 2 Towers -> https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv
The Return of the King -> https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv.
Code
library(readr)
<-
lotr_1 read_csv(file = "https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Fellowship_Of_The_Ring.csv")
Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Film, Race
dbl (2): Female, Male
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
<-
lotr_2 read_csv(file = "https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Two_Towers.csv")
Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Film, Race
dbl (2): Female, Male
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
<-
lotr_3 read_csv(file = "https://raw.githubusercontent.com/jennybc/lotr-tidy/master/data/The_Return_Of_The_King.csv")
Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Film, Race
dbl (2): Female, Male
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
6.1 Question 1
Analyze each file. How many rows do we have? How many columns?
Code
nrow(lotr_1)
ncol(lotr_1)
nrow(lotr_2)
ncol(lotr_2)
nrow(lotr_3)
ncol(lotr_3)
6.2 Question 2
Bind the 3 tibble into one. After joining them together use the
clean_names()
function of the{janitor}
package (it is useful to standardize column names always the same).
Code
<-
lotr bind_rows(lotr_1, lotr_2, lotr_3) |>
::clean_names()
janitor lotr
6.3 Question 3
The numbers stored in
female
andmale
actually represent the amount of words that, in each saga and in each race, each of the sexes says (the amount of dialogue of each type of character). Is it tidy data? If not, convert to that format.
Code
# is not since female, male should be data, not variable name
# right now we have the same variable (count_word, for example)
# separated into two columns (one for gender). To convert it
# to tidy data just pivot to make the table longer
# by creating a variable that carries the gender information.
<-
lotr_tidy |>
lotr pivot_longer(cols = c("female", "male"), names_to = "gender",
values_to = "word_count")
lotr_tidy
6.4 Question 4
Save the unified dataset in tidy format in a
.csv
file.
Code
write_csv(lotr_tidy, file = "./datos/lotr_tidy.csv")
6.5 Question 5
Change the movie titles to lowercase and replace the
film
variable with the new value
Code
<-
lotr_tidy |>
lotr_tidy mutate(film = str_to_lower(film))
lotr_tidy
6.6 Question 6
Filter only the data for male Hobbits and sort by number of words (from most to least). In which of the sagas did they talk the most?
Code
|>
lotr_tidy filter(race == "Hobbit" & gender == "male") |>
arrange(desc(word_count))
6.7 Question 7
Separate each of the datasets so that there is only one object called
lotr_nest
with the nested datasets inside (a tibble with two columns, one identifying the dataset, i.e. the movie, and the other with the nested dataset).
Code
<-
lotr_nest |>
lotr_tidy nest(data = c(race, gender, word_count))
lotr_nest
7 Summaries
7.1 count()
|> count(var1, var2) data
|> count(var1, var2) starwars
So far we have only transformed or queried the data but we have not generated statistics. Let’s start with the simple: how to count (frequencies)? When used alone count()
will simply return the number of records, but when used with count()
variables it calculates what is known as frequencies: number of elements of each modality.
|> count(sex) starwars
# A tibble: 5 × 2
sex n
<chr> <int>
1 female 16
2 hermaphroditic 1
3 male 60
4 none 6
5 <NA> 4
Also if we pass several variables it calculates what is known as a contiguity table. With sort = TRUE
it will return the ordered count (most frequent first).
|> count(sex, gender, sort = TRUE) starwars
# A tibble: 6 × 3
sex gender n
<chr> <chr> <int>
1 male masculine 60
2 female feminine 16
3 none masculine 5
4 <NA> <NA> 4
5 hermaphroditic masculine 1
6 none feminine 1
7.2 group_by()
|>
data group(var1, var2) |>
some_action() |>
ungroup()
|>
starwars group_by(var1, var2) |>
some_action() |>
ungroup()
One of the most powerful functions to combine with the actions seen is group_by()
, which will allow us to group our records beforehand.
|>
starwars group_by(sex) |>
count() |>
ungroup()
# A tibble: 5 × 2
sex n
<chr> <int>
1 female 16
2 hermaphroditic 1
3 male 60
4 none 6
5 <NA> 4
When applying group_by()
it is important to understand that it DOES NOT MODIFY the data, but creates a group variable (sub-tables for each group) that will modify future actions: the operations will be applied to each sub-table separately
For example, imagine that we want to extract the highest character with slice_max()
.
|> slice_max(height) starwars
# A tibble: 1 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Yarael P… 264 NA none white yellow NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
What if we want to extract the tallest character but…of each of the sexes?
|>
starwars group_by(sex) |>
slice_max(height) |>
ungroup()
# A tibble: 5 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Taun We 213 NA none grey black NA fema… femin…
2 Jabba De… 175 1358 <NA> green-tan… orange 600 herm… mascu…
3 Yarael P… 264 NA none white yellow NA male mascu…
4 IG-88 200 140 none metal red 15 none mascu…
5 Gregar T… 185 85 black dark brown NA <NA> <NA>
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
The web https://tidydatatutor.com/ allows to visualize the operations of {tidyverse}
(doing with the old pipe %>%
)
You should always remember to make ungroup to remove the created group variable.
The “new” version of {dplyr}
now allows to include the group variable in the call to many functions with the argument by = ...
or .by = ...
.
|> slice_max(height, by = sex) starwars
# A tibble: 5 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Yarael P… 264 NA none white yellow NA male mascu…
2 IG-88 200 140 none metal red 15 none mascu…
3 Taun We 213 NA none grey black NA fema… femin…
4 Jabba De… 175 1358 <NA> green-tan… orange 600 herm… mascu…
5 Gregar T… 185 85 black dark brown NA <NA> <NA>
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
7.3 Row-by-row: rowwise()
A very useful option used before an operation is also rowwise()
: every operation that comes afterwards will be applied on each row separately. For example, let’s define a dummy set of grades.
<- tibble("maths" = c(7.5, 8, 9.1, 3),
grades "language" = c(8, 6, 6.5, 9.2))
If we apply the average directly the value will be identical since it has done the global average, but we would like to get an average per record. For that we will use rowwise()
.
|>
grades rowwise() |>
mutate(ave_grades = mean(c(maths, language)))
# A tibble: 4 × 3
# Rowwise:
maths language ave_grades
<dbl> <dbl> <dbl>
1 7.5 8 7.75
2 8 6 7
3 9.1 6.5 7.8
4 3 9.2 6.1
7.4 summarise()
|> simple_summary() data
|> summarise() starwars
Finally we have summarise()
, which will allow us to get statistical summaries. For example, let’s calculate the average of the heights.
|>
starwars drop_na(height) |>
summarise(ave_height = mean(height))
# A tibble: 1 × 1
ave_height
<dbl>
1 175.
Notice that mutate()
returns as many rows as original records, while with summarise()
it calculates a new summary dataset, only including what is indicated.
If we also combine this with the grouping of group_by()
or .by = ...
, in a few lines of code you can get disaggregated statistics.
|>
starwars drop_na(sex, height, mass) |>
summarise(ave_height = mean(height),
ave_mass = mean(mass),
.by = sex)
# A tibble: 4 × 3
sex ave_height ave_mass
<chr> <dbl> <dbl>
1 male 178. 80.2
2 none 140 69.8
3 female 172. 54.7
4 hermaphroditic 175 1358
7.5 reframe()
|> complex_summary() data
|> reframe() starwars
In the new {dplyr}
they have included reframe()
to avoid summarise()
problems when we return more than one value per variable in more complex summaries.
|>
starwars drop_na(mass) |>
summarise(quantile(mass))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
# A tibble: 5 × 1
`quantile(mass)`
<dbl>
1 15
2 55.6
3 79
4 84.5
5 1358
|>
starwars drop_na(mass) |>
reframe(quantile(mass))
# A tibble: 5 × 1
`quantile(mass)`
<dbl>
1 15
2 55.6
3 79
4 84.5
5 1358
7.5.1 across()
One trick is to make use of selectors across()
and where()
. The former allows us to act on several columns by name (with mutate()
or summarise()
).
|> summarise(ave = across(height:mass, mean, na.rm = TRUE), .by = sex) starwars
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `ave = across(height:mass, mean, na.rm = TRUE)`.
ℹ In group 1: `sex = "male"`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 5 × 2
sex ave$height $mass
<chr> <dbl> <dbl>
1 male 179. 80.2
2 none 131. 69.8
3 female 172. 54.7
4 hermaphroditic 175 1358
5 <NA> 175 81
The second, where()
, allows us to do the same but selecting by type.
|>
starwars summarise(across(where(is.numeric), mean, na.rm = TRUE), .by = c(sex, gender))
# A tibble: 6 × 5
sex gender height mass birth_year
<chr> <chr> <dbl> <dbl> <dbl>
1 male masculine 179. 80.2 84.8
2 none masculine 140 69.8 53.3
3 female feminine 172. 54.7 47.2
4 hermaphroditic masculine 175 1358 600
5 <NA> <NA> 175 81 NaN
6 none feminine 96 NaN NaN
7.6 💻 It’s your turn
Try to perform the following exercises without looking at the solutions
📝 Calculate how many characters there are of each species, ordered from most to least frequent.
Code
|> count(species, sort = TRUE) starwars
📝 After eliminating missing variables for weight and height, add a new variable to calculate the BMI of each character, and determine the average BMI of our characters disaggregated by gender.
Code
|>
starwars drop_na(mass, height) |>
mutate(BMI = mass / ((height/100)^2)) |>
summarise(ave_BMI = mean(BMI), .by = sex)
📝 Obtain the youngest character for each gender.
Code
|> # reminder that birth_year is in fact the age
starwars slice_min(birth_year, by = sex)
📝 Get the age of the youngest and oldest character of each sex.
Code
|>
starwars drop_na(birth_year) |>
summarise(min(birth_year), max(birth_year), .by = sex)
📝 Determine the number of characters in each decade (take a look at round()
, first without disaggregating and then disaggregated by sex.
Code
|>
starwars count(birth_decade = round(birth_year, -1))
8 🐣 Case study I: billboard
We are going to do a summary of what we learned in {tidyverse}
with the billboard table of the {tidyr}
package. The dataset represents something similar to the top 40 (but American version and a top 100 instead of 40): for each artist and song we store the date when it entered the ranking, and the position it occupied in the ranking in each of the weeks (wk1
, wk2
, …).
billboard
# A tibble: 317 × 8
artist track date.entered wk1 wk2 wk3 wk4 wk5
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby Don't Cry (Ke… 2000-02-26 87 82 72 77 87
2 2Ge+her The Hardest Part O… 2000-09-02 91 87 92 NA NA
3 3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66
4 3 Doors Down Loser 2000-10-21 76 76 72 69 67
5 504 Boyz Wobble Wobble 2000-04-15 57 34 25 17 17
6 98^0 Give Me Just One N… 2000-08-19 51 39 34 26 26
7 A*Teens Dancing Queen 2000-07-08 97 97 96 95 100
8 Aaliyah I Don't Wanna 2000-01-29 84 62 51 41 38
9 Aaliyah Try Again 2000-03-18 59 53 38 28 21
10 Adams, Yolanda Open My Heart 2000-08-26 76 76 74 69 68
# ℹ 307 more rows
8.1 Question 1
Select only the first 52 weeks.
Code
<-
billboard_year |>
billboard select(artist:date.entered, num_range("wk", 1:52))
8.2 Question 2
Then convert the dataset to tidydata with the appropriate formats and types for each variable (e.g., the resulting week variable should be a number; look for options in
pivot_longer()
) with no missing data.
Code
<-
billboard_tidy |>
billboard_year pivot_longer(cols = "wk1":"wk52", names_to = "week",
values_to = "rank", names_prefix = "wk",
values_drop_na = TRUE) |>
mutate(week = as.numeric(week))
8.3 Question 3
Extract the (different) artists appearing in the table, including how many times each one appears.
Code
|>
billboard_tidy count(artist, sort = TRUE)
8.4 Question 4
Determines how many different songs each artist has in the ranking (a song that appears many weeks only counts as one song).
Code
|>
billboard_tidy distinct(artist, track) |>
count(artist, sort = TRUE)
8.5 Question 5
Determine the 5 songs that appear the most weeks in the ranking.
Code
|>
billboard_tidy count(track) |>
slice_max(n, n = 5)
8.6 Question 6
Determine for each artist the song that appears the most weeks in the ranking.
Code
|>
billboard_tidy group_by(artist) |>
count(track) |>
slice_max(n, n = 1) |>
ungroup()
8.7 Question 7
Calculates the highest position each song has been in. Calculates the highest position an artist has been in.
Code
|>
billboard_tidy slice_min(rank, by = track, with_ties = FALSE) |>
select(artist, track, rank)
|>
billboard_tidy slice_min(rank, by = artist, with_ties = FALSE) |>
select(artist, rank)
8.8 Question 8
Get a summary table with the average ranking of each artist (counting only the highest ranking achieved by their songs), as well as the number of (different) songs they have placed in the top 100.
Code
|>
billboard_tidy slice_min(rank, by = c(artist, track), with_ties = FALSE) |>
summarise(ave_rank = mean(rank), n_songs = n(), .by = artist)
8.9 Question 9
Perform stratified random sampling, extracting 50% of the data.
Code
|>
billboard_tidy slice_sample(prop = 0.5)
9 🐣 Case study II: soccer
Let’s continue practicing what we learned in {tidyverse}
with the file futbol.csv
, where we have data of the players of the 5 main men’s soccer leagues, from 2005 to 2019, compiling different statistics. The data has been extracted directly using the {worldfootballR}
package, which allows us to extract data from https://www.fbref.com.
<- read_csv(file = "./datos/futbol.csv") data
Rows: 40393 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): team, league, player, country, position
dbl (11): season, date_birth, minutes_playing, minutes_per_match_playing, go...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
# A tibble: 40,393 × 16
season team league player country position date_birth minutes_playing
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 2005 Ajaccio Ligue 1 Djamel Ab… ALG MF 1986 30
2 2005 Ajaccio Ligue 1 Gaspar Az… POR DF 1975 1224
3 2005 Ajaccio Ligue 1 Yacine Be… ALG MF 1981 140
4 2005 Ajaccio Ligue 1 Nicolas B… FRA MF 1976 892
5 2005 Ajaccio Ligue 1 Marcelinh… BRA MF 1971 704
6 2005 Ajaccio Ligue 1 Cyril Cha… FRA FW 1979 1308
7 2005 Ajaccio Ligue 1 Xavier Co… FRA DF 1974 2734
8 2005 Ajaccio Ligue 1 Renaud Co… FRA MF 1980 896
9 2005 Ajaccio Ligue 1 Yohan Dem… FRA DF,MF 1978 2658
10 2005 Ajaccio Ligue 1 Christoph… FRA DF 1973 74
# ℹ 40,383 more rows
# ℹ 8 more variables: minutes_per_match_playing <dbl>, goals <dbl>,
# assist <dbl>, goals_minus_pk <dbl>, pk <dbl>, pk_attemp <dbl>,
# yellow_card <dbl>, red_card <dbl>
The variables capture the following information:
season
,team
,league
: season, team and league.player
,country
,position
,date_birth
: name, country, position and birth of year of each player.minutes_playing
,matches
: total minutes playing and 90’ matches playedgoals
,assist
: goals and assists.pk
,pk_attemp
,goals_minus_pk
: penalties, penalties attempted and goals excluding penalties.yellow_card
,red_card
: yellow/red cards.
data
# A tibble: 40,393 × 16
season team league player country position date_birth minutes_playing
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 2005 Ajaccio Ligue 1 Djamel Ab… ALG MF 1986 30
2 2005 Ajaccio Ligue 1 Gaspar Az… POR DF 1975 1224
3 2005 Ajaccio Ligue 1 Yacine Be… ALG MF 1981 140
4 2005 Ajaccio Ligue 1 Nicolas B… FRA MF 1976 892
5 2005 Ajaccio Ligue 1 Marcelinh… BRA MF 1971 704
6 2005 Ajaccio Ligue 1 Cyril Cha… FRA FW 1979 1308
7 2005 Ajaccio Ligue 1 Xavier Co… FRA DF 1974 2734
8 2005 Ajaccio Ligue 1 Renaud Co… FRA MF 1980 896
9 2005 Ajaccio Ligue 1 Yohan Dem… FRA DF,MF 1978 2658
10 2005 Ajaccio Ligue 1 Christoph… FRA DF 1973 74
# ℹ 40,383 more rows
# ℹ 8 more variables: minutes_per_match_playing <dbl>, goals <dbl>,
# assist <dbl>, goals_minus_pk <dbl>, pk <dbl>, pk_attemp <dbl>,
# yellow_card <dbl>, red_card <dbl>
9.1 Question 1
Include a new variable indicating the age of each player in each season (e.g. if he was born in 1989 and the season is 2017, he is put 18 years old). Place this column after the date of birth.
Code
<-
datos |>
datos mutate(age = season - date_birth, .after = "date_birth")
datos
9.2 Question 2
Calculate the goals and assists (separately) scored every 90 minutes (and include them as variables). For this you should only consider players who have played more than 30 minutes during the season and have played more than 5 matches on average.
Code
<-
datos |>
datos filter(minutes_playing > 30 & matches > 5) |>
mutate(goals_per_90 = 90 * goals / minutes_playing,
assist_per_90 = 90 * assist / minutes_playing)
datos
9.3 Question 3
Includes a new variable
pk_fails
that calculates the number of missed penalties. It also includes a new variablegoals_minus_pk_per_90
that calculates goals per 90 minutes but excludes penalty goals.
Code
<-
datos |>
datos mutate(pk_fails = pk_attemp - pk,
goals_minus_pk_per_90 = 90 * goals_minus_pk / minutes_playing)
datos
9.4 Question 4
Create a new variable that codes in
role
whether a player is a “main” (more than 30 slots of 90’s played on average), “recurrent” (between 20 and 30 minutes), “substitute” (between 7 and 20 minutes) or “sporadic” (less than 7).
Code
<-
datos |>
datos mutate(rol = case_when(matches > 30 ~ "main",
> 20 ~ "recurrent",
matches > 7 ~ "substitute",
matches TRUE ~ "sporadic"))
datos
9.5 Question 5
Calculate the average goals and average goals per 90 minutes for each player category (the one created in
role
). Which role performs better (i.e. more goals per 90 minutes)?
Code
|>
datos summarise(mean_goals = mean(goals, na.rm = TRUE),
mean_goals_per_90 = mean(goals_per_90, na.rm = TRUE),
.by = "rol")
9.6 Question 6
Determine in the whole dataset the 10 players with the best scoring average per 90 minutes. Which Spanish players are in the top 10?
Code
# top 10
|>
datos slice_max(goals_per_90, n = 10)
|>
datos slice_max(goals_per_90, n = 10) |>
filter(country == "ESP")
9.7 Question 7
Calculate the total number of goals in each of the major leagues, in which league were the most goals scored during all these years? Get the summary table ordered from most to least goals.
Code
|>
datos summarise(total_goals = sum(goals, na.rm = TRUE), .by = league) |>
arrange(desc(total_goals))
Get the total number of goals per league and season ordered from most to least
Code
|>
datos summarise(total_goals = sum(goals, na.rm = TRUE), .by = c(season, league)) |>
arrange(desc(total_goals))
9.8 Question 8
Calculate for each league and each year, the most effective player (more average goals per game). In how many seasons was it Ronaldo? In how many Messi?
Code
|>
datos slice_max(goals_per_90, by = c(season, league))
|>
datos slice_max(goals_per_90, by = c(season, league)) |>
count(player == "Cristiano Ronaldo")
|>
datos slice_max(goals_per_90, by = c(season, league)) |>
count(player == "Lionel Messi")
9.9 Question 9
Calculate the number of seasons that Cristiano Ronaldo averaged more goals per game than Messi (provided that both played in that season). Think how to redesign the dataset in order to make the calculation (simplify the table first only to what you are interested in).
Code
|>
datos filter(player %in% c("Cristiano Ronaldo", "Lionel Messi")) |>
select(season, player, goals_per_90) |>
pivot_wider(names_from = "player", values_from = "goals_per_90") |>
drop_na() |>
::clean_names() |>
janitorcount(cristiano_ronaldo > lionel_messi)
9.10 Question 10
Determine which player with a main role, and who is neither a goalkeeper nor a defender (containing GK or DF), averages fewer goals per game in each season and league. Determine the one who scores the most.
Code
|>
datos filter(rol == "main" & !str_detect(position, "GK|DF")) |>
slice_min(goals_per_90, by = c(season, league), with_ties = FALSE)
|>
datos filter(rol == "main" & !str_detect(position, "GK|DF")) |>
slice_max(goals_per_90, by = c(season, league), with_ties = FALSE)
10 🐣 Case study III: speechs
Although we cannot do arithmetic operations with them, some operations we can do with the text strings will be important. For that we will use the {stringr}
package (within the same {lubridate}
“universe of packages”).
library(stringr)
10.1 Basic utilities
10.1.1 Length
The most obvious is a function that, given a string, gives us the length. For this we can use the function str_length()
.
str_length("abc")
[1] 3
It is important to note that it counts both numbers and spaces, as well as non-alphanumeric characters.
str_length("abc 123 *")
[1] 9
Also if the text is absent it returns absent (remember: NA
is absent, “NA”
is a string of text more)
str_length(NA)
[1] NA
The package functions are prepared to be vectorized which means that if we apply a function to a vector of two strings it applies it to both in the same way.
str_length(c("abc", "defghi"))
[1] 3 6
10.1.2 Order
Another very common is order text strings (in alphabetical order). For this we can use the str_order()
function, distinguishing ..._sort()
and ..._order()
as with numbers.
<- c("y", "i", "k")
x str_order(x)
[1] 2 3 1
str_sort(x)
[1] "i" "k" "y"
10.2 Handling
10.2.1 Extract sub strings
- Extract sub strings: given a text string,
str_sub(text, start = ..., end = ...)
extracts the substring from thestart
position toend
(if negative it starts counting down).
str_sub("abcd efg", start = 4, end = 6)
[1] "d e"
str_sub("abcd efg", start = 5)
[1] " efg"
str_sub("abcd efg", start = 4, end = -2)
[1] "d ef"
- Extraer subcadenas: the
str_sub()
function allows you to apply it vectorially to multiple strings, and even use it to assign values.
<- c("abcdef", "ghifjk")
x str_sub(x, start = 3, end = -2)
[1] "cde" "ifj"
str_sub(x, start = -1, end = -1)
[1] "f" "k"
str_sub(x, start = 2, end = 2) <- "*"
10.2.2 Duplicate strings
- Duplicate strings: with
str_dup(..., times = ...)
, given a string (or several strings), we can repeat a stringtimes
times.
str_dup("abc", times = 3)
[1] "abcabcabc"
<- c("abcdef", "ghifjk")
x str_dup(x, times = c(2, 5))
[1] "abcdefabcdef" "ghifjkghifjkghifjkghifjkghifjk"
10.2.3 Concatenate strings
- Concatenate strings: with
str_c
we can concatenate different text strings (withsep = ...
we indicate the character that will be used as separator)
str_c("Morning", "My name is Javier")
[1] "MorningMy name is Javier"
str_c("Morning", "My name is Javier", sep = ". ")
[1] "Morning. My name is Javier"
10.2.4 Lower/upper case
- Lower/upper case: with
str_to_...()
we can convert texts to uppercase (..._upper
), to lowercase (..._lower
) and to title (..._title
, first letter of each word in uppercase).
str_to_upper("My name is Javi")
[1] "MY NAME IS JAVI"
str_to_lower("My name is Javi")
[1] "my name is javi"
str_to_title("My name is Javi")
[1] "My Name Is Javi"
10.2.5 Replace
- Replace:
str_replace()
searches for a given pattern in a text string and, if found, replaces it with a replacement string.
str_replace(c("javi", "sandra", "carlos"), pattern = "i", replacement = "*")
[1] "jav*" "sandra" "carlos"
. . .
With str_replace_all()
we replace all matches (by default only the first one is replaced).
str_replace(c("javi", "sandra", "carlos"), pattern = "a", replacement = "*")
[1] "j*vi" "s*ndra" "c*rlos"
str_replace_all(c("javi", "sandra", "carlos"), pattern = "a", replacement = "*")
[1] "j*vi" "s*ndr*" "c*rlos"
10.3 Blank spaces
10.3.1 Filling spaces
- Filling: the
str_pad()
function fills a string with spaces (at the beginning by default) so that it has the specified width. Withside = “both”
as an extra argument it adds them on both sides. Withside = “right”
it adds them at the end. Withpad = ...
we can decide if we want to fill with another type of character (space by default).
str_pad("abc", width = 6)
[1] " abc"
str_pad("abc", 12, side = "both")
[1] " abc "
str_pad("abc", 6, side = "right", pad = "*")
[1] "abc***"
. . .
If width
is less than the length of the string, it does nothing.
str_pad("abc", width = 2)
[1] "abc"
10.3.2 Remove/trim blank spaces
- Remove blank spaces: with
str_trim()
we can remove whitespace at the beginning and at the end of the string. If we addside = ...
we can change whether we want to remove them only at the end or at the beginning (by default, both). Withstr_squish()
we change any succession of whitespace in the middle of the text to just one (and remove at the beginning and end).
str_trim(" abcde fghi ")
[1] "abcde fghi"
str_trim(" abcde ")
[1] "abcde"
str_trim(" abcde ", side = "left")
[1] "abcde "
str_squish(" abcde fghi ")
[1] "abcde fghi"
10.4 Patterns
10.4.1 Detect. Regular expr.
- Detect: with
str_detect()
we can detect whether or not a text string contains a sequence of characters
str_detect(c("javi álvarez", "javi reyes", "sandra reyes"), pattern = "javi")
[1] TRUE TRUE FALSE
str_detect(c("javi álvarez", "javi reyes", "sandra reyes"), pattern = "reyes")
[1] FALSE TRUE TRUE
str_detect(c("javi álvarez", "javi reyes", "sandra reyes"), pattern = "carlos")
[1] FALSE FALSE FALSE
- Regular expr: not only will we be able to detect simple patterns, but we can also make use of what are known as regular expressions, indicating for example that we want to locate any pattern that is at least a letter
str_detect(c("a", "ab", "abc", "abcd"), pattern = "[a-z]")
[1] TRUE TRUE TRUE TRUE
. . .
If we indicate {n}
after the square brackets, we can detect those strings with n consecutive letters.
str_detect(c("a", "ab", "abc", "abcd"), pattern = "[a-z]{3}")
[1] FALSE FALSE TRUE TRUE
- Regular expr: a good command of these expressions can be very useful, for example, to detect correct formats in ID cards or telephone numbers (from Madrid, for example).
We will consider that a correct DNI format is one followed by 8 numbers ([0-9]{8}
) followed directly by a capital letter ([A-Z]
).
str_detect(c("5055A198-W", "50508040W", "5050505W", "50508040-W"),
pattern = "[0-9]{8}[A-Z]")
[1] FALSE TRUE FALSE FALSE
. . .
We can search for different patterns at the same time by concatenating them with a |
.
str_detect(c("5055A198-W", "50508040W", "5050505W", "50508040-W"),
pattern = "[0-9]{8}[A-Z]|[0-9]{8}[-][A-Z]")
[1] FALSE TRUE FALSE TRUE
10.4.2 Count
- Count: with
str_count()
we can count how many times the same pattern appears
str_count(c("abcd defg", "ab defg", "ab cd"), pattern = "[a-z]{4}")
[1] 2 1 0
10.4.3 Locate positions
- Locate positions:
str_locate()
allows us to locate the first position where a pattern occurs. Withstr_locate_all()
we get all of them.
str_locate(c("abcde abcd", "cba", "*a*"), pattern = "a")
start end
[1,] 1 1
[2,] 3 3
[3,] 2 2
str_locate_all(c("abcde abcd", "cba", "*a*"), pattern = "a")
[[1]]
start end
[1,] 1 1
[2,] 7 7
[[2]]
start end
[1,] 3 3
[[3]]
start end
[1,] 2 2
10.4.4 Extract patterns
- Extract patterns: with
str_extract()
we can extract patterns (withstr_extract_all()
all of them) from a text string.
str_extract(c("DNI: 5050W", "DNI: 50558040W, DNI: 50558080-W", "DNI: 50558080-W"),
pattern = "[0-9]{8}[A-Z]|[0-9]{8}[-][A-Z]")
[1] NA "50558040W" "50558080-W"
str_extract_all(c("DNI: 5050W", "DNI: 50558040W, DNI: 50558080-W", "DNI: 50558080-W"),
pattern = "[0-9]{8}[A-Z]|[0-9]{8}[-][A-Z]")
[[1]]
character(0)
[[2]]
[1] "50558040W" "50558080-W"
[[3]]
[1] "50558080-W"
10.4.5 Split strings
- Split: with
str_split()
we can locate a pattern and split the text string whenever it appears (withstr_split_fixed()
we can split it into a specific number of pieces).
str_split(c("a-b-c", "ab-c-d-e"), pattern = "-")
[[1]]
[1] "a" "b" "c"
[[2]]
[1] "ab" "c" "d" "e"
str_split_fixed(c("a-b-c", "ab-c-d-e"), pattern = "-", n = 2)
[,1] [,2]
[1,] "a" "b-c"
[2,] "ab" "c-d-e"
. . .
If we use boundary()
as a pattern we can split based on characters, phrases, words, etc.
<- "Esto es una frase. Y esto otra."
x str_split(x, boundary("word"))
[[1]]
[1] "Esto" "es" "una" "frase" "Y" "esto" "otra"
str_split(x, boundary("sentence"))
[[1]]
[1] "Esto es una frase. " "Y esto otra."
10.5 💻 Case study III: speechs
Try to perform the following exercises without looking at the solutions
The dataset will be discursos
(speeches, extracted from https://github.com/lirondos/discursos-de-navidad) where are stored the christmas speeches of the heads of state of Spain from 1946 to 2021 (in dictatorship and in democracy)
load(file = "./datos/discursos.RData")
# discursos = speeches in spanish
All solutions are designed in R base. Try yourself to “translate” to tidyverse.
10.5.1 Question 1
Converts all speech to lowercase.
Code
# Convert to lowercase
$texto <- str_to_lower(discursos$texto) discursos
10.5.2 Question 2
Remove punctuation marks such as “:”, “,”, “.”, “;”, “¡”, “!”, “¿” and “?”. Then remove leading, trailing and middle spaces if they exist, leaving only one of them.
Code
# Remove punctuation marks
$texto <-
discursosstr_remove_all(discursos$texto, pattern = "\\:|\\,|\\.|\\;|\\¡|\\!|\\¿|\\?")
# Then we eliminate spaces at the front, at the back and in the middle, leaving just one space
$texto <- str_squish(discursos$texto) discursos
10.5.3 Question 3
Create a new variable
long
with the length of each speech.
Code
# new variable
$long <- str_length(discursos$texto) discursos
10.5.4 Question 4
Add a new variable
n_words
with the number of words of each speech. Hint: if after dividing each speech into words you uselength()
it will surely return 76 since it has stored it in a data type called list. To calculate the length of each of the 76 elements of the list we will uselengths()
.
Code
<- list("a" = 1:2, "b" = 1:3, "c" = 1:4)
list_object length(list_object)
lengths(list_object)
# We divide and apply lengths
$n_words <- lengths(str_split(discursos$texto, boundary("word"))) discursos
10.5.5 Question 5
Determine the 5 years with the greatest length, and the 5 years with the least number of words.
Code
# 5 years with the longest length (we use order to obtain indexes)
$year[order(discursos$long, decreasing = TRUE)[1:5]]
discursos
# 5 years with fewer words
$year[order(discursos$n_words)[1:5]] discursos
10.5.6 Question 6
It incorporates a new variable called
spain
that calculates the number of times that “españoles”, “españolas” or “españa” is said in the speech. Determine the 5 years where these words are least mentioned.
Code
$spain <- str_count(discursos$texto, pattern = "españoles|españolas|españa")
discursos$year[order(discursos$spain)[1:5]] discursos
10.5.7 Question 7
Of the 76 years calculate the number of speeches in which the words “woman” or “women” are mentioned more than the words “man” or “men”.
Code
sum(str_count(discursos$texto, pattern = "mujer|mujeres") >
str_count(discursos$texto, pattern = "hombre|hombres"))
10.5.8 Question 8
Detect the speeches where “Catalonia”, “Catalans”, “Catalan” or “Catalonian” appear and keep from the database only those that comply with it
Code
str_detect(discursos$texto, pattern = "cataluña|catalanes|catalán|catalanas"), ] discursos[