library(tibble)
tibble("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70), "BMI" = weight / (height^2))
# A tibble: 3 × 3
height weight BMI
<dbl> <dbl> <dbl>
1 1.7 80 27.7
2 1.8 75 23.1
3 1.6 70 27.3
Practical workbooks of Data Programming in Master in Computational Social Science (2024-2025)
As commented, our final database format will be the tibble
type object, an enhanced data.frame
.
library(tibble)
tibble("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70), "BMI" = weight / (height^2))
# A tibble: 3 × 3
height weight BMI
<dbl> <dbl> <dbl>
1 1.7 80 27.7
2 1.8 75 23.1
3 1.6 70 27.3
Metainformation: in the header it automatically tells us the number of rows and columns, and the type of each variable.
Recursivity: allows to define the variables sequentially (as we have seen).
Consistency: if you access a column that does not exist it warns you with a warning.
By rows: allows to create by rows with tribble()
.
So far, everything we have done in R
has been done in the programming paradigm known as R base. When R
was born as a language, many of those who programmed in it imitated forms and methodologies inherited from other languages, based on the use of
Loops for and while
Dollar $
to access to the variables
Structures if-else
And although knowing these structures can be interesting in some cases, in most cases they are obsolete and we will be able to avoid them (especially loops) since R
is specially designed to work in a functional way (instead of element-by-element).
In this context of functional programming, a decade ago {tidyverse}
was born, a “universe” of packages to guarantee an efficient, coherent and lexicographically simple to understand workflow, based on the idea that our data is clean and tidy.
{tibble}
: optimizing data.frame{tidyr}
: data cleaning{readr}
: load rectangular data (.csv), {readxl}
: import .xls and .xlsx files{dplyr}
: grammar for debugging{stringr}
: text handling{purrr}
: list handling{forcats}
: qualitative handling{ggplot2}
: data visualization{lubridate}
: date management{rvest}
: web scraping{tidymodels}
: modeling/predictionTidy datasets are all alike, but every messy dataset is messy in its own way (Hadley Wickham, Chief Scientist en RStudio)
TIDYVERSE
The universe of {tidyverse}
packages is based on the idea introduced by Hadley Wickham (the God we pray to) of standardizing the format of data to
The first thing will therefore be to understand what the tidydata sets are, since the whole {tidyverse}
is based on the data being standardized.
In {tidyverse}
the operator pipe (pipe) defined as |>
(ctrl+shift+M) will be key: it will be a pipe that traverses the data and transforms it.
In R base, if we want to apply three functions first()
, second()
and third()
in order, it would be
third(second(first(data)))
In {tidyverse}
we can read from left to right and separate data from the actions
|> first() |> second() |> third() data
Since version 4.1.0 of R
we have |>
, a native pipe available outside tidyverse, replacing the old pipe %>%
which depended on the {magrittr}
package (quite problematic).
The main advantage is that the code is very readable (almost literal) and you can do large operations on the data with very little code.
|>
data tidy(...) |>
filter(...) |>
select(...) |>
arrange(...) |>
modify(...) |>
rename(...) |>
group(...) |>
count(...) |>
summarise(...) |>
plot(...)
But what does the non-tidy (messy) data look like? Let’s load the table4a
table from the {tidyr}
package (we already have it loaded from the {tidyverse}
environment).
library(tidyr)
table4a
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
What could be wrong?
❎ Each row represents two observations (1999 and 2000) → the columns 1999
and 2000
should actually themselves be values of a variable and not column names.
We will include a new column that stores the year and another one that stores the value of the variable of interest in each of those years. And we will do it with the pivot_longer()
function: pivot the table to long format:
|>
table4a pivot_longer(cols = c("1999", "2000"), names_to = "year", values_to = "cases")
# A tibble: 6 × 3
country year cases
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
cols
: name of the variables to pivot.names_to
: name of the new variable to which we send the header of the table (the names).values_to
: name of the new variable to which we are going to send the data.Let’s see another example with table table2
.
table2
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
What could be wrong?
❎ Each observation is divided into two rows → the records with the same year should be the same
What we will do will be the opposite: with pivot_wider()
we will widen the table
|> pivot_wider(names_from = type, values_from = count) table2
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Let’s see another example with table table3
.
table3
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
What could be wrong?
❎ Each cell contains several values
What we will do is make use of the separate()
function to send separate each value to a different column.
|> separate(rate, into = c("cases", "pop")) table3
# A tibble: 6 × 4
country year cases pop
<chr> <dbl> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Notice that the data, although it has separated them, kept them as text when in fact they should be numeric variables. For this we can add the optional argument convert = TRUE
.
|> separate(rate, into = c("cases", "pop"), convert = TRUE) table3
# A tibble: 6 × 4
country year cases pop
<chr> <dbl> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Let’s see the last example with table table5
.
table5
# A tibble: 6 × 4
country century year rate
<chr> <chr> <chr> <chr>
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
What could be wrong?
❎ We have same values divided in two columns
We will use unite()
to unite the values of century and year in the same column
|> unite(col = whole_year, century, year, sep = "") table5
# A tibble: 6 × 3
country whole_year rate
<chr> <chr> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
We can also nest datasets inside another one: imagine we have a dataset with variables x
and y
, with two records, another with one and another with 3 of them.
<-
data tibble("dataset" = c(1, 1, 2, 3, 3, 3),
"x" = c(0, 2, NA, -2, 6, 7),
"y" = c(-1, NA, 5, 1.5, NA, -2))
data
# A tibble: 6 × 3
dataset x y
<dbl> <dbl> <dbl>
1 1 0 -1
2 1 2 NA
3 2 NA 5
4 3 -2 1.5
5 3 6 NA
6 3 7 -2
Everything that has an equal value in dataset
should form its own tibble
so let’s create one inside the one we have
For it we will use the function nest()
indicating it which variables form the datasets that will be nested (in this case variables x
and y
). Notice that inside what it stores is a variable of type list (since each dataset has a different length).
<-
data_nest |>
data nest(data = c(x, y))
data_nest
# A tibble: 3 × 2
dataset data
<dbl> <list>
1 1 <tibble [2 × 2]>
2 2 <tibble [1 × 2]>
3 3 <tibble [3 × 2]>
To unnest just use the unnest()
function indicating the column to unnest.
|> unnest(cols = c(data)) data_nest
# A tibble: 6 × 3
dataset x y
<dbl> <dbl> <dbl>
1 1 0 -1
2 1 2 NA
3 2 NA 5
4 3 -2 1.5
5 3 6 NA
6 3 7 -2
In the {tidyr}
package we have the world_bank_pop
dataset which contains data from the World Bank about population per country from 2000 to 2018.
library(tidyr)
world_bank_pop
# A tibble: 1,064 × 20
country indicator `2000` `2001` `2002` `2003` `2004` `2005` `2006`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABW SP.URB.TOTL 4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
2 ABW SP.URB.GROW 1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
3 ABW SP.POP.TOTL 8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
4 ABW SP.POP.GROW 2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
5 AFE SP.URB.TOTL 1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
6 AFE SP.URB.GROW 3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
7 AFE SP.POP.TOTL 4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
8 AFE SP.POP.GROW 2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
9 AFG SP.URB.TOTL 4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
10 AFG SP.URB.GROW 1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
# ℹ 1,054 more rows
# ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
# `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
# `2016` <dbl>, `2017` <dbl>
What could be wrong?
First of all, we can see that we effectively have the same variable in 18 columns: population. What we should have is a column called pop
with these values and another year
indicating to which year corresponds the measurement. And for this we will do it with pivot_longer()
.
<-
world_bank_pop_tidy |>
world_bank_pop pivot_longer(cols = -(country:indicator), names_to = "year", values_to = "value")
world_bank_pop_tidy
# A tibble: 19,152 × 4
country indicator year value
<chr> <chr> <chr> <dbl>
1 ABW SP.URB.TOTL 2000 41625
2 ABW SP.URB.TOTL 2001 42025
3 ABW SP.URB.TOTL 2002 42194
4 ABW SP.URB.TOTL 2003 42277
5 ABW SP.URB.TOTL 2004 42317
6 ABW SP.URB.TOTL 2005 42399
7 ABW SP.URB.TOTL 2006 42555
8 ABW SP.URB.TOTL 2007 42729
9 ABW SP.URB.TOTL 2008 42906
10 ABW SP.URB.TOTL 2009 43079
# ℹ 19,142 more rows
world_bank_pop_tidy
# A tibble: 19,152 × 4
country indicator year value
<chr> <chr> <chr> <dbl>
1 ABW SP.URB.TOTL 2000 41625
2 ABW SP.URB.TOTL 2001 42025
3 ABW SP.URB.TOTL 2002 42194
4 ABW SP.URB.TOTL 2003 42277
5 ABW SP.URB.TOTL 2004 42317
6 ABW SP.URB.TOTL 2005 42399
7 ABW SP.URB.TOTL 2006 42555
8 ABW SP.URB.TOTL 2007 42729
9 ABW SP.URB.TOTL 2008 42906
10 ABW SP.URB.TOTL 2009 43079
# ℹ 19,142 more rows
Is everything correct?
If you notice we have two types of population measures, total ...TOTL
and its growth ...GROW
, but in addition we have them for each country in global SP.POP...
and only in urban area SP.URB...
.
unique(world_bank_pop_tidy$indicator)
[1] "SP.URB.TOTL" "SP.URB.GROW" "SP.POP.TOTL" "SP.POP.GROW"
What should be done?
We will separate this variable into 3: one for the prefix SP
(which we will eliminate later), one for the area (POP/URB
) and one for the value (variable
), which can be total or growth.
<-
world_bank_pop_tidy2 |>
world_bank_pop_tidy separate(indicator, c("dummy", "area", "variable"))
$dummy <- NULL
world_bank_pop_tidy2 world_bank_pop_tidy2
# A tibble: 19,152 × 5
country area variable year value
<chr> <chr> <chr> <chr> <dbl>
1 ABW URB TOTL 2000 41625
2 ABW URB TOTL 2001 42025
3 ABW URB TOTL 2002 42194
4 ABW URB TOTL 2003 42277
5 ABW URB TOTL 2004 42317
6 ABW URB TOTL 2005 42399
7 ABW URB TOTL 2006 42555
8 ABW URB TOTL 2007 42729
9 ABW URB TOTL 2008 42906
10 ABW URB TOTL 2009 43079
# ℹ 19,142 more rows
This can be done in a simpler way by indicating in the variable that we want to eliminate that it is NA
inside separate()
.
<-
world_bank_pop_tidy |>
world_bank_pop_tidy separate(indicator, c(NA, "area", "variable"))
world_bank_pop_tidy
# A tibble: 19,152 × 5
country area variable year value
<chr> <chr> <chr> <chr> <dbl>
1 ABW URB TOTL 2000 41625
2 ABW URB TOTL 2001 42025
3 ABW URB TOTL 2002 42194
4 ABW URB TOTL 2003 42277
5 ABW URB TOTL 2004 42317
6 ABW URB TOTL 2005 42399
7 ABW URB TOTL 2006 42555
8 ABW URB TOTL 2007 42729
9 ABW URB TOTL 2008 42906
10 ABW URB TOTL 2009 43079
# ℹ 19,142 more rows
Have we got it yet? Think carefully: does each variable have its own column?
If you actually look at the total population and growth variables, they should be different variables, since they even have different units: one is inhabitants, the other is percentage points. To do the reverse of the initial operation, pivot_wider()
(later we will use a tremendously useful function, {janitor}
’s clean_names()
which unifies variable names).
<-
world_bank_pop_tidy |>
world_bank_pop_tidy pivot_wider(names_from = "variable", values_from = "value") |>
::clean_names()
janitor world_bank_pop_tidy
# A tibble: 9,576 × 5
country area year totl grow
<chr> <chr> <chr> <dbl> <dbl>
1 ABW URB 2000 41625 1.66
2 ABW URB 2001 42025 0.956
3 ABW URB 2002 42194 0.401
4 ABW URB 2003 42277 0.197
5 ABW URB 2004 42317 0.0946
6 ABW URB 2005 42399 0.194
7 ABW URB 2006 42555 0.367
8 ABW URB 2007 42729 0.408
9 ABW URB 2008 42906 0.413
10 ABW URB 2009 43079 0.402
# ℹ 9,566 more rows
The complete code would be this: short, concise, readable and self-descriptive.
<-
world_bank_pop_tidy |>
world_bank_pop pivot_longer(cols = -(country:indicator), names_to = "year", values_to = "value") |>
separate(indicator, c(NA, "area", "variable")) |>
pivot_wider(names_from = "variable", values_from = "value") |>
::clean_names()
janitor world_bank_pop_tidy
# A tibble: 9,576 × 5
country area year totl grow
<chr> <chr> <chr> <dbl> <dbl>
1 ABW URB 2000 41625 1.66
2 ABW URB 2001 42025 0.956
3 ABW URB 2002 42194 0.401
4 ABW URB 2003 42277 0.197
5 ABW URB 2004 42317 0.0946
6 ABW URB 2005 42399 0.194
7 ABW URB 2006 42555 0.367
8 ABW URB 2007 42729 0.408
9 ABW URB 2008 42906 0.413
10 ABW URB 2009 43079 0.402
# ℹ 9,566 more rows
In{tidyr}
package we have who2
dataset (World Health Organization dataset)
library(tidyr)
who2
# A tibble: 7,240 × 58
country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1980 NA NA NA NA NA NA
2 Afghanistan 1981 NA NA NA NA NA NA
3 Afghanistan 1982 NA NA NA NA NA NA
4 Afghanistan 1983 NA NA NA NA NA NA
5 Afghanistan 1984 NA NA NA NA NA NA
6 Afghanistan 1985 NA NA NA NA NA NA
7 Afghanistan 1986 NA NA NA NA NA NA
8 Afghanistan 1987 NA NA NA NA NA NA
9 Afghanistan 1988 NA NA NA NA NA NA
10 Afghanistan 1989 NA NA NA NA NA NA
# ℹ 7,230 more rows
# ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
# sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
# sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
# sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
# sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
# sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …
Is it tidy data? Why?
First step for tidy data: we must pivot the table (tip: use paper and pen to sketch how the database should look like) so that there is a column called cases
(since all columns starting from year
is actually the same, cases of a disease).
<-
who_tidy |>
who2 pivot_longer(cols = -(country:year), names_to = "type", values_to = "cases")
who_tidy
# A tibble: 405,440 × 4
country year type cases
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1980 sp_m_014 NA
2 Afghanistan 1980 sp_m_1524 NA
3 Afghanistan 1980 sp_m_2534 NA
4 Afghanistan 1980 sp_m_3544 NA
5 Afghanistan 1980 sp_m_4554 NA
6 Afghanistan 1980 sp_m_5564 NA
7 Afghanistan 1980 sp_m_65 NA
8 Afghanistan 1980 sp_f_014 NA
9 Afghanistan 1980 sp_f_1524 NA
10 Afghanistan 1980 sp_f_2534 NA
# ℹ 405,430 more rows
If you notice there are a lot of rows that do not make sense to keep because we have no cases! Investigate the pivot_longer()
options to see how we can directly remove them in the pivot.
<-
who_tidy |>
who2 pivot_longer(cols = -(country:year), names_to = "type", values_to = "cases",
values_drop_na = TRUE)
who_tidy
# A tibble: 76,046 × 4
country year type cases
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1997 sp_m_014 0
2 Afghanistan 1997 sp_m_1524 10
3 Afghanistan 1997 sp_m_2534 6
4 Afghanistan 1997 sp_m_3544 3
5 Afghanistan 1997 sp_m_4554 5
6 Afghanistan 1997 sp_m_5564 2
7 Afghanistan 1997 sp_m_65 0
8 Afghanistan 1997 sp_f_014 5
9 Afghanistan 1997 sp_f_1524 38
10 Afghanistan 1997 sp_f_2534 36
# ℹ 76,036 more rows
Now in type
we have coded the information as diagnosis_sex_age
. How to separate it in 3 columns? Investigate both separate()
and pivot_longer()
options.
# separate
<-
who_tidy |>
who_tidy separate(col = "type", into = c("diagnosis", "sex", "age"))
# pivot_longer
<-
who_tidy |>
who2 pivot_longer(cols = -(country:year), names_to = c("diagnosis", "sex", "age"),
values_to = "cases", values_drop_na = TRUE,
names_sep = "_")
who_tidy
# A tibble: 76,046 × 6
country year diagnosis sex age cases
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan 1997 sp m 014 0
2 Afghanistan 1997 sp m 1524 10
3 Afghanistan 1997 sp m 2534 6
4 Afghanistan 1997 sp m 3544 3
5 Afghanistan 1997 sp m 4554 5
6 Afghanistan 1997 sp m 5564 2
7 Afghanistan 1997 sp m 65 0
8 Afghanistan 1997 sp f 014 5
9 Afghanistan 1997 sp f 1524 38
10 Afghanistan 1997 sp f 2534 36
# ℹ 76,036 more rows
Finally, separate in two (age_inf
, age_sup
) the age range (which are numbers). Think about how to do it since it is not always 4 numbers (if there is no upper age range defined, put Inf
as an upper bound).
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
<-
who_tidy |>
who_tidy separate(col = "age", into = c("age_inf", "age_sup"), sep = -2, convert = TRUE)
$age_inf <- if_else(is.na(who_tidy$age_inf), 65, who_tidy$age_inf)
who_tidy$age_sup <- if_else(who_tidy$age_sup == 65, Inf, who_tidy$age_sup)
who_tidy who_tidy
# A tibble: 76,046 × 7
country year diagnosis sex age_inf age_sup cases
<chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan 1997 sp m 0 14 0
2 Afghanistan 1997 sp m 15 24 10
3 Afghanistan 1997 sp m 25 34 6
4 Afghanistan 1997 sp m 35 44 3
5 Afghanistan 1997 sp m 45 54 5
6 Afghanistan 1997 sp m 55 64 2
7 Afghanistan 1997 sp m 65 Inf 0
8 Afghanistan 1997 sp f 0 14 5
9 Afghanistan 1997 sp f 15 24 38
10 Afghanistan 1997 sp f 25 34 36
# ℹ 76,036 more rows
Try to perform the following exercises without looking at the solutions
📝 Use the original who2
dataset from the {tidyr}
package and try to answer the question: how many cases of tuberculosis were there in Spain in 1995 among women? Do it without converting it to tidydata. After that, pivot in a simple way and compare the code to be done when we have tidydata to when we don’t. Which one is more readable if you didn’t know R
? Which one has higher error probability?
# messy data
sum(who2[who2$country == "Spain" & who2$year == 1995,
names(who2)[str_detect(names(who2), "f_")]], na.rm = TRUE)
# tidy data (at this moment)
sum(who_tidy[who_tidy$country == "Spain" &
$year == 1995 &
who_tidy$sex == "f", ]$cases)
who_tidy
# in the future
|>
who_tidy filter(country == "Spain" & year == 1995 & sex == "f") |>
summarise(sum(cases))
📝 Using who_tidy
determine which sex has had more cases, men or women? Create a new variable avg_age
(mean age of the interval): if the range is 25 to 34, the mean age will be \((25 + 34)/2 = 29.5\) (if Inf
above, NA
)
# f vs m
sum(who_tidy[who_tidy$sex == "m", ]$cases)
sum(who_tidy[who_tidy$sex == "f", ]$cases)
# ave age
$ave_age <-
who_tidyif_else(is.infinite(who_tidy$age_sup), NA, (who_tidy$age_inf + who_tidy$age_sup)/2)
📝 If we must choose a country in which we have the lowest probability of infection, which country, between the United Kingdom (United Kingdom of Great Britain and Northern Ireland
) and France (similar population), had the fewest cases in the most recent year (whichever it was, even if the table was updated)?
<- who_tidy[who_tidy$year == max(who_tidy$year), ]
last_cases sum(last_cases[last_cases$country == "United Kingdom of Great Britain and Northern Ireland", "cases"])
sum(last_cases[last_cases$country == "France", "cases"])
# better France
📝 Take a look at table table4b
in package {tidyr}
. Is it tidydata? If not, what is wrong, how to convert it to tidy data in case it is not already?
|>
table4b pivot_longer(cols = "1999":"2000", names_to = "year",
values_to = "cases")
📝 Take a look at the billboard
table in the {tidyr}
package. Is it tidydata? If not, what is wrong, how to convert it to tidy data in case it is not already?
|>
billboard pivot_longer(cols = "wk1":"wk76",
names_to = "week",
names_prefix = "wk",
values_to = "position",
values_drop_na = TRUE)
Let’s perform a case study with the relig_income
table of the {tidyr}
package. As indicated in the ? relig_income
help, the table represents the number of people in each annual income bracket (20k = 20 000$) and in each religion.
library(tidyr)
relig_income
# A tibble: 18 × 11
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76 137 122
2 Atheist 12 27 37 52 35 70 73
3 Buddhist 27 21 30 34 33 58 62
4 Catholic 418 617 732 670 638 1116 949
5 Don’t k… 15 14 15 11 10 35 21
6 Evangel… 575 869 1064 982 881 1486 949
7 Hindu 1 9 7 9 11 34 47
8 Histori… 228 244 236 238 197 223 131
9 Jehovah… 20 27 24 24 21 30 15
10 Jewish 19 19 25 25 30 95 69
11 Mainlin… 289 495 619 655 651 1107 939
12 Mormon 29 40 48 51 56 112 85
13 Muslim 6 7 9 10 9 23 16
14 Orthodox 13 17 23 32 32 47 38
15 Other C… 9 7 11 13 13 14 18
16 Other F… 20 33 40 46 49 63 46
17 Other W… 5 2 3 4 2 7 3
18 Unaffil… 217 299 374 365 341 528 407
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
# `Don't know/refused` <dbl>
Is it tidy data? Why?
# It is not since we should really only have one variable of
# income and we have it divided into 11: all of them is the same variable only
# it takes on a different value.
How to convert it to tidy data? Do whatever you consider to end up with a three column table: religion, income and people
# The idea is to pivot all the income columns
# We will make the table longer and less wide so....
<-
relig_tidy |>
relig_income pivot_longer(cols = "<$10k":"Don't know/refused", names_to = "income",
values_to = "people")
relig_tidy
Using the tidy table above, separate the
income
variable into two, the lower and upper income limits (income_inf
andincome_sup
).
|>
relig_tidy separate(income, into = c("income_inf", "income_sup"), sep = "-")
If you look at the first column the
“$10k”
should be an upper dimension, not a lower one. How do you tell it to separate that case?
# We will prompt you to separate whether you find `“-”` or `“<”`
# (we use `|` to separate both options).
|>
relig_tidy # Separate by -
separate(income, into = c("income_inf", "income_sup"), sep = "-|<")
Now think about how we can convert the income limits to numerics (removing symbols, letters, etc.).
# To do this we will use the `{stringr}` package, specifically the
# `str_remove_all()` function, to which we can pass the characters
# we want to remove (`$` being a reserved character must be indicated with `$`).
library(stringr)
$income_inf <-
relig_tidystr_remove_all(relig_tidy$income_inf, "\\$|>|k")
$income_sup <-
relig_tidystr_remove_all(relig_tidy$income_sup, "\\$|>|k")
relig_tidy
Where we have
“Don't now/refused”
…. what should we have?
# It should be missing data: if it contains such a phrase, `NA`
# otherwise its value
$income_inf <-
relig_tidyif_else(str_detect(relig_tidy$income_inf, "refused"), NA, relig_tidy$income_inf)
$income_sup <-
relig_tidyif_else(str_detect(relig_tidy$income_sup, "refused"), NA, relig_tidy$income_sup)
# In the first line, that `“”` should also be `NA``
$income_inf <-
relig_tidyif_else(relig_tidy$income_inf == "", NA, relig_tidy$income_inf)
$income_suop <-
relig_tidyif_else(relig_tidy$income_sup == "", NA, relig_tidy$income_sup)
If you also notice that the numbers are actually characters, how to convert them?
$income_inf <- as.numeric(relig_tidy$income_inf)
relig_tidy$income_sup <- as.numeric(relig_tidy$income_sup)
relig_tidy relig_tidy
Can you think of any way to “quantify numerically” the missing values we have in this case?
# If you notice actually when there is absent at the lower limit
# we could actually put a 0 (no one can earn less than that)
# and when we have it at the upper limit it would be `Inf`.
$income_inf <-
relig_tidyif_else(is.na(relig_tidy$income_inf), 0, relig_tidy$income_inf)
$income_sup <-
relig_tidyif_else(is.na(relig_tidy$income_sup), Inf, relig_tidy$income_sup)
relig_tidy
Why was it important to have it in a tidy format? We will see it later when visualizing the data but this already allows us to perform very quick filters with very little code. For example: how many agnostic people with income greater than (or equal to) 30 do we have?
sum(relig_tidy$people[relig_tidy$religion == "Agnostic" & relig_tidy$income_inf >= 30])