Introduction to tidyverse: tidy data

Practical workbooks of Data Programming in Master in Computational Social Science (2024-2025)

Author

Javier Álvarez Liébana

1 Introduction to tidy data

As commented, our final database format will be the tibble type object, an enhanced data.frame.

library(tibble)
tibble("height" = c(1.7, 1.8, 1.6), "weight" = c(80, 75, 70), "BMI" = weight / (height^2))

# A tibble: 3 × 3
  height weight   BMI
   <dbl>  <dbl> <dbl>
1    1.7     80  27.7
2    1.8     75  23.1
3    1.6     70  27.3

Metainformation: in the header it automatically tells us the number of rows and columns, and the type of each variable.
Recursivity: allows to define the variables sequentially (as we have seen).
Consistency: if you access a column that does not exist it warns you with a warning.
By rows: allows to create by rows with tribble().

1.1 R base vs Tidyverse

So far, everything we have done in R has been done in the programming paradigm known as R base. When R was born as a language, many of those who programmed in it imitated forms and methodologies inherited from other languages, based on the use of

Loops for and while
Dollar $ to access to the variables
Structures if-else

And although knowing these structures can be interesting in some cases, in most cases they are obsolete and we will be able to avoid them (especially loops) since R is specially designed to work in a functional way (instead of element-by-element).

In this context of functional programming, a decade ago {tidyverse} was born, a “universe” of packages to guarantee an efficient, coherent and lexicographically simple to understand workflow, based on the idea that our data is clean and tidy.

{tibble}: optimizing data.frame
{tidyr}: data cleaning
{readr}: load rectangular data (.csv), {readxl}: import .xls and .xlsx files
{dplyr}: grammar for debugging
{stringr}: text handling
{purrr}: list handling
{forcats}: qualitative handling
{ggplot2}: data visualization
{lubridate}: date management
{rvest}: web scraping
{tidymodels}: modeling/prediction

1.2 Basic idea: tidy data

Tidy datasets are all alike, but every messy dataset is messy in its own way (Hadley Wickham, Chief Scientist en RStudio)

TIDYVERSE

The universe of {tidyverse} packages is based on the idea introduced by Hadley Wickham (the God we pray to) of standardizing the format of data to

systematize debugging
make it easier simpler to manipulate
legible code. :::

1.2.1 Rules in the tidy universe

The first thing will therefore be to understand what the tidydata sets are, since the whole {tidyverse} is based on the data being standardized.

Each variable in a single column
Each individual in a different row
Each cell with a single value
Each dataset in a tibble
If we want to join multiple datasets we must have a common (key) column.

1.2.2 Pipe

In {tidyverse} the operator pipe (pipe) defined as |> (ctrl+shift+M) will be key: it will be a pipe that traverses the data and transforms it.

In R base, if we want to apply three functions first(), second() and third() in order, it would be

third(second(first(data)))

In {tidyverse} we can read from left to right and separate data from the actions

data |> first() |> second() |> third()

Important

Since version 4.1.0 of R we have |>, a native pipe available outside tidyverse, replacing the old pipe %>% which depended on the {magrittr} package (quite problematic).

The main advantage is that the code is very readable (almost literal) and you can do large operations on the data with very little code.

data |>
  tidy(...) |>
  filter(...) |>
  select(...) |>
  arrange(...) |>
  modify(...) |>
  rename(...) |>
  group(...) |>
  count(...) |>
  summarise(...) |>
  plot(...)

1.3 Messy data: examples

1.3.1 pivot_longer()

But what does the non-tidy (messy) data look like? Let’s load the table4a table from the {tidyr} package (we already have it loaded from the {tidyverse} environment).

library(tidyr)
table4a

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

What could be wrong?

❎ Each row represents two observations (1999 and 2000) → the columns 1999 and 2000 should actually themselves be values of a variable and not column names.

We will include a new column that stores the year and another one that stores the value of the variable of interest in each of those years. And we will do it with the pivot_longer() function: pivot the table to long format:

table4a |> 
  pivot_longer(cols = c("1999", "2000"), names_to = "year", values_to = "cases")

# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

cols: name of the variables to pivot.
names_to: name of the new variable to which we send the header of the table (the names).
values_to: name of the new variable to which we are going to send the data.

1.3.2 pivot_wider()

Let’s see another example with table table2.

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

What could be wrong?

❎ Each observation is divided into two rows → the records with the same year should be the same

What we will do will be the opposite: with pivot_wider() we will widen the table

table2 |>  pivot_wider(names_from = type, values_from = count)

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

1.3.3 separate()

Let’s see another example with table table3.

table3

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

What could be wrong?

❎ Each cell contains several values

What we will do is make use of the separate() function to send separate each value to a different column.

table3 |> separate(rate, into = c("cases", "pop"))

# A tibble: 6 × 4
  country      year cases  pop       
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Notice that the data, although it has separated them, kept them as text when in fact they should be numeric variables. For this we can add the optional argument convert = TRUE.

table3 |> separate(rate, into = c("cases", "pop"), convert = TRUE)

# A tibble: 6 × 4
  country      year  cases        pop
  <chr>       <dbl>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

1.3.4 unite()

Let’s see the last example with table table5.

table5

# A tibble: 6 × 4
  country     century year  rate             
  <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

What could be wrong?

❎ We have same values divided in two columns

We will use unite() to unite the values of century and year in the same column

table5 |> unite(col = whole_year, century, year, sep = "")

# A tibble: 6 × 3
  country     whole_year rate             
  <chr>       <chr>      <chr>            
1 Afghanistan 1999       745/19987071     
2 Afghanistan 2000       2666/20595360    
3 Brazil      1999       37737/172006362  
4 Brazil      2000       80488/174504898  
5 China       1999       212258/1272915272
6 China       2000       213766/1280428583

1.4 Nest data

We can also nest datasets inside another one: imagine we have a dataset with variables x and y, with two records, another with one and another with 3 of them.

data <-
  tibble("dataset" = c(1, 1, 2, 3, 3, 3), 
         "x" = c(0, 2, NA, -2, 6, 7),
         "y" = c(-1, NA, 5, 1.5, NA, -2))
data

# A tibble: 6 × 3
  dataset     x     y
    <dbl> <dbl> <dbl>
1       1     0  -1  
2       1     2  NA  
3       2    NA   5  
4       3    -2   1.5
5       3     6  NA  
6       3     7  -2

Everything that has an equal value in dataset should form its own tibble so let’s create one inside the one we have

For it we will use the function nest() indicating it which variables form the datasets that will be nested (in this case variables x and y). Notice that inside what it stores is a variable of type list (since each dataset has a different length).

data_nest <-
  data |>
  nest(data = c(x, y))
data_nest

# A tibble: 3 × 2
  dataset data            
    <dbl> <list>          
1       1 <tibble [2 × 2]>
2       2 <tibble [1 × 2]>
3       3 <tibble [3 × 2]>

To unnest just use the unnest() function indicating the column to unnest.

data_nest |> unnest(cols = c(data))

# A tibble: 6 × 3
  dataset     x     y
    <dbl> <dbl> <dbl>
1       1     0  -1  
2       1     2  NA  
3       2    NA   5  
4       3    -2   1.5
5       3     6  NA  
6       3     7  -2

1.5 Example: world bank pop

In the {tidyr} package we have the world_bank_pop dataset which contains data from the World Bank about population per country from 2000 to 2018.

library(tidyr)
world_bank_pop

# A tibble: 1,064 × 20
   country indicator      `2000`  `2001`  `2002`  `2003`  `2004`  `2005`  `2006`
   <chr>   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ABW     SP.URB.TOTL    4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
 2 ABW     SP.URB.GROW    1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
 3 ABW     SP.POP.TOTL    8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
 4 ABW     SP.POP.GROW    2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
 5 AFE     SP.URB.TOTL    1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
 6 AFE     SP.URB.GROW    3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
 7 AFE     SP.POP.TOTL    4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
 8 AFE     SP.POP.GROW    2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
 9 AFG     SP.URB.TOTL    4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
10 AFG     SP.URB.GROW    1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
# ℹ 1,054 more rows
# ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
#   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
#   `2016` <dbl>, `2017` <dbl>

What could be wrong?

First of all, we can see that we effectively have the same variable in 18 columns: population. What we should have is a column called pop with these values and another year indicating to which year corresponds the measurement. And for this we will do it with pivot_longer().

world_bank_pop_tidy <-
  world_bank_pop |> 
  pivot_longer(cols = -(country:indicator), names_to = "year", values_to = "value")
world_bank_pop_tidy

# A tibble: 19,152 × 4
   country indicator   year  value
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

world_bank_pop_tidy

# A tibble: 19,152 × 4
   country indicator   year  value
   <chr>   <chr>       <chr> <dbl>
 1 ABW     SP.URB.TOTL 2000  41625
 2 ABW     SP.URB.TOTL 2001  42025
 3 ABW     SP.URB.TOTL 2002  42194
 4 ABW     SP.URB.TOTL 2003  42277
 5 ABW     SP.URB.TOTL 2004  42317
 6 ABW     SP.URB.TOTL 2005  42399
 7 ABW     SP.URB.TOTL 2006  42555
 8 ABW     SP.URB.TOTL 2007  42729
 9 ABW     SP.URB.TOTL 2008  42906
10 ABW     SP.URB.TOTL 2009  43079
# ℹ 19,142 more rows

Is everything correct?

If you notice we have two types of population measures, total ...TOTL and its growth ...GROW, but in addition we have them for each country in global SP.POP... and only in urban area SP.URB....

unique(world_bank_pop_tidy$indicator)

[1] "SP.URB.TOTL" "SP.URB.GROW" "SP.POP.TOTL" "SP.POP.GROW"

What should be done?

We will separate this variable into 3: one for the prefix SP (which we will eliminate later), one for the area (POP/URB) and one for the value (variable), which can be total or growth.

world_bank_pop_tidy2 <-
  world_bank_pop_tidy |>
  separate(indicator, c("dummy", "area", "variable"))
world_bank_pop_tidy2$dummy <- NULL
world_bank_pop_tidy2

# A tibble: 19,152 × 5
   country area  variable year  value
   <chr>   <chr> <chr>    <chr> <dbl>
 1 ABW     URB   TOTL     2000  41625
 2 ABW     URB   TOTL     2001  42025
 3 ABW     URB   TOTL     2002  42194
 4 ABW     URB   TOTL     2003  42277
 5 ABW     URB   TOTL     2004  42317
 6 ABW     URB   TOTL     2005  42399
 7 ABW     URB   TOTL     2006  42555
 8 ABW     URB   TOTL     2007  42729
 9 ABW     URB   TOTL     2008  42906
10 ABW     URB   TOTL     2009  43079
# ℹ 19,142 more rows

This can be done in a simpler way by indicating in the variable that we want to eliminate that it is NA inside separate().

world_bank_pop_tidy <-
  world_bank_pop_tidy |>
  separate(indicator, c(NA, "area", "variable"))
world_bank_pop_tidy

# A tibble: 19,152 × 5
   country area  variable year  value
   <chr>   <chr> <chr>    <chr> <dbl>
 1 ABW     URB   TOTL     2000  41625
 2 ABW     URB   TOTL     2001  42025
 3 ABW     URB   TOTL     2002  42194
 4 ABW     URB   TOTL     2003  42277
 5 ABW     URB   TOTL     2004  42317
 6 ABW     URB   TOTL     2005  42399
 7 ABW     URB   TOTL     2006  42555
 8 ABW     URB   TOTL     2007  42729
 9 ABW     URB   TOTL     2008  42906
10 ABW     URB   TOTL     2009  43079
# ℹ 19,142 more rows

Have we got it yet? Think carefully: does each variable have its own column?

If you actually look at the total population and growth variables, they should be different variables, since they even have different units: one is inhabitants, the other is percentage points. To do the reverse of the initial operation, pivot_wider() (later we will use a tremendously useful function, {janitor}’s clean_names() which unifies variable names).

world_bank_pop_tidy <-
  world_bank_pop_tidy |>
  pivot_wider(names_from = "variable", values_from = "value") |> 
  janitor::clean_names()
world_bank_pop_tidy

# A tibble: 9,576 × 5
   country area  year   totl   grow
   <chr>   <chr> <chr> <dbl>  <dbl>
 1 ABW     URB   2000  41625 1.66  
 2 ABW     URB   2001  42025 0.956 
 3 ABW     URB   2002  42194 0.401 
 4 ABW     URB   2003  42277 0.197 
 5 ABW     URB   2004  42317 0.0946
 6 ABW     URB   2005  42399 0.194 
 7 ABW     URB   2006  42555 0.367 
 8 ABW     URB   2007  42729 0.408 
 9 ABW     URB   2008  42906 0.413 
10 ABW     URB   2009  43079 0.402 
# ℹ 9,566 more rows

The complete code would be this: short, concise, readable and self-descriptive.

world_bank_pop_tidy <-
  world_bank_pop |> 
  pivot_longer(cols = -(country:indicator), names_to = "year", values_to = "value") |> 
  separate(indicator, c(NA, "area", "variable")) |> 
  pivot_wider(names_from = "variable", values_from = "value") |> 
  janitor::clean_names()
world_bank_pop_tidy

# A tibble: 9,576 × 5
   country area  year   totl   grow
   <chr>   <chr> <chr> <dbl>  <dbl>
 1 ABW     URB   2000  41625 1.66  
 2 ABW     URB   2001  42025 0.956 
 3 ABW     URB   2002  42194 0.401 
 4 ABW     URB   2003  42277 0.197 
 5 ABW     URB   2004  42317 0.0946
 6 ABW     URB   2005  42399 0.194 
 7 ABW     URB   2006  42555 0.367 
 8 ABW     URB   2007  42729 0.408 
 9 ABW     URB   2008  42906 0.413 
10 ABW     URB   2009  43079 0.402 
# ℹ 9,566 more rows

1.6 Example: who dataset

In{tidyr} package we have who2 dataset (World Health Organization dataset)

library(tidyr)
who2

# A tibble: 7,240 × 58
   country      year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
   <chr>       <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 Afghanistan  1980       NA        NA        NA        NA        NA        NA
 2 Afghanistan  1981       NA        NA        NA        NA        NA        NA
 3 Afghanistan  1982       NA        NA        NA        NA        NA        NA
 4 Afghanistan  1983       NA        NA        NA        NA        NA        NA
 5 Afghanistan  1984       NA        NA        NA        NA        NA        NA
 6 Afghanistan  1985       NA        NA        NA        NA        NA        NA
 7 Afghanistan  1986       NA        NA        NA        NA        NA        NA
 8 Afghanistan  1987       NA        NA        NA        NA        NA        NA
 9 Afghanistan  1988       NA        NA        NA        NA        NA        NA
10 Afghanistan  1989       NA        NA        NA        NA        NA        NA
# ℹ 7,230 more rows
# ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
#   sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
#   sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
#   sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
#   sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
#   sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …

Is it tidy data? Why?

First step for tidy data: we must pivot the table (tip: use paper and pen to sketch how the database should look like) so that there is a column called cases (since all columns starting from year is actually the same, cases of a disease).

who_tidy <-
  who2 |> 
  pivot_longer(cols = -(country:year), names_to = "type", values_to = "cases")
who_tidy

# A tibble: 405,440 × 4
   country      year type      cases
   <chr>       <dbl> <chr>     <dbl>
 1 Afghanistan  1980 sp_m_014     NA
 2 Afghanistan  1980 sp_m_1524    NA
 3 Afghanistan  1980 sp_m_2534    NA
 4 Afghanistan  1980 sp_m_3544    NA
 5 Afghanistan  1980 sp_m_4554    NA
 6 Afghanistan  1980 sp_m_5564    NA
 7 Afghanistan  1980 sp_m_65      NA
 8 Afghanistan  1980 sp_f_014     NA
 9 Afghanistan  1980 sp_f_1524    NA
10 Afghanistan  1980 sp_f_2534    NA
# ℹ 405,430 more rows

If you notice there are a lot of rows that do not make sense to keep because we have no cases! Investigate the pivot_longer() options to see how we can directly remove them in the pivot.

who_tidy <-
  who2 |> 
  pivot_longer(cols = -(country:year), names_to = "type", values_to = "cases",
               values_drop_na = TRUE)
who_tidy

# A tibble: 76,046 × 4
   country      year type      cases
   <chr>       <dbl> <chr>     <dbl>
 1 Afghanistan  1997 sp_m_014      0
 2 Afghanistan  1997 sp_m_1524    10
 3 Afghanistan  1997 sp_m_2534     6
 4 Afghanistan  1997 sp_m_3544     3
 5 Afghanistan  1997 sp_m_4554     5
 6 Afghanistan  1997 sp_m_5564     2
 7 Afghanistan  1997 sp_m_65       0
 8 Afghanistan  1997 sp_f_014      5
 9 Afghanistan  1997 sp_f_1524    38
10 Afghanistan  1997 sp_f_2534    36
# ℹ 76,036 more rows

Now in type we have coded the information as diagnosis_sex_age. How to separate it in 3 columns? Investigate both separate() and pivot_longer() options.

# separate
who_tidy <-
  who_tidy |> 
  separate(col = "type", into = c("diagnosis", "sex", "age"))

# pivot_longer
who_tidy <-
  who2 |> 
  pivot_longer(cols = -(country:year), names_to = c("diagnosis", "sex", "age"),
               values_to = "cases", values_drop_na = TRUE,
               names_sep = "_")
who_tidy

# A tibble: 76,046 × 6
   country      year diagnosis sex   age   cases
   <chr>       <dbl> <chr>     <chr> <chr> <dbl>
 1 Afghanistan  1997 sp        m     014       0
 2 Afghanistan  1997 sp        m     1524     10
 3 Afghanistan  1997 sp        m     2534      6
 4 Afghanistan  1997 sp        m     3544      3
 5 Afghanistan  1997 sp        m     4554      5
 6 Afghanistan  1997 sp        m     5564      2
 7 Afghanistan  1997 sp        m     65        0
 8 Afghanistan  1997 sp        f     014       5
 9 Afghanistan  1997 sp        f     1524     38
10 Afghanistan  1997 sp        f     2534     36
# ℹ 76,036 more rows

Finally, separate in two (age_inf, age_sup) the age range (which are numbers). Think about how to do it since it is not always 4 numbers (if there is no upper age range defined, put Inf as an upper bound).

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

who_tidy <-
  who_tidy |> 
  separate(col = "age", into = c("age_inf", "age_sup"), sep = -2, convert = TRUE)
who_tidy$age_inf <- if_else(is.na(who_tidy$age_inf), 65, who_tidy$age_inf)
who_tidy$age_sup <- if_else(who_tidy$age_sup == 65, Inf, who_tidy$age_sup)
who_tidy

# A tibble: 76,046 × 7
   country      year diagnosis sex   age_inf age_sup cases
   <chr>       <dbl> <chr>     <chr>   <dbl>   <dbl> <dbl>
 1 Afghanistan  1997 sp        m           0      14     0
 2 Afghanistan  1997 sp        m          15      24    10
 3 Afghanistan  1997 sp        m          25      34     6
 4 Afghanistan  1997 sp        m          35      44     3
 5 Afghanistan  1997 sp        m          45      54     5
 6 Afghanistan  1997 sp        m          55      64     2
 7 Afghanistan  1997 sp        m          65     Inf     0
 8 Afghanistan  1997 sp        f           0      14     5
 9 Afghanistan  1997 sp        f          15      24    38
10 Afghanistan  1997 sp        f          25      34    36
# ℹ 76,036 more rows

1.7 💻 It’s your turn

Try to perform the following exercises without looking at the solutions

📝 Use the original who2 dataset from the {tidyr} package and try to answer the question: how many cases of tuberculosis were there in Spain in 1995 among women? Do it without converting it to tidydata. After that, pivot in a simple way and compare the code to be done when we have tidydata to when we don’t. Which one is more readable if you didn’t know R? Which one has higher error probability?

Code

# messy data
sum(who2[who2$country == "Spain" & who2$year == 1995,
     names(who2)[str_detect(names(who2), "f_")]], na.rm = TRUE)

# tidy data (at this moment)
sum(who_tidy[who_tidy$country == "Spain" &
           who_tidy$year == 1995 &
           who_tidy$sex == "f", ]$cases)

# in the future
who_tidy |> 
  filter(country == "Spain" & year == 1995 & sex == "f") |> 
  summarise(sum(cases))

📝 Using who_tidy determine which sex has had more cases, men or women? Create a new variable avg_age (mean age of the interval): if the range is 25 to 34, the mean age will be $(25 + 34)/2 = 29.5$ (if Inf above, NA)

Code

# f vs m
sum(who_tidy[who_tidy$sex == "m", ]$cases)
sum(who_tidy[who_tidy$sex == "f", ]$cases)

# ave age
who_tidy$ave_age <- 
  if_else(is.infinite(who_tidy$age_sup), NA, (who_tidy$age_inf + who_tidy$age_sup)/2)

📝 If we must choose a country in which we have the lowest probability of infection, which country, between the United Kingdom (United Kingdom of Great Britain and Northern Ireland) and France (similar population), had the fewest cases in the most recent year (whichever it was, even if the table was updated)?

Code

last_cases <- who_tidy[who_tidy$year == max(who_tidy$year), ]
sum(last_cases[last_cases$country == "United Kingdom of Great Britain and Northern Ireland", "cases"])
sum(last_cases[last_cases$country == "France", "cases"])
# better France

📝 Take a look at table table4b in package {tidyr}. Is it tidydata? If not, what is wrong, how to convert it to tidy data in case it is not already?

Code

table4b |>
  pivot_longer(cols = "1999":"2000", names_to = "year",
               values_to = "cases")

📝 Take a look at the billboard table in the {tidyr} package. Is it tidydata? If not, what is wrong, how to convert it to tidy data in case it is not already?

Code

billboard |>
  pivot_longer(cols = "wk1":"wk76",
               names_to = "week",
               names_prefix = "wk",
               values_to = "position",
               values_drop_na = TRUE)

2 🐣 Case study: religion income

Let’s perform a case study with the relig_income table of the {tidyr} package. As indicated in the ? relig_income help, the table represents the number of people in each annual income bracket (20k = 20 000$) and in each religion.

library(tidyr)
relig_income

# A tibble: 18 × 11
   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
 1 Agnostic      27        34        60        81        76       137        122
 2 Atheist       12        27        37        52        35        70         73
 3 Buddhist      27        21        30        34        33        58         62
 4 Catholic     418       617       732       670       638      1116        949
 5 Don’t k…      15        14        15        11        10        35         21
 6 Evangel…     575       869      1064       982       881      1486        949
 7 Hindu          1         9         7         9        11        34         47
 8 Histori…     228       244       236       238       197       223        131
 9 Jehovah…      20        27        24        24        21        30         15
10 Jewish        19        19        25        25        30        95         69
11 Mainlin…     289       495       619       655       651      1107        939
12 Mormon        29        40        48        51        56       112         85
13 Muslim         6         7         9        10         9        23         16
14 Orthodox      13        17        23        32        32        47         38
15 Other C…       9         7        11        13        13        14         18
16 Other F…      20        33        40        46        49        63         46
17 Other W…       5         2         3         4         2         7          3
18 Unaffil…     217       299       374       365       341       528        407
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>

2.1 Question 1

Is it tidy data? Why?

Code

# It is not since we should really only have one variable of
# income and we have it divided into 11: all of them is the same variable only
# it takes on a different value.

2.2 Question 2

How to convert it to tidy data? Do whatever you consider to end up with a three column table: religion, income and people

Code

# The idea is to pivot all the income columns
# We will make the table longer and less wide so....
relig_tidy <-
  relig_income |>
  pivot_longer(cols = "<$10k":"Don't know/refused", names_to = "income",
               values_to = "people")
relig_tidy

2.3 Question 3

Using the tidy table above, separate the income variable into two, the lower and upper income limits (income_inf and income_sup).

Code

relig_tidy |>
  separate(income, into = c("income_inf", "income_sup"), sep = "-")

2.4 Question 4

If you look at the first column the “$10k” should be an upper dimension, not a lower one. How do you tell it to separate that case?

Code

# We will prompt you to separate whether you find `“-”` or `“<”`
# (we use `|` to separate both options).
relig_tidy |>
  # Separate by -
  separate(income, into = c("income_inf", "income_sup"), sep = "-|<")

2.5 Question 5

Now think about how we can convert the income limits to numerics (removing symbols, letters, etc.).

Code

# To do this we will use the `{stringr}` package, specifically the
# `str_remove_all()` function, to which we can pass the characters 
# we want to remove (`$` being a reserved character must be indicated with `$`).
library(stringr)
relig_tidy$income_inf <-
  str_remove_all(relig_tidy$income_inf, "\\$|>|k")
relig_tidy$income_sup <-
  str_remove_all(relig_tidy$income_sup, "\\$|>|k")

relig_tidy

2.6 Question 6

Where we have “Don't now/refused”…. what should we have?

Code

# It should be missing data: if it contains such a phrase, `NA`
# otherwise its value
relig_tidy$income_inf <-
  if_else(str_detect(relig_tidy$income_inf, "refused"), NA, relig_tidy$income_inf)
relig_tidy$income_sup <-
  if_else(str_detect(relig_tidy$income_sup, "refused"), NA, relig_tidy$income_sup)

# In the first line, that `“”` should also be `NA``
relig_tidy$income_inf <-
  if_else(relig_tidy$income_inf == "", NA, relig_tidy$income_inf)
relig_tidy$income_suop <-
  if_else(relig_tidy$income_sup == "", NA, relig_tidy$income_sup)

2.7 Question 7

If you also notice that the numbers are actually characters, how to convert them?

Code

relig_tidy$income_inf <- as.numeric(relig_tidy$income_inf)
relig_tidy$income_sup <- as.numeric(relig_tidy$income_sup)
relig_tidy

2.8 Question 8

Can you think of any way to “quantify numerically” the missing values we have in this case?

Code

# If you notice actually when there is absent at the lower limit
# we could actually put a 0 (no one can earn less than that)
# and when we have it at the upper limit it would be `Inf`.
relig_tidy$income_inf <-
  if_else(is.na(relig_tidy$income_inf), 0, relig_tidy$income_inf)
relig_tidy$income_sup <-
  if_else(is.na(relig_tidy$income_sup), Inf, relig_tidy$income_sup)
relig_tidy

2.9 Question 9

Why was it important to have it in a tidy format? We will see it later when visualizing the data but this already allows us to perform very quick filters with very little code. For example: how many agnostic people with income greater than (or equal to) 30 do we have?

Code

sum(relig_tidy$people[relig_tidy$religion == "Agnostic" & relig_tidy$income_inf >= 30])