Joins and imports: how to combine, import and export files in R. Factors and lists

Practical workbooks of Data Programming in Master in Computational Social Science (2024-2025)

Author

Javier Álvarez Liébana

1 Joins

When working with data we will not always have the information in a single table, and sometimes, we will be interested in cross-referencing information from different sources.

For this we will use a classic of every language that handles data: the famous join, a tool that will allow us to cross one or several tables, making use of a identifying column of each one of them.

table_1 |>
  xxx_join(table_2, by = id)
  • inner_join(): only records with id in both tables survive.

  • full_join(): keeps all records in both tables.

  • left_join(): keeps all the records of the first table, and looks for which ones have id also in the second one (in case of not having it, it fills with NA the fields of the 2nd table).

  • right_join(): keeps all records in the second table, and searches which ones have id also in the first one.

Let’s test the various joins with a simple example

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tb_1 <- tibble("key" = 1:3, "val_x" = c("x1", "x2", "x3"))
tb_2 <- tibble("key" = c(1, 2, 4), "val_y" = c("y1", "y2", "y3"))
tb_1
# A tibble: 3 × 2
    key val_x
  <int> <chr>
1     1 x1   
2     2 x2   
3     3 x3   
tb_2
# A tibble: 3 × 2
    key val_y
  <dbl> <chr>
1     1 y1   
2     2 y2   
3     4 y3   

1.1 left_join()

Imagine that we want to incorporate to tb_1 the information from table_2, identifying the records by the key column (by = "key", the column it has to cross): we want to keep all the records of the first table and look for which ones have the same value in key also in the second one.

tb_1 |> 
  left_join(tb_2, by = "key")
# A tibble: 3 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 

tb_1 |>
  left_join(tb_2, by = "key")
# A tibble: 3 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 

Notice that the records in the first one whose key was not found in the second one has given them the value of absent.

1.2 right_join()

The right_join() will perform the opposite operation: we will now incorporate to tb_2 the information from table_2, identifying the records by the key column: we want to keep all the records of the second one and look for which ones have id (same value in key) also in the first table.

tb_1 |> 
  right_join(tb_2, by = "key")
# A tibble: 3 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y3   

tb_1 |>
  right_join(tb_2, by = "key")
# A tibble: 3 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y3   

Notice that now the records of the second one whose key was not found in the first one are the ones given the value of absent.

1.3 keys and suffixes

The key columns we will use for the crossover will not always be named the same.

tb_1 <- tibble("key_1" = 1:3, "val_x" = c("x1", "x2", "x3"))
tb_2 <- tibble("key_2" = c(1, 2, 4), "val_y" = c("y1", "y2", "y3"))
  • by = c("key_2" = "key_2"): we will indicate in which column of each table are the keys that we are going to cross.
# Left
tb_1  |> 
  left_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 3 × 3
  key_1 val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
# Right
tb_1 |> 
  right_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 3 × 3
  key_1 val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y3   

We can also cross over several columns at the same time (it will interpret as equal record the one that has the same set of keys), with by = c("var1_t1" = "var1_t2", "var2_t1" = "var2_t2", ...). Let’s modify the previous example

tb_1 <- tibble("k_11" = 1:3, "k_12" = c("a", "b", "c"),  "val_x" = c("x1", "x2", "x3"))
tb_2 <- tibble("k_21" = c(1, 2, 4), "k_22" = c("a", "b", "e"), "val_y" = c("y1", "y2", "y3"))
# Left
tb_1 |> 
  left_join(tb_2,
            by = c("k_11" = "k_21", "k_12" = "k_22"))
# A tibble: 3 × 4
   k_11 k_12  val_x val_y
  <dbl> <chr> <chr> <chr>
1     1 a     x1    y1   
2     2 b     x2    y2   
3     3 c     x3    <NA> 
# Right
tb_1 |> 
  right_join(tb_2,
             by = c("k_11" = "k_21", "k_12" = "k_22"))
# A tibble: 3 × 4
   k_11 k_12  val_x val_y
  <dbl> <chr> <chr> <chr>
1     1 a     x1    y1   
2     2 b     x2    y2   
3     4 e     <NA>  y3   

It could also happen that when crossing two tables, there are columns of values named the same

tb_1 <- tibble("key_1" = 1:3, "val" = c("x1", "x2", "x3"))
tb_2 <- tibble("key_2" = c(1, 2, 4), "val" = c("y1", "y2", "y3"))
# Left
tb_1 |>
  left_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 3 × 3
  key_1 val.x val.y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 

Notice that default adds the suffixes .x and .y to tell us which table they come from. This suffix can be specified in the optional argument suffix = ..., which allows us to distinguish the variables of one table from another.

# Left
tb_1 |>
  left_join(tb_2, by = c("key_1" = "key_2"),
            suffix = c("_table1", "_table2"))
# A tibble: 3 × 3
  key_1 val_table1 val_table2
  <dbl> <chr>      <chr>     
1     1 x1         y1        
2     2 x2         y2        
3     3 x3         <NA>      

1.4 full_join()

The two previous cases form what is known as outer joins: crosses where observations are kept that appear in at least one table. The third outer join is known as full_join() which will keep observations from both tables, adding rows that do not match the other table.

tb_1  |> 
  full_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 4 × 3
  key_1 val.x val.y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
4     4 <NA>  y3   

1.5 inner_join()

Opposite the outer join is what is known as inner join, with inner_join(): a join in which only the observations that appear in both tables are kept, only those records that are patched are kept.

tb_1 |> 
  inner_join(tb_2,  by = c("key_1" = "key_2"))
# A tibble: 2 × 3
  key_1 val.x val.y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   

Note that in terms of records, inner_join if it is commutative, we don’t care about the order of the tables: the only thing that changes is the order of the columns it adds.

tb_1 |> 
  inner_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 2 × 3
  key_1 val.x val.y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
tb_2 |> inner_join(tb_1, by = c("key_2" = "key_1"))
# A tibble: 2 × 3
  key_2 val.x val.y
  <dbl> <chr> <chr>
1     1 y1    x1   
2     2 y2    x2   

1.6 semi/anti_join()

Finally we have two interesting tools to filter (not cross) records: semi_join() and anti_join(). The semi join leaves us in the first table the records whose key is also in the second table (like an inner join but without adding the info from the second table). And the second one, the anti join, does just the opposite (those that are not).

# semijoin
tb_1 |> 
  semi_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 2 × 2
  key_1 val  
  <int> <chr>
1     1 x1   
2     2 x2   
# antijoin
tb_1 |> 
  anti_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 1 × 2
  key_1 val  
  <int> <chr>
1     3 x3   

1.7 💻 It’s your turn

For the exercises we will use the tables available in the package {nycflights13}.

library(nycflights13)
  • airlines: name of airlines (with their abbreviation).
  • airports: airport data (names, longitude, latitude, altitude, etc).
  • flights, planes: flight and aircraft data.
  • weather: hourly weather data.

📝 From package {nycflights13} incorporates into the flights table the airline data in airlines. We want to maintain all flight records, adding the airline information to the airlines table.

Code
flights_airlines <-
  flights |> 
  left_join(airlines, by = "carrier")
flights_airlines

📝 To the table obtained from the crossing of the previous section, include the data of the aircraft in planes, but including only those flights for which we have information on their aircraft (and vice versa).

Code
flights_airlines_planes <- 
  flights_airlines |> 
  inner_join(planes, by = "tailnum")
flights_airlines_planes

📝 Repeat the previous exercise but keeping both year variables (in one is the year of flight, in the other is the year of construction of the aircraft), and distinguishing them from each other

Code
flights_airlines_planes <- 
  flights_airlines |>
  inner_join(planes, by = "tailnum",
             suffix = c("_flight", "_build_aircraft"))
flights_airlines_planes

📝 To the table obtained from the previous exercise includes the longitude and latitude of the airports in airports, distinguishing between the latitude/longitude of the airport at destination and at origin.

Code
flights_airlines_planes |> 
  left_join(airports |> select(faa, lat, lon),
            by = c("origin" = "faa")) |> 
  rename(lat_origin = lat, lon_origin = lon) |> 
  left_join(airports |> select(faa, lat, lon),
            by = c("dest" = "faa")) |> 
  rename(lat_dest = lat, lon_dest = lon)

📝 Filter from airports only those airports from which flights depart. Repeat the process filtering only those airports where flights arrive.

Code
airports |> 
  semi_join(flights, by = c("faa" = "origin"))
airports |> 
  semi_join(flights, by = c("faa" = "dest"))

📝 How many flights do we not have information about the aircraft? Eliminate flights that do not have an aircraft ID (other than NA) beforehand.

Code
flights |>
  drop_na(tailnum) |>
  anti_join(planes, by = "tailnum") |>
  count(tailnum, sort = TRUE) 

2 🐣 Case study I: Beatles and Rolling Stones

We will use to practice simple joins the band_members and band_instruments datasets already included in the {dplyr} package.

library(dplyr)
band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

In the first one we have a series of artists and the band they belong to; in the second one we have a series of artists and the instrument they play. Beyond performing the requested actions, try to visualize which final table you would have

2.1 Question 1

Given the table band_members, incorporate the information of what instrument each member plays (band_instruments) of the ones you have in that table.

Code
left_join_band <-
  band_members |> 
  left_join(band_instruments, by = "name")

2.2 Question 2

Given the band_members and band_instruments tables, what kind of join should you do to have a complete table, with no absentees, where all band members have their instrument information, and every instrument has a member associated with it?

Code
inner_join_band <-
  band_members |>
  inner_join(band_instruments, by = "name")

2.3 Question 3

Given the band_instruments table, how to incorporate the information of who plays each instrument (in case we know it)?

Code
right_join_band <-
  band_members |>
  right_join(band_instruments, by = "name")

# other option
left_join_instruments <-
  band_instruments |> 
  left_join(band_members, by = "name")

2.4 Question 4

Given the band_members and band_instruments tables, what kind of join should you do to have a table with all the info, both members and instruments, even if there are members whose instrument you do not know, and instruments whose carrier you do not know?

Code
full_join_band <-
  band_members |>
  full_join(band_instruments, by = "name")

3 🐣 Case study II: income by municipalities

In the file municipios.csv we have stored the information of the municipalities of Spain as of 2019.

# 2019 data
mun_data <- read_csv(file = "./datos/municipios.csv")
Rows: 8212 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): codauto, ine.ccaa.name, cpro, ine.prov.name, cmun, name, LAU_CODE, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

On the other hand, in the file renta_mun we have the average per capita income of each administrative unit (municipalities, districts, provinces, autonomous communities,…) for different years.

renta_mun <- read_csv(file = "./datos/renta_mun.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 55273 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Unidad, codigo_ine
dbl (5): 2019, 2018, 2017, 2016, 2015

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Before we start let’s normalize variable names using clean_names() from the {janitor} package.

mun_data <-
  mun_data |> 
  janitor::clean_names()
renta_mun <-
  renta_mun |> 
  janitor::clean_names()

3.1 Question 1

Convert to tidydata renta_mun obtaining a table of 4 columns: unit, year, income and codigo_ine (no absent and each data of the correct type)

Code
renta_mun_tidy <-
  renta_mun |> 
  pivot_longer(cols = contains("x"), names_to = "year",
               values_to = "income", names_prefix = "x",
               names_transform = list(year = as.numeric),
               values_drop_na = TRUE)

3.2 Question 2

If you look at the table above we have data from different administrative units that are not always municipalities. Knowing that all municipalities have a 5-character code, filter only those records that correspond to municipal units.

Code
renta_mun_tidy <-
  renta_mun_tidy |>
  filter(str_detect(codigo_ine, pattern = "[0-9]{5}") & 
           str_length(codigo_ine) == 5)

3.3 Question 3

Then properly separate the unit variable into two columns: one with the code (which you already have so one of the two should be removed) and the name. Remove any remaining spaces (take a look at the {stringr} package options).

Code
renta_mun_tidy <-
  renta_mun_tidy |>
  separate(col = "unidad", into = c("cod_rm", "name"), sep = 5) |> 
  select(-cod_rm) |> 
  mutate(name = str_trim(name)) 

3.4 Question 4

In which year was the median income higher? And lower? What was the median income of municipalities in Spain in 2019?

summary_renta <-
  renta_mun_tidy |> 
  summarise("mean_income" = mean(income, na.rm = TRUE),
            .by = year)
summary_renta |>
  slice_min(mean_income, n = 1)
# A tibble: 1 × 2
   year mean_income
  <dbl>       <dbl>
1  2015      10083.
summary_renta |>
  slice_max(mean_income, n = 1)
# A tibble: 1 × 2
   year mean_income
  <dbl>       <dbl>
1  2019      11672.
renta_mun_tidy |> 
  filter(year == 2019) |> 
  summarise("median_income" = median(income, na.rm = TRUE))
# A tibble: 1 × 1
  median_income
          <dbl>
1         11462

3.5 Question 5

Do whatever you consider to obtain the province with the highest average income in 2019 and the one with the lowest. Be sure to get its name.

Code
summary_by_prov <-
  renta_mun_tidy |> 
  filter(year == 2019) |> 
  left_join(mun_data, by = "codigo_ine", suffix = c("", "_rm")) |> 
  select(-contains("rm")) |> 
  summarise("mean_by_prov" = mean(income, na.rm = TRUE),
            .by = c("cpro", "ine_prov_name"))

summary_by_prov |> 
  slice_max(mean_by_prov, n = 1)
# A tibble: 1 × 3
  cpro  ine_prov_name mean_by_prov
  <chr> <chr>                <dbl>
1 20    Gipuzkoa            15890.
Code
summary_by_prov |> 
  slice_min(mean_by_prov, n = 1)
# A tibble: 1 × 3
  cpro  ine_prov_name mean_by_prov
  <chr> <chr>                <dbl>
1 06    Badajoz              8805.

3.6 Question 6

Obtain from each ccaa the name of the municipality with the highest income in 2019.

Code
renta_mun_tidy |> 
  filter(year == 2019) |> 
  left_join(mun_data, by = "codigo_ine", suffix = c("", "_rm")) |> 
  select(-contains("rm")) |> 
  slice_max(income, by = "codauto")
# A tibble: 19 × 10
   name  codigo_ine  year income codauto ine_ccaa_name cpro  ine_prov_name cmun 
   <chr> <chr>      <dbl>  <dbl> <chr>   <chr>         <chr> <chr>         <chr>
 1 Nava… 40904       2019  20242 07      Castilla y L… 40    Segovia       904  
 2 Alma… 10019       2019  13940 11      Extremadura   10    Cáceres       019  
 3 Olei… 15058       2019  16447 12      Galicia       15    Coruña, A     058  
 4 Lauk… 48053       2019  21672 16      País Vasco    48    Bizkaia       053  
 5 Lumb… 26091       2019  24007 17      Rioja, La     26    Rioja, La     091  
 6 Murc… 30030       2019  11631 14      Murcia, Regi… 30    Murcia        030  
 7 Raja… 08178       2019  23401 09      Cataluña      08    Barcelona     178  
 8 Aloc… 19023       2019  19700 08      Castilla - L… 19    Guadalajara   023  
 9 Argu… 22037       2019  19219 02      Aragón        22    Huesca        037  
10 Alcu… 04009       2019  14196 01      Andalucía     04    Almería       009  
11 Aran… 31023       2019  15517 15      Navarra, Com… 31    Navarra       023  
12 Sant… 35021       2019  15982 05      Canarias      35    Palmas, Las   021  
13 Roca… 46216       2019  17872 10      Comunitat Va… 46    Valencia/Val… 216  
14 Pozu… 28115       2019  26367 13      Madrid, Comu… 28    Madrid        115  
15 Vall… 07063       2019  19565 04      Balears, Ill… 07    Balears, Ill… 063  
16 Vald… 39093       2019  15574 06      Cantabria     39    Cantabria     093  
17 Teve… 33072       2019  15409 03      Asturias, Pr… 33    Asturias      072  
18 Ceuta 51001       2019  12038 18      Ceuta         51    Ceuta         001  
19 Meli… 52001       2019  11463 19      Melilla       52    Melilla       001  
# ℹ 1 more variable: lau_code <chr>

4 Import and export files

So far we have only used data already loaded in packages but many times we will need to import data externally. One of the main strengths of R is that we can import data very easily in different formats:

  • R native formats: .rda, .RData and .rds formats.

  • Rectangular (tabular) data: .csv and .tsv formats

  • Untabulated data: .txt format.

  • Data in excel: .xls and .xlsx formats

  • Data from SAS/Stata/SPSS: .sas7bdat, .sav and .dat formats

  • Data from Google Drive

  • Data from API’s: aemet, catastro, twitter, spotify, etc.

4.1 R native formats

The simplest files to import into R (and which usually take up less disk space) are its own native extensions: files in .RData, .rda and .rds formats. To load the former we simply need to use the native function load() by providing it the file path.

  • RData file: we are going to import the file world_bank_pop.RData, which includes the dataset world_bank_pop
load("./datos/world_bank_pop.RData")
world_bank_pop
# A tibble: 1,064 × 20
   country indicator      `2000`  `2001`  `2002`  `2003`  `2004`  `2005`  `2006`
   <chr>   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ABW     SP.URB.TOTL    4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
 2 ABW     SP.URB.GROW    1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
 3 ABW     SP.POP.TOTL    8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
 4 ABW     SP.POP.GROW    2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
 5 AFE     SP.URB.TOTL    1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
 6 AFE     SP.URB.GROW    3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
 7 AFE     SP.POP.TOTL    4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
 8 AFE     SP.POP.GROW    2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
 9 AFG     SP.URB.TOTL    4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
10 AFG     SP.URB.GROW    1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
# ℹ 1,054 more rows
# ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
#   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
#   `2016` <dbl>, `2017` <dbl>
  • .rda file: we will import the airquality dataset from airquality.rda
load("./datos/airquality.rda")
airquality |> as_tibble()
# A tibble: 153 × 6
   Ozone Solar.R  Wind  Temp Month   Day
   <int>   <int> <dbl> <int> <int> <int>
 1    41     190   7.4    67     5     1
 2    36     118   8      72     5     2
 3    12     149  12.6    74     5     3
 4    18     313  11.5    62     5     4
 5    NA      NA  14.3    56     5     5
 6    28      NA  14.9    66     5     6
 7    23     299   8.6    65     5     7
 8    19      99  13.8    59     5     8
 9     8      19  20.1    61     5     9
10    NA     194   8.6    69     5    10
# ℹ 143 more rows

Note that files loaded with load() are automatically loaded into the environment (with the originally saved name), and not only datasets can be loaded: load() allows us to load multiple objetcs (not only a tabular data)

Native .rda and .RData files are a properly way to save your environment.

load(file = "./datos/multiple_objects.rda")
  • .rds files: for this type we must use readRDS(), and we need to incorporate a argument file with the path. In this case we are going to import lung cancer data from the North Central Cancer Treatment Group. Note that now .rds import files are a unique database
lung_cancer <-
  readRDS(file = "./datos/NCCTG_lung_cancer.rds") |>
  as_tibble()
lung_cancer
# A tibble: 228 × 10
    inst  time status   age   sex ph.ecog ph.karno pat.karno meal.cal wt.loss
   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
 1     3   306      2    74     1       1       90       100     1175      NA
 2     3   455      2    68     1       0       90        90     1225      15
 3     3  1010      1    56     1       0       90        90       NA      15
 4     5   210      2    57     1       1       90        60     1150      11
 5     1   883      2    60     1       0      100        90       NA       0
 6    12  1022      1    74     1       1       50        80      513       0
 7     7   310      2    68     2       2       70        60      384      10
 8    11   361      2    71     2       2       60        80      538       1
 9     1   218      2    53     1       1       70        80      825      16
10     7   166      2    61     1       2       70        70      271      34
# ℹ 218 more rows
Important

The paths must always be without spaces, ñ, or accents.

4.2 Tabular data: readr

The {readr} package within the {tidyverse} environment contains several useful functions for loading rectangular data (without formatting).

  • read_csv(): .csv files whose separator is comma
  • read_csv2(): semicolon
  • read_tsv(): tabulator.
  • read_table(): space.
  • read_delim(): generic function for character delimited files.

All of them need as argument the file path plus other optional (skip header or not, decimals, etc). See more at https://readr.tidyverse.org/

4.2.1 .csv, .tsv

The main advantage of {readr} is that it automates formatting to go from a flat (unformatted) file to a tibble (in rows and columns, with formatting).

  • File .csv: with read_csv() we will load comma separated files, passing as argument the path in file = .... Let’s import the chickens.csv dataset (about cartoon chickens, why not). If you look at the output it gives us the type of variables.
library(readr)
chickens <- read_csv(file = "./datos/chickens.csv")
Rows: 5 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): chicken, sex, motto
dbl (1): eggs_laid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chickens
# A tibble: 5 × 4
  chicken                 sex     eggs_laid motto                               
  <chr>                   <chr>       <dbl> <chr>                               
1 Foghorn Leghorn         rooster         0 That's a joke, ah say, that's a jok…
2 Chicken Little          hen             3 The sky is falling!                 
3 Ginger                  hen            12 Listen. We'll either die free chick…
4 Camilla the Chicken     hen             7 Bawk, buck, ba-gawk.                
5 Ernie The Giant Chicken rooster         0 Put Captain Solo in the cargo hold. 

The variable format will normally be done automatically by read_csv(), and we can query it with spec().

spec(chickens)
cols(
  chicken = col_character(),
  sex = col_character(),
  eggs_laid = col_double(),
  motto = col_character()
)

Although it usually does it well automatically we can specify the format explicitly in col_types = list() (in list format, with col_xxx() for each type of variable, for example eggs_laid will be imported as character).

chickens <-
  read_csv(file = "./datos/chickens.csv",
           col_types = list(col_character(), col_character(),
                            col_character(), col_character()))
chickens
# A tibble: 5 × 4
  chicken                 sex     eggs_laid motto                               
  <chr>                   <chr>   <chr>     <chr>                               
1 Foghorn Leghorn         rooster 0         That's a joke, ah say, that's a jok…
2 Chicken Little          hen     3         The sky is falling!                 
3 Ginger                  hen     12        Listen. We'll either die free chick…
4 Camilla the Chicken     hen     7         Bawk, buck, ba-gawk.                
5 Ernie The Giant Chicken rooster 0         Put Captain Solo in the cargo hold. 

We can even indicate that variables we want to select (without occupying memory), by indicating it in col_select = ... (in list format, with col_select = ...).

chickens <-
  read_csv(file = "./datos/chickens.csv",
           col_select = c(chicken, sex, eggs_laid))
Rows: 5 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): chicken, sex
dbl (1): eggs_laid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chickens
# A tibble: 5 × 3
  chicken                 sex     eggs_laid
  <chr>                   <chr>       <dbl>
1 Foghorn Leghorn         rooster         0
2 Chicken Little          hen             3
3 Ginger                  hen            12
4 Camilla the Chicken     hen             7
5 Ernie The Giant Chicken rooster         0

4.2.2 .txt

What happens when the separator is not correct?

If we use read_csv() it expects the separator between columns to be a comma but, as you can see with the following .txt, it interprets everything as a single column: has no comma and does not know where to separate

datos_txt <- read_csv(file = "./datos/massey-rating.txt")
Rows: 10 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): UCC PAY LAZ KPK  RT   COF BIH DII ENG ACU Rank Team            Conf

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(datos_txt)
[1] 10  1
as_tibble(datos_txt)
# A tibble: 10 × 1
   `UCC PAY LAZ KPK  RT   COF BIH DII ENG ACU Rank Team            Conf`
   <chr>                                                                
 1 1   1   1   1   1     1   1   1   1   1    1 Ohio St          B10    
 2 2   2   2   2   2     2   2   2   4   2    2 Oregon           P12    
 3 3   4   3   4   3     4   3   4   2   3    3 Alabama          SEC    
 4 4   3   4   3   4     3   5   3   3   4    4 TCU              B12    
 5 6   6   6   5   5     7   6   5   6  11    5 Michigan St      B10    
 6 7   7   7   6   7     6  11   8   7   8    6 Georgia          SEC    
 7 5   5   5   7   6     8   4   6   5   5    7 Florida St       ACC    
 8 8   8   9   9  10     5   7   7  10   7    8 Baylor           B12    
 9 9  11   8  13  11    11  12   9  14   9    9 Georgia Tech     ACC    
10 13  10  13  11   8     9  10  11   9  10   10 Mississippi      SEC   

To do this we have.

  • read_csv2() when the separator is semicolon, read_tsv() when the is a tab and read_table() when the is a space.

  • read_delim() in general.

datos_txt <- read_table(file = "./datos/massey-rating.txt")

── Column specification ────────────────────────────────────────────────────────
cols(
  UCC = col_double(),
  PAY = col_double(),
  LAZ = col_double(),
  KPK = col_double(),
  RT = col_double(),
  COF = col_double(),
  BIH = col_double(),
  DII = col_double(),
  ENG = col_double(),
  ACU = col_double(),
  Rank = col_double(),
  Team = col_character(),
  Conf = col_character()
)
Warning: 10 parsing failures.
row col   expected     actual                        file
  1  -- 13 columns 15 columns './datos/massey-rating.txt'
  2  -- 13 columns 14 columns './datos/massey-rating.txt'
  3  -- 13 columns 14 columns './datos/massey-rating.txt'
  4  -- 13 columns 14 columns './datos/massey-rating.txt'
  5  -- 13 columns 15 columns './datos/massey-rating.txt'
... ... .......... .......... ...........................
See problems(...) for more details.
as_tibble(datos_txt)
# A tibble: 10 × 13
     UCC   PAY   LAZ   KPK    RT   COF   BIH   DII   ENG   ACU  Rank Team  Conf 
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
 1     1     1     1     1     1     1     1     1     1     1     1 Ohio  St   
 2     2     2     2     2     2     2     2     2     4     2     2 Oreg… P12  
 3     3     4     3     4     3     4     3     4     2     3     3 Alab… SEC  
 4     4     3     4     3     4     3     5     3     3     4     4 TCU   B12  
 5     6     6     6     5     5     7     6     5     6    11     5 Mich… St   
 6     7     7     7     6     7     6    11     8     7     8     6 Geor… SEC  
 7     5     5     5     7     6     8     4     6     5     5     7 Flor… St   
 8     8     8     9     9    10     5     7     7    10     7     8 Bayl… B12  
 9     9    11     8    13    11    11    12     9    14     9     9 Geor… Tech 
10    13    10    13    11     8     9    10    11     9    10    10 Miss… SEC  

4.3 Excel data (.xls, .xlsx)

Another key import package will be the {readxl} package for importing data from Excel. Three functions will be key:

  • read_xls() specific to .xls, read_xlsx() specific to .xlsx.
  • read_excel(): for both .xls and .xlsx.

We are going to import deaths.xlsx with celebrity death records.

library(readxl)
deaths <- read_xlsx(path = "./datos/deaths.xlsx")
New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
deaths
# A tibble: 18 × 6
   `Lots of people`             ...2                     ...3  ...4  ...5  ...6 
   <chr>                        <chr>                    <chr> <chr> <chr> <chr>
 1 simply cannot resist writing <NA>                     <NA>  <NA>  <NA>  some…
 2 at                           the                      top   <NA>  of    thei…
 3 or                           merging                  <NA>  <NA>  <NA>  cells
 4 Name                         Profession               Age   Has … Date… Date…
 5 David Bowie                  musician                 69    TRUE  17175 42379
 6 Carrie Fisher                actor                    60    TRUE  20749 42731
 7 Chuck Berry                  musician                 90    TRUE  9788  42812
 8 Bill Paxton                  actor                    61    TRUE  20226 42791
 9 Prince                       musician                 57    TRUE  21343 42481
10 Alan Rickman                 actor                    69    FALSE 16854 42383
11 Florence Henderson           actor                    82    TRUE  12464 42698
12 Harper Lee                   author                   89    FALSE 9615  42419
13 Zsa Zsa Gábor                actor                    99    TRUE  6247  42722
14 George Michael               musician                 53    FALSE 23187 42729
15 Some                         <NA>                     <NA>  <NA>  <NA>  <NA> 
16 <NA>                         also like to write stuff <NA>  <NA>  <NA>  <NA> 
17 <NA>                         <NA>                     at t… bott… <NA>  <NA> 
18 <NA>                         <NA>                     <NA>  <NA>  <NA>  too! 
deaths |> slice(1:6)
# A tibble: 6 × 6
  `Lots of people`             ...2       ...3  ...4     ...5          ...6     
  <chr>                        <chr>      <chr> <chr>    <chr>         <chr>    
1 simply cannot resist writing <NA>       <NA>  <NA>     <NA>          some not…
2 at                           the        top   <NA>     of            their sp…
3 or                           merging    <NA>  <NA>     <NA>          cells    
4 Name                         Profession Age   Has kids Date of birth Date of …
5 David Bowie                  musician   69    TRUE     17175         42379    
6 Carrie Fisher                actor      60    TRUE     20749         42731    

One thing that is very common misfortune is that there is some kind of comment or text at the beginning of the file, having to skip those rows.

We can skip these rows directly in the load with skip = ... (indicating the number of rows to skip).

deaths <- read_xlsx(path = "./datos/deaths.xlsx", skip = 4)
deaths
# A tibble: 14 × 6
   Name          Profession Age   `Has kids` `Date of birth`     `Date of death`
   <chr>         <chr>      <chr> <chr>      <dttm>              <chr>          
 1 David Bowie   musician   69    TRUE       1947-01-08 00:00:00 42379          
 2 Carrie Fisher actor      60    TRUE       1956-10-21 00:00:00 42731          
 3 Chuck Berry   musician   90    TRUE       1926-10-18 00:00:00 42812          
 4 Bill Paxton   actor      61    TRUE       1955-05-17 00:00:00 42791          
 5 Prince        musician   57    TRUE       1958-06-07 00:00:00 42481          
 6 Alan Rickman  actor      69    FALSE      1946-02-21 00:00:00 42383          
 7 Florence Hen… actor      82    TRUE       1934-02-14 00:00:00 42698          
 8 Harper Lee    author     89    FALSE      1926-04-28 00:00:00 42419          
 9 Zsa Zsa Gábor actor      99    TRUE       1917-02-06 00:00:00 42722          
10 George Micha… musician   53    FALSE      1963-06-25 00:00:00 42729          
11 Some          <NA>       <NA>  <NA>       NA                  <NA>           
12 <NA>          also like… <NA>  <NA>       NA                  <NA>           
13 <NA>          <NA>       at t… bottom,    NA                  <NA>           
14 <NA>          <NA>       <NA>  <NA>       NA                  too!           

In addition with col_names = ... we can already rename the columns in the import (if provide names assumes 1st line already as a data)

deaths <-
  read_xlsx(path = "./datos/deaths.xlsx", skip = 5,
            col_names = c("name", "profession", "age", "kids", "birth", "death"))
deaths
# A tibble: 14 × 6
   name               profession           age   kids  birth               death
   <chr>              <chr>                <chr> <chr> <dttm>              <chr>
 1 David Bowie        musician             69    TRUE  1947-01-08 00:00:00 42379
 2 Carrie Fisher      actor                60    TRUE  1956-10-21 00:00:00 42731
 3 Chuck Berry        musician             90    TRUE  1926-10-18 00:00:00 42812
 4 Bill Paxton        actor                61    TRUE  1955-05-17 00:00:00 42791
 5 Prince             musician             57    TRUE  1958-06-07 00:00:00 42481
 6 Alan Rickman       actor                69    FALSE 1946-02-21 00:00:00 42383
 7 Florence Henderson actor                82    TRUE  1934-02-14 00:00:00 42698
 8 Harper Lee         author               89    FALSE 1926-04-28 00:00:00 42419
 9 Zsa Zsa Gábor      actor                99    TRUE  1917-02-06 00:00:00 42722
10 George Michael     musician             53    FALSE 1963-06-25 00:00:00 42729
11 Some               <NA>                 <NA>  <NA>  NA                  <NA> 
12 <NA>               also like to write … <NA>  <NA>  NA                  <NA> 
13 <NA>               <NA>                 at t… bott… NA                  <NA> 
14 <NA>               <NA>                 <NA>  <NA>  NA                  too! 

Sometimes Excel dates are incorrectly formatted (surprise): we can use convertToDate() from the {openxlsx} package to convert it.

library(openxlsx)
deaths$death <- convertToDate(deaths$death)
Warning in convertToDate(deaths$death): NAs introduced by coercion
deaths
# A tibble: 14 × 6
   name               profession      age   kids  birth               death     
   <chr>              <chr>           <chr> <chr> <dttm>              <date>    
 1 David Bowie        musician        69    TRUE  1947-01-08 00:00:00 2016-01-10
 2 Carrie Fisher      actor           60    TRUE  1956-10-21 00:00:00 2016-12-27
 3 Chuck Berry        musician        90    TRUE  1926-10-18 00:00:00 2017-03-18
 4 Bill Paxton        actor           61    TRUE  1955-05-17 00:00:00 2017-02-25
 5 Prince             musician        57    TRUE  1958-06-07 00:00:00 2016-04-21
 6 Alan Rickman       actor           69    FALSE 1946-02-21 00:00:00 2016-01-14
 7 Florence Henderson actor           82    TRUE  1934-02-14 00:00:00 2016-11-24
 8 Harper Lee         author          89    FALSE 1926-04-28 00:00:00 2016-02-19
 9 Zsa Zsa Gábor      actor           99    TRUE  1917-02-06 00:00:00 2016-12-18
10 George Michael     musician        53    FALSE 1963-06-25 00:00:00 2016-12-25
11 Some               <NA>            <NA>  <NA>  NA                  NA        
12 <NA>               also like to w… <NA>  <NA>  NA                  NA        
13 <NA>               <NA>            at t… bott… NA                  NA        
14 <NA>               <NA>            <NA>  <NA>  NA                  NA        

We can also load an Excel with several sheets: to indicate the sheet (either by its name or by its number) we will use the argument sheet = ....

mtcars <- read_xlsx(path = "./datos/datasets.xlsx", sheet = "mtcars")
mtcars
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

We can even indicate the range of cells to load with range = ....

iris <- read_xlsx(path = "./datos/datasets.xlsx", sheet = "iris", range = "C1:E4")
iris
# A tibble: 3 × 3
  Petal.Length Petal.Width Species
         <dbl>       <dbl> <chr>  
1          1.4         0.2 setosa 
2          1.4         0.2 setosa 
3          1.3         0.2 setosa 

4.4 Import from SAS/STATA/SPSS

The {haven} package within the tidyverse orbit will allow us to import files from the 3 most important payment software: SAS, SPSS and Stata.

library(haven)

# SAS
iris_sas <- read_sas(data_file = "./datos/iris.sas7bdat")

# SPSS
iris_spss <- read_sav(file = "./datos/iris.sav")

# Stata
iris_stata <- read_dta(file = "./datos/iris.dta")

4.5 Export

In the same way that we can import we can also export

  • exported in .RData (recommended option for variables stored in R). Remember that this extension can only be used in R. To do so, just use save(object, file = path).
table <- tibble("a" = 1:4, "b" = 1:4)
save(table, file = "./datos/table.RData")
rm(table) # eliminar
load("./datos/table.RData")
table
# A tibble: 4 × 2
      a     b
  <int> <int>
1     1     1
2     2     2
3     3     3
4     4     4
  • exported in .RData multiple objects
table <- tibble("a" = 1:4, "b" = 1:4)
a <- 1
b <- c("javi", "sandra")
save(table, a, b, file = "./datos/mult_obj.RData")
rm(list = c("a", "b", "table"))
load("./datos/mult_obj.RData")
table
# A tibble: 4 × 2
      a     b
  <int> <int>
1     1     1
2     2     2
3     3     3
4     4     4
  • exported in .csv. To do this we simply use write_csv(object, file = path).
write_csv(table, file = "./datos/table.csv")
read_csv(file = "./datos/table.csv")
Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): a, b

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 2
      a     b
  <dbl> <dbl>
1     1     1
2     2     2
3     3     3
4     4     4

4.6 Import from new sources

4.6.1 Website

One of the main advantages of R is that we can make use of all the previous functions of import but directly from a web, without the need to perform the manual download: instead of passing it the local path we will indicate the link. For example, we are going to download the covid data from ISCIII (https://cnecovid.isciii.es/covid19/#documentaci%C3%B3n-y-datos)

covid_data <-
  read_csv(file = "https://cnecovid.isciii.es/covid19/resources/casos_hosp_uci_def_sexo_edad_provres.csv")
covid_data
Rows: 500 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): provincia_iso, sexo, grupo_edad
dbl  (4): num_casos, num_hosp, num_uci, num_def
date (1): fecha

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 500 × 8
   provincia_iso sexo  grupo_edad fecha      num_casos num_hosp num_uci num_def
   <chr>         <chr> <chr>      <date>         <dbl>    <dbl>   <dbl>   <dbl>
 1 A             H     0-9        2020-01-01         0        0       0       0
 2 A             H     10-19      2020-01-01         0        0       0       0
 3 A             H     20-29      2020-01-01         0        0       0       0
 4 A             H     30-39      2020-01-01         0        0       0       0
 5 A             H     40-49      2020-01-01         0        0       0       0
 6 A             H     50-59      2020-01-01         0        0       0       0
 7 A             H     60-69      2020-01-01         0        0       0       0
 8 A             H     70-79      2020-01-01         0        0       0       0
 9 A             H     80+        2020-01-01         0        0       0       0
10 A             H     NC         2020-01-01         0        0       0       0
# ℹ 490 more rows

4.6.2 Wikipedia

The {rvest} package, one of the most useful of {tidyverse} allows us to import directly from an html. For example, to export wikipedia tables just read_html() to import the html, html_element("table") to extract the table objects, and html_table() to convert the html table to tibble.

library(rvest)

Attaching package: 'rvest'
The following object is masked from 'package:readr':

    guess_encoding
wiki_jump <- 'https://en.wikipedia.org/wiki/Men%27s_long_jump_world_record_progression'
wiki_jump |> read_html() |> 
  html_element("table") |> 
  html_table()
# A tibble: 19 × 5
   Mark                       Wind   Athlete                  Place        Date 
   <chr>                      <chr>  <chr>                    <chr>        <chr>
 1 7.61 m (24 ft 11+1⁄2 in)   ""     Peter O'Connor (IRE)     Dublin, Ire… 5 Au…
 2 7.69 m (25 ft 2+3⁄4 in)    ""     Edward Gourdin (USA)     Cambridge, … 23 J…
 3 7.76 m (25 ft 5+1⁄2 in)    ""     Robert LeGendre (USA)    Paris, Fran… 7 Ju…
 4 7.89 m (25 ft 10+1⁄2 in)   ""     DeHart Hubbard (USA)     Chicago, Un… 13 J…
 5 7.90 m (25 ft 11 in)       ""     Edward Hamm (USA)        Cambridge, … 7 Ju…
 6 7.93 m (26 ft 0 in)        "0.0"  Sylvio Cator (HAI)       Paris, Fran… 9 Se…
 7 7.98 m (26 ft 2 in)        "0.5"  Chuhei Nambu (JPN)       Tokyo, Japan 27 O…
 8 8.13 m (26 ft 8 in)        "1.5"  Jesse Owens (USA)        Ann Arbor, … 25 M…
 9 8.21 m (26 ft 11 in)       "0.0"  Ralph Boston (USA)       Walnut, Uni… 12 A…
10 8.24 m (27 ft 1⁄4 in)      "1.8"  Ralph Boston (USA)       Modesto, Un… 27 M…
11 8.28 m (27 ft 1+3⁄4 in)    "1.2"  Ralph Boston (USA)       Moscow, Sov… 16 J…
12 8.31 m (27 ft 3 in) A      "−0.1" Igor Ter-Ovanesyan (URS) Yerevan, So… 10 J…
13 8.33 m (27 ft 3+3⁄4 in)[2] ""     Phil Shinnick (USA)      Modesto, Un… 25 M…
14 8.31 m (27 ft 3 in)        "0.0"  Ralph Boston (USA)       Kingston, J… 15 A…
15 8.34 m (27 ft 4+1⁄4 in)    "1.0"  Ralph Boston (USA)       Los Angeles… 12 S…
16 8.35 m (27 ft 4+1⁄2 in)[5] "0.0"  Ralph Boston (USA)       Modesto, Un… 29 M…
17 8.35 m (27 ft 4+1⁄2 in) A  "0.0"  Igor Ter-Ovanesyan (URS) Mexico City… 19 O…
18 8.90 m (29 ft 2+1⁄4 in) A  "2.0"  Bob Beamon (USA)         Mexico City… 18 O…
19 8.95 m (29 ft 4+1⁄4 in)    "0.3"  Mike Powell (USA)        Tokyo, Japan 30 A…

4.6.3 Google Drive

Another option available (especially if we work with other people working) is to import from a Google Drive spreadsheet, making use of read_sheet() from the {googlesheets4} package.

The first time you will be asked for a tidyverse permission to interact with your drive

library(googlesheets4)
google_sheet <-
  read_sheet("https://docs.google.com/spreadsheets/d/1Uz38nHjl3bmftxDpcXj--DYyPo1I39NHVf-xjeg1_wI/edit?usp=sharing")
! Using an auto-discovered, cached token.
  To suppress this message, modify your code or options to clearly consent to
  the use of a cached token.
  See gargle's "Non-interactive auth" vignette for more details:
  <https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googlesheets4 package is using a cached token for 'javalv09@ucm.es'.
✔ Reading from "import-google-sheet".
✔ Range 'Hoja 1'.
google_sheet
# A tibble: 3 × 4
  a         b     c d     
  <chr> <dbl> <dbl> <chr> 
1 d1      0.5     1 hombre
2 d2     -0.3     2 hombre
3 d3      1.1     3 mujer 

4.6.4 API (owid)

Another interesting option is the data download from an API: an intermediary between an app or data provider and our R. For example, let’s load the {owidR} library, which allows us to download data from the web https://ourworldindata.org/. For example, the owid_covid() function loads without realizing it more than 400 000 records with more than 60 variables from 238 countries.

library(owidR)
owid_covid() |> as_tibble()
# A tibble: 7 × 67
  iso_code continent location    date       total_cases new_cases
  <chr>    <chr>     <chr>       <IDate>          <int>     <int>
1 AFG      Asia      Afghanistan 2020-01-05           0         0
2 AFG      Asia      Afghanistan 2020-01-06           0         0
3 AFG      Asia      Afghanistan 2020-01-07           0         0
4 AFG      Asia      Afghanistan 2020-01-08           0         0
5 AFG      Asia      Afghanistan 2020-01-09           0         0
6 AFG      Asia      Afghanistan 2020-01-10           0         0
7 AFG      Asia      Afghanistan 2020-01-11           0         0
# ℹ 61 more variables: new_cases_smoothed <dbl>, total_deaths <int>,
#   new_deaths <int>, new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
#   new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
#   total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
#   new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
#   icu_patients <int>, icu_patients_per_million <dbl>, hosp_patients <int>,
#   hosp_patients_per_million <dbl>, weekly_icu_admissions <int>, …

4.6.5 API (aemet)

In many occasions to connect to the API we will first have to register and obtain a key, this is the case of the {climaemet} package to access Spanish meteorological data (https://opendata.aemet.es/centrodedescargas/inicio).

Once we have the API key we register it in our RStudio to be able to use it in the future.

library(climaemet)

# Api key
apikey <- "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJqYXZhbHYwOUB1Y20uZXMiLCJqdGkiOiI4YTU1ODUxMS01MTE3LTQ4MTYtYmM4OS1hYmVkNDhiODBkYzkiLCJpc3MiOiJBRU1FVCIsImlhdCI6MTY2NjQ2OTcxNSwidXNlcklkIjoiOGE1NTg1MTEtNTExNy00ODE2LWJjODktYWJlZDQ4YjgwZGM5Iiwicm9sZSI6IiJ9.HEMR77lZy2ASjmOxJa8ppx2J8Za1IViurMX3p1reVBU"

aemet_api_key(apikey, install = TRUE)

With this package we can do a search for stations to know both its postal code and its identifier code within the AEMET network

stations <- aemet_stations()
stations
# A tibble: 947 × 7
   indicativo indsinop nombre                 provincia altitud longitud latitud
   <chr>      <chr>    <chr>                  <chr>       <dbl>    <dbl>   <dbl>
 1 B013X      "08304"  ESCORCA, LLUC          ILLES BA…     490     2.89    39.8
 2 B051A      "08316"  SÓLLER, PUERTO         ILLES BA…       5     2.69    39.8
 3 B087X      ""       BANYALBUFAR            ILLES BA…      60     2.51    39.7
 4 B103B      "99103"  ANDRATX - SANT ELM     ILLES BA…      52     2.37    39.6
 5 B158X      ""       CALVIÀ, ES CAPDELLÀ    ILLES BA…      50     2.47    39.6
 6 B228       "08301"  PALMA, PUERTO          ILLES BA…       3     2.63    39.6
 7 B236C      ""       PALMA, UNIVERSIDAD     ILLES BA…      95     2.64    39.6
 8 B248       "08303"  SIERRA DE ALFABIA, BU… ILLES BA…    1030     2.71    39.7
 9 B275E      "08302"  SON BONET, AEROPUERTO  BALEARES       49     2.71    39.6
10 B278       "08306"  PALMA DE MALLORCA, AE… ILLES BA…       8     2.74    39.6
# ℹ 937 more rows

For example, to get data from the station of the airport of El Prat, Barcelona, the code to provide is "0076", obtaining hourly data

aemet_last_obs("0076")
# A tibble: 13 × 23
   idema   lon fint                 prec   alt  vmax    vv    dv   lat  dmax
   <chr> <dbl> <dttm>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 0076   2.07 2024-11-02 07:00:00   0       4  11.3   3.6   360  41.3   340
 2 0076   2.07 2024-11-02 08:00:00   0       4  10.3   4.2    10  41.3   340
 3 0076   2.07 2024-11-02 09:00:00   0       4  10.8   4.4    10  41.3    30
 4 0076   2.07 2024-11-02 10:00:00   0.2     4  12.4   4.7   360  41.3    10
 5 0076   2.07 2024-11-02 11:00:00   0       4  11.8   4.7    10  41.3    10
 6 0076   2.07 2024-11-02 12:00:00   0       4  14.9  10.7    70  41.3    70
 7 0076   2.07 2024-11-02 13:00:00   0.3     4  16    11.9    70  41.3    80
 8 0076   2.07 2024-11-02 14:00:00   0       4  18    14.1    80  41.3    80
 9 0076   2.07 2024-11-02 15:00:00   0       4  17    13.4    90  41.3    80
10 0076   2.07 2024-11-02 16:00:00   0       4  14.9  11.4    90  41.3    80
11 0076   2.07 2024-11-02 17:00:00   0       4  13.4   9.9    90  41.3    80
12 0076   2.07 2024-11-02 18:00:00   0       4  11.8   7.3    80  41.3    80
13 0076   2.07 2024-11-02 19:00:00   0       4   9.8   6.4    80  41.3    80
# ℹ 13 more variables: ubi <chr>, pres <dbl>, hr <dbl>, stdvv <dbl>, ts <dbl>,
#   pres_nmar <dbl>, tamin <dbl>, ta <dbl>, tamax <dbl>, tpr <dbl>, vis <dbl>,
#   stddv <dbl>, inso <dbl>

4.6.6 API (US census)

One of the most useful tools in recent years is known as {tidycensus}: a tool for facilitating the process of downloading census data for the United States from R.

library(tidycensus)
  • get_decennial(): to access the census data (US Decennial Census), done every 10 years (years 2000, 2010 and 2020).

  • get_acs(): to access the annual and quinquennial (5 years) ACS (American Community Survey) (census != survey)

  • get_estimates(): to access the annual population, birth and death estimates (census != survey)

  • get_pums(): to access the microdata (unaggregated data) of the ACS (anonymized at the individual level)

  • get_flows(): to access the migration flow data.

portar from API (US census)

For example, we will download the census data (get_decennial()) at the state level (geography = “state”) for the population (variable variables = “P001001”) for the year 2010 (see variables in tidycensus::load_variables()).

total_population_10 <-
  get_decennial(geography = "state", 
  variables = "P001001",
  year = 2010)
Getting data from the 2010 decennial Census
Using Census Summary File 1
total_population_10
# A tibble: 52 × 4
   GEOID NAME        variable    value
   <chr> <chr>       <chr>       <dbl>
 1 01    Alabama     P001001   4779736
 2 02    Alaska      P001001    710231
 3 04    Arizona     P001001   6392017
 4 05    Arkansas    P001001   2915918
 5 06    California  P001001  37253956
 6 22    Louisiana   P001001   4533372
 7 21    Kentucky    P001001   4339367
 8 08    Colorado    P001001   5029196
 9 09    Connecticut P001001   3574097
10 10    Delaware    P001001    897934
# ℹ 42 more rows

4.6.7 Other options

4.7 💻 Your turn

Try to solve the following exercises without looking at the solutions

📝 The who dataset we have used in previous exercises, export it to a native R format in the data folder of the project

Code
library(tidyr)
save(who, file = "./datos/who.RData")

📝 Loads the who dataset but from the data folder (import the file created in the previous exercise)

Code
load("./datos/who.RData")

📝 Repeats the same (export and import) in 4 formats: .csv, .xlsx, .sav (spss) and .dta (stata)

Code
# csv
library(readr)
write_csv(who, file = "./datos/who.csv")
who_data <- read_csv(file = "./datos/who.csv")

# excel
library(openxlsx)
write.xlsx(who, file = "./datos/who.xlsx")
who_data <- read_xlsx(path = "./datos/who.xlsx")

# sas y stata
library(haven)
write_sav(who, path = "./datos/who.sav")
who_data <- read_spss(path = "./datos/who.sav")

write_dta(who, path = "./datos/who.dta")
who_data <- read_dta(path = "./datos/who.dta")

📝 Repeat the loading of who.csv but only select the first 4 columns already in the load.

Code
who_select <-
  read_csv(file = "./datos/who.csv",
           col_select = c("country", "iso2", "iso3", "year"))

5 🐣 Case study I: CIS survey

📊 Data

We are going to put into practice the loading and preprocessing of a file generated by one of the most used software (SPSS). The file contains data from the CIS (Centro de Investigaciones Sociológicas) barometer «Perceptions on equality between men and women and gender stereotypes» whose sample work was carried out from November 6 to 14 (4000 interviews of both sexes over 16 years old in 1174 municipalities and 50 provinces).

5.1 Question 1

Load the file extension .sav that you have in the subfolder CIS-feminism inside the data folder. After loading normalize variable names with {janitor} package.

Code
library(haven)
data <-
  read_sav(file = "./datos/CIS-feminismo/3428.sav") |> 
  janitor::clean_names()

5.2 Question 2

Calculate using tidyverse the number of distinct values of each variable and eliminate those that only have a constant value. First get the name of those variables and then use them inside a select().

Code
rm_variables <-
  data |> 
  summarise(across(everything(), n_distinct)) |> 
  pivot_longer(cols = everything(), names_to = "variable", values_to = "n_values") |> 
  filter(n_values <= 1) |> 
  pull(variable)
data <-
  data |> 
  select(-rm_variables)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(rm_variables)

  # Now:
  data %>% select(all_of(rm_variables))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

5.3 Question 3

Perform the same action to detect those variables that have a different value for each individual (all values are different). These columns are important, they will be the ids of the respondent, but one is enough (if there are several, delete one of them).

Code
rm_variables <-
  data |> 
  summarise(across(everything(), n_distinct)) |> 
  pivot_longer(cols = everything(), names_to = "variable", values_to = "n_values") |> 
  filter(n_values == nrow(data)) |> 
  pull(variable)
data <-
  data |> 
  # all of them except one of them
  select(-rm_variables[-1])

5.4 Question 4

To understand what the variables mean, you have in the folder CIS-feminismos the questionnaire that was made to each respondent in cues3428.pdf and the sampling data sheet in FT3428.pdf. In these documents you will see how, for example, the variables of the questionnaire almost all begin with p..., although we also have other types of variables related to household chores (tarhog...), children (hijomenor...), childcare (cuidadohijos...), care tasks in general (tareascuid...) and other variables related to the ideological scale, religion, voting memory, etc. In addition, all variables starting with ia_xxx refer to codes regarding possible incidences in the data collection and the variables peso... to the type of weighting used in the sampling. Proceed to eliminate these last two type of variables.

Code
data <-
  data |> 
  select(-contains("ia_"), -contains("peso"))

5.5 Question 5

Calculate the number of interviews per autonomous community (ccaa) and extract the 5 with the fewest interviews.

Code
data |>
  count(ccaa, sort = TRUE)

data |>
  count(ccaa) |> 
  slice_min(n = 5, n)

5.6 Question 6

Use the {datapasta} package (see https://javieralvarezliebana.es/docencia/mucss-data-programming/slides/#/data-pasta) to copy from the INE page the population of each province and import it into a tibble

Code
# this code should be automatically generated by de addin in R Studio
# after installing the R package
population <- 
  tibble::tribble(~Total, ~`47.385.107`,
                    "01 Andalucía",   "8.472.407",
                       "02 Aragón",   "1.326.261",
      "03 Asturias, Principado de",   "1.011.792",
               "04 Balears, Illes",   "1.173.008",
                     "05 Canarias",   "2.172.944",
                    "06 Cantabria",     "584.507",
              "07 Castilla y León",   "2.383.139",
         "08 Castilla - La Mancha",   "2.049.562",
                     "09 Cataluña",   "7.763.362",
         "10 Comunitat Valenciana",   "5.058.138",
                  "11 Extremadura",   "1.059.501",
                      "12 Galicia",   "2.695.645",
         "13 Madrid, Comunidad de",   "6.751.251",
            "14 Murcia, Región de",   "1.518.486",
  "15 Navarra, Comunidad Foral de",     "661.537",
                   "16 País Vasco",   "2.213.993",
                    "17 Rioja, La",     "319.796",
                        "18 Ceuta",      "83.517",
                      "19 Melilla",      "86.261"
  )

5.7 Question 7

Rename properly the variables of population datasets, converts the population to number and separates the variable of the autonomous communities in two, one with the code and another with the name (according to INE), properly processing and trimming the names. See before exercises about {stringr} package.

Code
population <-
  population |> 
  rename(ccaa = Total, pop = `47.385.107`) |> 
  # First to convert to numeric we need to remove dots
  mutate("pop" = as.numeric(str_replace_all(pop, "\\.", ""))) |> 
  # if we use " " as sep, we split also some names
  separate(col = ccaa, into = c("cod_INE", "name"), sep = 2) |> 
  # From stringr package, str_squish() removes whitespace at the start 
  # and end, and replaces all internal whitespace with a single space.
  mutate(name = str_squish(name))

5.8 Question 8

Add the population information to our CIS table. Then calculate a summary table with the ratio between the percentage of population represented by each community (of the total of Spain) and the percentage of respondents of each community (with respect to the total). Which are the 3 most over-represented and which are the 3 least? Think that if the ratio is higher than 1, it implies that there is a lower proportion of respondents from that community than would correspond to it by population (under-represented)

Code
data <-
  data |> 
  # you need first to convert cod_INE as numeric since both 
  # key columns should be of same type
  left_join(population |> mutate("cod_INE" = as.numeric(cod_INE)),
            by = c("ccaa" = "cod_INE")) |> 
  relocate(name, pop, .after = "ccaa")

prop_surv_pop <-
  data |>
  summarise("prop_surv" = n() / sum(nrow(data)),
            "pop" = unique(pop),
            .by = ccaa) |> 
  mutate("prop_pop" = pop / sum(pop),
         "ratio" = prop_pop / prop_surv)

# overrepresented
prop_surv_pop |> 
  slice_min(ratio, n = 3)

# underrepresented
prop_surv_pop |> 
  slice_max(ratio, n = 3)

6 Lists

We have already seen that lists are an object in R that allows us to store collections of variables of different type (as with data.frame and tibble) but also different lengths, with totally heterogeneous structures (even a list can have inside it another list).

name <- "Javi"
age <- 34
marks <- c(7, 8, 5, 3, 10, 9)
parents <- c("Paloma", "Goyo")

list_var <- list("name" = name, "age" = age, "marks" = marks, "parents" = parents)
list_var$name
[1] "Javi"
list_var$marks
[1]  7  8  5  3 10  9

We can also make lists with other lists inside, so that to access each level we must use the [[]] operator.

list_of_lists <- list("list_1" = list_var[1:2], "list_2" = list_var[3:4])
names(list_of_lists)
[1] "list_1" "list_2"
names(list_of_lists[[1]])
[1] "name" "age" 
list_of_lists[[1]][[1]]
[1] "Javi"

We are allowed to store n-dimensional data!

 

One of the disadvantages is that a list cannot be vectorized immediately, so any arithmetic operation applied to a list will give error.

data <- list("a" = 1:5, "b" = 10:20)
data / 2
Error in data/2: non-numeric argument to binary operator

For this purpose, one of the common (but outdated) options is to make use of the lapply() family.

lapply(data, FUN = function(x) { x / 2})
$a
[1] 0.5 1.0 1.5 2.0 2.5

$b
 [1]  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5  9.0  9.5 10.0

By default, the output of lapply()is always a list of equal length.

A more flexible and versatile option is to make use of the {purrr} package of the {tidyverse} environment.

library(purrr)

This package is intended to mimic the functional programming of other languages such as Scala or Hadoop’s map-reduce strategy (from Google).

The simplest function of the {purrr} package is the map() function, which applies a vectorized function to each of the elements of a list. Let’s see a first example applied to vectors

  • map() allows us to “map” each list and apply the function element by element (if applicable).
x <- list("x1" = 1:4, "x2" = 11:20)
map(x, sqrt) 
$x1
[1] 1.000000 1.414214 1.732051 2.000000

$x2
 [1] 3.316625 3.464102 3.605551 3.741657 3.872983 4.000000 4.123106 4.242641
 [9] 4.358899 4.472136
Be careful

With vectors we have a default vectorization because R performs element-by-element operations. Note that, by default, the output of map is a list.

Let’s look at another example. Define in a list two samples from 2 normal distributions, of different sample size and different mean. Compute the mean of each one.

Code
x <- list(rnorm(n = 1500, mean = 0, sd = 0.7),
          rnorm(n = 2800, mean = 2, sd = 1.5))
map(x, mean)
[[1]]
[1] -0.0126662

[[2]]
[1] 1.978925

What if we want to calculate the mean of their squared values?

Code
map(x, function(x) { mean(x^2) })
[[1]]
[1] 0.453184

[[2]]
[1] 6.155281

In addition to being more readable and efficient, with {purrr} we can decide the output format after the operation

  • output as numeric (double) vector with map_dbl()
  • output as numeric (int) vector with map_int()
  • output as character vector with map_chr()
  • output as logical vector with map_lgl()
library(glue)
map_dbl(x, mean)
[1] -0.0126662  1.9789246
map_chr(x, function(x) { glue("Mean is {round(mean(x), 5)}") })
[1] "Mean is -0.01267" "Mean is 1.97892" 
c(x[[1]][3], x[[2]][3])
[1] 0.4777273 3.3570897

Also, if you pass it a number instead of a function, it will return the ith element of each list.

map_dbl(x, 3)
[1] 0.4777273 3.3570897
list_dummy <- list("a" = dplyr::starwars, "b" = tidyr::billboard)

We can also use pluck() to access to the i-th element of a list

list_dummy |> 
  pluck(1)
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

We also have the option of generalizing it to be able to use functions that need two arguments in the form of a list (binary operations), with map2()

x <- list("a" = 1:3, "b" = 4:6)
y <- list("c" = c(-1, 4, 0), "b" = c(5, -4, -1))
map2(x, y, function(x, y) { x^2 + y^2})
$a
[1]  2 20  9

$b
[1] 41 41 37

We can obtain the output in the form of data.frame by adding list_rbind() or list_cbind(), which converts a list into a table.

x <- c("a", "b", "c")
y <- 1:3
map2(x, y, function(x, y) { tibble(x, y) }) |> list_rbind()
# A tibble: 3 × 2
  x         y
  <chr> <int>
1 a         1
2 b         2
3 c         3

We can generalize it further with pmap_xxx() which allows us to use multiple arguments (multiple lists).

x <- list(1, 1, 1)
y <- list(10, 20, 30)
z <- list(100, 200, 300)
pmap_dbl(list(x, y, z), sum)
[1] 111 221 331

We have other types of iterators that, although they assume inputs, do not return anything, as walk() (just one input argument), walk2() (two arguments) and pwalk() (multiple arguments), all invisibly return, just call a function for its side effects rather than its return value.

list("a" = 1:3, "b" = 4:6) |>
  map2(list("a" = 11:13, "b" = 14:16),
       function(x, y) { x + y }) |> 
  walk(print)
[1] 12 14 16
[1] 18 20 22

6.1 💻 It’s your turn

📝 Define a list of 4 elements of different types and access the second of them (I will include one that is a tibble so that you can see that in a list there is room for everything).

Code
list_example <-
  list("name" = "Javier", "cp" = 28019,
       "siblings" = TRUE,
       "marks" = tibble("maths" = c(7.5, 8, 9),
                        "lang" = c(10, 5, 6)))
list_example

📝 From the list above, access the elements that occupy places 1 and 4 of the list defined above.

Code
list_example[c(1, 4)]

list_example$name
list_example$marks

list_example[c("name", "marks")]

📝 Load the starwars dataset from the {dplyr} package and access the second movie that appears in starwars$films (for each character). Determine which ones do not appear in more than one movie.

second_film <- map(starwars$films, 2)
map_lgl(second_film, is.null)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
[13] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
[25]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE
[37]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
[49]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
[61]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
[73]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
[85]  TRUE  TRUE  TRUE

7 Quali: factors (forcats package)

In statistics, when we talk about qualitative variables, we will call levels or modalities the different values that these data can take. For example, in the case of the sex variable of the starwars set, we have 4 allowed levels: female, hermaphroditic, male and none (in addition to missing data).

starwars |> count(sex)
# A tibble: 5 × 2
  sex                n
  <chr>          <int>
1 female            16
2 hermaphroditic     1
3 male              60
4 none               6
5 <NA>               4

These kinds of variables are known in R as factors, and the fundamental package to deal with them is {forcats} (from the {tidyverse} environment).

This package allows us to set the levels (stored internally as levels) that a given categorical variable takes so that no mistakes, errors in data collection and generation can be generated. It also makes their analysis less computationally expensive when doing searches and comparisons, giving them a different treatment than normal text strings.

Let’s see a simple example simulating a party variable taking the values "PP", "PSOE" and "SUMAR" (of size 15)

set.seed(1234567)
party <- sample(x = c("PP", "PSOE", "SUMAR"), size = 15, replace = TRUE)
party
 [1] "PP"    "PSOE"  "SUMAR" "PP"    "PP"    "SUMAR" "PSOE"  "SUMAR" "SUMAR"
[10] "SUMAR" "PP"    "PSOE"  "PP"    "PSOE"  "SUMAR"

The party variable is currently of type text, of type chr, something we can check with class(state).

class(party)
[1] "character"

From a statistical and computational point of view, for R this variable right now would be equivalent to a named variable. But statistically a variable as a string is not the same as a categorical variable that can only take those 3 levels. How to convert to factor?

By making use of the as_factor() function from the {forcats} package.

library(tidyverse)
party_fct <- tibble("id" = 1:length(party),
                    "party" = as_factor(party))
party_fct
# A tibble: 15 × 2
      id party
   <int> <fct>
 1     1 PP   
 2     2 PSOE 
 3     3 SUMAR
 4     4 PP   
 5     5 PP   
 6     6 SUMAR
 7     7 PSOE 
 8     8 SUMAR
 9     9 SUMAR
10    10 SUMAR
11    11 PP   
12    12 PSOE 
13    13 PP   
14    14 PSOE 
15    15 SUMAR

Not only the class of the variable has changed, but now, below the saved value, the sentence Levels: ... appears: these are the modalities or levels of our qualitative.

party_fct |> pull(party)
 [1] PP    PSOE  SUMAR PP    PP    SUMAR PSOE  SUMAR SUMAR SUMAR PP    PSOE 
[13] PP    PSOE  SUMAR
Levels: PP PSOE SUMAR

Imagine that we are defining the database of deputies of the Congress and that the deputies of the PP did not attend that day to the plenary session: although our variable does not take that value THAT DAY, the state PSOE is a allowed level in the database (so even if we eliminate it, because it is a factor, the level remains, we do not have it now but it is an allowed level).

party_fct |> 
  filter(party %in% c("PP", "SUMAR")) |> 
  pull(party)
 [1] PP    SUMAR PP    PP    SUMAR SUMAR SUMAR SUMAR PP    PP    SUMAR
Levels: PP PSOE SUMAR

With factor() function we can explicitly specify the names of the modalities and using levels = ... we can explicitly tell it the “order” of the modalities

party_fct <-
  tibble(id = 1:length(party),
         party = factor(party, levels = c("SUMAR", "PP", "PSOE")))
party_fct |> pull(party)
 [1] PP    PSOE  SUMAR PP    PP    SUMAR PSOE  SUMAR SUMAR SUMAR PP    PSOE 
[13] PP    PSOE  SUMAR
Levels: SUMAR PP PSOE

The previous “order” is just in the sense which it will be counted/plotted first) but we don’t have (yet) an ordinal variable

party_fct$party < "SUMAR"
Warning in Ops.factor(party_fct$party, "SUMAR"): '<' not meaningful for factors
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

What if we want to define a qualitative variable ORDINAL? Inside factor() we must indicate that ordered = TRUE.

marks <- c("A", "E", "F", "B", "A+", "A", "C", "C", "D", "B", "A", "C", "C", "E", "F", "D", "A+")
marks_database <-
  tibble("student" = 1:length(marks),
         "marks" =
           factor(marks, levels = c("F", "E", "D", "C", "B", "A", "A+"),
                  ordered = TRUE))
marks_database |> pull(marks)
 [1] A  E  F  B  A+ A  C  C  D  B  A  C  C  E  F  D  A+
Levels: F < E < D < C < B < A < A+

What changes? If you notice now, although the variable is still qualitative, we can make comparisons and order the records because there is a hierarchy between the modalities.

marks_database |> filter(marks >= "B")
# A tibble: 7 × 2
  student marks
    <int> <ord>
1       1 A    
2       4 B    
3       5 A+   
4       6 A    
5      10 B    
6      11 A    
7      17 A+   

If we want to tell it to drop an unused level at that moment (and that we want to exclude from the definition) we can do it with fct_drop().

marks_database |> 
  filter(marks %in% c("F", "E", "D", "C", "B", "A")) |> 
  pull(marks)
 [1] A E F B A C C D B A C C E F D
Levels: F < E < D < C < B < A < A+

marks_database |> 
  filter(marks %in% c("F", "E", "D", "C", "B", "A")) |> 
  mutate(marks = fct_drop(marks)) |>  
  pull(marks)
 [1] A E F B A C C D B A C C E F D
Levels: F < E < D < C < B < A

Just as we can delete levels we can expand existing levels (even if there is no data for that level at that time) with fct_expand().

marks_database |> 
  mutate(marks = fct_expand(marks, c("F-", "A+", "A++"))) %>% 
  pull(marks)
 [1] A  E  F  B  A+ A  C  C  D  B  A  C  C  E  F  D  A+
Levels: F < E < D < C < B < A < A+ < F- < A++

In addition with fct_explicit_na() we can assign a level to the missing values to be included in the analysis and visualizations.

fct_explicit_na(factor(c("a", "b", NA)))
Warning: `fct_explicit_na()` was deprecated in forcats 1.0.0.
ℹ Please use `fct_na_value_to_level()` instead.
[1] a         b         (Missing)
Levels: a b (Missing)

Even once defined we can reorder the levels with fct_relevel().

marks_database_expand <- 
  marks_database |>  
  mutate(marks = fct_expand(marks, c("F-", "A+", "A++"))) |> 
  pull(marks)

marks_database_expand |> 
  fct_relevel(c("F-", "F", "E", "D", "C", "B", "A", "A+", "A++"))
 [1] A  E  F  B  A+ A  C  C  D  B  A  C  C  E  F  D  A+
Levels: F- < F < E < D < C < B < A < A+ < A++

This way of working with qualitative variables allows us to give a theoretical definition of our database, and we can even count values that do not yet exist (but could), making use of fct_count().

marks_database |> 
  mutate(marks = fct_expand(marks, c("F-", "A+", "A++"))) |> 
  pull(marks) |> 
  fct_count()
# A tibble: 9 × 2
  f         n
  <ord> <int>
1 F         2
2 E         2
3 D         2
4 C         4
5 B         2
6 A         3
7 A+        2
8 F-        0
9 A++       0

The levels can also be sorted by frequency with fct_infreq().

marks_database |> 
  mutate(marks = fct_infreq(marks)) |> 
  pull(marks)
 [1] A  E  F  B  A+ A  C  C  D  B  A  C  C  E  F  D  A+
Levels: C < A < F < E < D < B < A+
marks_database |> 
  mutate(marks = fct_infreq(marks)) |> 
  pull(marks) |> 
  fct_count()
# A tibble: 7 × 2
  f         n
  <ord> <int>
1 C         4
2 A         3
3 F         2
4 E         2
5 D         2
6 B         2
7 A+        2

Sometimes we will want to group levels, for example, not allowing levels that do not happen a minimum number of times with fct_lump_min(..., min = ..) (observations that do not meet this will go to a generic level called Other, although this can be changed with the other_level argument).

marks_database |> 
  pull(marks) %>% 
  fct_lump_min(min = 3)
 [1] A     Other Other Other Other A     C     C     Other Other A     C    
[13] C     Other Other Other Other
Levels: C < A < Other
marks_database |> 
  pull(marks) |>
  fct_lump_min(min = 3,
               other_level = "Less frequent")
 [1] A             Less frequent Less frequent Less frequent Less frequent
 [6] A             C             C             Less frequent Less frequent
[11] A             C             C             Less frequent Less frequent
[16] Less frequent Less frequent
Levels: C < A < Less frequent

We can do something equivalent but based on its relative frequency with fct_lump_prop().

marks_database |> 
  pull(marks) |> 
  fct_lump_prop(prop = 0.15,
                other_level = "less frequent")
 [1] A             less frequent less frequent less frequent less frequent
 [6] A             C             C             less frequent less frequent
[11] A             C             C             less frequent less frequent
[16] less frequent less frequent
Levels: C < A < less frequent

We can apply this to our datasets to recategorize variables very quickly.

starwars |>  
  drop_na(species) |> 
  mutate(species =
           fct_lump_min(species, min = 3,
                        other_level = "Others")) |>  
  count(species)
# A tibble: 4 × 2
  species     n
  <fct>   <int>
1 Droid       6
2 Gungan      3
3 Human      35
4 Others     39

With fct_reorder() we can also indicate that we want to order the factors according to a function applied to another variable.

starwars_factor <- 
  starwars |>  
  drop_na(height, species) |> 
  mutate(species =
           fct_lump_min(species, min = 3,
                        other_level = "Others"))
starwars_factor |>  pull(species)
 [1] Human  Droid  Droid  Human  Human  Human  Human  Droid  Human  Human 
[11] Human  Human  Others Human  Others Others Human  Others Human  Human 
[21] Droid  Others Human  Human  Others Human  Others Others Human  Others
[31] Human  Human  Gungan Gungan Gungan Human  Others Others Human  Human 
[41] Others Others Others Others Others Others Others Human  Others Others
[51] Others Others Others Others Others Others Human  Others Others Others
[61] Human  Human  Human  Human  Others Others Others Others Human  Droid 
[71] Others Others Others Others Others Human  Others
Levels: Droid Gungan Human Others
starwars_factor |> 
  mutate(species = fct_reorder(species, height, mean)) |> 
  pull(species)
 [1] Human  Droid  Droid  Human  Human  Human  Human  Droid  Human  Human 
[11] Human  Human  Others Human  Others Others Human  Others Human  Human 
[21] Droid  Others Human  Human  Others Human  Others Others Human  Others
[31] Human  Human  Gungan Gungan Gungan Human  Others Others Human  Human 
[41] Others Others Others Others Others Others Others Human  Others Others
[51] Others Others Others Others Others Others Human  Others Others Others
[61] Human  Human  Human  Human  Others Others Others Others Human  Droid 
[71] Others Others Others Others Others Human  Others
Levels: Droid Others Human Gungan

7.1 💻 It’s your turn

📝 Given the variable months defined below (defined as a character vector), convert this variable to factor (just that)

months <- c("Jan", "Feb", "Mar", "Apr")
Code
months <- c("Jan", "Feb", "Mar", "Apr")
months_fct <- as_factor(months)
months_fct

📝 Given the variable months defined below converts this variable to a factor but indicating the levels correctly.

months <- c(NA, "Apr", "Jan", "Oct", "Jul", "Jan", "Sep", NA, "Feb", "Dic",
           "Jul", "Mar", "Jan", "Mar", "Feb", "Apr", "May", "Oct", "Sep",  NA,
           "Dic", "Jul", "Nov", "Feb", "Oct", "Jun", "Sep", "Oct", "Oct", "Sep")
Code
months_fct <-
  factor(months,
         levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dic"))
months_fct

📝 Count how many values there are for each month but keep in mind that they are factors (maybe there are unused levels and you should get a 0 for them).

Code
meses_fct |>  fct_count()

📝 Since there are missing values, it indicates that the absentee is a thirteenth level labeled "missing".

Code
months_fct <- 
  months_fct |> 
  fct_explicit_na(na_level = "missing")
months_fct

📝 Removes unused levels.

Code
months_fct <- 
  months_fct %>%
  fct_drop()
months_fct

📝 Sort the levels by frequency of occurrence.

Code
months_fct |> 
  fct_infreq()

📝 Group levels so that any level that does not appear at least 7% of the time is grouped in a level called "other months".

Code
months_fct <-
  months_fct |> 
  fct_lump_prop(prop = 0.07, other_level = "other months")
months_fct 

8 🐣 Case study II: CIS survey

📊 Data

We are going to work again with file contains data from the CIS (Centro de Investigaciones Sociológicas) barometer «Perceptions on equality between men and women and gender stereotypes» whose sample work was carried out from November 6 to 14 (4000 interviews of both sexes over 16 years old in 1174 municipalities and 50 provinces).

8.1 Question 1

Run the entire code of the previous case study

library(haven)
data <-
  read_sav(file = "./datos/CIS-feminismo/3428.sav") |> 
  janitor::clean_names()

rm_variables <-
  data |> 
  summarise(across(everything(), n_distinct)) |> 
  pivot_longer(cols = everything(), names_to = "variable", values_to = "n_values") |> 
  filter(n_values <= 1) |> 
  pull(variable)
data <-
  data |> 
  select(-rm_variables)

rm_variables <-
  data |> 
  summarise(across(everything(), n_distinct)) |> 
  pivot_longer(cols = everything(), names_to = "variable", values_to = "n_values") |> 
  filter(n_values == nrow(data)) |> 
  pull(variable)
data <-
  data |> 
  # all of them except one of them
  select(-rm_variables[-1])

data <-
  data |> 
  select(-contains("ia_"), -contains("peso"))

population <- 
  tibble::tribble(~Total, ~`47.385.107`,
                    "01 Andalucía",   "8.472.407",
                       "02 Aragón",   "1.326.261",
      "03 Asturias, Principado de",   "1.011.792",
               "04 Balears, Illes",   "1.173.008",
                     "05 Canarias",   "2.172.944",
                    "06 Cantabria",     "584.507",
              "07 Castilla y León",   "2.383.139",
         "08 Castilla - La Mancha",   "2.049.562",
                     "09 Cataluña",   "7.763.362",
         "10 Comunitat Valenciana",   "5.058.138",
                  "11 Extremadura",   "1.059.501",
                      "12 Galicia",   "2.695.645",
         "13 Madrid, Comunidad de",   "6.751.251",
            "14 Murcia, Región de",   "1.518.486",
  "15 Navarra, Comunidad Foral de",     "661.537",
                   "16 País Vasco",   "2.213.993",
                    "17 Rioja, La",     "319.796",
                        "18 Ceuta",      "83.517",
                      "19 Melilla",      "86.261"
  )

population <-
  population |> 
  rename(ccaa = Total, pop = `47.385.107`) |> 
  # First to convert to numeric we need to remove dots
  mutate("pop" = as.numeric(str_replace_all(pop, "\\.", ""))) |> 
  # if we use " " as sep, we split also some names
  separate(col = ccaa, into = c("cod_INE", "name"), sep = 2) |> 
  # From stringr package, str_squish() removes whitespace at the start 
  # and end, and replaces all internal whitespace with a single space.
  mutate(name = str_squish(name))

data <-
  data |> 
  # you need first to convert cod_INE as numeric since both 
  # key columns should be of same type
  left_join(population |> mutate("cod_INE" = as.numeric(cod_INE)),
            by = c("ccaa" = "cod_INE")) |> 
  relocate(name, pop, .after = "ccaa")

8.2 Question 2

Notice that many columns are of type <dbl+lbl>, a new data type inherited from SPSS to be able to have the numeric value (e.g. 1 for the autonomous community ccaa of Andalusia) but also the label («Andalucía»). You can see more in the {labelled} package but for now we are going to use to_factor() to be able to work with those labels if we want.

data |> 
  # numeric mode
  count(ccaa == 1)
# A tibble: 2 × 2
  `ccaa == 1`     n
  <lgl>       <int>
1 FALSE        3385
2 TRUE          620
data |> 
  # labels mode
  count(labelled::to_factor(ccaa) == "Andalucía")
# A tibble: 2 × 2
  `labelled::to_factor(ccaa) == "Andalucía"`     n
  <lgl>                                      <int>
1 FALSE                                       3385
2 TRUE                                         620

The variables escfeminis and sitconv actually correspond to question P16 and P18 of the questionnaire. Rename them as p16 and p18.

Code
data <- 
  data |> 
  rename(p16 = escfeminis, p18 = sitconv)

8.3 Question 3

The variables tarhogentrev_1, tarhogentrev_2, tarhogentrev_8, tarhogentrev_9, tarhogentrev_hh, tarhogentrev_mm refer to the same question: time spent on household chores.

  • The variable tarhogentrev_1 stores a 1 if the respondent answered in hours (otherwise 2)

  • The variable tarhogentrev_2 stores a 1 if the respondent answered in minutes (otherwise 2)

  • The variable tarhogentrev_8 stores a 1 if the respondent answered “can’t say” (otherwise 2)

  • The variable tarhogentrev_9 stores a 1 if the respondent did not want to answer.

  • The variables tarhogentrev_hh and tarhogentrev_mm store the amounts (in hours or minutes, or 99 or 98 if the respondent did not know or did not want to answer).

Use the information from these 5 variables in such a way that the information is contained in a single variable p19 with the number of minutes spent (if NA cannot be known) and delete all the others

Code
data <- 
  data |>
  mutate("p19" =
           if_else(tarhogentrev_1 == 1, tarhogentrev_hh * 60,
                   if_else(tarhogentrev_2 == 1, tarhogentrev_mm, NA)),
         .after = p18) |>
  select(-contains("tarhogentrev"))

8.4 Question 4

Perform the same recoding with the variables tarhogparej_1, tarhogparej_2, tarhogparej_8, tarhogparej_9, tarhogparej_hh, tarhogparej_mm (time spent on household chores of the couple) and group their info in p20.

data <- 
  data |> 
  mutate("p20" =
           if_else(tarhogparej_1 == 1, tarhogparej_hh * 60,
                   if_else(tarhogparej_2 == 1, tarhogparej_mm, NA)),
         .after = p19) |>
  select(-contains("tarhogparej"))

8.5 Question 5

The variable hijomenor_1 contains a 1 if it has children and the number of fixed is in hijomenor_n; if hijomenor_97 contains a 1 we should impute 0 in number of children; if hijomenor_99 is 1 we should impute NA. Collects the info of the 4 variables in a single variable called p21.

data <- 
  data |> 
  mutate("p21" =
           if_else(hijomenor_1 == 1, hijomenor_n,
                   if_else(hijomenor_97 == 1, 0, NA)),
         .after = p20) |>
  select(-contains("hijomenor"))

8.6 Question 6

The following variables (cuidadohijos_xxx and cuidadohijospar_xxx) code something similar with respect to time spent on childcare (of the respondent and his/her partner). Summarize the information in two variables p21a and p21b as we have done with p19 and p20.

data <- 
  data |> 
  mutate("p21a" =
           if_else(cuidadohijos_1 == 1, cuidadohijos_hh * 60,
                   if_else(cuidadohijos_2 == 1, cuidadohijos_mm, NA)),
         "p21b" =
           if_else(cuidadohijospar_1 == 1, cuidadohijospar_hh * 60,
                   if_else(cuidadohijospar_2 == 1, cuidadohijospar_mm, NA)),
         .after = p21) |>
  select(-contains("cuidadohijos"))

8.7 Question 7

Do the same with tareascuid_xxx regarding the time spent on care and attention of a sick and/or dependent person in a working day, and store the info in p22a. Think that in p22 is stored if you have a sick or dependent person in your care; if p22 is No (2) or N.C. (9) the value of p22a should be NA.

data <- 
  data |> 
  mutate("p22a" =
           if_else(p22 %in% c(2, 9), NA,
                   if_else(tareascui_1 == 1, tareascui_hh * 60,
                           if_else(tareascui_2 == 1, tareascui_mm, NA))),
         .after = p22) |>
  select(-contains("tareascui"))

8.8 Question 8

Finally the variables escideol, participaciong, recuvotog, escuela, nivelestentrev, religion, practicarelig6, sitlab, relalab, cno11, tipojor, ingreshog, clasesocial, paisnac, paisnac2 should be renamed as p24, p25, p25a, p26, p26a, p27, p27a, p28, p28a, p28b, p28c, p29, p30, p31 and p31a. In addition the variable sinceridad collects the subjective sincerity that the pollster has perceived in the respondent and the variables recuvotogr and recuerdo capture vote recall information that we already have collected elsewhere. Rename the former, delete the latter and move the variables estudios, cno11r, clasesub (studies, occupation according to Social Security coding and subjective class identification) before the surveyed questions.

Code
data <-
  data |> 
  rename(p24 = escideol, p25 = participaciong, p25a = recuvotog,
         p26 = escuela, p26a = nivelestentrev, p27 = religion,
         p27a = practicarelig6, p28 = sitlab, p28a = relalab,
         p28b = cno11, p28c = tipojor, p29 = ingreshog,
         p30 = clasesocial, p31 = paisnac, p31a = paisnac2) |> 
  select(-(sinceridad:recuerdo)) |> 
  relocate(estudios, cno11r, clasesub, .before = p0)
data
# A tibble: 4,005 × 107
   registro ccaa          name     pop prov    mun      capital tamuni  tipo_tel
      <dbl> <dbl+lbl>     <chr>  <dbl> <dbl+l> <dbl+lb> <dbl+l> <dbl+l> <dbl+lb>
 1     2821 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
 2    19046 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 1 [Fijo]
 3    20465 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
 4    31462 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
 5    33504 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
 6    35292 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
 7    37633 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
 8    40018 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 1 [Fijo]
 9    41526 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
10    62946 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 1 [Fijo]
# ℹ 3,995 more rows
# ℹ 98 more variables: sexo <dbl+lbl>, edad <dbl+lbl>, estudios <dbl+lbl>,
#   cno11r <dbl+lbl>, clasesub <dbl+lbl>, p0 <dbl+lbl>, p1 <dbl+lbl>,
#   p2 <dbl+lbl>, p3_1 <dbl+lbl>, p3_2 <dbl+lbl>, p3_3 <dbl+lbl>,
#   p3_4 <dbl+lbl>, p3_5 <dbl+lbl>, p4 <dbl+lbl>, p5 <dbl+lbl>, p6_1 <dbl+lbl>,
#   p6_2 <dbl+lbl>, p6_3 <dbl+lbl>, p7_1 <dbl+lbl>, p7_2 <dbl+lbl>,
#   p7_3 <dbl+lbl>, p7_4 <dbl+lbl>, p7_5 <dbl+lbl>, p7_6 <dbl+lbl>, …

8.9 Question 9

Note that there are variables in the questionnaire, such as p0 and p1, that although they are coded numerically, in reality we should have them only qualitatively. For example, if you look at the questionnaire, the variable p0 corresponds to the question P0.c which says “First of all I would like to ask you if you have…

  • Spanish nationality –> 1

  • Spanish nationality and other –> 2

  • Other nationality –> 3

data |> count(p0)
# A tibble: 3 × 2
  p0                                      n
  <dbl+lbl>                           <int>
1 1 [La nacionalidad española]         3711
2 2 [La nacionalidad española y otra]   151
3 3 [Otra nacionalidad]                 143

If we were to leave the variable as it is, numerically, we would be assuming two incorrect things:

  1. There is a hierarchy (having Spanish nationality is better or worse than having another nationality).

  2. This hierarchy is also quantified numerically: since 2 is double 1, it implies that, in some way, having two nationalities is “twice as much” as having only Spanish nationality

If you look at the questionnaire, all the variables that are p{number} (without the low slash _) are of this type. The others respond to a more complex questionnaire (a series of questions to be assessed on a scale that is qualitative ordinal and we will consider that it can be converted to an ordinal factor).

Convert the p{number} to qualitative (nominal) with its label and the p{number}_x to ordinal qualitative (with its number). In the latter, note that “don’t know/no answer” responses are coded as 98 and 99 (and should be NA).

Code
library(labelled)
data <-
  data |> 
  # with $ to specify "not more than..."
  mutate(across(matches("p[0-9]{1}$|p[0-9]{2}$"), to_factor),
         across(matches("p[0-9]{1}_|p[0-9]{2}_"),
                function(x) { factor(if_else(x %in% c(98, 99), NA, x),
                                     ordered = TRUE) } ))

In some compound questions with only 5 options, the NA are coded as 7, 8 or 9.

data <-
  data |> 
  # with $ to specify "not more than..."
  mutate(across(matches("p[0-9]{1}_|p[0-9]{2}_"),
                function(x) { factor(if_else(x %in% c(7, 8, 9), NA, x),
                                     ordered = TRUE) } ))

8.10 Question 10

Calculate a two-dimensional table of relative frequencies between the variable sex and the variable p4 (degree of inequality between men and women in Spain) normalizing as you consider to answer the following question: what percentage of women believe that inequalities are very large or quite large? And of men?

Code
table_freq <- table(to_factor(data$sexo), data$p4)
prop.table(table_freq, margin = 1)
# 67% of women
# 48.7% of men

8.11 Question 11

Calculate a measure of association between these variables and detail conclusions.

Code
data |> 
  summarise("sig_chisq" = chisq.test(to_factor(sexo), p4)$p.value)
# tiny p-valor --> enough evidences againts null hyp --> 
# answers to p4 depends on sex

8.12 Question 12

Create a simple graph in ggplot() that allows to represent the variable p4 in function of sexes, allowing to appreciate the differences in them.

Code
# first option: geom_bar with abs freqs
ggplot(data |>
         mutate("sexo" = to_factor(sexo))) +
  geom_bar(aes(x = p4, fill = p4), alpha = 0.7) +
  ggthemes::scale_fill_colorblind() +
  facet_wrap(~sexo) +
  theme_minimal()

# second option: geom_col with rel freqs
ggplot(data |>
         mutate("sexo" = to_factor(sexo)) |> 
         count(sexo, p4) |> 
         mutate("porc" = n/sum(n))) +
  geom_col(aes(x = p4, y = porc, fill = p4), alpha = 0.7) +
  scale_y_continuous(labels = scales::label_percent()) +
  ggthemes::scale_fill_colorblind() +
  facet_wrap(~sexo) +
  theme_minimal()

# third option: horizontal bars
ggplot(data |>
         mutate("sexo" = to_factor(sexo)) |> 
         count(sexo, p4) |> 
         mutate("porc" = n/sum(n)) |> 
         mutate("porc" = if_else(sexo == "Hombre", -porc, porc))) +
  geom_col(aes(y = p4, x = porc, fill = p4), alpha = 0.7) +
  scale_x_continuous(labels = scales::label_percent()) +
  ggthemes::scale_fill_colorblind() +
  facet_wrap(~sexo, scales = "free_x") +
  theme_minimal()

# fourth option: fill geom bar
ggplot(data |>
         mutate("sexo" = to_factor(sexo))) +
  geom_bar(aes(x = sexo, fill = p4), position = "fill",
           alpha = 0.7) +
  ggthemes::scale_fill_colorblind() +
  theme_minimal()

8.13 Question 13

Performs the same analysis but regroups p4 earlier into only 3 levels: grandes (Muy grandes and Bastante grandes), pequeñas (Pequeñas and Casi inexistentes) and NS/NC.

Code
data <-
  data |> 
  mutate("p4" =
           fct_collapse(data$p4,
                        "grandes" = c("Muy grandes", "Bastante grandes"), "NS/NC" = c("N.S.", "N.C."),
                        other_level = "pequeñas"))

table_freq <- table(to_factor(data$sexo), data$p4)
prop.table(table_freq, margin = 1)
# 67% of women
# 48.7% of men

data |>
  summarise("sig_chisq" = chisq.test(to_factor(sexo), p4)$p.value)

ggplot(data |>
         mutate("sexo" = to_factor(sexo))) +
  geom_bar(aes(x = sexo, fill = p4), position = "fill",
           alpha = 0.7) +
  ggthemes::scale_fill_colorblind() +
  theme_minimal()

8.14 Question 14

Question p16 captures, on a scale of 0 to 10 (where 0 means “not at all feminist” and 10 means “very feminist”) the self-placement of the respondents. How is this variable distributed by gender? Make a frequency table

Code
prop.table(table(to_factor(data$sexo), data$p16), margin = 1)

What % of men and women declare themselves above 6?

Code
data |> 
  mutate("p16" = factor(p16, ordered = TRUE)) |> 
  group_by(sexo) |> 
  count(p16 > 7) |> 
  mutate("porc" = 100*n/sum(n)) |> 
  ungroup()
# 48.3% between men
# 62.3% between women

Question p19 captured minutes spent on household chores. Converts to numeric (“No minutes” as 0).

Code
data <-
  data |>
  mutate("p19" = as.numeric(as.character(p19)))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `p19 = as.numeric(as.character(p19))`.
Caused by warning:
! NAs introduced by coercion

After that, first calculate the average by sex disaggregated, and then the average disaggregated by sex and p16. What conclusions do you draw? Make a graph if you consider it

Code
mean_sex <-
  data |> 
  summarise("avg_min" = mean(p19, na.rm = TRUE), .by = sexo)
# 175 minutes in women vs 129 in men

mean_sex_feminism <-
  data |> 
  summarise("avg_min" = mean(p19, na.rm = TRUE),
            .by = c(sexo, p16))
ggplot(mean_sex_feminism) +
  geom_col(aes(x = p16, y = avg_min, fill = avg_min),
           alpha = 0.7) +
  scale_fill_viridis_c(option = "magma") +
  facet_wrap(~to_factor(sexo)) +
  theme_minimal()

Code
# We can see how in men the average minutes remains cte
# without depending on the self-localization in the
# feminism scale

8.15 Question 15

We will calculate in a new variable the difference in minutes spent at home between the respondent (p19) and his/her partner (p20).

Code
data <- 
  data |>
  mutate("p20" = as.numeric(as.character(p20))) |> 
  mutate("diff_min" = p19 - p20)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `p20 = as.numeric(as.character(p20))`.
Caused by warning:
! NAs introduced by coercion

Repeat the previous analysis using this difference in minutes.

Code
mean_sex <-
  data |> 
  filter(p17a %in% c(1, 2) & as.numeric(p17a) != as.numeric(sexo)) |>
  summarise("avg_diff_min" = mean(diff_min, na.rm = TRUE), .by = sexo)
# +63.7 minutes in women vs -40.6 in men

mean_sex_feminism <-
  data |> 
  filter(p17a %in% c(1, 2) & as.numeric(p17a) != as.numeric(sexo)) |>
  summarise("avg_diff_min" = mean(diff_min, na.rm = TRUE),
            .by = c(sexo, p16))

ggplot(mean_sex_feminism) +
  geom_col(aes(x = p16, y = avg_diff_min, fill = avg_diff_min),
           alpha = 0.7) +
  scale_fill_viridis_c(option = "magma") +
  facet_wrap(~to_factor(sexo)) +
  theme_minimal()