table_1 |>
xxx_join(table_2, by = id)Joins and imports: how to combine, import and export files in R. Factors and lists
Practical workbooks of Data Programming in Master in Computational Social Science (2024-2025)
1 Joins
When working with data we will not always have the information in a single table, and sometimes, we will be interested in cross-referencing information from different sources.
For this we will use a classic of every language that handles data: the famous join, a tool that will allow us to cross one or several tables, making use of a identifying column of each one of them.
inner_join(): only records with id in both tables survive.full_join(): keeps all records in both tables.left_join(): keeps all the records of the first table, and looks for which ones have id also in the second one (in case of not having it, it fills with NA the fields of the 2nd table).right_join(): keeps all records in the second table, and searches which ones have id also in the first one.
Let’s test the various joins with a simple example
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tb_1 <- tibble("key" = 1:3, "val_x" = c("x1", "x2", "x3"))
tb_2 <- tibble("key" = c(1, 2, 4), "val_y" = c("y1", "y2", "y3"))tb_1# A tibble: 3 × 2
key val_x
<int> <chr>
1 1 x1
2 2 x2
3 3 x3
tb_2# A tibble: 3 × 2
key val_y
<dbl> <chr>
1 1 y1
2 2 y2
3 4 y3
1.1 left_join()
Imagine that we want to incorporate to tb_1 the information from table_2, identifying the records by the key column (by = "key", the column it has to cross): we want to keep all the records of the first table and look for which ones have the same value in key also in the second one.
tb_1 |>
left_join(tb_2, by = "key")# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
tb_1 |>
left_join(tb_2, by = "key")# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
Notice that the records in the first one whose key was not found in the second one has given them the value of absent.
1.2 right_join()
The right_join() will perform the opposite operation: we will now incorporate to tb_2 the information from table_2, identifying the records by the key column: we want to keep all the records of the second one and look for which ones have id (same value in key) also in the first table.
tb_1 |>
right_join(tb_2, by = "key")# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 4 <NA> y3
tb_1 |>
right_join(tb_2, by = "key")# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 4 <NA> y3
Notice that now the records of the second one whose key was not found in the first one are the ones given the value of absent.
1.3 keys and suffixes
The key columns we will use for the crossover will not always be named the same.
tb_1 <- tibble("key_1" = 1:3, "val_x" = c("x1", "x2", "x3"))
tb_2 <- tibble("key_2" = c(1, 2, 4), "val_y" = c("y1", "y2", "y3"))by = c("key_2" = "key_2"): we will indicate in which column of each table are the keys that we are going to cross.
# Left
tb_1 |>
left_join(tb_2, by = c("key_1" = "key_2"))# A tibble: 3 × 3
key_1 val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
# Right
tb_1 |>
right_join(tb_2, by = c("key_1" = "key_2"))# A tibble: 3 × 3
key_1 val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 4 <NA> y3
We can also cross over several columns at the same time (it will interpret as equal record the one that has the same set of keys), with by = c("var1_t1" = "var1_t2", "var2_t1" = "var2_t2", ...). Let’s modify the previous example
tb_1 <- tibble("k_11" = 1:3, "k_12" = c("a", "b", "c"), "val_x" = c("x1", "x2", "x3"))
tb_2 <- tibble("k_21" = c(1, 2, 4), "k_22" = c("a", "b", "e"), "val_y" = c("y1", "y2", "y3"))# Left
tb_1 |>
left_join(tb_2,
by = c("k_11" = "k_21", "k_12" = "k_22"))# A tibble: 3 × 4
k_11 k_12 val_x val_y
<dbl> <chr> <chr> <chr>
1 1 a x1 y1
2 2 b x2 y2
3 3 c x3 <NA>
# Right
tb_1 |>
right_join(tb_2,
by = c("k_11" = "k_21", "k_12" = "k_22"))# A tibble: 3 × 4
k_11 k_12 val_x val_y
<dbl> <chr> <chr> <chr>
1 1 a x1 y1
2 2 b x2 y2
3 4 e <NA> y3
It could also happen that when crossing two tables, there are columns of values named the same
tb_1 <- tibble("key_1" = 1:3, "val" = c("x1", "x2", "x3"))
tb_2 <- tibble("key_2" = c(1, 2, 4), "val" = c("y1", "y2", "y3"))# Left
tb_1 |>
left_join(tb_2, by = c("key_1" = "key_2"))# A tibble: 3 × 3
key_1 val.x val.y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
Notice that default adds the suffixes .x and .y to tell us which table they come from. This suffix can be specified in the optional argument suffix = ..., which allows us to distinguish the variables of one table from another.
# Left
tb_1 |>
left_join(tb_2, by = c("key_1" = "key_2"),
suffix = c("_table1", "_table2"))# A tibble: 3 × 3
key_1 val_table1 val_table2
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
1.4 full_join()
The two previous cases form what is known as outer joins: crosses where observations are kept that appear in at least one table. The third outer join is known as full_join() which will keep observations from both tables, adding rows that do not match the other table.
tb_1 |>
full_join(tb_2, by = c("key_1" = "key_2"))# A tibble: 4 × 3
key_1 val.x val.y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
4 4 <NA> y3
1.5 inner_join()
Opposite the outer join is what is known as inner join, with inner_join(): a join in which only the observations that appear in both tables are kept, only those records that are patched are kept.
tb_1 |>
inner_join(tb_2, by = c("key_1" = "key_2"))# A tibble: 2 × 3
key_1 val.x val.y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
Note that in terms of records, inner_join if it is commutative, we don’t care about the order of the tables: the only thing that changes is the order of the columns it adds.
tb_1 |>
inner_join(tb_2, by = c("key_1" = "key_2"))# A tibble: 2 × 3
key_1 val.x val.y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
tb_2 |> inner_join(tb_1, by = c("key_2" = "key_1"))# A tibble: 2 × 3
key_2 val.x val.y
<dbl> <chr> <chr>
1 1 y1 x1
2 2 y2 x2
1.6 semi/anti_join()
Finally we have two interesting tools to filter (not cross) records: semi_join() and anti_join(). The semi join leaves us in the first table the records whose key is also in the second table (like an inner join but without adding the info from the second table). And the second one, the anti join, does just the opposite (those that are not).
# semijoin
tb_1 |>
semi_join(tb_2, by = c("key_1" = "key_2"))# A tibble: 2 × 2
key_1 val
<int> <chr>
1 1 x1
2 2 x2
# antijoin
tb_1 |>
anti_join(tb_2, by = c("key_1" = "key_2"))# A tibble: 1 × 2
key_1 val
<int> <chr>
1 3 x3
1.7 💻 It’s your turn
For the exercises we will use the tables available in the package {nycflights13}.
library(nycflights13)- airlines: name of airlines (with their abbreviation).
- airports: airport data (names, longitude, latitude, altitude, etc).
- flights, planes: flight and aircraft data.
- weather: hourly weather data.
📝 From package {nycflights13} incorporates into the flights table the airline data in airlines. We want to maintain all flight records, adding the airline information to the airlines table.
Code
flights_airlines <-
flights |>
left_join(airlines, by = "carrier")
flights_airlines📝 To the table obtained from the crossing of the previous section, include the data of the aircraft in planes, but including only those flights for which we have information on their aircraft (and vice versa).
Code
flights_airlines_planes <-
flights_airlines |>
inner_join(planes, by = "tailnum")
flights_airlines_planes📝 Repeat the previous exercise but keeping both year variables (in one is the year of flight, in the other is the year of construction of the aircraft), and distinguishing them from each other
Code
flights_airlines_planes <-
flights_airlines |>
inner_join(planes, by = "tailnum",
suffix = c("_flight", "_build_aircraft"))
flights_airlines_planes📝 To the table obtained from the previous exercise includes the longitude and latitude of the airports in airports, distinguishing between the latitude/longitude of the airport at destination and at origin.
Code
flights_airlines_planes |>
left_join(airports |> select(faa, lat, lon),
by = c("origin" = "faa")) |>
rename(lat_origin = lat, lon_origin = lon) |>
left_join(airports |> select(faa, lat, lon),
by = c("dest" = "faa")) |>
rename(lat_dest = lat, lon_dest = lon)📝 Filter from airports only those airports from which flights depart. Repeat the process filtering only those airports where flights arrive.
Code
airports |>
semi_join(flights, by = c("faa" = "origin"))
airports |>
semi_join(flights, by = c("faa" = "dest"))📝 How many flights do we not have information about the aircraft? Eliminate flights that do not have an aircraft ID (other than NA) beforehand.
Code
flights |>
drop_na(tailnum) |>
anti_join(planes, by = "tailnum") |>
count(tailnum, sort = TRUE) 2 🐣 Case study I: Beatles and Rolling Stones
We will use to practice simple joins the band_members and band_instruments datasets already included in the {dplyr} package.
library(dplyr)
band_members# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
band_instruments# A tibble: 3 × 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
In the first one we have a series of artists and the band they belong to; in the second one we have a series of artists and the instrument they play. Beyond performing the requested actions, try to visualize which final table you would have
2.1 Question 1
Given the table
band_members, incorporate the information of what instrument each member plays (band_instruments) of the ones you have in that table.
Code
left_join_band <-
band_members |>
left_join(band_instruments, by = "name")2.2 Question 2
Given the
band_membersandband_instrumentstables, what kind of join should you do to have a complete table, with no absentees, where all band members have their instrument information, and every instrument has a member associated with it?
Code
inner_join_band <-
band_members |>
inner_join(band_instruments, by = "name")2.3 Question 3
Given the
band_instrumentstable, how to incorporate the information of who plays each instrument (in case we know it)?
Code
right_join_band <-
band_members |>
right_join(band_instruments, by = "name")
# other option
left_join_instruments <-
band_instruments |>
left_join(band_members, by = "name")2.4 Question 4
Given the
band_membersandband_instrumentstables, what kind of join should you do to have a table with all the info, both members and instruments, even if there are members whose instrument you do not know, and instruments whose carrier you do not know?
Code
full_join_band <-
band_members |>
full_join(band_instruments, by = "name")3 🐣 Case study II: income by municipalities
In the file municipios.csv we have stored the information of the municipalities of Spain as of 2019.
The variable
LAU_coderepresents the code as local administrative unit in the EU (see more at https://ec.europa.eu/eurostat/web/nuts/local-administrative-units).The variable
codigo_ineis formed by joining the province code and the community code (each province has a code from 1 to 52, it does not depend on the ccaa).
# 2019 data
mun_data <- read_csv(file = "./datos/municipios.csv")Rows: 8212 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): codauto, ine.ccaa.name, cpro, ine.prov.name, cmun, name, LAU_CODE, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
On the other hand, in the file renta_mun we have the average per capita income of each administrative unit (municipalities, districts, provinces, autonomous communities,…) for different years.
renta_mun <- read_csv(file = "./datos/renta_mun.csv")Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 55273 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Unidad, codigo_ine
dbl (5): 2019, 2018, 2017, 2016, 2015
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Before we start let’s normalize variable names using clean_names() from the {janitor} package.
mun_data <-
mun_data |>
janitor::clean_names()
renta_mun <-
renta_mun |>
janitor::clean_names()3.1 Question 1
Convert to tidydata
renta_munobtaining a table of 4 columns:unit,year,incomeandcodigo_ine(no absent and each data of the correct type)
Code
renta_mun_tidy <-
renta_mun |>
pivot_longer(cols = contains("x"), names_to = "year",
values_to = "income", names_prefix = "x",
names_transform = list(year = as.numeric),
values_drop_na = TRUE)3.2 Question 2
If you look at the table above we have data from different administrative units that are not always municipalities. Knowing that all municipalities have a 5-character code, filter only those records that correspond to municipal units.
Code
renta_mun_tidy <-
renta_mun_tidy |>
filter(str_detect(codigo_ine, pattern = "[0-9]{5}") &
str_length(codigo_ine) == 5)3.3 Question 3
Then properly separate the unit variable into two columns: one with the code (which you already have so one of the two should be removed) and the name. Remove any remaining spaces (take a look at the
{stringr}package options).
Code
renta_mun_tidy <-
renta_mun_tidy |>
separate(col = "unidad", into = c("cod_rm", "name"), sep = 5) |>
select(-cod_rm) |>
mutate(name = str_trim(name)) 3.4 Question 4
In which year was the median income higher? And lower? What was the median income of municipalities in Spain in 2019?
summary_renta <-
renta_mun_tidy |>
summarise("mean_income" = mean(income, na.rm = TRUE),
.by = year)
summary_renta |>
slice_min(mean_income, n = 1)# A tibble: 1 × 2
year mean_income
<dbl> <dbl>
1 2015 10083.
summary_renta |>
slice_max(mean_income, n = 1)# A tibble: 1 × 2
year mean_income
<dbl> <dbl>
1 2019 11672.
renta_mun_tidy |>
filter(year == 2019) |>
summarise("median_income" = median(income, na.rm = TRUE))# A tibble: 1 × 1
median_income
<dbl>
1 11462
3.5 Question 5
Do whatever you consider to obtain the province with the highest average income in 2019 and the one with the lowest. Be sure to get its name.
Code
summary_by_prov <-
renta_mun_tidy |>
filter(year == 2019) |>
left_join(mun_data, by = "codigo_ine", suffix = c("", "_rm")) |>
select(-contains("rm")) |>
summarise("mean_by_prov" = mean(income, na.rm = TRUE),
.by = c("cpro", "ine_prov_name"))
summary_by_prov |>
slice_max(mean_by_prov, n = 1)# A tibble: 1 × 3
cpro ine_prov_name mean_by_prov
<chr> <chr> <dbl>
1 20 Gipuzkoa 15890.
Code
summary_by_prov |>
slice_min(mean_by_prov, n = 1)# A tibble: 1 × 3
cpro ine_prov_name mean_by_prov
<chr> <chr> <dbl>
1 06 Badajoz 8805.
3.6 Question 6
Obtain from each ccaa the name of the municipality with the highest income in 2019.
Code
renta_mun_tidy |>
filter(year == 2019) |>
left_join(mun_data, by = "codigo_ine", suffix = c("", "_rm")) |>
select(-contains("rm")) |>
slice_max(income, by = "codauto")# A tibble: 19 × 10
name codigo_ine year income codauto ine_ccaa_name cpro ine_prov_name cmun
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Nava… 40904 2019 20242 07 Castilla y L… 40 Segovia 904
2 Alma… 10019 2019 13940 11 Extremadura 10 Cáceres 019
3 Olei… 15058 2019 16447 12 Galicia 15 Coruña, A 058
4 Lauk… 48053 2019 21672 16 País Vasco 48 Bizkaia 053
5 Lumb… 26091 2019 24007 17 Rioja, La 26 Rioja, La 091
6 Murc… 30030 2019 11631 14 Murcia, Regi… 30 Murcia 030
7 Raja… 08178 2019 23401 09 Cataluña 08 Barcelona 178
8 Aloc… 19023 2019 19700 08 Castilla - L… 19 Guadalajara 023
9 Argu… 22037 2019 19219 02 Aragón 22 Huesca 037
10 Alcu… 04009 2019 14196 01 Andalucía 04 Almería 009
11 Aran… 31023 2019 15517 15 Navarra, Com… 31 Navarra 023
12 Sant… 35021 2019 15982 05 Canarias 35 Palmas, Las 021
13 Roca… 46216 2019 17872 10 Comunitat Va… 46 Valencia/Val… 216
14 Pozu… 28115 2019 26367 13 Madrid, Comu… 28 Madrid 115
15 Vall… 07063 2019 19565 04 Balears, Ill… 07 Balears, Ill… 063
16 Vald… 39093 2019 15574 06 Cantabria 39 Cantabria 093
17 Teve… 33072 2019 15409 03 Asturias, Pr… 33 Asturias 072
18 Ceuta 51001 2019 12038 18 Ceuta 51 Ceuta 001
19 Meli… 52001 2019 11463 19 Melilla 52 Melilla 001
# ℹ 1 more variable: lau_code <chr>
4 Import and export files
So far we have only used data already loaded in packages but many times we will need to import data externally. One of the main strengths of R is that we can import data very easily in different formats:
R native formats:
.rda,.RDataand.rdsformats.Rectangular (tabular) data:
.csvand.tsvformatsUntabulated data:
.txtformat.Data in excel:
.xlsand.xlsxformatsData from SAS/Stata/SPSS:
.sas7bdat,.savand.datformatsData from Google Drive
Data from API’s: aemet, catastro, twitter, spotify, etc.
4.1 R native formats
The simplest files to import into R (and which usually take up less disk space) are its own native extensions: files in .RData, .rda and .rds formats. To load the former we simply need to use the native function load() by providing it the file path.
RDatafile: we are going to import the fileworld_bank_pop.RData, which includes the datasetworld_bank_pop
load("./datos/world_bank_pop.RData")
world_bank_pop# A tibble: 1,064 × 20
country indicator `2000` `2001` `2002` `2003` `2004` `2005` `2006`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABW SP.URB.TOTL 4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
2 ABW SP.URB.GROW 1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
3 ABW SP.POP.TOTL 8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
4 ABW SP.POP.GROW 2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
5 AFE SP.URB.TOTL 1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
6 AFE SP.URB.GROW 3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
7 AFE SP.POP.TOTL 4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
8 AFE SP.POP.GROW 2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
9 AFG SP.URB.TOTL 4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
10 AFG SP.URB.GROW 1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
# ℹ 1,054 more rows
# ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
# `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
# `2016` <dbl>, `2017` <dbl>
.rdafile: we will import the airquality dataset fromairquality.rda
load("./datos/airquality.rda")
airquality |> as_tibble()# A tibble: 153 × 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
# ℹ 143 more rows
Note that files loaded with load() are automatically loaded into the environment (with the originally saved name), and not only datasets can be loaded: load() allows us to load multiple objetcs (not only a tabular data)
Native .rda and .RData files are a properly way to save your environment.
load(file = "./datos/multiple_objects.rda").rdsfiles: for this type we must usereadRDS(), and we need to incorporate a argumentfilewith the path. In this case we are going to import lung cancer data from the North Central Cancer Treatment Group. Note that now .rds import files are a unique database
lung_cancer <-
readRDS(file = "./datos/NCCTG_lung_cancer.rds") |>
as_tibble()
lung_cancer# A tibble: 228 × 10
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 306 2 74 1 1 90 100 1175 NA
2 3 455 2 68 1 0 90 90 1225 15
3 3 1010 1 56 1 0 90 90 NA 15
4 5 210 2 57 1 1 90 60 1150 11
5 1 883 2 60 1 0 100 90 NA 0
6 12 1022 1 74 1 1 50 80 513 0
7 7 310 2 68 2 2 70 60 384 10
8 11 361 2 71 2 2 60 80 538 1
9 1 218 2 53 1 1 70 80 825 16
10 7 166 2 61 1 2 70 70 271 34
# ℹ 218 more rows
The paths must always be without spaces, ñ, or accents.
4.2 Tabular data: readr
The {readr} package within the {tidyverse} environment contains several useful functions for loading rectangular data (without formatting).
read_csv():.csvfiles whose separator is commaread_csv2(): semicolonread_tsv(): tabulator.read_table(): space.read_delim(): generic function for character delimited files.
All of them need as argument the file path plus other optional (skip header or not, decimals, etc). See more at https://readr.tidyverse.org/
4.2.1 .csv, .tsv
The main advantage of {readr} is that it automates formatting to go from a flat (unformatted) file to a tibble (in rows and columns, with formatting).
- File
.csv: withread_csv()we will load comma separated files, passing as argument the path infile = .... Let’s import thechickens.csvdataset (about cartoon chickens, why not). If you look at the output it gives us the type of variables.
library(readr)
chickens <- read_csv(file = "./datos/chickens.csv")Rows: 5 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): chicken, sex, motto
dbl (1): eggs_laid
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chickens# A tibble: 5 × 4
chicken sex eggs_laid motto
<chr> <chr> <dbl> <chr>
1 Foghorn Leghorn rooster 0 That's a joke, ah say, that's a jok…
2 Chicken Little hen 3 The sky is falling!
3 Ginger hen 12 Listen. We'll either die free chick…
4 Camilla the Chicken hen 7 Bawk, buck, ba-gawk.
5 Ernie The Giant Chicken rooster 0 Put Captain Solo in the cargo hold.
The variable format will normally be done automatically by read_csv(), and we can query it with spec().
spec(chickens)cols(
chicken = col_character(),
sex = col_character(),
eggs_laid = col_double(),
motto = col_character()
)
Although it usually does it well automatically we can specify the format explicitly in col_types = list() (in list format, with col_xxx() for each type of variable, for example eggs_laid will be imported as character).
chickens <-
read_csv(file = "./datos/chickens.csv",
col_types = list(col_character(), col_character(),
col_character(), col_character()))
chickens# A tibble: 5 × 4
chicken sex eggs_laid motto
<chr> <chr> <chr> <chr>
1 Foghorn Leghorn rooster 0 That's a joke, ah say, that's a jok…
2 Chicken Little hen 3 The sky is falling!
3 Ginger hen 12 Listen. We'll either die free chick…
4 Camilla the Chicken hen 7 Bawk, buck, ba-gawk.
5 Ernie The Giant Chicken rooster 0 Put Captain Solo in the cargo hold.
We can even indicate that variables we want to select (without occupying memory), by indicating it in col_select = ... (in list format, with col_select = ...).
chickens <-
read_csv(file = "./datos/chickens.csv",
col_select = c(chicken, sex, eggs_laid))Rows: 5 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): chicken, sex
dbl (1): eggs_laid
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chickens# A tibble: 5 × 3
chicken sex eggs_laid
<chr> <chr> <dbl>
1 Foghorn Leghorn rooster 0
2 Chicken Little hen 3
3 Ginger hen 12
4 Camilla the Chicken hen 7
5 Ernie The Giant Chicken rooster 0
4.2.2 .txt
What happens when the separator is not correct?
If we use read_csv() it expects the separator between columns to be a comma but, as you can see with the following .txt, it interprets everything as a single column: has no comma and does not know where to separate
datos_txt <- read_csv(file = "./datos/massey-rating.txt")Rows: 10 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): UCC PAY LAZ KPK RT COF BIH DII ENG ACU Rank Team Conf
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(datos_txt)[1] 10 1
as_tibble(datos_txt)# A tibble: 10 × 1
`UCC PAY LAZ KPK RT COF BIH DII ENG ACU Rank Team Conf`
<chr>
1 1 1 1 1 1 1 1 1 1 1 1 Ohio St B10
2 2 2 2 2 2 2 2 2 4 2 2 Oregon P12
3 3 4 3 4 3 4 3 4 2 3 3 Alabama SEC
4 4 3 4 3 4 3 5 3 3 4 4 TCU B12
5 6 6 6 5 5 7 6 5 6 11 5 Michigan St B10
6 7 7 7 6 7 6 11 8 7 8 6 Georgia SEC
7 5 5 5 7 6 8 4 6 5 5 7 Florida St ACC
8 8 8 9 9 10 5 7 7 10 7 8 Baylor B12
9 9 11 8 13 11 11 12 9 14 9 9 Georgia Tech ACC
10 13 10 13 11 8 9 10 11 9 10 10 Mississippi SEC
To do this we have.
read_csv2()when the separator is semicolon,read_tsv()when the is a tab andread_table()when the is a space.read_delim()in general.
datos_txt <- read_table(file = "./datos/massey-rating.txt")
── Column specification ────────────────────────────────────────────────────────
cols(
UCC = col_double(),
PAY = col_double(),
LAZ = col_double(),
KPK = col_double(),
RT = col_double(),
COF = col_double(),
BIH = col_double(),
DII = col_double(),
ENG = col_double(),
ACU = col_double(),
Rank = col_double(),
Team = col_character(),
Conf = col_character()
)
Warning: 10 parsing failures.
row col expected actual file
1 -- 13 columns 15 columns './datos/massey-rating.txt'
2 -- 13 columns 14 columns './datos/massey-rating.txt'
3 -- 13 columns 14 columns './datos/massey-rating.txt'
4 -- 13 columns 14 columns './datos/massey-rating.txt'
5 -- 13 columns 15 columns './datos/massey-rating.txt'
... ... .......... .......... ...........................
See problems(...) for more details.
as_tibble(datos_txt)# A tibble: 10 × 13
UCC PAY LAZ KPK RT COF BIH DII ENG ACU Rank Team Conf
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 1 1 1 1 1 1 1 1 1 1 Ohio St
2 2 2 2 2 2 2 2 2 4 2 2 Oreg… P12
3 3 4 3 4 3 4 3 4 2 3 3 Alab… SEC
4 4 3 4 3 4 3 5 3 3 4 4 TCU B12
5 6 6 6 5 5 7 6 5 6 11 5 Mich… St
6 7 7 7 6 7 6 11 8 7 8 6 Geor… SEC
7 5 5 5 7 6 8 4 6 5 5 7 Flor… St
8 8 8 9 9 10 5 7 7 10 7 8 Bayl… B12
9 9 11 8 13 11 11 12 9 14 9 9 Geor… Tech
10 13 10 13 11 8 9 10 11 9 10 10 Miss… SEC
4.3 Excel data (.xls, .xlsx)
Another key import package will be the {readxl} package for importing data from Excel. Three functions will be key:
read_xls()specific to.xls,read_xlsx()specific to.xlsx.read_excel(): for both.xlsand.xlsx.
We are going to import deaths.xlsx with celebrity death records.
library(readxl)
deaths <- read_xlsx(path = "./datos/deaths.xlsx")New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
deaths# A tibble: 18 × 6
`Lots of people` ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 simply cannot resist writing <NA> <NA> <NA> <NA> some…
2 at the top <NA> of thei…
3 or merging <NA> <NA> <NA> cells
4 Name Profession Age Has … Date… Date…
5 David Bowie musician 69 TRUE 17175 42379
6 Carrie Fisher actor 60 TRUE 20749 42731
7 Chuck Berry musician 90 TRUE 9788 42812
8 Bill Paxton actor 61 TRUE 20226 42791
9 Prince musician 57 TRUE 21343 42481
10 Alan Rickman actor 69 FALSE 16854 42383
11 Florence Henderson actor 82 TRUE 12464 42698
12 Harper Lee author 89 FALSE 9615 42419
13 Zsa Zsa Gábor actor 99 TRUE 6247 42722
14 George Michael musician 53 FALSE 23187 42729
15 Some <NA> <NA> <NA> <NA> <NA>
16 <NA> also like to write stuff <NA> <NA> <NA> <NA>
17 <NA> <NA> at t… bott… <NA> <NA>
18 <NA> <NA> <NA> <NA> <NA> too!
deaths |> slice(1:6)# A tibble: 6 × 6
`Lots of people` ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 simply cannot resist writing <NA> <NA> <NA> <NA> some not…
2 at the top <NA> of their sp…
3 or merging <NA> <NA> <NA> cells
4 Name Profession Age Has kids Date of birth Date of …
5 David Bowie musician 69 TRUE 17175 42379
6 Carrie Fisher actor 60 TRUE 20749 42731
One thing that is very common misfortune is that there is some kind of comment or text at the beginning of the file, having to skip those rows.
We can skip these rows directly in the load with skip = ... (indicating the number of rows to skip).
deaths <- read_xlsx(path = "./datos/deaths.xlsx", skip = 4)
deaths# A tibble: 14 × 6
Name Profession Age `Has kids` `Date of birth` `Date of death`
<chr> <chr> <chr> <chr> <dttm> <chr>
1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 42379
2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 42731
3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 42812
4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 42791
5 Prince musician 57 TRUE 1958-06-07 00:00:00 42481
6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 42383
7 Florence Hen… actor 82 TRUE 1934-02-14 00:00:00 42698
8 Harper Lee author 89 FALSE 1926-04-28 00:00:00 42419
9 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 42722
10 George Micha… musician 53 FALSE 1963-06-25 00:00:00 42729
11 Some <NA> <NA> <NA> NA <NA>
12 <NA> also like… <NA> <NA> NA <NA>
13 <NA> <NA> at t… bottom, NA <NA>
14 <NA> <NA> <NA> <NA> NA too!
In addition with col_names = ... we can already rename the columns in the import (if provide names assumes 1st line already as a data)
deaths <-
read_xlsx(path = "./datos/deaths.xlsx", skip = 5,
col_names = c("name", "profession", "age", "kids", "birth", "death"))
deaths# A tibble: 14 × 6
name profession age kids birth death
<chr> <chr> <chr> <chr> <dttm> <chr>
1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 42379
2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 42731
3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 42812
4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 42791
5 Prince musician 57 TRUE 1958-06-07 00:00:00 42481
6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 42383
7 Florence Henderson actor 82 TRUE 1934-02-14 00:00:00 42698
8 Harper Lee author 89 FALSE 1926-04-28 00:00:00 42419
9 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 42722
10 George Michael musician 53 FALSE 1963-06-25 00:00:00 42729
11 Some <NA> <NA> <NA> NA <NA>
12 <NA> also like to write … <NA> <NA> NA <NA>
13 <NA> <NA> at t… bott… NA <NA>
14 <NA> <NA> <NA> <NA> NA too!
Sometimes Excel dates are incorrectly formatted (surprise): we can use convertToDate() from the {openxlsx} package to convert it.
library(openxlsx)
deaths$death <- convertToDate(deaths$death)Warning in convertToDate(deaths$death): NAs introduced by coercion
deaths# A tibble: 14 × 6
name profession age kids birth death
<chr> <chr> <chr> <chr> <dttm> <date>
1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 2016-01-10
2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27
3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 2017-03-18
4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25
5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21
6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 2016-01-14
7 Florence Henderson actor 82 TRUE 1934-02-14 00:00:00 2016-11-24
8 Harper Lee author 89 FALSE 1926-04-28 00:00:00 2016-02-19
9 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 2016-12-18
10 George Michael musician 53 FALSE 1963-06-25 00:00:00 2016-12-25
11 Some <NA> <NA> <NA> NA NA
12 <NA> also like to w… <NA> <NA> NA NA
13 <NA> <NA> at t… bott… NA NA
14 <NA> <NA> <NA> <NA> NA NA
We can also load an Excel with several sheets: to indicate the sheet (either by its name or by its number) we will use the argument sheet = ....
mtcars <- read_xlsx(path = "./datos/datasets.xlsx", sheet = "mtcars")
mtcars# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
We can even indicate the range of cells to load with range = ....
iris <- read_xlsx(path = "./datos/datasets.xlsx", sheet = "iris", range = "C1:E4")
iris# A tibble: 3 × 3
Petal.Length Petal.Width Species
<dbl> <dbl> <chr>
1 1.4 0.2 setosa
2 1.4 0.2 setosa
3 1.3 0.2 setosa
4.4 Import from SAS/STATA/SPSS
The {haven} package within the tidyverse orbit will allow us to import files from the 3 most important payment software: SAS, SPSS and Stata.
library(haven)
# SAS
iris_sas <- read_sas(data_file = "./datos/iris.sas7bdat")
# SPSS
iris_spss <- read_sav(file = "./datos/iris.sav")
# Stata
iris_stata <- read_dta(file = "./datos/iris.dta")4.5 Export
In the same way that we can import we can also export
- exported in
.RData(recommended option for variables stored inR). Remember that this extension can only be used inR. To do so, just usesave(object, file = path).
table <- tibble("a" = 1:4, "b" = 1:4)
save(table, file = "./datos/table.RData")
rm(table) # eliminar
load("./datos/table.RData")
table# A tibble: 4 × 2
a b
<int> <int>
1 1 1
2 2 2
3 3 3
4 4 4
- exported in
.RDatamultiple objects
table <- tibble("a" = 1:4, "b" = 1:4)
a <- 1
b <- c("javi", "sandra")
save(table, a, b, file = "./datos/mult_obj.RData")
rm(list = c("a", "b", "table"))
load("./datos/mult_obj.RData")
table# A tibble: 4 × 2
a b
<int> <int>
1 1 1
2 2 2
3 3 3
4 4 4
- exported in
.csv. To do this we simply usewrite_csv(object, file = path).
write_csv(table, file = "./datos/table.csv")
read_csv(file = "./datos/table.csv")Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): a, b
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 2
a b
<dbl> <dbl>
1 1 1
2 2 2
3 3 3
4 4 4
4.6 Import from new sources
4.6.1 Website
One of the main advantages of R is that we can make use of all the previous functions of import but directly from a web, without the need to perform the manual download: instead of passing it the local path we will indicate the link. For example, we are going to download the covid data from ISCIII (https://cnecovid.isciii.es/covid19/#documentaci%C3%B3n-y-datos)
covid_data <-
read_csv(file = "https://cnecovid.isciii.es/covid19/resources/casos_hosp_uci_def_sexo_edad_provres.csv")
covid_dataRows: 500 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): provincia_iso, sexo, grupo_edad
dbl (4): num_casos, num_hosp, num_uci, num_def
date (1): fecha
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 500 × 8
provincia_iso sexo grupo_edad fecha num_casos num_hosp num_uci num_def
<chr> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl>
1 A H 0-9 2020-01-01 0 0 0 0
2 A H 10-19 2020-01-01 0 0 0 0
3 A H 20-29 2020-01-01 0 0 0 0
4 A H 30-39 2020-01-01 0 0 0 0
5 A H 40-49 2020-01-01 0 0 0 0
6 A H 50-59 2020-01-01 0 0 0 0
7 A H 60-69 2020-01-01 0 0 0 0
8 A H 70-79 2020-01-01 0 0 0 0
9 A H 80+ 2020-01-01 0 0 0 0
10 A H NC 2020-01-01 0 0 0 0
# ℹ 490 more rows
4.6.2 Wikipedia
The {rvest} package, one of the most useful of {tidyverse} allows us to import directly from an html. For example, to export wikipedia tables just read_html() to import the html, html_element("table") to extract the table objects, and html_table() to convert the html table to tibble.
library(rvest)
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
wiki_jump <- 'https://en.wikipedia.org/wiki/Men%27s_long_jump_world_record_progression'
wiki_jump |> read_html() |>
html_element("table") |>
html_table()# A tibble: 19 × 5
Mark Wind Athlete Place Date
<chr> <chr> <chr> <chr> <chr>
1 7.61 m (24 ft 11+1⁄2 in) "" Peter O'Connor (IRE) Dublin, Ire… 5 Au…
2 7.69 m (25 ft 2+3⁄4 in) "" Edward Gourdin (USA) Cambridge, … 23 J…
3 7.76 m (25 ft 5+1⁄2 in) "" Robert LeGendre (USA) Paris, Fran… 7 Ju…
4 7.89 m (25 ft 10+1⁄2 in) "" DeHart Hubbard (USA) Chicago, Un… 13 J…
5 7.90 m (25 ft 11 in) "" Edward Hamm (USA) Cambridge, … 7 Ju…
6 7.93 m (26 ft 0 in) "0.0" Sylvio Cator (HAI) Paris, Fran… 9 Se…
7 7.98 m (26 ft 2 in) "0.5" Chuhei Nambu (JPN) Tokyo, Japan 27 O…
8 8.13 m (26 ft 8 in) "1.5" Jesse Owens (USA) Ann Arbor, … 25 M…
9 8.21 m (26 ft 11 in) "0.0" Ralph Boston (USA) Walnut, Uni… 12 A…
10 8.24 m (27 ft 1⁄4 in) "1.8" Ralph Boston (USA) Modesto, Un… 27 M…
11 8.28 m (27 ft 1+3⁄4 in) "1.2" Ralph Boston (USA) Moscow, Sov… 16 J…
12 8.31 m (27 ft 3 in) A "−0.1" Igor Ter-Ovanesyan (URS) Yerevan, So… 10 J…
13 8.33 m (27 ft 3+3⁄4 in)[2] "" Phil Shinnick (USA) Modesto, Un… 25 M…
14 8.31 m (27 ft 3 in) "0.0" Ralph Boston (USA) Kingston, J… 15 A…
15 8.34 m (27 ft 4+1⁄4 in) "1.0" Ralph Boston (USA) Los Angeles… 12 S…
16 8.35 m (27 ft 4+1⁄2 in)[5] "0.0" Ralph Boston (USA) Modesto, Un… 29 M…
17 8.35 m (27 ft 4+1⁄2 in) A "0.0" Igor Ter-Ovanesyan (URS) Mexico City… 19 O…
18 8.90 m (29 ft 2+1⁄4 in) A "2.0" Bob Beamon (USA) Mexico City… 18 O…
19 8.95 m (29 ft 4+1⁄4 in) "0.3" Mike Powell (USA) Tokyo, Japan 30 A…
4.6.3 Google Drive
Another option available (especially if we work with other people working) is to import from a Google Drive spreadsheet, making use of read_sheet() from the {googlesheets4} package.
The first time you will be asked for a tidyverse permission to interact with your drive
library(googlesheets4)
google_sheet <-
read_sheet("https://docs.google.com/spreadsheets/d/1Uz38nHjl3bmftxDpcXj--DYyPo1I39NHVf-xjeg1_wI/edit?usp=sharing")! Using an auto-discovered, cached token.
To suppress this message, modify your code or options to clearly consent to
the use of a cached token.
See gargle's "Non-interactive auth" vignette for more details:
<https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googlesheets4 package is using a cached token for 'javalv09@ucm.es'.
✔ Reading from "import-google-sheet".
✔ Range 'Hoja 1'.
google_sheet# A tibble: 3 × 4
a b c d
<chr> <dbl> <dbl> <chr>
1 d1 0.5 1 hombre
2 d2 -0.3 2 hombre
3 d3 1.1 3 mujer
4.6.4 API (owid)
Another interesting option is the data download from an API: an intermediary between an app or data provider and our R. For example, let’s load the {owidR} library, which allows us to download data from the web https://ourworldindata.org/. For example, the owid_covid() function loads without realizing it more than 400 000 records with more than 60 variables from 238 countries.
library(owidR)
owid_covid() |> as_tibble()# A tibble: 7 × 67
iso_code continent location date total_cases new_cases
<chr> <chr> <chr> <IDate> <int> <int>
1 AFG Asia Afghanistan 2020-01-05 0 0
2 AFG Asia Afghanistan 2020-01-06 0 0
3 AFG Asia Afghanistan 2020-01-07 0 0
4 AFG Asia Afghanistan 2020-01-08 0 0
5 AFG Asia Afghanistan 2020-01-09 0 0
6 AFG Asia Afghanistan 2020-01-10 0 0
7 AFG Asia Afghanistan 2020-01-11 0 0
# ℹ 61 more variables: new_cases_smoothed <dbl>, total_deaths <int>,
# new_deaths <int>, new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
# new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
# total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
# new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
# icu_patients <int>, icu_patients_per_million <dbl>, hosp_patients <int>,
# hosp_patients_per_million <dbl>, weekly_icu_admissions <int>, …
4.6.5 API (aemet)
In many occasions to connect to the API we will first have to register and obtain a key, this is the case of the {climaemet} package to access Spanish meteorological data (https://opendata.aemet.es/centrodedescargas/inicio).
Once we have the API key we register it in our RStudio to be able to use it in the future.
library(climaemet)
# Api key
apikey <- "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJqYXZhbHYwOUB1Y20uZXMiLCJqdGkiOiI4YTU1ODUxMS01MTE3LTQ4MTYtYmM4OS1hYmVkNDhiODBkYzkiLCJpc3MiOiJBRU1FVCIsImlhdCI6MTY2NjQ2OTcxNSwidXNlcklkIjoiOGE1NTg1MTEtNTExNy00ODE2LWJjODktYWJlZDQ4YjgwZGM5Iiwicm9sZSI6IiJ9.HEMR77lZy2ASjmOxJa8ppx2J8Za1IViurMX3p1reVBU"
aemet_api_key(apikey, install = TRUE)With this package we can do a search for stations to know both its postal code and its identifier code within the AEMET network
stations <- aemet_stations()
stations# A tibble: 947 × 7
indicativo indsinop nombre provincia altitud longitud latitud
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 B013X "08304" ESCORCA, LLUC ILLES BA… 490 2.89 39.8
2 B051A "08316" SÓLLER, PUERTO ILLES BA… 5 2.69 39.8
3 B087X "" BANYALBUFAR ILLES BA… 60 2.51 39.7
4 B103B "99103" ANDRATX - SANT ELM ILLES BA… 52 2.37 39.6
5 B158X "" CALVIÀ, ES CAPDELLÀ ILLES BA… 50 2.47 39.6
6 B228 "08301" PALMA, PUERTO ILLES BA… 3 2.63 39.6
7 B236C "" PALMA, UNIVERSIDAD ILLES BA… 95 2.64 39.6
8 B248 "08303" SIERRA DE ALFABIA, BU… ILLES BA… 1030 2.71 39.7
9 B275E "08302" SON BONET, AEROPUERTO BALEARES 49 2.71 39.6
10 B278 "08306" PALMA DE MALLORCA, AE… ILLES BA… 8 2.74 39.6
# ℹ 937 more rows
For example, to get data from the station of the airport of El Prat, Barcelona, the code to provide is "0076", obtaining hourly data
aemet_last_obs("0076")# A tibble: 13 × 23
idema lon fint prec alt vmax vv dv lat dmax
<chr> <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0076 2.07 2024-11-02 07:00:00 0 4 11.3 3.6 360 41.3 340
2 0076 2.07 2024-11-02 08:00:00 0 4 10.3 4.2 10 41.3 340
3 0076 2.07 2024-11-02 09:00:00 0 4 10.8 4.4 10 41.3 30
4 0076 2.07 2024-11-02 10:00:00 0.2 4 12.4 4.7 360 41.3 10
5 0076 2.07 2024-11-02 11:00:00 0 4 11.8 4.7 10 41.3 10
6 0076 2.07 2024-11-02 12:00:00 0 4 14.9 10.7 70 41.3 70
7 0076 2.07 2024-11-02 13:00:00 0.3 4 16 11.9 70 41.3 80
8 0076 2.07 2024-11-02 14:00:00 0 4 18 14.1 80 41.3 80
9 0076 2.07 2024-11-02 15:00:00 0 4 17 13.4 90 41.3 80
10 0076 2.07 2024-11-02 16:00:00 0 4 14.9 11.4 90 41.3 80
11 0076 2.07 2024-11-02 17:00:00 0 4 13.4 9.9 90 41.3 80
12 0076 2.07 2024-11-02 18:00:00 0 4 11.8 7.3 80 41.3 80
13 0076 2.07 2024-11-02 19:00:00 0 4 9.8 6.4 80 41.3 80
# ℹ 13 more variables: ubi <chr>, pres <dbl>, hr <dbl>, stdvv <dbl>, ts <dbl>,
# pres_nmar <dbl>, tamin <dbl>, ta <dbl>, tamax <dbl>, tpr <dbl>, vis <dbl>,
# stddv <dbl>, inso <dbl>
4.6.6 API (US census)
One of the most useful tools in recent years is known as {tidycensus}: a tool for facilitating the process of downloading census data for the United States from R.
library(tidycensus)get_decennial(): to access the census data (US Decennial Census), done every 10 years (years 2000, 2010 and 2020).get_acs(): to access the annual and quinquennial (5 years) ACS (American Community Survey) (census != survey)get_estimates(): to access the annual population, birth and death estimates (census != survey)get_pums(): to access the microdata (unaggregated data) of the ACS (anonymized at the individual level)get_flows(): to access the migration flow data.
portar from API (US census)
For example, we will download the census data (get_decennial()) at the state level (geography = “state”) for the population (variable variables = “P001001”) for the year 2010 (see variables in tidycensus::load_variables()).
total_population_10 <-
get_decennial(geography = "state",
variables = "P001001",
year = 2010)Getting data from the 2010 decennial Census
Using Census Summary File 1
total_population_10# A tibble: 52 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 01 Alabama P001001 4779736
2 02 Alaska P001001 710231
3 04 Arizona P001001 6392017
4 05 Arkansas P001001 2915918
5 06 California P001001 37253956
6 22 Louisiana P001001 4533372
7 21 Kentucky P001001 4339367
8 08 Colorado P001001 5029196
9 09 Connecticut P001001 3574097
10 10 Delaware P001001 897934
# ℹ 42 more rows
4.6.7 Other options
{chessR}: data from chess matches. See https://github.com/JaseZiv/chessR{spotifyr}: data from Spotify. See https://www.rcharlie.com/spotifyr/{gtrendsR}: data from Google Trends. See https://github.com/PMassicotte/gtrendsR{scholar}: data from https://github.com/jkeirstead/scholar
4.7 💻 Your turn
Try to solve the following exercises without looking at the solutions
📝 The who dataset we have used in previous exercises, export it to a native R format in the data folder of the project
Code
library(tidyr)
save(who, file = "./datos/who.RData")📝 Loads the who dataset but from the data folder (import the file created in the previous exercise)
Code
load("./datos/who.RData")📝 Repeats the same (export and import) in 4 formats: .csv, .xlsx, .sav (spss) and .dta (stata)
Code
# csv
library(readr)
write_csv(who, file = "./datos/who.csv")
who_data <- read_csv(file = "./datos/who.csv")
# excel
library(openxlsx)
write.xlsx(who, file = "./datos/who.xlsx")
who_data <- read_xlsx(path = "./datos/who.xlsx")
# sas y stata
library(haven)
write_sav(who, path = "./datos/who.sav")
who_data <- read_spss(path = "./datos/who.sav")
write_dta(who, path = "./datos/who.dta")
who_data <- read_dta(path = "./datos/who.dta")📝 Repeat the loading of who.csv but only select the first 4 columns already in the load.
Code
who_select <-
read_csv(file = "./datos/who.csv",
col_select = c("country", "iso2", "iso3", "year"))5 🐣 Case study I: CIS survey
We are going to put into practice the loading and preprocessing of a file generated by one of the most used software (SPSS). The file contains data from the CIS (Centro de Investigaciones Sociológicas) barometer «Perceptions on equality between men and women and gender stereotypes» whose sample work was carried out from November 6 to 14 (4000 interviews of both sexes over 16 years old in 1174 municipalities and 50 provinces).
5.1 Question 1
Load the file extension
.savthat you have in the subfolderCIS-feminisminside the data folder. After loading normalize variable names with{janitor}package.
Code
library(haven)
data <-
read_sav(file = "./datos/CIS-feminismo/3428.sav") |>
janitor::clean_names()5.2 Question 2
Calculate using tidyverse the number of distinct values of each variable and eliminate those that only have a constant value. First get the name of those variables and then use them inside a
select().
Code
rm_variables <-
data |>
summarise(across(everything(), n_distinct)) |>
pivot_longer(cols = everything(), names_to = "variable", values_to = "n_values") |>
filter(n_values <= 1) |>
pull(variable)
data <-
data |>
select(-rm_variables)Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(rm_variables)
# Now:
data %>% select(all_of(rm_variables))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
5.3 Question 3
Perform the same action to detect those variables that have a different value for each individual (all values are different). These columns are important, they will be the ids of the respondent, but one is enough (if there are several, delete one of them).
Code
rm_variables <-
data |>
summarise(across(everything(), n_distinct)) |>
pivot_longer(cols = everything(), names_to = "variable", values_to = "n_values") |>
filter(n_values == nrow(data)) |>
pull(variable)
data <-
data |>
# all of them except one of them
select(-rm_variables[-1])5.4 Question 4
To understand what the variables mean, you have in the folder
CIS-feminismosthe questionnaire that was made to each respondent incues3428.pdfand the sampling data sheet inFT3428.pdf. In these documents you will see how, for example, the variables of the questionnaire almost all begin withp..., although we also have other types of variables related to household chores (tarhog...), children (hijomenor...), childcare (cuidadohijos...), care tasks in general (tareascuid...) and other variables related to the ideological scale, religion, voting memory, etc. In addition, all variables starting withia_xxxrefer to codes regarding possible incidences in the data collection and the variablespeso...to the type of weighting used in the sampling. Proceed to eliminate these last two type of variables.
Code
data <-
data |>
select(-contains("ia_"), -contains("peso"))5.5 Question 5
Calculate the number of interviews per autonomous community (
ccaa) and extract the 5 with the fewest interviews.
Code
data |>
count(ccaa, sort = TRUE)
data |>
count(ccaa) |>
slice_min(n = 5, n)5.6 Question 6
Use the
{datapasta}package (see https://javieralvarezliebana.es/docencia/mucss-data-programming/slides/#/data-pasta) to copy from the INE page the population of each province and import it into a tibble
Code
# this code should be automatically generated by de addin in R Studio
# after installing the R package
population <-
tibble::tribble(~Total, ~`47.385.107`,
"01 Andalucía", "8.472.407",
"02 Aragón", "1.326.261",
"03 Asturias, Principado de", "1.011.792",
"04 Balears, Illes", "1.173.008",
"05 Canarias", "2.172.944",
"06 Cantabria", "584.507",
"07 Castilla y León", "2.383.139",
"08 Castilla - La Mancha", "2.049.562",
"09 Cataluña", "7.763.362",
"10 Comunitat Valenciana", "5.058.138",
"11 Extremadura", "1.059.501",
"12 Galicia", "2.695.645",
"13 Madrid, Comunidad de", "6.751.251",
"14 Murcia, Región de", "1.518.486",
"15 Navarra, Comunidad Foral de", "661.537",
"16 País Vasco", "2.213.993",
"17 Rioja, La", "319.796",
"18 Ceuta", "83.517",
"19 Melilla", "86.261"
)5.7 Question 7
Rename properly the variables of
populationdatasets, converts the population to number and separates the variable of the autonomous communities in two, one with the code and another with the name (according to INE), properly processing and trimming the names. See before exercises about{stringr}package.
Code
population <-
population |>
rename(ccaa = Total, pop = `47.385.107`) |>
# First to convert to numeric we need to remove dots
mutate("pop" = as.numeric(str_replace_all(pop, "\\.", ""))) |>
# if we use " " as sep, we split also some names
separate(col = ccaa, into = c("cod_INE", "name"), sep = 2) |>
# From stringr package, str_squish() removes whitespace at the start
# and end, and replaces all internal whitespace with a single space.
mutate(name = str_squish(name))5.8 Question 8
Add the population information to our CIS table. Then calculate a summary table with the ratio between the percentage of population represented by each community (of the total of Spain) and the percentage of respondents of each community (with respect to the total). Which are the 3 most over-represented and which are the 3 least? Think that if the ratio is higher than 1, it implies that there is a lower proportion of respondents from that community than would correspond to it by population (under-represented)
Code
data <-
data |>
# you need first to convert cod_INE as numeric since both
# key columns should be of same type
left_join(population |> mutate("cod_INE" = as.numeric(cod_INE)),
by = c("ccaa" = "cod_INE")) |>
relocate(name, pop, .after = "ccaa")
prop_surv_pop <-
data |>
summarise("prop_surv" = n() / sum(nrow(data)),
"pop" = unique(pop),
.by = ccaa) |>
mutate("prop_pop" = pop / sum(pop),
"ratio" = prop_pop / prop_surv)
# overrepresented
prop_surv_pop |>
slice_min(ratio, n = 3)
# underrepresented
prop_surv_pop |>
slice_max(ratio, n = 3)6 Lists
We have already seen that lists are an object in R that allows us to store collections of variables of different type (as with data.frame and tibble) but also different lengths, with totally heterogeneous structures (even a list can have inside it another list).
name <- "Javi"
age <- 34
marks <- c(7, 8, 5, 3, 10, 9)
parents <- c("Paloma", "Goyo")
list_var <- list("name" = name, "age" = age, "marks" = marks, "parents" = parents)list_var$name[1] "Javi"
list_var$marks[1] 7 8 5 3 10 9
We can also make lists with other lists inside, so that to access each level we must use the [[]] operator.
list_of_lists <- list("list_1" = list_var[1:2], "list_2" = list_var[3:4])
names(list_of_lists)[1] "list_1" "list_2"
names(list_of_lists[[1]])[1] "name" "age"
list_of_lists[[1]][[1]][1] "Javi"
We are allowed to store n-dimensional data!
One of the disadvantages is that a list cannot be vectorized immediately, so any arithmetic operation applied to a list will give error.
data <- list("a" = 1:5, "b" = 10:20)
data / 2Error in data/2: non-numeric argument to binary operator
For this purpose, one of the common (but outdated) options is to make use of the lapply() family.
lapply(data, FUN = function(x) { x / 2})$a
[1] 0.5 1.0 1.5 2.0 2.5
$b
[1] 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0
By default, the output of lapply()is always a list of equal length.
A more flexible and versatile option is to make use of the {purrr} package of the {tidyverse} environment.
library(purrr)This package is intended to mimic the functional programming of other languages such as Scala or Hadoop’s map-reduce strategy (from Google).
The simplest function of the {purrr} package is the map() function, which applies a vectorized function to each of the elements of a list. Let’s see a first example applied to vectors
map()allows us to “map” each list and apply the function element by element (if applicable).
x <- list("x1" = 1:4, "x2" = 11:20)
map(x, sqrt) $x1
[1] 1.000000 1.414214 1.732051 2.000000
$x2
[1] 3.316625 3.464102 3.605551 3.741657 3.872983 4.000000 4.123106 4.242641
[9] 4.358899 4.472136
With vectors we have a default vectorization because R performs element-by-element operations. Note that, by default, the output of map is a list.
Let’s look at another example. Define in a list two samples from 2 normal distributions, of different sample size and different mean. Compute the mean of each one.
Code
x <- list(rnorm(n = 1500, mean = 0, sd = 0.7),
rnorm(n = 2800, mean = 2, sd = 1.5))
map(x, mean)[[1]]
[1] -0.0126662
[[2]]
[1] 1.978925
What if we want to calculate the mean of their squared values?
Code
map(x, function(x) { mean(x^2) })[[1]]
[1] 0.453184
[[2]]
[1] 6.155281
In addition to being more readable and efficient, with {purrr} we can decide the output format after the operation
- output as numeric (double) vector with
map_dbl() - output as numeric (int) vector with
map_int() - output as character vector with
map_chr() - output as logical vector with
map_lgl()
library(glue)
map_dbl(x, mean)[1] -0.0126662 1.9789246
map_chr(x, function(x) { glue("Mean is {round(mean(x), 5)}") })[1] "Mean is -0.01267" "Mean is 1.97892"
c(x[[1]][3], x[[2]][3])[1] 0.4777273 3.3570897
Also, if you pass it a number instead of a function, it will return the ith element of each list.
map_dbl(x, 3)[1] 0.4777273 3.3570897
list_dummy <- list("a" = dplyr::starwars, "b" = tidyr::billboard)We can also use pluck() to access to the i-th element of a list
list_dummy |>
pluck(1)# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
We also have the option of generalizing it to be able to use functions that need two arguments in the form of a list (binary operations), with map2()
x <- list("a" = 1:3, "b" = 4:6)
y <- list("c" = c(-1, 4, 0), "b" = c(5, -4, -1))
map2(x, y, function(x, y) { x^2 + y^2})$a
[1] 2 20 9
$b
[1] 41 41 37
We can obtain the output in the form of data.frame by adding list_rbind() or list_cbind(), which converts a list into a table.
x <- c("a", "b", "c")
y <- 1:3
map2(x, y, function(x, y) { tibble(x, y) }) |> list_rbind()# A tibble: 3 × 2
x y
<chr> <int>
1 a 1
2 b 2
3 c 3
We can generalize it further with pmap_xxx() which allows us to use multiple arguments (multiple lists).
x <- list(1, 1, 1)
y <- list(10, 20, 30)
z <- list(100, 200, 300)
pmap_dbl(list(x, y, z), sum)[1] 111 221 331
We have other types of iterators that, although they assume inputs, do not return anything, as walk() (just one input argument), walk2() (two arguments) and pwalk() (multiple arguments), all invisibly return, just call a function for its side effects rather than its return value.
list("a" = 1:3, "b" = 4:6) |>
map2(list("a" = 11:13, "b" = 14:16),
function(x, y) { x + y }) |>
walk(print)[1] 12 14 16
[1] 18 20 22
6.1 💻 It’s your turn
📝 Define a list of 4 elements of different types and access the second of them (I will include one that is a tibble so that you can see that in a list there is room for everything).
Code
list_example <-
list("name" = "Javier", "cp" = 28019,
"siblings" = TRUE,
"marks" = tibble("maths" = c(7.5, 8, 9),
"lang" = c(10, 5, 6)))
list_example📝 From the list above, access the elements that occupy places 1 and 4 of the list defined above.
Code
list_example[c(1, 4)]
list_example$name
list_example$marks
list_example[c("name", "marks")]📝 Load the starwars dataset from the {dplyr} package and access the second movie that appears in starwars$films (for each character). Determine which ones do not appear in more than one movie.
second_film <- map(starwars$films, 2)
map_lgl(second_film, is.null) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
[13] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
[25] TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
[37] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
[49] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
[61] TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[73] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[85] TRUE TRUE TRUE
7 Quali: factors (forcats package)
In statistics, when we talk about qualitative variables, we will call levels or modalities the different values that these data can take. For example, in the case of the sex variable of the starwars set, we have 4 allowed levels: female, hermaphroditic, male and none (in addition to missing data).
starwars |> count(sex)# A tibble: 5 × 2
sex n
<chr> <int>
1 female 16
2 hermaphroditic 1
3 male 60
4 none 6
5 <NA> 4
These kinds of variables are known in R as factors, and the fundamental package to deal with them is {forcats} (from the {tidyverse} environment).
This package allows us to set the levels (stored internally as levels) that a given categorical variable takes so that no mistakes, errors in data collection and generation can be generated. It also makes their analysis less computationally expensive when doing searches and comparisons, giving them a different treatment than normal text strings.
Let’s see a simple example simulating a party variable taking the values "PP", "PSOE" and "SUMAR" (of size 15)
set.seed(1234567)
party <- sample(x = c("PP", "PSOE", "SUMAR"), size = 15, replace = TRUE)
party [1] "PP" "PSOE" "SUMAR" "PP" "PP" "SUMAR" "PSOE" "SUMAR" "SUMAR"
[10] "SUMAR" "PP" "PSOE" "PP" "PSOE" "SUMAR"
The party variable is currently of type text, of type chr, something we can check with class(state).
class(party)[1] "character"
From a statistical and computational point of view, for R this variable right now would be equivalent to a named variable. But statistically a variable as a string is not the same as a categorical variable that can only take those 3 levels. How to convert to factor?
By making use of the as_factor() function from the {forcats} package.
library(tidyverse)
party_fct <- tibble("id" = 1:length(party),
"party" = as_factor(party))
party_fct# A tibble: 15 × 2
id party
<int> <fct>
1 1 PP
2 2 PSOE
3 3 SUMAR
4 4 PP
5 5 PP
6 6 SUMAR
7 7 PSOE
8 8 SUMAR
9 9 SUMAR
10 10 SUMAR
11 11 PP
12 12 PSOE
13 13 PP
14 14 PSOE
15 15 SUMAR
Not only the class of the variable has changed, but now, below the saved value, the sentence Levels: ... appears: these are the modalities or levels of our qualitative.
party_fct |> pull(party) [1] PP PSOE SUMAR PP PP SUMAR PSOE SUMAR SUMAR SUMAR PP PSOE
[13] PP PSOE SUMAR
Levels: PP PSOE SUMAR
Imagine that we are defining the database of deputies of the Congress and that the deputies of the PP did not attend that day to the plenary session: although our variable does not take that value THAT DAY, the state PSOE is a allowed level in the database (so even if we eliminate it, because it is a factor, the level remains, we do not have it now but it is an allowed level).
party_fct |>
filter(party %in% c("PP", "SUMAR")) |>
pull(party) [1] PP SUMAR PP PP SUMAR SUMAR SUMAR SUMAR PP PP SUMAR
Levels: PP PSOE SUMAR
With factor() function we can explicitly specify the names of the modalities and using levels = ... we can explicitly tell it the “order” of the modalities
party_fct <-
tibble(id = 1:length(party),
party = factor(party, levels = c("SUMAR", "PP", "PSOE")))
party_fct |> pull(party) [1] PP PSOE SUMAR PP PP SUMAR PSOE SUMAR SUMAR SUMAR PP PSOE
[13] PP PSOE SUMAR
Levels: SUMAR PP PSOE
The previous “order” is just in the sense which it will be counted/plotted first) but we don’t have (yet) an ordinal variable
party_fct$party < "SUMAR"Warning in Ops.factor(party_fct$party, "SUMAR"): '<' not meaningful for factors
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
What if we want to define a qualitative variable ORDINAL? Inside factor() we must indicate that ordered = TRUE.
marks <- c("A", "E", "F", "B", "A+", "A", "C", "C", "D", "B", "A", "C", "C", "E", "F", "D", "A+")
marks_database <-
tibble("student" = 1:length(marks),
"marks" =
factor(marks, levels = c("F", "E", "D", "C", "B", "A", "A+"),
ordered = TRUE))
marks_database |> pull(marks) [1] A E F B A+ A C C D B A C C E F D A+
Levels: F < E < D < C < B < A < A+
What changes? If you notice now, although the variable is still qualitative, we can make comparisons and order the records because there is a hierarchy between the modalities.
marks_database |> filter(marks >= "B")# A tibble: 7 × 2
student marks
<int> <ord>
1 1 A
2 4 B
3 5 A+
4 6 A
5 10 B
6 11 A
7 17 A+
If we want to tell it to drop an unused level at that moment (and that we want to exclude from the definition) we can do it with fct_drop().
marks_database |>
filter(marks %in% c("F", "E", "D", "C", "B", "A")) |>
pull(marks) [1] A E F B A C C D B A C C E F D
Levels: F < E < D < C < B < A < A+
marks_database |>
filter(marks %in% c("F", "E", "D", "C", "B", "A")) |>
mutate(marks = fct_drop(marks)) |>
pull(marks) [1] A E F B A C C D B A C C E F D
Levels: F < E < D < C < B < A
Just as we can delete levels we can expand existing levels (even if there is no data for that level at that time) with fct_expand().
marks_database |>
mutate(marks = fct_expand(marks, c("F-", "A+", "A++"))) %>%
pull(marks) [1] A E F B A+ A C C D B A C C E F D A+
Levels: F < E < D < C < B < A < A+ < F- < A++
In addition with fct_explicit_na() we can assign a level to the missing values to be included in the analysis and visualizations.
fct_explicit_na(factor(c("a", "b", NA)))Warning: `fct_explicit_na()` was deprecated in forcats 1.0.0.
ℹ Please use `fct_na_value_to_level()` instead.
[1] a b (Missing)
Levels: a b (Missing)
Even once defined we can reorder the levels with fct_relevel().
marks_database_expand <-
marks_database |>
mutate(marks = fct_expand(marks, c("F-", "A+", "A++"))) |>
pull(marks)
marks_database_expand |>
fct_relevel(c("F-", "F", "E", "D", "C", "B", "A", "A+", "A++")) [1] A E F B A+ A C C D B A C C E F D A+
Levels: F- < F < E < D < C < B < A < A+ < A++
This way of working with qualitative variables allows us to give a theoretical definition of our database, and we can even count values that do not yet exist (but could), making use of fct_count().
marks_database |>
mutate(marks = fct_expand(marks, c("F-", "A+", "A++"))) |>
pull(marks) |>
fct_count()# A tibble: 9 × 2
f n
<ord> <int>
1 F 2
2 E 2
3 D 2
4 C 4
5 B 2
6 A 3
7 A+ 2
8 F- 0
9 A++ 0
The levels can also be sorted by frequency with fct_infreq().
marks_database |>
mutate(marks = fct_infreq(marks)) |>
pull(marks) [1] A E F B A+ A C C D B A C C E F D A+
Levels: C < A < F < E < D < B < A+
marks_database |>
mutate(marks = fct_infreq(marks)) |>
pull(marks) |>
fct_count()# A tibble: 7 × 2
f n
<ord> <int>
1 C 4
2 A 3
3 F 2
4 E 2
5 D 2
6 B 2
7 A+ 2
Sometimes we will want to group levels, for example, not allowing levels that do not happen a minimum number of times with fct_lump_min(..., min = ..) (observations that do not meet this will go to a generic level called Other, although this can be changed with the other_level argument).
marks_database |>
pull(marks) %>%
fct_lump_min(min = 3) [1] A Other Other Other Other A C C Other Other A C
[13] C Other Other Other Other
Levels: C < A < Other
marks_database |>
pull(marks) |>
fct_lump_min(min = 3,
other_level = "Less frequent") [1] A Less frequent Less frequent Less frequent Less frequent
[6] A C C Less frequent Less frequent
[11] A C C Less frequent Less frequent
[16] Less frequent Less frequent
Levels: C < A < Less frequent
We can do something equivalent but based on its relative frequency with fct_lump_prop().
marks_database |>
pull(marks) |>
fct_lump_prop(prop = 0.15,
other_level = "less frequent") [1] A less frequent less frequent less frequent less frequent
[6] A C C less frequent less frequent
[11] A C C less frequent less frequent
[16] less frequent less frequent
Levels: C < A < less frequent
We can apply this to our datasets to recategorize variables very quickly.
starwars |>
drop_na(species) |>
mutate(species =
fct_lump_min(species, min = 3,
other_level = "Others")) |>
count(species)# A tibble: 4 × 2
species n
<fct> <int>
1 Droid 6
2 Gungan 3
3 Human 35
4 Others 39
With fct_reorder() we can also indicate that we want to order the factors according to a function applied to another variable.
starwars_factor <-
starwars |>
drop_na(height, species) |>
mutate(species =
fct_lump_min(species, min = 3,
other_level = "Others"))starwars_factor |> pull(species) [1] Human Droid Droid Human Human Human Human Droid Human Human
[11] Human Human Others Human Others Others Human Others Human Human
[21] Droid Others Human Human Others Human Others Others Human Others
[31] Human Human Gungan Gungan Gungan Human Others Others Human Human
[41] Others Others Others Others Others Others Others Human Others Others
[51] Others Others Others Others Others Others Human Others Others Others
[61] Human Human Human Human Others Others Others Others Human Droid
[71] Others Others Others Others Others Human Others
Levels: Droid Gungan Human Others
starwars_factor |>
mutate(species = fct_reorder(species, height, mean)) |>
pull(species) [1] Human Droid Droid Human Human Human Human Droid Human Human
[11] Human Human Others Human Others Others Human Others Human Human
[21] Droid Others Human Human Others Human Others Others Human Others
[31] Human Human Gungan Gungan Gungan Human Others Others Human Human
[41] Others Others Others Others Others Others Others Human Others Others
[51] Others Others Others Others Others Others Human Others Others Others
[61] Human Human Human Human Others Others Others Others Human Droid
[71] Others Others Others Others Others Human Others
Levels: Droid Others Human Gungan
7.1 💻 It’s your turn
📝 Given the variable months defined below (defined as a character vector), convert this variable to factor (just that)
months <- c("Jan", "Feb", "Mar", "Apr")Code
months <- c("Jan", "Feb", "Mar", "Apr")
months_fct <- as_factor(months)
months_fct📝 Given the variable months defined below converts this variable to a factor but indicating the levels correctly.
months <- c(NA, "Apr", "Jan", "Oct", "Jul", "Jan", "Sep", NA, "Feb", "Dic",
"Jul", "Mar", "Jan", "Mar", "Feb", "Apr", "May", "Oct", "Sep", NA,
"Dic", "Jul", "Nov", "Feb", "Oct", "Jun", "Sep", "Oct", "Oct", "Sep")Code
months_fct <-
factor(months,
levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dic"))
months_fct📝 Count how many values there are for each month but keep in mind that they are factors (maybe there are unused levels and you should get a 0 for them).
Code
meses_fct |> fct_count()📝 Since there are missing values, it indicates that the absentee is a thirteenth level labeled "missing".
Code
months_fct <-
months_fct |>
fct_explicit_na(na_level = "missing")
months_fct📝 Removes unused levels.
Code
months_fct <-
months_fct %>%
fct_drop()
months_fct📝 Sort the levels by frequency of occurrence.
Code
months_fct |>
fct_infreq()📝 Group levels so that any level that does not appear at least 7% of the time is grouped in a level called "other months".
Code
months_fct <-
months_fct |>
fct_lump_prop(prop = 0.07, other_level = "other months")
months_fct 8 🐣 Case study II: CIS survey
We are going to work again with file contains data from the CIS (Centro de Investigaciones Sociológicas) barometer «Perceptions on equality between men and women and gender stereotypes» whose sample work was carried out from November 6 to 14 (4000 interviews of both sexes over 16 years old in 1174 municipalities and 50 provinces).
8.1 Question 1
Run the entire code of the previous case study
library(haven)
data <-
read_sav(file = "./datos/CIS-feminismo/3428.sav") |>
janitor::clean_names()
rm_variables <-
data |>
summarise(across(everything(), n_distinct)) |>
pivot_longer(cols = everything(), names_to = "variable", values_to = "n_values") |>
filter(n_values <= 1) |>
pull(variable)
data <-
data |>
select(-rm_variables)
rm_variables <-
data |>
summarise(across(everything(), n_distinct)) |>
pivot_longer(cols = everything(), names_to = "variable", values_to = "n_values") |>
filter(n_values == nrow(data)) |>
pull(variable)
data <-
data |>
# all of them except one of them
select(-rm_variables[-1])
data <-
data |>
select(-contains("ia_"), -contains("peso"))
population <-
tibble::tribble(~Total, ~`47.385.107`,
"01 Andalucía", "8.472.407",
"02 Aragón", "1.326.261",
"03 Asturias, Principado de", "1.011.792",
"04 Balears, Illes", "1.173.008",
"05 Canarias", "2.172.944",
"06 Cantabria", "584.507",
"07 Castilla y León", "2.383.139",
"08 Castilla - La Mancha", "2.049.562",
"09 Cataluña", "7.763.362",
"10 Comunitat Valenciana", "5.058.138",
"11 Extremadura", "1.059.501",
"12 Galicia", "2.695.645",
"13 Madrid, Comunidad de", "6.751.251",
"14 Murcia, Región de", "1.518.486",
"15 Navarra, Comunidad Foral de", "661.537",
"16 País Vasco", "2.213.993",
"17 Rioja, La", "319.796",
"18 Ceuta", "83.517",
"19 Melilla", "86.261"
)
population <-
population |>
rename(ccaa = Total, pop = `47.385.107`) |>
# First to convert to numeric we need to remove dots
mutate("pop" = as.numeric(str_replace_all(pop, "\\.", ""))) |>
# if we use " " as sep, we split also some names
separate(col = ccaa, into = c("cod_INE", "name"), sep = 2) |>
# From stringr package, str_squish() removes whitespace at the start
# and end, and replaces all internal whitespace with a single space.
mutate(name = str_squish(name))
data <-
data |>
# you need first to convert cod_INE as numeric since both
# key columns should be of same type
left_join(population |> mutate("cod_INE" = as.numeric(cod_INE)),
by = c("ccaa" = "cod_INE")) |>
relocate(name, pop, .after = "ccaa")8.2 Question 2
Notice that many columns are of type
<dbl+lbl>, a new data type inherited from SPSS to be able to have the numeric value (e.g.1for the autonomous community ccaa of Andalusia) but also the label («Andalucía»). You can see more in the{labelled}package but for now we are going to useto_factor()to be able to work with those labels if we want.
data |>
# numeric mode
count(ccaa == 1)# A tibble: 2 × 2
`ccaa == 1` n
<lgl> <int>
1 FALSE 3385
2 TRUE 620
data |>
# labels mode
count(labelled::to_factor(ccaa) == "Andalucía")# A tibble: 2 × 2
`labelled::to_factor(ccaa) == "Andalucía"` n
<lgl> <int>
1 FALSE 3385
2 TRUE 620
The variables
escfeminisandsitconvactually correspond to questionP16andP18of the questionnaire. Rename them asp16andp18.
Code
data <-
data |>
rename(p16 = escfeminis, p18 = sitconv)8.3 Question 3
The variables tarhogentrev_1, tarhogentrev_2, tarhogentrev_8, tarhogentrev_9, tarhogentrev_hh, tarhogentrev_mm refer to the same question: time spent on household chores.
The variable
tarhogentrev_1stores a1if the respondent answered in hours (otherwise2)The variable
tarhogentrev_2stores a1if the respondent answered in minutes (otherwise2)The variable
tarhogentrev_8stores a1if the respondent answered “can’t say” (otherwise2)The variable
tarhogentrev_9stores a1if the respondent did not want to answer.The variables
tarhogentrev_hhandtarhogentrev_mmstore the amounts (in hours or minutes, or99or98if the respondent did not know or did not want to answer).
Use the information from these 5 variables in such a way that the information is contained in a single variable
p19with the number of minutes spent (ifNAcannot be known) and delete all the others
Code
data <-
data |>
mutate("p19" =
if_else(tarhogentrev_1 == 1, tarhogentrev_hh * 60,
if_else(tarhogentrev_2 == 1, tarhogentrev_mm, NA)),
.after = p18) |>
select(-contains("tarhogentrev"))8.4 Question 4
Perform the same recoding with the variables
tarhogparej_1,tarhogparej_2,tarhogparej_8,tarhogparej_9,tarhogparej_hh,tarhogparej_mm(time spent on household chores of the couple) and group their info inp20.
data <-
data |>
mutate("p20" =
if_else(tarhogparej_1 == 1, tarhogparej_hh * 60,
if_else(tarhogparej_2 == 1, tarhogparej_mm, NA)),
.after = p19) |>
select(-contains("tarhogparej"))8.5 Question 5
The variable
hijomenor_1contains a1if it has children and the number of fixed is inhijomenor_n; ifhijomenor_97contains a1we should impute0in number of children; ifhijomenor_99is1we should imputeNA. Collects the info of the 4 variables in a single variable calledp21.
data <-
data |>
mutate("p21" =
if_else(hijomenor_1 == 1, hijomenor_n,
if_else(hijomenor_97 == 1, 0, NA)),
.after = p20) |>
select(-contains("hijomenor"))8.6 Question 6
The following variables (
cuidadohijos_xxxandcuidadohijospar_xxx) code something similar with respect to time spent on childcare (of the respondent and his/her partner). Summarize the information in two variablesp21aandp21bas we have done withp19andp20.
data <-
data |>
mutate("p21a" =
if_else(cuidadohijos_1 == 1, cuidadohijos_hh * 60,
if_else(cuidadohijos_2 == 1, cuidadohijos_mm, NA)),
"p21b" =
if_else(cuidadohijospar_1 == 1, cuidadohijospar_hh * 60,
if_else(cuidadohijospar_2 == 1, cuidadohijospar_mm, NA)),
.after = p21) |>
select(-contains("cuidadohijos"))8.7 Question 7
Do the same with
tareascuid_xxxregarding the time spent on care and attention of a sick and/or dependent person in a working day, and store the info inp22a. Think that inp22is stored if you have a sick or dependent person in your care; ifp22isNo(2) orN.C.(9) the value ofp22ashould beNA.
data <-
data |>
mutate("p22a" =
if_else(p22 %in% c(2, 9), NA,
if_else(tareascui_1 == 1, tareascui_hh * 60,
if_else(tareascui_2 == 1, tareascui_mm, NA))),
.after = p22) |>
select(-contains("tareascui"))8.8 Question 8
Finally the variables
escideol,participaciong,recuvotog,escuela,nivelestentrev,religion,practicarelig6,sitlab,relalab,cno11,tipojor,ingreshog,clasesocial,paisnac,paisnac2should be renamed asp24,p25,p25a,p26,p26a,p27,p27a,p28,p28a,p28b,p28c,p29,p30,p31andp31a. In addition the variablesinceridadcollects the subjective sincerity that the pollster has perceived in the respondent and the variablesrecuvotograndrecuerdocapture vote recall information that we already have collected elsewhere. Rename the former, delete the latter and move the variablesestudios,cno11r,clasesub(studies, occupation according to Social Security coding and subjective class identification) before the surveyed questions.
Code
data <-
data |>
rename(p24 = escideol, p25 = participaciong, p25a = recuvotog,
p26 = escuela, p26a = nivelestentrev, p27 = religion,
p27a = practicarelig6, p28 = sitlab, p28a = relalab,
p28b = cno11, p28c = tipojor, p29 = ingreshog,
p30 = clasesocial, p31 = paisnac, p31a = paisnac2) |>
select(-(sinceridad:recuerdo)) |>
relocate(estudios, cno11r, clasesub, .before = p0)
data# A tibble: 4,005 × 107
registro ccaa name pop prov mun capital tamuni tipo_tel
<dbl> <dbl+lbl> <chr> <dbl> <dbl+l> <dbl+lb> <dbl+l> <dbl+l> <dbl+lb>
1 2821 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
2 19046 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 1 [Fijo]
3 20465 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
4 31462 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
5 33504 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
6 35292 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
7 37633 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
8 40018 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 1 [Fijo]
9 41526 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
10 62946 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 1 [Fijo]
# ℹ 3,995 more rows
# ℹ 98 more variables: sexo <dbl+lbl>, edad <dbl+lbl>, estudios <dbl+lbl>,
# cno11r <dbl+lbl>, clasesub <dbl+lbl>, p0 <dbl+lbl>, p1 <dbl+lbl>,
# p2 <dbl+lbl>, p3_1 <dbl+lbl>, p3_2 <dbl+lbl>, p3_3 <dbl+lbl>,
# p3_4 <dbl+lbl>, p3_5 <dbl+lbl>, p4 <dbl+lbl>, p5 <dbl+lbl>, p6_1 <dbl+lbl>,
# p6_2 <dbl+lbl>, p6_3 <dbl+lbl>, p7_1 <dbl+lbl>, p7_2 <dbl+lbl>,
# p7_3 <dbl+lbl>, p7_4 <dbl+lbl>, p7_5 <dbl+lbl>, p7_6 <dbl+lbl>, …
8.9 Question 9
Note that there are variables in the questionnaire, such as p0 and p1, that although they are coded numerically, in reality we should have them only qualitatively. For example, if you look at the questionnaire, the variable p0 corresponds to the question P0.c which says “First of all I would like to ask you if you have…
Spanish nationality –> 1
Spanish nationality and other –> 2
Other nationality –> 3
data |> count(p0)# A tibble: 3 × 2
p0 n
<dbl+lbl> <int>
1 1 [La nacionalidad española] 3711
2 2 [La nacionalidad española y otra] 151
3 3 [Otra nacionalidad] 143
If we were to leave the variable as it is, numerically, we would be assuming two incorrect things:
There is a hierarchy (having Spanish nationality is better or worse than having another nationality).
This hierarchy is also quantified numerically: since 2 is double 1, it implies that, in some way, having two nationalities is “twice as much” as having only Spanish nationality
If you look at the questionnaire, all the variables that are p{number} (without the low slash _) are of this type. The others respond to a more complex questionnaire (a series of questions to be assessed on a scale that is qualitative ordinal and we will consider that it can be converted to an ordinal factor).
Convert the
p{number}to qualitative (nominal) with its label and thep{number}_xto ordinal qualitative (with its number). In the latter, note that “don’t know/no answer” responses are coded as98and99(and should be NA).
Code
library(labelled)
data <-
data |>
# with $ to specify "not more than..."
mutate(across(matches("p[0-9]{1}$|p[0-9]{2}$"), to_factor),
across(matches("p[0-9]{1}_|p[0-9]{2}_"),
function(x) { factor(if_else(x %in% c(98, 99), NA, x),
ordered = TRUE) } ))In some compound questions with only 5 options, the
NAare coded as7,8or9.
data <-
data |>
# with $ to specify "not more than..."
mutate(across(matches("p[0-9]{1}_|p[0-9]{2}_"),
function(x) { factor(if_else(x %in% c(7, 8, 9), NA, x),
ordered = TRUE) } ))8.10 Question 10
Calculate a two-dimensional table of relative frequencies between the variable
sexand the variablep4(degree of inequality between men and women in Spain) normalizing as you consider to answer the following question: what percentage of women believe that inequalities are very large or quite large? And of men?
Code
table_freq <- table(to_factor(data$sexo), data$p4)
prop.table(table_freq, margin = 1)
# 67% of women
# 48.7% of men8.11 Question 11
Calculate a measure of association between these variables and detail conclusions.
Code
data |>
summarise("sig_chisq" = chisq.test(to_factor(sexo), p4)$p.value)
# tiny p-valor --> enough evidences againts null hyp -->
# answers to p4 depends on sex8.12 Question 12
Create a simple graph in
ggplot()that allows to represent the variablep4in function of sexes, allowing to appreciate the differences in them.
Code
# first option: geom_bar with abs freqs
ggplot(data |>
mutate("sexo" = to_factor(sexo))) +
geom_bar(aes(x = p4, fill = p4), alpha = 0.7) +
ggthemes::scale_fill_colorblind() +
facet_wrap(~sexo) +
theme_minimal()
# second option: geom_col with rel freqs
ggplot(data |>
mutate("sexo" = to_factor(sexo)) |>
count(sexo, p4) |>
mutate("porc" = n/sum(n))) +
geom_col(aes(x = p4, y = porc, fill = p4), alpha = 0.7) +
scale_y_continuous(labels = scales::label_percent()) +
ggthemes::scale_fill_colorblind() +
facet_wrap(~sexo) +
theme_minimal()
# third option: horizontal bars
ggplot(data |>
mutate("sexo" = to_factor(sexo)) |>
count(sexo, p4) |>
mutate("porc" = n/sum(n)) |>
mutate("porc" = if_else(sexo == "Hombre", -porc, porc))) +
geom_col(aes(y = p4, x = porc, fill = p4), alpha = 0.7) +
scale_x_continuous(labels = scales::label_percent()) +
ggthemes::scale_fill_colorblind() +
facet_wrap(~sexo, scales = "free_x") +
theme_minimal()
# fourth option: fill geom bar
ggplot(data |>
mutate("sexo" = to_factor(sexo))) +
geom_bar(aes(x = sexo, fill = p4), position = "fill",
alpha = 0.7) +
ggthemes::scale_fill_colorblind() +
theme_minimal()8.13 Question 13
Performs the same analysis but regroups
p4earlier into only 3 levels: grandes (Muy grandesandBastante grandes), pequeñas (PequeñasandCasi inexistentes) andNS/NC.
Code
data <-
data |>
mutate("p4" =
fct_collapse(data$p4,
"grandes" = c("Muy grandes", "Bastante grandes"), "NS/NC" = c("N.S.", "N.C."),
other_level = "pequeñas"))
table_freq <- table(to_factor(data$sexo), data$p4)
prop.table(table_freq, margin = 1)
# 67% of women
# 48.7% of men
data |>
summarise("sig_chisq" = chisq.test(to_factor(sexo), p4)$p.value)
ggplot(data |>
mutate("sexo" = to_factor(sexo))) +
geom_bar(aes(x = sexo, fill = p4), position = "fill",
alpha = 0.7) +
ggthemes::scale_fill_colorblind() +
theme_minimal()8.14 Question 14
Question
p16captures, on a scale of 0 to 10 (where 0 means “not at all feminist” and 10 means “very feminist”) the self-placement of the respondents. How is this variable distributed by gender? Make a frequency table
Code
prop.table(table(to_factor(data$sexo), data$p16), margin = 1)What % of men and women declare themselves above 6?
Code
data |>
mutate("p16" = factor(p16, ordered = TRUE)) |>
group_by(sexo) |>
count(p16 > 7) |>
mutate("porc" = 100*n/sum(n)) |>
ungroup()
# 48.3% between men
# 62.3% between womenQuestion
p19captured minutes spent on household chores. Converts to numeric (“No minutes” as 0).
Code
data <-
data |>
mutate("p19" = as.numeric(as.character(p19)))Warning: There was 1 warning in `mutate()`.
ℹ In argument: `p19 = as.numeric(as.character(p19))`.
Caused by warning:
! NAs introduced by coercion
After that, first calculate the average by sex disaggregated, and then the average disaggregated by sex and
p16. What conclusions do you draw? Make a graph if you consider it
Code
mean_sex <-
data |>
summarise("avg_min" = mean(p19, na.rm = TRUE), .by = sexo)
# 175 minutes in women vs 129 in men
mean_sex_feminism <-
data |>
summarise("avg_min" = mean(p19, na.rm = TRUE),
.by = c(sexo, p16))
ggplot(mean_sex_feminism) +
geom_col(aes(x = p16, y = avg_min, fill = avg_min),
alpha = 0.7) +
scale_fill_viridis_c(option = "magma") +
facet_wrap(~to_factor(sexo)) +
theme_minimal()Code
# We can see how in men the average minutes remains cte
# without depending on the self-localization in the
# feminism scale8.15 Question 15
We will calculate in a new variable the difference in minutes spent at home between the respondent (
p19) and his/her partner (p20).
Code
data <-
data |>
mutate("p20" = as.numeric(as.character(p20))) |>
mutate("diff_min" = p19 - p20)Warning: There was 1 warning in `mutate()`.
ℹ In argument: `p20 = as.numeric(as.character(p20))`.
Caused by warning:
! NAs introduced by coercion
Repeat the previous analysis using this difference in minutes.
Code
mean_sex <-
data |>
filter(p17a %in% c(1, 2) & as.numeric(p17a) != as.numeric(sexo)) |>
summarise("avg_diff_min" = mean(diff_min, na.rm = TRUE), .by = sexo)
# +63.7 minutes in women vs -40.6 in men
mean_sex_feminism <-
data |>
filter(p17a %in% c(1, 2) & as.numeric(p17a) != as.numeric(sexo)) |>
summarise("avg_diff_min" = mean(diff_min, na.rm = TRUE),
.by = c(sexo, p16))
ggplot(mean_sex_feminism) +
geom_col(aes(x = p16, y = avg_diff_min, fill = avg_diff_min),
alpha = 0.7) +
scale_fill_viridis_c(option = "magma") +
facet_wrap(~to_factor(sexo)) +
theme_minimal()