|>
table_1 xxx_join(table_2, by = id)
Joins and imports: how to combine, import and export files in R. Factors and lists
Practical workbooks of Data Programming in Master in Computational Social Science (2024-2025)
1 Joins
When working with data we will not always have the information in a single table, and sometimes, we will be interested in cross-referencing information from different sources.
For this we will use a classic of every language that handles data: the famous join, a tool that will allow us to cross one or several tables, making use of a identifying column of each one of them.
inner_join()
: only records with id in both tables survive.full_join()
: keeps all records in both tables.left_join()
: keeps all the records of the first table, and looks for which ones have id also in the second one (in case of not having it, it fills with NA the fields of the 2nd table).right_join()
: keeps all records in the second table, and searches which ones have id also in the first one.
Let’s test the various joins with a simple example
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
<- tibble("key" = 1:3, "val_x" = c("x1", "x2", "x3"))
tb_1 <- tibble("key" = c(1, 2, 4), "val_y" = c("y1", "y2", "y3")) tb_2
tb_1
# A tibble: 3 × 2
key val_x
<int> <chr>
1 1 x1
2 2 x2
3 3 x3
tb_2
# A tibble: 3 × 2
key val_y
<dbl> <chr>
1 1 y1
2 2 y2
3 4 y3
1.1 left_join()
Imagine that we want to incorporate to tb_1
the information from table_2, identifying the records by the key column (by = "key"
, the column it has to cross): we want to keep all the records of the first table and look for which ones have the same value in key
also in the second one.
|>
tb_1 left_join(tb_2, by = "key")
# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
|>
tb_1 left_join(tb_2, by = "key")
# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
Notice that the records in the first one whose key was not found in the second one has given them the value of absent.
1.2 right_join()
The right_join()
will perform the opposite operation: we will now incorporate to tb_2
the information from table_2, identifying the records by the key
column: we want to keep all the records of the second one and look for which ones have id (same value in key
) also in the first table.
|>
tb_1 right_join(tb_2, by = "key")
# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 4 <NA> y3
|>
tb_1 right_join(tb_2, by = "key")
# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 4 <NA> y3
Notice that now the records of the second one whose key was not found in the first one are the ones given the value of absent.
1.3 keys and suffixes
The key columns we will use for the crossover will not always be named the same.
<- tibble("key_1" = 1:3, "val_x" = c("x1", "x2", "x3"))
tb_1 <- tibble("key_2" = c(1, 2, 4), "val_y" = c("y1", "y2", "y3")) tb_2
by = c("key_2" = "key_2")
: we will indicate in which column of each table are the keys that we are going to cross.
# Left
|>
tb_1 left_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 3 × 3
key_1 val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
# Right
|>
tb_1 right_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 3 × 3
key_1 val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 4 <NA> y3
We can also cross over several columns at the same time (it will interpret as equal record the one that has the same set of keys), with by = c("var1_t1" = "var1_t2", "var2_t1" = "var2_t2", ...)
. Let’s modify the previous example
<- tibble("k_11" = 1:3, "k_12" = c("a", "b", "c"), "val_x" = c("x1", "x2", "x3"))
tb_1 <- tibble("k_21" = c(1, 2, 4), "k_22" = c("a", "b", "e"), "val_y" = c("y1", "y2", "y3")) tb_2
# Left
|>
tb_1 left_join(tb_2,
by = c("k_11" = "k_21", "k_12" = "k_22"))
# A tibble: 3 × 4
k_11 k_12 val_x val_y
<dbl> <chr> <chr> <chr>
1 1 a x1 y1
2 2 b x2 y2
3 3 c x3 <NA>
# Right
|>
tb_1 right_join(tb_2,
by = c("k_11" = "k_21", "k_12" = "k_22"))
# A tibble: 3 × 4
k_11 k_12 val_x val_y
<dbl> <chr> <chr> <chr>
1 1 a x1 y1
2 2 b x2 y2
3 4 e <NA> y3
It could also happen that when crossing two tables, there are columns of values named the same
<- tibble("key_1" = 1:3, "val" = c("x1", "x2", "x3"))
tb_1 <- tibble("key_2" = c(1, 2, 4), "val" = c("y1", "y2", "y3")) tb_2
# Left
|>
tb_1 left_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 3 × 3
key_1 val.x val.y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
Notice that default adds the suffixes .x
and .y
to tell us which table they come from. This suffix can be specified in the optional argument suffix = ...
, which allows us to distinguish the variables of one table from another.
# Left
|>
tb_1 left_join(tb_2, by = c("key_1" = "key_2"),
suffix = c("_table1", "_table2"))
# A tibble: 3 × 3
key_1 val_table1 val_table2
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
1.4 full_join()
The two previous cases form what is known as outer joins: crosses where observations are kept that appear in at least one table. The third outer join is known as full_join()
which will keep observations from both tables, adding rows that do not match the other table.
|>
tb_1 full_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 4 × 3
key_1 val.x val.y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
4 4 <NA> y3
1.5 inner_join()
Opposite the outer join is what is known as inner join, with inner_join()
: a join in which only the observations that appear in both tables are kept, only those records that are patched are kept.
|>
tb_1 inner_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 2 × 3
key_1 val.x val.y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
Note that in terms of records, inner_join
if it is commutative, we don’t care about the order of the tables: the only thing that changes is the order of the columns it adds.
|>
tb_1 inner_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 2 × 3
key_1 val.x val.y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
|> inner_join(tb_1, by = c("key_2" = "key_1")) tb_2
# A tibble: 2 × 3
key_2 val.x val.y
<dbl> <chr> <chr>
1 1 y1 x1
2 2 y2 x2
1.6 semi/anti_join()
Finally we have two interesting tools to filter (not cross) records: semi_join()
and anti_join()
. The semi join leaves us in the first table the records whose key is also in the second table (like an inner join but without adding the info from the second table). And the second one, the anti join, does just the opposite (those that are not).
# semijoin
|>
tb_1 semi_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 2 × 2
key_1 val
<int> <chr>
1 1 x1
2 2 x2
# antijoin
|>
tb_1 anti_join(tb_2, by = c("key_1" = "key_2"))
# A tibble: 1 × 2
key_1 val
<int> <chr>
1 3 x3
1.7 💻 It’s your turn
For the exercises we will use the tables available in the package {nycflights13}
.
library(nycflights13)
- airlines: name of airlines (with their abbreviation).
- airports: airport data (names, longitude, latitude, altitude, etc).
- flights, planes: flight and aircraft data.
- weather: hourly weather data.
📝 From package {nycflights13}
incorporates into the flights
table the airline data in airlines
. We want to maintain all flight records, adding the airline information to the airlines table.
Code
<-
flights_airlines |>
flights left_join(airlines, by = "carrier")
flights_airlines
📝 To the table obtained from the crossing of the previous section, include the data of the aircraft in planes
, but including only those flights for which we have information on their aircraft (and vice versa).
Code
<-
flights_airlines_planes |>
flights_airlines inner_join(planes, by = "tailnum")
flights_airlines_planes
📝 Repeat the previous exercise but keeping both year
variables (in one is the year of flight, in the other is the year of construction of the aircraft), and distinguishing them from each other
Code
<-
flights_airlines_planes |>
flights_airlines inner_join(planes, by = "tailnum",
suffix = c("_flight", "_build_aircraft"))
flights_airlines_planes
📝 To the table obtained from the previous exercise includes the longitude and latitude of the airports in airports
, distinguishing between the latitude/longitude of the airport at destination and at origin.
Code
|>
flights_airlines_planes left_join(airports |> select(faa, lat, lon),
by = c("origin" = "faa")) |>
rename(lat_origin = lat, lon_origin = lon) |>
left_join(airports |> select(faa, lat, lon),
by = c("dest" = "faa")) |>
rename(lat_dest = lat, lon_dest = lon)
📝 Filter from airports
only those airports from which flights depart. Repeat the process filtering only those airports where flights arrive.
Code
|>
airports semi_join(flights, by = c("faa" = "origin"))
|>
airports semi_join(flights, by = c("faa" = "dest"))
📝 How many flights do we not have information about the aircraft? Eliminate flights that do not have an aircraft ID (other than NA) beforehand.
Code
|>
flights drop_na(tailnum) |>
anti_join(planes, by = "tailnum") |>
count(tailnum, sort = TRUE)
2 🐣 Case study I: Beatles and Rolling Stones
We will use to practice simple joins the band_members
and band_instruments
datasets already included in the {dplyr}
package.
library(dplyr)
band_members
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
band_instruments
# A tibble: 3 × 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
In the first one we have a series of artists and the band they belong to; in the second one we have a series of artists and the instrument they play. Beyond performing the requested actions, try to visualize which final table you would have
2.1 Question 1
Given the table
band_members
, incorporate the information of what instrument each member plays (band_instruments
) of the ones you have in that table.
Code
<-
left_join_band |>
band_members left_join(band_instruments, by = "name")
2.2 Question 2
Given the
band_members
andband_instruments
tables, what kind of join should you do to have a complete table, with no absentees, where all band members have their instrument information, and every instrument has a member associated with it?
Code
<-
inner_join_band |>
band_members inner_join(band_instruments, by = "name")
2.3 Question 3
Given the
band_instruments
table, how to incorporate the information of who plays each instrument (in case we know it)?
Code
<-
right_join_band |>
band_members right_join(band_instruments, by = "name")
# other option
<-
left_join_instruments |>
band_instruments left_join(band_members, by = "name")
2.4 Question 4
Given the
band_members
andband_instruments
tables, what kind of join should you do to have a table with all the info, both members and instruments, even if there are members whose instrument you do not know, and instruments whose carrier you do not know?
Code
<-
full_join_band |>
band_members full_join(band_instruments, by = "name")
3 🐣 Case study II: income by municipalities
In the file municipios.csv
we have stored the information of the municipalities of Spain as of 2019.
The variable
LAU_code
represents the code as local administrative unit in the EU (see more at https://ec.europa.eu/eurostat/web/nuts/local-administrative-units).The variable
codigo_ine
is formed by joining the province code and the community code (each province has a code from 1 to 52, it does not depend on the ccaa).
# 2019 data
<- read_csv(file = "./datos/municipios.csv") mun_data
Rows: 8212 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): codauto, ine.ccaa.name, cpro, ine.prov.name, cmun, name, LAU_CODE, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
On the other hand, in the file renta_mun
we have the average per capita income of each administrative unit (municipalities, districts, provinces, autonomous communities,…) for different years.
<- read_csv(file = "./datos/renta_mun.csv") renta_mun
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 55273 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Unidad, codigo_ine
dbl (5): 2019, 2018, 2017, 2016, 2015
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Before we start let’s normalize variable names using clean_names()
from the {janitor}
package.
<-
mun_data |>
mun_data ::clean_names()
janitor<-
renta_mun |>
renta_mun ::clean_names() janitor
3.1 Question 1
Convert to tidydata
renta_mun
obtaining a table of 4 columns:unit
,year
,income
andcodigo_ine
(no absent and each data of the correct type)
Code
<-
renta_mun_tidy |>
renta_mun pivot_longer(cols = contains("x"), names_to = "year",
values_to = "income", names_prefix = "x",
names_transform = list(year = as.numeric),
values_drop_na = TRUE)
3.2 Question 2
If you look at the table above we have data from different administrative units that are not always municipalities. Knowing that all municipalities have a 5-character code, filter only those records that correspond to municipal units.
Code
<-
renta_mun_tidy |>
renta_mun_tidy filter(str_detect(codigo_ine, pattern = "[0-9]{5}") &
str_length(codigo_ine) == 5)
3.3 Question 3
Then properly separate the unit variable into two columns: one with the code (which you already have so one of the two should be removed) and the name. Remove any remaining spaces (take a look at the
{stringr}
package options).
Code
<-
renta_mun_tidy |>
renta_mun_tidy separate(col = "unidad", into = c("cod_rm", "name"), sep = 5) |>
select(-cod_rm) |>
mutate(name = str_trim(name))
3.4 Question 4
In which year was the median income higher? And lower? What was the median income of municipalities in Spain in 2019?
<-
summary_renta |>
renta_mun_tidy summarise("mean_income" = mean(income, na.rm = TRUE),
.by = year)
|>
summary_renta slice_min(mean_income, n = 1)
# A tibble: 1 × 2
year mean_income
<dbl> <dbl>
1 2015 10083.
|>
summary_renta slice_max(mean_income, n = 1)
# A tibble: 1 × 2
year mean_income
<dbl> <dbl>
1 2019 11672.
|>
renta_mun_tidy filter(year == 2019) |>
summarise("median_income" = median(income, na.rm = TRUE))
# A tibble: 1 × 1
median_income
<dbl>
1 11462
3.5 Question 5
Do whatever you consider to obtain the province with the highest average income in 2019 and the one with the lowest. Be sure to get its name.
Code
<-
summary_by_prov |>
renta_mun_tidy filter(year == 2019) |>
left_join(mun_data, by = "codigo_ine", suffix = c("", "_rm")) |>
select(-contains("rm")) |>
summarise("mean_by_prov" = mean(income, na.rm = TRUE),
.by = c("cpro", "ine_prov_name"))
|>
summary_by_prov slice_max(mean_by_prov, n = 1)
# A tibble: 1 × 3
cpro ine_prov_name mean_by_prov
<chr> <chr> <dbl>
1 20 Gipuzkoa 15890.
Code
|>
summary_by_prov slice_min(mean_by_prov, n = 1)
# A tibble: 1 × 3
cpro ine_prov_name mean_by_prov
<chr> <chr> <dbl>
1 06 Badajoz 8805.
3.6 Question 6
Obtain from each ccaa the name of the municipality with the highest income in 2019.
Code
|>
renta_mun_tidy filter(year == 2019) |>
left_join(mun_data, by = "codigo_ine", suffix = c("", "_rm")) |>
select(-contains("rm")) |>
slice_max(income, by = "codauto")
# A tibble: 19 × 10
name codigo_ine year income codauto ine_ccaa_name cpro ine_prov_name cmun
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Nava… 40904 2019 20242 07 Castilla y L… 40 Segovia 904
2 Alma… 10019 2019 13940 11 Extremadura 10 Cáceres 019
3 Olei… 15058 2019 16447 12 Galicia 15 Coruña, A 058
4 Lauk… 48053 2019 21672 16 País Vasco 48 Bizkaia 053
5 Lumb… 26091 2019 24007 17 Rioja, La 26 Rioja, La 091
6 Murc… 30030 2019 11631 14 Murcia, Regi… 30 Murcia 030
7 Raja… 08178 2019 23401 09 Cataluña 08 Barcelona 178
8 Aloc… 19023 2019 19700 08 Castilla - L… 19 Guadalajara 023
9 Argu… 22037 2019 19219 02 Aragón 22 Huesca 037
10 Alcu… 04009 2019 14196 01 Andalucía 04 Almería 009
11 Aran… 31023 2019 15517 15 Navarra, Com… 31 Navarra 023
12 Sant… 35021 2019 15982 05 Canarias 35 Palmas, Las 021
13 Roca… 46216 2019 17872 10 Comunitat Va… 46 Valencia/Val… 216
14 Pozu… 28115 2019 26367 13 Madrid, Comu… 28 Madrid 115
15 Vall… 07063 2019 19565 04 Balears, Ill… 07 Balears, Ill… 063
16 Vald… 39093 2019 15574 06 Cantabria 39 Cantabria 093
17 Teve… 33072 2019 15409 03 Asturias, Pr… 33 Asturias 072
18 Ceuta 51001 2019 12038 18 Ceuta 51 Ceuta 001
19 Meli… 52001 2019 11463 19 Melilla 52 Melilla 001
# ℹ 1 more variable: lau_code <chr>
4 Import and export files
So far we have only used data already loaded in packages but many times we will need to import data externally. One of the main strengths of R
is that we can import data very easily in different formats:
R native formats:
.rda
,.RData
and.rds
formats.Rectangular (tabular) data:
.csv
and.tsv
formatsUntabulated data:
.txt
format.Data in excel:
.xls
and.xlsx
formatsData from SAS/Stata/SPSS:
.sas7bdat
,.sav
and.dat
formatsData from Google Drive
Data from API’s: aemet, catastro, twitter, spotify, etc.
4.1 R native formats
The simplest files to import into R
(and which usually take up less disk space) are its own native extensions: files in .RData
, .rda
and .rds
formats. To load the former we simply need to use the native function load()
by providing it the file path.
RData
file: we are going to import the fileworld_bank_pop.RData
, which includes the datasetworld_bank_pop
load("./datos/world_bank_pop.RData")
world_bank_pop
# A tibble: 1,064 × 20
country indicator `2000` `2001` `2002` `2003` `2004` `2005` `2006`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABW SP.URB.TOTL 4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
2 ABW SP.URB.GROW 1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
3 ABW SP.POP.TOTL 8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
4 ABW SP.POP.GROW 2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
5 AFE SP.URB.TOTL 1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
6 AFE SP.URB.GROW 3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
7 AFE SP.POP.TOTL 4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
8 AFE SP.POP.GROW 2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
9 AFG SP.URB.TOTL 4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
10 AFG SP.URB.GROW 1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
# ℹ 1,054 more rows
# ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
# `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
# `2016` <dbl>, `2017` <dbl>
.rda
file: we will import the airquality dataset fromairquality.rda
load("./datos/airquality.rda")
|> as_tibble() airquality
# A tibble: 153 × 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
# ℹ 143 more rows
Note that files loaded with load()
are automatically loaded into the environment (with the originally saved name), and not only datasets can be loaded: load()
allows us to load multiple objetcs (not only a tabular data)
Native .rda
and .RData
files are a properly way to save your environment.
load(file = "./datos/multiple_objects.rda")
.rds
files: for this type we must usereadRDS()
, and we need to incorporate a argumentfile
with the path. In this case we are going to import lung cancer data from the North Central Cancer Treatment Group. Note that now .rds import files are a unique database
<-
lung_cancer readRDS(file = "./datos/NCCTG_lung_cancer.rds") |>
as_tibble()
lung_cancer
# A tibble: 228 × 10
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 306 2 74 1 1 90 100 1175 NA
2 3 455 2 68 1 0 90 90 1225 15
3 3 1010 1 56 1 0 90 90 NA 15
4 5 210 2 57 1 1 90 60 1150 11
5 1 883 2 60 1 0 100 90 NA 0
6 12 1022 1 74 1 1 50 80 513 0
7 7 310 2 68 2 2 70 60 384 10
8 11 361 2 71 2 2 60 80 538 1
9 1 218 2 53 1 1 70 80 825 16
10 7 166 2 61 1 2 70 70 271 34
# ℹ 218 more rows
The paths must always be without spaces, ñ, or accents.
4.2 Tabular data: readr
The {readr}
package within the {tidyverse}
environment contains several useful functions for loading rectangular data (without formatting).
read_csv()
:.csv
files whose separator is commaread_csv2()
: semicolonread_tsv()
: tabulator.read_table()
: space.read_delim()
: generic function for character delimited files.
All of them need as argument the file path plus other optional (skip header or not, decimals, etc). See more at https://readr.tidyverse.org/
4.2.1 .csv, .tsv
The main advantage of {readr}
is that it automates formatting to go from a flat (unformatted) file to a tibble (in rows and columns, with formatting).
- File
.csv
: withread_csv()
we will load comma separated files, passing as argument the path infile = ...
. Let’s import thechickens.csv
dataset (about cartoon chickens, why not). If you look at the output it gives us the type of variables.
library(readr)
<- read_csv(file = "./datos/chickens.csv") chickens
Rows: 5 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): chicken, sex, motto
dbl (1): eggs_laid
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chickens
# A tibble: 5 × 4
chicken sex eggs_laid motto
<chr> <chr> <dbl> <chr>
1 Foghorn Leghorn rooster 0 That's a joke, ah say, that's a jok…
2 Chicken Little hen 3 The sky is falling!
3 Ginger hen 12 Listen. We'll either die free chick…
4 Camilla the Chicken hen 7 Bawk, buck, ba-gawk.
5 Ernie The Giant Chicken rooster 0 Put Captain Solo in the cargo hold.
The variable format will normally be done automatically by read_csv()
, and we can query it with spec()
.
spec(chickens)
cols(
chicken = col_character(),
sex = col_character(),
eggs_laid = col_double(),
motto = col_character()
)
Although it usually does it well automatically we can specify the format explicitly in col_types = list()
(in list format, with col_xxx()
for each type of variable, for example eggs_laid
will be imported as character).
<-
chickens read_csv(file = "./datos/chickens.csv",
col_types = list(col_character(), col_character(),
col_character(), col_character()))
chickens
# A tibble: 5 × 4
chicken sex eggs_laid motto
<chr> <chr> <chr> <chr>
1 Foghorn Leghorn rooster 0 That's a joke, ah say, that's a jok…
2 Chicken Little hen 3 The sky is falling!
3 Ginger hen 12 Listen. We'll either die free chick…
4 Camilla the Chicken hen 7 Bawk, buck, ba-gawk.
5 Ernie The Giant Chicken rooster 0 Put Captain Solo in the cargo hold.
We can even indicate that variables we want to select (without occupying memory), by indicating it in col_select = ...
(in list format, with col_select = ...
).
<-
chickens read_csv(file = "./datos/chickens.csv",
col_select = c(chicken, sex, eggs_laid))
Rows: 5 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): chicken, sex
dbl (1): eggs_laid
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chickens
# A tibble: 5 × 3
chicken sex eggs_laid
<chr> <chr> <dbl>
1 Foghorn Leghorn rooster 0
2 Chicken Little hen 3
3 Ginger hen 12
4 Camilla the Chicken hen 7
5 Ernie The Giant Chicken rooster 0
4.2.2 .txt
What happens when the separator is not correct?
If we use read_csv()
it expects the separator between columns to be a comma but, as you can see with the following .txt
, it interprets everything as a single column: has no comma and does not know where to separate
<- read_csv(file = "./datos/massey-rating.txt") datos_txt
Rows: 10 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): UCC PAY LAZ KPK RT COF BIH DII ENG ACU Rank Team Conf
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(datos_txt)
[1] 10 1
as_tibble(datos_txt)
# A tibble: 10 × 1
`UCC PAY LAZ KPK RT COF BIH DII ENG ACU Rank Team Conf`
<chr>
1 1 1 1 1 1 1 1 1 1 1 1 Ohio St B10
2 2 2 2 2 2 2 2 2 4 2 2 Oregon P12
3 3 4 3 4 3 4 3 4 2 3 3 Alabama SEC
4 4 3 4 3 4 3 5 3 3 4 4 TCU B12
5 6 6 6 5 5 7 6 5 6 11 5 Michigan St B10
6 7 7 7 6 7 6 11 8 7 8 6 Georgia SEC
7 5 5 5 7 6 8 4 6 5 5 7 Florida St ACC
8 8 8 9 9 10 5 7 7 10 7 8 Baylor B12
9 9 11 8 13 11 11 12 9 14 9 9 Georgia Tech ACC
10 13 10 13 11 8 9 10 11 9 10 10 Mississippi SEC
To do this we have.
read_csv2()
when the separator is semicolon,read_tsv()
when the is a tab andread_table()
when the is a space.read_delim()
in general.
<- read_table(file = "./datos/massey-rating.txt") datos_txt
── Column specification ────────────────────────────────────────────────────────
cols(
UCC = col_double(),
PAY = col_double(),
LAZ = col_double(),
KPK = col_double(),
RT = col_double(),
COF = col_double(),
BIH = col_double(),
DII = col_double(),
ENG = col_double(),
ACU = col_double(),
Rank = col_double(),
Team = col_character(),
Conf = col_character()
)
Warning: 10 parsing failures.
row col expected actual file
1 -- 13 columns 15 columns './datos/massey-rating.txt'
2 -- 13 columns 14 columns './datos/massey-rating.txt'
3 -- 13 columns 14 columns './datos/massey-rating.txt'
4 -- 13 columns 14 columns './datos/massey-rating.txt'
5 -- 13 columns 15 columns './datos/massey-rating.txt'
... ... .......... .......... ...........................
See problems(...) for more details.
as_tibble(datos_txt)
# A tibble: 10 × 13
UCC PAY LAZ KPK RT COF BIH DII ENG ACU Rank Team Conf
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 1 1 1 1 1 1 1 1 1 1 Ohio St
2 2 2 2 2 2 2 2 2 4 2 2 Oreg… P12
3 3 4 3 4 3 4 3 4 2 3 3 Alab… SEC
4 4 3 4 3 4 3 5 3 3 4 4 TCU B12
5 6 6 6 5 5 7 6 5 6 11 5 Mich… St
6 7 7 7 6 7 6 11 8 7 8 6 Geor… SEC
7 5 5 5 7 6 8 4 6 5 5 7 Flor… St
8 8 8 9 9 10 5 7 7 10 7 8 Bayl… B12
9 9 11 8 13 11 11 12 9 14 9 9 Geor… Tech
10 13 10 13 11 8 9 10 11 9 10 10 Miss… SEC
4.3 Excel data (.xls, .xlsx)
Another key import package will be the {readxl}
package for importing data from Excel. Three functions will be key:
read_xls()
specific to.xls
,read_xlsx()
specific to.xlsx
.read_excel()
: for both.xls
and.xlsx
.
We are going to import deaths.xlsx
with celebrity death records.
library(readxl)
<- read_xlsx(path = "./datos/deaths.xlsx") deaths
New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
deaths
# A tibble: 18 × 6
`Lots of people` ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 simply cannot resist writing <NA> <NA> <NA> <NA> some…
2 at the top <NA> of thei…
3 or merging <NA> <NA> <NA> cells
4 Name Profession Age Has … Date… Date…
5 David Bowie musician 69 TRUE 17175 42379
6 Carrie Fisher actor 60 TRUE 20749 42731
7 Chuck Berry musician 90 TRUE 9788 42812
8 Bill Paxton actor 61 TRUE 20226 42791
9 Prince musician 57 TRUE 21343 42481
10 Alan Rickman actor 69 FALSE 16854 42383
11 Florence Henderson actor 82 TRUE 12464 42698
12 Harper Lee author 89 FALSE 9615 42419
13 Zsa Zsa Gábor actor 99 TRUE 6247 42722
14 George Michael musician 53 FALSE 23187 42729
15 Some <NA> <NA> <NA> <NA> <NA>
16 <NA> also like to write stuff <NA> <NA> <NA> <NA>
17 <NA> <NA> at t… bott… <NA> <NA>
18 <NA> <NA> <NA> <NA> <NA> too!
|> slice(1:6) deaths
# A tibble: 6 × 6
`Lots of people` ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 simply cannot resist writing <NA> <NA> <NA> <NA> some not…
2 at the top <NA> of their sp…
3 or merging <NA> <NA> <NA> cells
4 Name Profession Age Has kids Date of birth Date of …
5 David Bowie musician 69 TRUE 17175 42379
6 Carrie Fisher actor 60 TRUE 20749 42731
One thing that is very common misfortune is that there is some kind of comment or text at the beginning of the file, having to skip those rows.
We can skip these rows directly in the load with skip = ...
(indicating the number of rows to skip).
<- read_xlsx(path = "./datos/deaths.xlsx", skip = 4)
deaths deaths
# A tibble: 14 × 6
Name Profession Age `Has kids` `Date of birth` `Date of death`
<chr> <chr> <chr> <chr> <dttm> <chr>
1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 42379
2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 42731
3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 42812
4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 42791
5 Prince musician 57 TRUE 1958-06-07 00:00:00 42481
6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 42383
7 Florence Hen… actor 82 TRUE 1934-02-14 00:00:00 42698
8 Harper Lee author 89 FALSE 1926-04-28 00:00:00 42419
9 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 42722
10 George Micha… musician 53 FALSE 1963-06-25 00:00:00 42729
11 Some <NA> <NA> <NA> NA <NA>
12 <NA> also like… <NA> <NA> NA <NA>
13 <NA> <NA> at t… bottom, NA <NA>
14 <NA> <NA> <NA> <NA> NA too!
In addition with col_names = ...
we can already rename the columns in the import (if provide names assumes 1st line already as a data)
deaths <-
read_xlsx(path = "./datos/deaths.xlsx", skip = 5,
col_names = c("name", "profession", "age", "kids", "birth", "death"))
deaths
# A tibble: 14 × 6
name profession age kids birth death
<chr> <chr> <chr> <chr> <dttm> <chr>
1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 42379
2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 42731
3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 42812
4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 42791
5 Prince musician 57 TRUE 1958-06-07 00:00:00 42481
6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 42383
7 Florence Henderson actor 82 TRUE 1934-02-14 00:00:00 42698
8 Harper Lee author 89 FALSE 1926-04-28 00:00:00 42419
9 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 42722
10 George Michael musician 53 FALSE 1963-06-25 00:00:00 42729
11 Some <NA> <NA> <NA> NA <NA>
12 <NA> also like to write … <NA> <NA> NA <NA>
13 <NA> <NA> at t… bott… NA <NA>
14 <NA> <NA> <NA> <NA> NA too!
Sometimes Excel dates are incorrectly formatted (surprise): we can use convertToDate()
from the {openxlsx}
package to convert it.
library(openxlsx)
$death <- convertToDate(deaths$death) deaths
Warning in convertToDate(deaths$death): NAs introduced by coercion
deaths
# A tibble: 14 × 6
name profession age kids birth death
<chr> <chr> <chr> <chr> <dttm> <date>
1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 2016-01-10
2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27
3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 2017-03-18
4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25
5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21
6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 2016-01-14
7 Florence Henderson actor 82 TRUE 1934-02-14 00:00:00 2016-11-24
8 Harper Lee author 89 FALSE 1926-04-28 00:00:00 2016-02-19
9 Zsa Zsa Gábor actor 99 TRUE 1917-02-06 00:00:00 2016-12-18
10 George Michael musician 53 FALSE 1963-06-25 00:00:00 2016-12-25
11 Some <NA> <NA> <NA> NA NA
12 <NA> also like to w… <NA> <NA> NA NA
13 <NA> <NA> at t… bott… NA NA
14 <NA> <NA> <NA> <NA> NA NA
We can also load an Excel with several sheets: to indicate the sheet (either by its name or by its number) we will use the argument sheet = ...
.
<- read_xlsx(path = "./datos/datasets.xlsx", sheet = "mtcars")
mtcars mtcars
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
We can even indicate the range of cells to load with range = ...
.
<- read_xlsx(path = "./datos/datasets.xlsx", sheet = "iris", range = "C1:E4")
iris iris
# A tibble: 3 × 3
Petal.Length Petal.Width Species
<dbl> <dbl> <chr>
1 1.4 0.2 setosa
2 1.4 0.2 setosa
3 1.3 0.2 setosa
4.4 Import from SAS/STATA/SPSS
The {haven}
package within the tidyverse orbit will allow us to import files from the 3 most important payment software: SAS, SPSS and Stata.
library(haven)
# SAS
<- read_sas(data_file = "./datos/iris.sas7bdat")
iris_sas
# SPSS
<- read_sav(file = "./datos/iris.sav")
iris_spss
# Stata
<- read_dta(file = "./datos/iris.dta") iris_stata
4.5 Export
In the same way that we can import we can also export
- exported in
.RData
(recommended option for variables stored inR
). Remember that this extension can only be used inR
. To do so, just usesave(object, file = path)
.
<- tibble("a" = 1:4, "b" = 1:4)
table save(table, file = "./datos/table.RData")
rm(table) # eliminar
load("./datos/table.RData")
table
# A tibble: 4 × 2
a b
<int> <int>
1 1 1
2 2 2
3 3 3
4 4 4
- exported in
.RData
multiple objects
<- tibble("a" = 1:4, "b" = 1:4)
table <- 1
a <- c("javi", "sandra")
b save(table, a, b, file = "./datos/mult_obj.RData")
rm(list = c("a", "b", "table"))
load("./datos/mult_obj.RData")
table
# A tibble: 4 × 2
a b
<int> <int>
1 1 1
2 2 2
3 3 3
4 4 4
- exported in
.csv
. To do this we simply usewrite_csv(object, file = path)
.
write_csv(table, file = "./datos/table.csv")
read_csv(file = "./datos/table.csv")
Rows: 4 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): a, b
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 2
a b
<dbl> <dbl>
1 1 1
2 2 2
3 3 3
4 4 4
4.6 Import from new sources
4.6.1 Website
One of the main advantages of R
is that we can make use of all the previous functions of import but directly from a web, without the need to perform the manual download: instead of passing it the local path we will indicate the link. For example, we are going to download the covid data from ISCIII (https://cnecovid.isciii.es/covid19/#documentaci%C3%B3n-y-datos)
<-
covid_data read_csv(file = "https://cnecovid.isciii.es/covid19/resources/casos_hosp_uci_def_sexo_edad_provres.csv")
covid_data
Rows: 500 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): provincia_iso, sexo, grupo_edad
dbl (4): num_casos, num_hosp, num_uci, num_def
date (1): fecha
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 500 × 8
provincia_iso sexo grupo_edad fecha num_casos num_hosp num_uci num_def
<chr> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl>
1 A H 0-9 2020-01-01 0 0 0 0
2 A H 10-19 2020-01-01 0 0 0 0
3 A H 20-29 2020-01-01 0 0 0 0
4 A H 30-39 2020-01-01 0 0 0 0
5 A H 40-49 2020-01-01 0 0 0 0
6 A H 50-59 2020-01-01 0 0 0 0
7 A H 60-69 2020-01-01 0 0 0 0
8 A H 70-79 2020-01-01 0 0 0 0
9 A H 80+ 2020-01-01 0 0 0 0
10 A H NC 2020-01-01 0 0 0 0
# ℹ 490 more rows
4.6.2 Wikipedia
The {rvest}
package, one of the most useful of {tidyverse}
allows us to import directly from an html
. For example, to export wikipedia tables just read_html()
to import the html, html_element("table")
to extract the table objects, and html_table()
to convert the html table to tibble
.
library(rvest)
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
<- 'https://en.wikipedia.org/wiki/Men%27s_long_jump_world_record_progression'
wiki_jump |> read_html() |>
wiki_jump html_element("table") |>
html_table()
# A tibble: 19 × 5
Mark Wind Athlete Place Date
<chr> <chr> <chr> <chr> <chr>
1 7.61 m (24 ft 11+1⁄2 in) "" Peter O'Connor (IRE) Dublin, Ire… 5 Au…
2 7.69 m (25 ft 2+3⁄4 in) "" Edward Gourdin (USA) Cambridge, … 23 J…
3 7.76 m (25 ft 5+1⁄2 in) "" Robert LeGendre (USA) Paris, Fran… 7 Ju…
4 7.89 m (25 ft 10+1⁄2 in) "" DeHart Hubbard (USA) Chicago, Un… 13 J…
5 7.90 m (25 ft 11 in) "" Edward Hamm (USA) Cambridge, … 7 Ju…
6 7.93 m (26 ft 0 in) "0.0" Sylvio Cator (HAI) Paris, Fran… 9 Se…
7 7.98 m (26 ft 2 in) "0.5" Chuhei Nambu (JPN) Tokyo, Japan 27 O…
8 8.13 m (26 ft 8 in) "1.5" Jesse Owens (USA) Ann Arbor, … 25 M…
9 8.21 m (26 ft 11 in) "0.0" Ralph Boston (USA) Walnut, Uni… 12 A…
10 8.24 m (27 ft 1⁄4 in) "1.8" Ralph Boston (USA) Modesto, Un… 27 M…
11 8.28 m (27 ft 1+3⁄4 in) "1.2" Ralph Boston (USA) Moscow, Sov… 16 J…
12 8.31 m (27 ft 3 in) A "−0.1" Igor Ter-Ovanesyan (URS) Yerevan, So… 10 J…
13 8.33 m (27 ft 3+3⁄4 in)[2] "" Phil Shinnick (USA) Modesto, Un… 25 M…
14 8.31 m (27 ft 3 in) "0.0" Ralph Boston (USA) Kingston, J… 15 A…
15 8.34 m (27 ft 4+1⁄4 in) "1.0" Ralph Boston (USA) Los Angeles… 12 S…
16 8.35 m (27 ft 4+1⁄2 in)[5] "0.0" Ralph Boston (USA) Modesto, Un… 29 M…
17 8.35 m (27 ft 4+1⁄2 in) A "0.0" Igor Ter-Ovanesyan (URS) Mexico City… 19 O…
18 8.90 m (29 ft 2+1⁄4 in) A "2.0" Bob Beamon (USA) Mexico City… 18 O…
19 8.95 m (29 ft 4+1⁄4 in) "0.3" Mike Powell (USA) Tokyo, Japan 30 A…
4.6.3 Google Drive
Another option available (especially if we work with other people working) is to import from a Google Drive spreadsheet, making use of read_sheet()
from the {googlesheets4}
package.
The first time you will be asked for a tidyverse permission to interact with your drive
library(googlesheets4)
<-
google_sheet read_sheet("https://docs.google.com/spreadsheets/d/1Uz38nHjl3bmftxDpcXj--DYyPo1I39NHVf-xjeg1_wI/edit?usp=sharing")
! Using an auto-discovered, cached token.
To suppress this message, modify your code or options to clearly consent to
the use of a cached token.
See gargle's "Non-interactive auth" vignette for more details:
<https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googlesheets4 package is using a cached token for 'javalv09@ucm.es'.
✔ Reading from "import-google-sheet".
✔ Range 'Hoja 1'.
google_sheet
# A tibble: 3 × 4
a b c d
<chr> <dbl> <dbl> <chr>
1 d1 0.5 1 hombre
2 d2 -0.3 2 hombre
3 d3 1.1 3 mujer
4.6.4 API (owid)
Another interesting option is the data download from an API: an intermediary between an app or data provider and our R
. For example, let’s load the {owidR}
library, which allows us to download data from the web https://ourworldindata.org/. For example, the owid_covid()
function loads without realizing it more than 400 000 records with more than 60 variables from 238 countries.
library(owidR)
owid_covid() |> as_tibble()
# A tibble: 7 × 67
iso_code continent location date total_cases new_cases
<chr> <chr> <chr> <IDate> <int> <int>
1 AFG Asia Afghanistan 2020-01-05 0 0
2 AFG Asia Afghanistan 2020-01-06 0 0
3 AFG Asia Afghanistan 2020-01-07 0 0
4 AFG Asia Afghanistan 2020-01-08 0 0
5 AFG Asia Afghanistan 2020-01-09 0 0
6 AFG Asia Afghanistan 2020-01-10 0 0
7 AFG Asia Afghanistan 2020-01-11 0 0
# ℹ 61 more variables: new_cases_smoothed <dbl>, total_deaths <int>,
# new_deaths <int>, new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
# new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
# total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
# new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
# icu_patients <int>, icu_patients_per_million <dbl>, hosp_patients <int>,
# hosp_patients_per_million <dbl>, weekly_icu_admissions <int>, …
4.6.5 API (aemet)
In many occasions to connect to the API we will first have to register and obtain a key, this is the case of the {climaemet}
package to access Spanish meteorological data (https://opendata.aemet.es/centrodedescargas/inicio).
Once we have the API key we register it in our RStudio to be able to use it in the future.
library(climaemet)
# Api key
<- "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJqYXZhbHYwOUB1Y20uZXMiLCJqdGkiOiI4YTU1ODUxMS01MTE3LTQ4MTYtYmM4OS1hYmVkNDhiODBkYzkiLCJpc3MiOiJBRU1FVCIsImlhdCI6MTY2NjQ2OTcxNSwidXNlcklkIjoiOGE1NTg1MTEtNTExNy00ODE2LWJjODktYWJlZDQ4YjgwZGM5Iiwicm9sZSI6IiJ9.HEMR77lZy2ASjmOxJa8ppx2J8Za1IViurMX3p1reVBU"
apikey
aemet_api_key(apikey, install = TRUE)
With this package we can do a search for stations to know both its postal code and its identifier code within the AEMET network
<- aemet_stations()
stations stations
# A tibble: 947 × 7
indicativo indsinop nombre provincia altitud longitud latitud
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 B013X "08304" ESCORCA, LLUC ILLES BA… 490 2.89 39.8
2 B051A "08316" SÓLLER, PUERTO ILLES BA… 5 2.69 39.8
3 B087X "" BANYALBUFAR ILLES BA… 60 2.51 39.7
4 B103B "99103" ANDRATX - SANT ELM ILLES BA… 52 2.37 39.6
5 B158X "" CALVIÀ, ES CAPDELLÀ ILLES BA… 50 2.47 39.6
6 B228 "08301" PALMA, PUERTO ILLES BA… 3 2.63 39.6
7 B236C "" PALMA, UNIVERSIDAD ILLES BA… 95 2.64 39.6
8 B248 "08303" SIERRA DE ALFABIA, BU… ILLES BA… 1030 2.71 39.7
9 B275E "08302" SON BONET, AEROPUERTO BALEARES 49 2.71 39.6
10 B278 "08306" PALMA DE MALLORCA, AE… ILLES BA… 8 2.74 39.6
# ℹ 937 more rows
For example, to get data from the station of the airport of El Prat, Barcelona, the code to provide is "0076"
, obtaining hourly data
aemet_last_obs("0076")
# A tibble: 13 × 23
idema lon fint prec alt vmax vv dv lat dmax
<chr> <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0076 2.07 2024-11-02 07:00:00 0 4 11.3 3.6 360 41.3 340
2 0076 2.07 2024-11-02 08:00:00 0 4 10.3 4.2 10 41.3 340
3 0076 2.07 2024-11-02 09:00:00 0 4 10.8 4.4 10 41.3 30
4 0076 2.07 2024-11-02 10:00:00 0.2 4 12.4 4.7 360 41.3 10
5 0076 2.07 2024-11-02 11:00:00 0 4 11.8 4.7 10 41.3 10
6 0076 2.07 2024-11-02 12:00:00 0 4 14.9 10.7 70 41.3 70
7 0076 2.07 2024-11-02 13:00:00 0.3 4 16 11.9 70 41.3 80
8 0076 2.07 2024-11-02 14:00:00 0 4 18 14.1 80 41.3 80
9 0076 2.07 2024-11-02 15:00:00 0 4 17 13.4 90 41.3 80
10 0076 2.07 2024-11-02 16:00:00 0 4 14.9 11.4 90 41.3 80
11 0076 2.07 2024-11-02 17:00:00 0 4 13.4 9.9 90 41.3 80
12 0076 2.07 2024-11-02 18:00:00 0 4 11.8 7.3 80 41.3 80
13 0076 2.07 2024-11-02 19:00:00 0 4 9.8 6.4 80 41.3 80
# ℹ 13 more variables: ubi <chr>, pres <dbl>, hr <dbl>, stdvv <dbl>, ts <dbl>,
# pres_nmar <dbl>, tamin <dbl>, ta <dbl>, tamax <dbl>, tpr <dbl>, vis <dbl>,
# stddv <dbl>, inso <dbl>
4.6.6 API (US census)
One of the most useful tools in recent years is known as {tidycensus}
: a tool for facilitating the process of downloading census data for the United States from R
.
library(tidycensus)
get_decennial()
: to access the census data (US Decennial Census), done every 10 years (years 2000, 2010 and 2020).get_acs()
: to access the annual and quinquennial (5 years) ACS (American Community Survey) (census != survey)get_estimates()
: to access the annual population, birth and death estimates (census != survey)get_pums()
: to access the microdata (unaggregated data) of the ACS (anonymized at the individual level)get_flows()
: to access the migration flow data.
portar from API (US census)
For example, we will download the census data (get_decennial()
) at the state level (geography = “state”
) for the population (variable variables = “P001001”
) for the year 2010 (see variables in tidycensus::load_variables()
).
<-
total_population_10 get_decennial(geography = "state",
variables = "P001001",
year = 2010)
Getting data from the 2010 decennial Census
Using Census Summary File 1
total_population_10
# A tibble: 52 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 01 Alabama P001001 4779736
2 02 Alaska P001001 710231
3 04 Arizona P001001 6392017
4 05 Arkansas P001001 2915918
5 06 California P001001 37253956
6 22 Louisiana P001001 4533372
7 21 Kentucky P001001 4339367
8 08 Colorado P001001 5029196
9 09 Connecticut P001001 3574097
10 10 Delaware P001001 897934
# ℹ 42 more rows
4.6.7 Other options
{chessR}
: data from chess matches. See https://github.com/JaseZiv/chessR{spotifyr}
: data from Spotify. See https://www.rcharlie.com/spotifyr/{gtrendsR}
: data from Google Trends. See https://github.com/PMassicotte/gtrendsR{scholar}
: data from https://github.com/jkeirstead/scholar
4.7 💻 Your turn
Try to solve the following exercises without looking at the solutions
📝 The who
dataset we have used in previous exercises, export it to a native R
format in the data
folder of the project
Code
library(tidyr)
save(who, file = "./datos/who.RData")
📝 Loads the who
dataset but from the data folder (import the file created in the previous exercise)
Code
load("./datos/who.RData")
📝 Repeats the same (export and import) in 4 formats: .csv
, .xlsx
, .sav
(spss) and .dta
(stata)
Code
# csv
library(readr)
write_csv(who, file = "./datos/who.csv")
<- read_csv(file = "./datos/who.csv")
who_data
# excel
library(openxlsx)
write.xlsx(who, file = "./datos/who.xlsx")
<- read_xlsx(path = "./datos/who.xlsx")
who_data
# sas y stata
library(haven)
write_sav(who, path = "./datos/who.sav")
<- read_spss(path = "./datos/who.sav")
who_data
write_dta(who, path = "./datos/who.dta")
<- read_dta(path = "./datos/who.dta") who_data
📝 Repeat the loading of who.csv
but only select the first 4 columns already in the load.
Code
<-
who_select read_csv(file = "./datos/who.csv",
col_select = c("country", "iso2", "iso3", "year"))
5 🐣 Case study I: CIS survey
We are going to put into practice the loading and preprocessing of a file generated by one of the most used software (SPSS). The file contains data from the CIS (Centro de Investigaciones Sociológicas) barometer «Perceptions on equality between men and women and gender stereotypes» whose sample work was carried out from November 6 to 14 (4000 interviews of both sexes over 16 years old in 1174 municipalities and 50 provinces).
5.1 Question 1
Load the file extension
.sav
that you have in the subfolderCIS-feminism
inside the data folder. After loading normalize variable names with{janitor}
package.
Code
library(haven)
<-
data read_sav(file = "./datos/CIS-feminismo/3428.sav") |>
::clean_names() janitor
5.2 Question 2
Calculate using tidyverse the number of distinct values of each variable and eliminate those that only have a constant value. First get the name of those variables and then use them inside a
select()
.
Code
<-
rm_variables |>
data summarise(across(everything(), n_distinct)) |>
pivot_longer(cols = everything(), names_to = "variable", values_to = "n_values") |>
filter(n_values <= 1) |>
pull(variable)
<-
data |>
data select(-rm_variables)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(rm_variables)
# Now:
data %>% select(all_of(rm_variables))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
5.3 Question 3
Perform the same action to detect those variables that have a different value for each individual (all values are different). These columns are important, they will be the ids of the respondent, but one is enough (if there are several, delete one of them).
Code
<-
rm_variables |>
data summarise(across(everything(), n_distinct)) |>
pivot_longer(cols = everything(), names_to = "variable", values_to = "n_values") |>
filter(n_values == nrow(data)) |>
pull(variable)
<-
data |>
data # all of them except one of them
select(-rm_variables[-1])
5.4 Question 4
To understand what the variables mean, you have in the folder
CIS-feminismos
the questionnaire that was made to each respondent incues3428.pdf
and the sampling data sheet inFT3428.pdf
. In these documents you will see how, for example, the variables of the questionnaire almost all begin withp...
, although we also have other types of variables related to household chores (tarhog...
), children (hijomenor...
), childcare (cuidadohijos...
), care tasks in general (tareascuid...
) and other variables related to the ideological scale, religion, voting memory, etc. In addition, all variables starting withia_xxx
refer to codes regarding possible incidences in the data collection and the variablespeso...
to the type of weighting used in the sampling. Proceed to eliminate these last two type of variables.
Code
<-
data |>
data select(-contains("ia_"), -contains("peso"))
5.5 Question 5
Calculate the number of interviews per autonomous community (
ccaa
) and extract the 5 with the fewest interviews.
Code
|>
data count(ccaa, sort = TRUE)
|>
data count(ccaa) |>
slice_min(n = 5, n)
5.6 Question 6
Use the
{datapasta}
package (see https://javieralvarezliebana.es/docencia/mucss-data-programming/slides/#/data-pasta) to copy from the INE page the population of each province and import it into a tibble
Code
# this code should be automatically generated by de addin in R Studio
# after installing the R package
<-
population ::tribble(~Total, ~`47.385.107`,
tibble"01 Andalucía", "8.472.407",
"02 Aragón", "1.326.261",
"03 Asturias, Principado de", "1.011.792",
"04 Balears, Illes", "1.173.008",
"05 Canarias", "2.172.944",
"06 Cantabria", "584.507",
"07 Castilla y León", "2.383.139",
"08 Castilla - La Mancha", "2.049.562",
"09 Cataluña", "7.763.362",
"10 Comunitat Valenciana", "5.058.138",
"11 Extremadura", "1.059.501",
"12 Galicia", "2.695.645",
"13 Madrid, Comunidad de", "6.751.251",
"14 Murcia, Región de", "1.518.486",
"15 Navarra, Comunidad Foral de", "661.537",
"16 País Vasco", "2.213.993",
"17 Rioja, La", "319.796",
"18 Ceuta", "83.517",
"19 Melilla", "86.261"
)
5.7 Question 7
Rename properly the variables of
population
datasets, converts the population to number and separates the variable of the autonomous communities in two, one with the code and another with the name (according to INE), properly processing and trimming the names. See before exercises about{stringr}
package.
Code
<-
population |>
population rename(ccaa = Total, pop = `47.385.107`) |>
# First to convert to numeric we need to remove dots
mutate("pop" = as.numeric(str_replace_all(pop, "\\.", ""))) |>
# if we use " " as sep, we split also some names
separate(col = ccaa, into = c("cod_INE", "name"), sep = 2) |>
# From stringr package, str_squish() removes whitespace at the start
# and end, and replaces all internal whitespace with a single space.
mutate(name = str_squish(name))
5.8 Question 8
Add the population information to our CIS table. Then calculate a summary table with the ratio between the percentage of population represented by each community (of the total of Spain) and the percentage of respondents of each community (with respect to the total). Which are the 3 most over-represented and which are the 3 least? Think that if the ratio is higher than 1, it implies that there is a lower proportion of respondents from that community than would correspond to it by population (under-represented)
Code
<-
data |>
data # you need first to convert cod_INE as numeric since both
# key columns should be of same type
left_join(population |> mutate("cod_INE" = as.numeric(cod_INE)),
by = c("ccaa" = "cod_INE")) |>
relocate(name, pop, .after = "ccaa")
<-
prop_surv_pop |>
data summarise("prop_surv" = n() / sum(nrow(data)),
"pop" = unique(pop),
.by = ccaa) |>
mutate("prop_pop" = pop / sum(pop),
"ratio" = prop_pop / prop_surv)
# overrepresented
|>
prop_surv_pop slice_min(ratio, n = 3)
# underrepresented
|>
prop_surv_pop slice_max(ratio, n = 3)
6 Lists
We have already seen that lists are an object in R that allows us to store collections of variables of different type (as with data.frame
and tibble
) but also different lengths, with totally heterogeneous structures (even a list can have inside it another list).
<- "Javi"
name <- 34
age <- c(7, 8, 5, 3, 10, 9)
marks <- c("Paloma", "Goyo")
parents
<- list("name" = name, "age" = age, "marks" = marks, "parents" = parents) list_var
$name list_var
[1] "Javi"
$marks list_var
[1] 7 8 5 3 10 9
We can also make lists with other lists inside, so that to access each level we must use the [[]]
operator.
<- list("list_1" = list_var[1:2], "list_2" = list_var[3:4])
list_of_lists names(list_of_lists)
[1] "list_1" "list_2"
names(list_of_lists[[1]])
[1] "name" "age"
1]][[1]] list_of_lists[[
[1] "Javi"
We are allowed to store n-dimensional data!
One of the disadvantages is that a list cannot be vectorized immediately, so any arithmetic operation applied to a list will give error.
<- list("a" = 1:5, "b" = 10:20)
data / 2 data
Error in data/2: non-numeric argument to binary operator
For this purpose, one of the common (but outdated) options is to make use of the lapply()
family.
lapply(data, FUN = function(x) { x / 2})
$a
[1] 0.5 1.0 1.5 2.0 2.5
$b
[1] 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0
By default, the output of lapply()
is always a list of equal length.
A more flexible and versatile option is to make use of the {purrr}
package of the {tidyverse}
environment.
library(purrr)
This package is intended to mimic the functional programming of other languages such as Scala or Hadoop’s map-reduce strategy (from Google).
The simplest function of the {purrr}
package is the map()
function, which applies a vectorized function to each of the elements of a list. Let’s see a first example applied to vectors
map()
allows us to “map” each list and apply the function element by element (if applicable).
<- list("x1" = 1:4, "x2" = 11:20)
x map(x, sqrt)
$x1
[1] 1.000000 1.414214 1.732051 2.000000
$x2
[1] 3.316625 3.464102 3.605551 3.741657 3.872983 4.000000 4.123106 4.242641
[9] 4.358899 4.472136
With vectors we have a default vectorization because R
performs element-by-element operations. Note that, by default, the output of map
is a list.
Let’s look at another example. Define in a list two samples from 2 normal distributions, of different sample size and different mean. Compute the mean of each one.
Code
<- list(rnorm(n = 1500, mean = 0, sd = 0.7),
x rnorm(n = 2800, mean = 2, sd = 1.5))
map(x, mean)
[[1]]
[1] -0.0126662
[[2]]
[1] 1.978925
What if we want to calculate the mean of their squared values?
Code
map(x, function(x) { mean(x^2) })
[[1]]
[1] 0.453184
[[2]]
[1] 6.155281
In addition to being more readable and efficient, with {purrr}
we can decide the output format after the operation
- output as numeric (double) vector with
map_dbl()
- output as numeric (int) vector with
map_int()
- output as character vector with
map_chr()
- output as logical vector with
map_lgl()
library(glue)
map_dbl(x, mean)
[1] -0.0126662 1.9789246
map_chr(x, function(x) { glue("Mean is {round(mean(x), 5)}") })
[1] "Mean is -0.01267" "Mean is 1.97892"
c(x[[1]][3], x[[2]][3])
[1] 0.4777273 3.3570897
Also, if you pass it a number instead of a function, it will return the ith element of each list.
map_dbl(x, 3)
[1] 0.4777273 3.3570897
<- list("a" = dplyr::starwars, "b" = tidyr::billboard) list_dummy
We can also use pluck()
to access to the i-th element of a list
|>
list_dummy pluck(1)
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
We also have the option of generalizing it to be able to use functions that need two arguments in the form of a list (binary operations), with map2()
<- list("a" = 1:3, "b" = 4:6)
x <- list("c" = c(-1, 4, 0), "b" = c(5, -4, -1))
y map2(x, y, function(x, y) { x^2 + y^2})
$a
[1] 2 20 9
$b
[1] 41 41 37
We can obtain the output in the form of data.frame
by adding list_rbind()
or list_cbind()
, which converts a list into a table.
<- c("a", "b", "c")
x <- 1:3
y map2(x, y, function(x, y) { tibble(x, y) }) |> list_rbind()
# A tibble: 3 × 2
x y
<chr> <int>
1 a 1
2 b 2
3 c 3
We can generalize it further with pmap_xxx()
which allows us to use multiple arguments (multiple lists).
<- list(1, 1, 1)
x <- list(10, 20, 30)
y <- list(100, 200, 300)
z pmap_dbl(list(x, y, z), sum)
[1] 111 221 331
We have other types of iterators that, although they assume inputs, do not return anything, as walk()
(just one input argument), walk2()
(two arguments) and pwalk()
(multiple arguments), all invisibly return, just call a function for its side effects rather than its return value.
list("a" = 1:3, "b" = 4:6) |>
map2(list("a" = 11:13, "b" = 14:16),
function(x, y) { x + y }) |>
walk(print)
[1] 12 14 16
[1] 18 20 22
6.1 💻 It’s your turn
📝 Define a list of 4 elements of different types and access the second of them (I will include one that is a tibble so that you can see that in a list there is room for everything).
Code
<-
list_example list("name" = "Javier", "cp" = 28019,
"siblings" = TRUE,
"marks" = tibble("maths" = c(7.5, 8, 9),
"lang" = c(10, 5, 6)))
list_example
📝 From the list above, access the elements that occupy places 1 and 4 of the list defined above.
Code
c(1, 4)]
list_example[
$name
list_example$marks
list_example
c("name", "marks")] list_example[
📝 Load the starwars
dataset from the {dplyr}
package and access the second movie that appears in starwars$films
(for each character). Determine which ones do not appear in more than one movie.
<- map(starwars$films, 2)
second_film map_lgl(second_film, is.null)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
[13] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
[25] TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
[37] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
[49] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
[61] TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[73] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[85] TRUE TRUE TRUE
7 Quali: factors (forcats package)
In statistics, when we talk about qualitative variables, we will call levels or modalities the different values that these data can take. For example, in the case of the sex
variable of the starwars
set, we have 4 allowed levels: female
, hermaphroditic
, male
and none
(in addition to missing data).
|> count(sex) starwars
# A tibble: 5 × 2
sex n
<chr> <int>
1 female 16
2 hermaphroditic 1
3 male 60
4 none 6
5 <NA> 4
These kinds of variables are known in R
as factors, and the fundamental package to deal with them is {forcats}
(from the {tidyverse}
environment).
This package allows us to set the levels (stored internally as levels
) that a given categorical variable takes so that no mistakes, errors in data collection and generation can be generated. It also makes their analysis less computationally expensive when doing searches and comparisons, giving them a different treatment than normal text strings.
Let’s see a simple example simulating a party
variable taking the values "PP"
, "PSOE"
and "SUMAR"
(of size 15)
set.seed(1234567)
<- sample(x = c("PP", "PSOE", "SUMAR"), size = 15, replace = TRUE)
party party
[1] "PP" "PSOE" "SUMAR" "PP" "PP" "SUMAR" "PSOE" "SUMAR" "SUMAR"
[10] "SUMAR" "PP" "PSOE" "PP" "PSOE" "SUMAR"
The party
variable is currently of type text, of type chr
, something we can check with class(state)
.
class(party)
[1] "character"
From a statistical and computational point of view, for R
this variable right now would be equivalent to a named variable. But statistically a variable as a string is not the same as a categorical variable that can only take those 3 levels. How to convert to factor?
By making use of the as_factor()
function from the {forcats}
package.
library(tidyverse)
<- tibble("id" = 1:length(party),
party_fct "party" = as_factor(party))
party_fct
# A tibble: 15 × 2
id party
<int> <fct>
1 1 PP
2 2 PSOE
3 3 SUMAR
4 4 PP
5 5 PP
6 6 SUMAR
7 7 PSOE
8 8 SUMAR
9 9 SUMAR
10 10 SUMAR
11 11 PP
12 12 PSOE
13 13 PP
14 14 PSOE
15 15 SUMAR
Not only the class of the variable has changed, but now, below the saved value, the sentence Levels: ...
appears: these are the modalities or levels of our qualitative.
|> pull(party) party_fct
[1] PP PSOE SUMAR PP PP SUMAR PSOE SUMAR SUMAR SUMAR PP PSOE
[13] PP PSOE SUMAR
Levels: PP PSOE SUMAR
Imagine that we are defining the database of deputies of the Congress and that the deputies of the PP did not attend that day to the plenary session: although our variable does not take that value THAT DAY, the state PSOE
is a allowed level in the database (so even if we eliminate it, because it is a factor, the level remains, we do not have it now but it is an allowed level).
|>
party_fct filter(party %in% c("PP", "SUMAR")) |>
pull(party)
[1] PP SUMAR PP PP SUMAR SUMAR SUMAR SUMAR PP PP SUMAR
Levels: PP PSOE SUMAR
With factor()
function we can explicitly specify the names of the modalities and using levels = ...
we can explicitly tell it the “order” of the modalities
<-
party_fct tibble(id = 1:length(party),
party = factor(party, levels = c("SUMAR", "PP", "PSOE")))
|> pull(party) party_fct
[1] PP PSOE SUMAR PP PP SUMAR PSOE SUMAR SUMAR SUMAR PP PSOE
[13] PP PSOE SUMAR
Levels: SUMAR PP PSOE
The previous “order” is just in the sense which it will be counted/plotted first) but we don’t have (yet) an ordinal variable
$party < "SUMAR" party_fct
Warning in Ops.factor(party_fct$party, "SUMAR"): '<' not meaningful for factors
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
What if we want to define a qualitative variable ORDINAL? Inside factor()
we must indicate that ordered = TRUE
.
<- c("A", "E", "F", "B", "A+", "A", "C", "C", "D", "B", "A", "C", "C", "E", "F", "D", "A+")
marks <-
marks_database tibble("student" = 1:length(marks),
"marks" =
factor(marks, levels = c("F", "E", "D", "C", "B", "A", "A+"),
ordered = TRUE))
|> pull(marks) marks_database
[1] A E F B A+ A C C D B A C C E F D A+
Levels: F < E < D < C < B < A < A+
What changes? If you notice now, although the variable is still qualitative, we can make comparisons and order the records because there is a hierarchy between the modalities.
|> filter(marks >= "B") marks_database
# A tibble: 7 × 2
student marks
<int> <ord>
1 1 A
2 4 B
3 5 A+
4 6 A
5 10 B
6 11 A
7 17 A+
If we want to tell it to drop an unused level at that moment (and that we want to exclude from the definition) we can do it with fct_drop()
.
|>
marks_database filter(marks %in% c("F", "E", "D", "C", "B", "A")) |>
pull(marks)
[1] A E F B A C C D B A C C E F D
Levels: F < E < D < C < B < A < A+
|>
marks_database filter(marks %in% c("F", "E", "D", "C", "B", "A")) |>
mutate(marks = fct_drop(marks)) |>
pull(marks)
[1] A E F B A C C D B A C C E F D
Levels: F < E < D < C < B < A
Just as we can delete levels we can expand existing levels (even if there is no data for that level at that time) with fct_expand()
.
|>
marks_database mutate(marks = fct_expand(marks, c("F-", "A+", "A++"))) %>%
pull(marks)
[1] A E F B A+ A C C D B A C C E F D A+
Levels: F < E < D < C < B < A < A+ < F- < A++
In addition with fct_explicit_na()
we can assign a level to the missing values to be included in the analysis and visualizations.
fct_explicit_na(factor(c("a", "b", NA)))
Warning: `fct_explicit_na()` was deprecated in forcats 1.0.0.
ℹ Please use `fct_na_value_to_level()` instead.
[1] a b (Missing)
Levels: a b (Missing)
Even once defined we can reorder the levels with fct_relevel()
.
<-
marks_database_expand |>
marks_database mutate(marks = fct_expand(marks, c("F-", "A+", "A++"))) |>
pull(marks)
|>
marks_database_expand fct_relevel(c("F-", "F", "E", "D", "C", "B", "A", "A+", "A++"))
[1] A E F B A+ A C C D B A C C E F D A+
Levels: F- < F < E < D < C < B < A < A+ < A++
This way of working with qualitative variables allows us to give a theoretical definition of our database, and we can even count values that do not yet exist (but could), making use of fct_count()
.
|>
marks_database mutate(marks = fct_expand(marks, c("F-", "A+", "A++"))) |>
pull(marks) |>
fct_count()
# A tibble: 9 × 2
f n
<ord> <int>
1 F 2
2 E 2
3 D 2
4 C 4
5 B 2
6 A 3
7 A+ 2
8 F- 0
9 A++ 0
The levels can also be sorted by frequency with fct_infreq()
.
|>
marks_database mutate(marks = fct_infreq(marks)) |>
pull(marks)
[1] A E F B A+ A C C D B A C C E F D A+
Levels: C < A < F < E < D < B < A+
|>
marks_database mutate(marks = fct_infreq(marks)) |>
pull(marks) |>
fct_count()
# A tibble: 7 × 2
f n
<ord> <int>
1 C 4
2 A 3
3 F 2
4 E 2
5 D 2
6 B 2
7 A+ 2
Sometimes we will want to group levels, for example, not allowing levels that do not happen a minimum number of times with fct_lump_min(..., min = ..)
(observations that do not meet this will go to a generic level called Other
, although this can be changed with the other_level
argument).
|>
marks_database pull(marks) %>%
fct_lump_min(min = 3)
[1] A Other Other Other Other A C C Other Other A C
[13] C Other Other Other Other
Levels: C < A < Other
|>
marks_database pull(marks) |>
fct_lump_min(min = 3,
other_level = "Less frequent")
[1] A Less frequent Less frequent Less frequent Less frequent
[6] A C C Less frequent Less frequent
[11] A C C Less frequent Less frequent
[16] Less frequent Less frequent
Levels: C < A < Less frequent
We can do something equivalent but based on its relative frequency with fct_lump_prop()
.
|>
marks_database pull(marks) |>
fct_lump_prop(prop = 0.15,
other_level = "less frequent")
[1] A less frequent less frequent less frequent less frequent
[6] A C C less frequent less frequent
[11] A C C less frequent less frequent
[16] less frequent less frequent
Levels: C < A < less frequent
We can apply this to our datasets to recategorize variables very quickly.
|>
starwars drop_na(species) |>
mutate(species =
fct_lump_min(species, min = 3,
other_level = "Others")) |>
count(species)
# A tibble: 4 × 2
species n
<fct> <int>
1 Droid 6
2 Gungan 3
3 Human 35
4 Others 39
With fct_reorder()
we can also indicate that we want to order the factors according to a function applied to another variable.
<-
starwars_factor |>
starwars drop_na(height, species) |>
mutate(species =
fct_lump_min(species, min = 3,
other_level = "Others"))
|> pull(species) starwars_factor
[1] Human Droid Droid Human Human Human Human Droid Human Human
[11] Human Human Others Human Others Others Human Others Human Human
[21] Droid Others Human Human Others Human Others Others Human Others
[31] Human Human Gungan Gungan Gungan Human Others Others Human Human
[41] Others Others Others Others Others Others Others Human Others Others
[51] Others Others Others Others Others Others Human Others Others Others
[61] Human Human Human Human Others Others Others Others Human Droid
[71] Others Others Others Others Others Human Others
Levels: Droid Gungan Human Others
|>
starwars_factor mutate(species = fct_reorder(species, height, mean)) |>
pull(species)
[1] Human Droid Droid Human Human Human Human Droid Human Human
[11] Human Human Others Human Others Others Human Others Human Human
[21] Droid Others Human Human Others Human Others Others Human Others
[31] Human Human Gungan Gungan Gungan Human Others Others Human Human
[41] Others Others Others Others Others Others Others Human Others Others
[51] Others Others Others Others Others Others Human Others Others Others
[61] Human Human Human Human Others Others Others Others Human Droid
[71] Others Others Others Others Others Human Others
Levels: Droid Others Human Gungan
7.1 💻 It’s your turn
📝 Given the variable months
defined below (defined as a character vector), convert this variable to factor (just that)
<- c("Jan", "Feb", "Mar", "Apr") months
Code
<- c("Jan", "Feb", "Mar", "Apr")
months <- as_factor(months)
months_fct months_fct
📝 Given the variable months
defined below converts this variable to a factor but indicating the levels correctly.
<- c(NA, "Apr", "Jan", "Oct", "Jul", "Jan", "Sep", NA, "Feb", "Dic",
months "Jul", "Mar", "Jan", "Mar", "Feb", "Apr", "May", "Oct", "Sep", NA,
"Dic", "Jul", "Nov", "Feb", "Oct", "Jun", "Sep", "Oct", "Oct", "Sep")
Code
<-
months_fct factor(months,
levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dic"))
months_fct
📝 Count how many values there are for each month but keep in mind that they are factors (maybe there are unused levels and you should get a 0 for them).
Code
|> fct_count() meses_fct
📝 Since there are missing values, it indicates that the absentee is a thirteenth level labeled "missing"
.
Code
<-
months_fct |>
months_fct fct_explicit_na(na_level = "missing")
months_fct
📝 Removes unused levels.
Code
<-
months_fct %>%
months_fct fct_drop()
months_fct
📝 Sort the levels by frequency of occurrence.
Code
|>
months_fct fct_infreq()
📝 Group levels so that any level that does not appear at least 7% of the time is grouped in a level called "other months"
.
Code
<-
months_fct |>
months_fct fct_lump_prop(prop = 0.07, other_level = "other months")
months_fct
8 🐣 Case study II: CIS survey
We are going to work again with file contains data from the CIS (Centro de Investigaciones Sociológicas) barometer «Perceptions on equality between men and women and gender stereotypes» whose sample work was carried out from November 6 to 14 (4000 interviews of both sexes over 16 years old in 1174 municipalities and 50 provinces).
8.1 Question 1
Run the entire code of the previous case study
library(haven)
<-
data read_sav(file = "./datos/CIS-feminismo/3428.sav") |>
::clean_names()
janitor
<-
rm_variables |>
data summarise(across(everything(), n_distinct)) |>
pivot_longer(cols = everything(), names_to = "variable", values_to = "n_values") |>
filter(n_values <= 1) |>
pull(variable)
<-
data |>
data select(-rm_variables)
<-
rm_variables |>
data summarise(across(everything(), n_distinct)) |>
pivot_longer(cols = everything(), names_to = "variable", values_to = "n_values") |>
filter(n_values == nrow(data)) |>
pull(variable)
<-
data |>
data # all of them except one of them
select(-rm_variables[-1])
<-
data |>
data select(-contains("ia_"), -contains("peso"))
<-
population ::tribble(~Total, ~`47.385.107`,
tibble"01 Andalucía", "8.472.407",
"02 Aragón", "1.326.261",
"03 Asturias, Principado de", "1.011.792",
"04 Balears, Illes", "1.173.008",
"05 Canarias", "2.172.944",
"06 Cantabria", "584.507",
"07 Castilla y León", "2.383.139",
"08 Castilla - La Mancha", "2.049.562",
"09 Cataluña", "7.763.362",
"10 Comunitat Valenciana", "5.058.138",
"11 Extremadura", "1.059.501",
"12 Galicia", "2.695.645",
"13 Madrid, Comunidad de", "6.751.251",
"14 Murcia, Región de", "1.518.486",
"15 Navarra, Comunidad Foral de", "661.537",
"16 País Vasco", "2.213.993",
"17 Rioja, La", "319.796",
"18 Ceuta", "83.517",
"19 Melilla", "86.261"
)
<-
population |>
population rename(ccaa = Total, pop = `47.385.107`) |>
# First to convert to numeric we need to remove dots
mutate("pop" = as.numeric(str_replace_all(pop, "\\.", ""))) |>
# if we use " " as sep, we split also some names
separate(col = ccaa, into = c("cod_INE", "name"), sep = 2) |>
# From stringr package, str_squish() removes whitespace at the start
# and end, and replaces all internal whitespace with a single space.
mutate(name = str_squish(name))
<-
data |>
data # you need first to convert cod_INE as numeric since both
# key columns should be of same type
left_join(population |> mutate("cod_INE" = as.numeric(cod_INE)),
by = c("ccaa" = "cod_INE")) |>
relocate(name, pop, .after = "ccaa")
8.2 Question 2
Notice that many columns are of type
<dbl+lbl>
, a new data type inherited from SPSS to be able to have the numeric value (e.g.1
for the autonomous community ccaa of Andalusia) but also the label («Andalucía»
). You can see more in the{labelled}
package but for now we are going to useto_factor()
to be able to work with those labels if we want.
|>
data # numeric mode
count(ccaa == 1)
# A tibble: 2 × 2
`ccaa == 1` n
<lgl> <int>
1 FALSE 3385
2 TRUE 620
|>
data # labels mode
count(labelled::to_factor(ccaa) == "Andalucía")
# A tibble: 2 × 2
`labelled::to_factor(ccaa) == "Andalucía"` n
<lgl> <int>
1 FALSE 3385
2 TRUE 620
The variables
escfeminis
andsitconv
actually correspond to questionP16
andP18
of the questionnaire. Rename them asp16
andp18
.
Code
<-
data |>
data rename(p16 = escfeminis, p18 = sitconv)
8.3 Question 3
The variables tarhogentrev_1
, tarhogentrev_2
, tarhogentrev_8
, tarhogentrev_9
, tarhogentrev_hh
, tarhogentrev_mm
refer to the same question: time spent on household chores.
The variable
tarhogentrev_1
stores a1
if the respondent answered in hours (otherwise2
)The variable
tarhogentrev_2
stores a1
if the respondent answered in minutes (otherwise2
)The variable
tarhogentrev_8
stores a1
if the respondent answered “can’t say” (otherwise2
)The variable
tarhogentrev_9
stores a1
if the respondent did not want to answer.The variables
tarhogentrev_hh
andtarhogentrev_mm
store the amounts (in hours or minutes, or99
or98
if the respondent did not know or did not want to answer).
Use the information from these 5 variables in such a way that the information is contained in a single variable
p19
with the number of minutes spent (ifNA
cannot be known) and delete all the others
Code
<-
data |>
data mutate("p19" =
if_else(tarhogentrev_1 == 1, tarhogentrev_hh * 60,
if_else(tarhogentrev_2 == 1, tarhogentrev_mm, NA)),
.after = p18) |>
select(-contains("tarhogentrev"))
8.4 Question 4
Perform the same recoding with the variables
tarhogparej_1
,tarhogparej_2
,tarhogparej_8
,tarhogparej_9
,tarhogparej_hh
,tarhogparej_mm
(time spent on household chores of the couple) and group their info inp20
.
<-
data |>
data mutate("p20" =
if_else(tarhogparej_1 == 1, tarhogparej_hh * 60,
if_else(tarhogparej_2 == 1, tarhogparej_mm, NA)),
.after = p19) |>
select(-contains("tarhogparej"))
8.5 Question 5
The variable
hijomenor_1
contains a1
if it has children and the number of fixed is inhijomenor_n
; ifhijomenor_97
contains a1
we should impute0
in number of children; ifhijomenor_99
is1
we should imputeNA
. Collects the info of the 4 variables in a single variable calledp21
.
<-
data |>
data mutate("p21" =
if_else(hijomenor_1 == 1, hijomenor_n,
if_else(hijomenor_97 == 1, 0, NA)),
.after = p20) |>
select(-contains("hijomenor"))
8.6 Question 6
The following variables (
cuidadohijos_xxx
andcuidadohijospar_xxx
) code something similar with respect to time spent on childcare (of the respondent and his/her partner). Summarize the information in two variablesp21a
andp21b
as we have done withp19
andp20
.
<-
data |>
data mutate("p21a" =
if_else(cuidadohijos_1 == 1, cuidadohijos_hh * 60,
if_else(cuidadohijos_2 == 1, cuidadohijos_mm, NA)),
"p21b" =
if_else(cuidadohijospar_1 == 1, cuidadohijospar_hh * 60,
if_else(cuidadohijospar_2 == 1, cuidadohijospar_mm, NA)),
.after = p21) |>
select(-contains("cuidadohijos"))
8.7 Question 7
Do the same with
tareascuid_xxx
regarding the time spent on care and attention of a sick and/or dependent person in a working day, and store the info inp22a
. Think that inp22
is stored if you have a sick or dependent person in your care; ifp22
isNo
(2
) orN.C.
(9
) the value ofp22a
should beNA
.
<-
data |>
data mutate("p22a" =
if_else(p22 %in% c(2, 9), NA,
if_else(tareascui_1 == 1, tareascui_hh * 60,
if_else(tareascui_2 == 1, tareascui_mm, NA))),
.after = p22) |>
select(-contains("tareascui"))
8.8 Question 8
Finally the variables
escideol
,participaciong
,recuvotog
,escuela
,nivelestentrev
,religion
,practicarelig6
,sitlab
,relalab
,cno11
,tipojor
,ingreshog
,clasesocial
,paisnac
,paisnac2
should be renamed asp24
,p25
,p25a
,p26
,p26a
,p27
,p27a
,p28
,p28a
,p28b
,p28c
,p29
,p30
,p31
andp31a
. In addition the variablesinceridad
collects the subjective sincerity that the pollster has perceived in the respondent and the variablesrecuvotogr
andrecuerdo
capture vote recall information that we already have collected elsewhere. Rename the former, delete the latter and move the variablesestudios
,cno11r
,clasesub
(studies, occupation according to Social Security coding and subjective class identification) before the surveyed questions.
Code
<-
data |>
data rename(p24 = escideol, p25 = participaciong, p25a = recuvotog,
p26 = escuela, p26a = nivelestentrev, p27 = religion,
p27a = practicarelig6, p28 = sitlab, p28a = relalab,
p28b = cno11, p28c = tipojor, p29 = ingreshog,
p30 = clasesocial, p31 = paisnac, p31a = paisnac2) |>
select(-(sinceridad:recuerdo)) |>
relocate(estudios, cno11r, clasesub, .before = p0)
data
# A tibble: 4,005 × 107
registro ccaa name pop prov mun capital tamuni tipo_tel
<dbl> <dbl+lbl> <chr> <dbl> <dbl+l> <dbl+lb> <dbl+l> <dbl+l> <dbl+lb>
1 2821 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
2 19046 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 1 [Fijo]
3 20465 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
4 31462 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
5 33504 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
6 35292 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
7 37633 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
8 40018 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 1 [Fijo]
9 41526 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 2 [Móvi…
10 62946 1 [Andalucía] Anda… 8.47e6 4 [Alm… 13 [Alm… 2 [Cap… 5 [100… 1 [Fijo]
# ℹ 3,995 more rows
# ℹ 98 more variables: sexo <dbl+lbl>, edad <dbl+lbl>, estudios <dbl+lbl>,
# cno11r <dbl+lbl>, clasesub <dbl+lbl>, p0 <dbl+lbl>, p1 <dbl+lbl>,
# p2 <dbl+lbl>, p3_1 <dbl+lbl>, p3_2 <dbl+lbl>, p3_3 <dbl+lbl>,
# p3_4 <dbl+lbl>, p3_5 <dbl+lbl>, p4 <dbl+lbl>, p5 <dbl+lbl>, p6_1 <dbl+lbl>,
# p6_2 <dbl+lbl>, p6_3 <dbl+lbl>, p7_1 <dbl+lbl>, p7_2 <dbl+lbl>,
# p7_3 <dbl+lbl>, p7_4 <dbl+lbl>, p7_5 <dbl+lbl>, p7_6 <dbl+lbl>, …
8.9 Question 9
Note that there are variables in the questionnaire, such as p0
and p1
, that although they are coded numerically, in reality we should have them only qualitatively. For example, if you look at the questionnaire, the variable p0
corresponds to the question P0.c
which says “First of all I would like to ask you if you have…
Spanish nationality –> 1
Spanish nationality and other –> 2
Other nationality –> 3
|> count(p0) data
# A tibble: 3 × 2
p0 n
<dbl+lbl> <int>
1 1 [La nacionalidad española] 3711
2 2 [La nacionalidad española y otra] 151
3 3 [Otra nacionalidad] 143
If we were to leave the variable as it is, numerically, we would be assuming two incorrect things:
There is a hierarchy (having Spanish nationality is better or worse than having another nationality).
This hierarchy is also quantified numerically: since 2 is double 1, it implies that, in some way, having two nationalities is “twice as much” as having only Spanish nationality
If you look at the questionnaire, all the variables that are p{number}
(without the low slash _
) are of this type. The others respond to a more complex questionnaire (a series of questions to be assessed on a scale that is qualitative ordinal and we will consider that it can be converted to an ordinal factor).
Convert the
p{number}
to qualitative (nominal) with its label and thep{number}_x
to ordinal qualitative (with its number). In the latter, note that “don’t know/no answer” responses are coded as98
and99
(and should be NA).
Code
library(labelled)
<-
data |>
data # with $ to specify "not more than..."
mutate(across(matches("p[0-9]{1}$|p[0-9]{2}$"), to_factor),
across(matches("p[0-9]{1}_|p[0-9]{2}_"),
function(x) { factor(if_else(x %in% c(98, 99), NA, x),
ordered = TRUE) } ))
In some compound questions with only 5 options, the
NA
are coded as7
,8
or9
.
<-
data |>
data # with $ to specify "not more than..."
mutate(across(matches("p[0-9]{1}_|p[0-9]{2}_"),
function(x) { factor(if_else(x %in% c(7, 8, 9), NA, x),
ordered = TRUE) } ))
8.10 Question 10
Calculate a two-dimensional table of relative frequencies between the variable
sex
and the variablep4
(degree of inequality between men and women in Spain) normalizing as you consider to answer the following question: what percentage of women believe that inequalities are very large or quite large? And of men?
Code
<- table(to_factor(data$sexo), data$p4)
table_freq prop.table(table_freq, margin = 1)
# 67% of women
# 48.7% of men
8.11 Question 11
Calculate a measure of association between these variables and detail conclusions.
Code
|>
data summarise("sig_chisq" = chisq.test(to_factor(sexo), p4)$p.value)
# tiny p-valor --> enough evidences againts null hyp -->
# answers to p4 depends on sex
8.12 Question 12
Create a simple graph in
ggplot()
that allows to represent the variablep4
in function of sexes, allowing to appreciate the differences in them.
Code
# first option: geom_bar with abs freqs
ggplot(data |>
mutate("sexo" = to_factor(sexo))) +
geom_bar(aes(x = p4, fill = p4), alpha = 0.7) +
::scale_fill_colorblind() +
ggthemesfacet_wrap(~sexo) +
theme_minimal()
# second option: geom_col with rel freqs
ggplot(data |>
mutate("sexo" = to_factor(sexo)) |>
count(sexo, p4) |>
mutate("porc" = n/sum(n))) +
geom_col(aes(x = p4, y = porc, fill = p4), alpha = 0.7) +
scale_y_continuous(labels = scales::label_percent()) +
::scale_fill_colorblind() +
ggthemesfacet_wrap(~sexo) +
theme_minimal()
# third option: horizontal bars
ggplot(data |>
mutate("sexo" = to_factor(sexo)) |>
count(sexo, p4) |>
mutate("porc" = n/sum(n)) |>
mutate("porc" = if_else(sexo == "Hombre", -porc, porc))) +
geom_col(aes(y = p4, x = porc, fill = p4), alpha = 0.7) +
scale_x_continuous(labels = scales::label_percent()) +
::scale_fill_colorblind() +
ggthemesfacet_wrap(~sexo, scales = "free_x") +
theme_minimal()
# fourth option: fill geom bar
ggplot(data |>
mutate("sexo" = to_factor(sexo))) +
geom_bar(aes(x = sexo, fill = p4), position = "fill",
alpha = 0.7) +
::scale_fill_colorblind() +
ggthemestheme_minimal()
8.13 Question 13
Performs the same analysis but regroups
p4
earlier into only 3 levels: grandes (Muy grandes
andBastante grandes
), pequeñas (Pequeñas
andCasi inexistentes
) andNS/NC
.
Code
<-
data |>
data mutate("p4" =
fct_collapse(data$p4,
"grandes" = c("Muy grandes", "Bastante grandes"), "NS/NC" = c("N.S.", "N.C."),
other_level = "pequeñas"))
<- table(to_factor(data$sexo), data$p4)
table_freq prop.table(table_freq, margin = 1)
# 67% of women
# 48.7% of men
|>
data summarise("sig_chisq" = chisq.test(to_factor(sexo), p4)$p.value)
ggplot(data |>
mutate("sexo" = to_factor(sexo))) +
geom_bar(aes(x = sexo, fill = p4), position = "fill",
alpha = 0.7) +
::scale_fill_colorblind() +
ggthemestheme_minimal()
8.14 Question 14
Question
p16
captures, on a scale of 0 to 10 (where 0 means “not at all feminist” and 10 means “very feminist”) the self-placement of the respondents. How is this variable distributed by gender? Make a frequency table
Code
prop.table(table(to_factor(data$sexo), data$p16), margin = 1)
What % of men and women declare themselves above 6?
Code
|>
data mutate("p16" = factor(p16, ordered = TRUE)) |>
group_by(sexo) |>
count(p16 > 7) |>
mutate("porc" = 100*n/sum(n)) |>
ungroup()
# 48.3% between men
# 62.3% between women
Question
p19
captured minutes spent on household chores. Converts to numeric (“No minutes” as 0).
Code
<-
data |>
data mutate("p19" = as.numeric(as.character(p19)))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `p19 = as.numeric(as.character(p19))`.
Caused by warning:
! NAs introduced by coercion
After that, first calculate the average by sex disaggregated, and then the average disaggregated by sex and
p16
. What conclusions do you draw? Make a graph if you consider it
Code
<-
mean_sex |>
data summarise("avg_min" = mean(p19, na.rm = TRUE), .by = sexo)
# 175 minutes in women vs 129 in men
<-
mean_sex_feminism |>
data summarise("avg_min" = mean(p19, na.rm = TRUE),
.by = c(sexo, p16))
ggplot(mean_sex_feminism) +
geom_col(aes(x = p16, y = avg_min, fill = avg_min),
alpha = 0.7) +
scale_fill_viridis_c(option = "magma") +
facet_wrap(~to_factor(sexo)) +
theme_minimal()
Code
# We can see how in men the average minutes remains cte
# without depending on the self-localization in the
# feminism scale
8.15 Question 15
We will calculate in a new variable the difference in minutes spent at home between the respondent (
p19
) and his/her partner (p20
).
Code
<-
data |>
data mutate("p20" = as.numeric(as.character(p20))) |>
mutate("diff_min" = p19 - p20)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `p20 = as.numeric(as.character(p20))`.
Caused by warning:
! NAs introduced by coercion
Repeat the previous analysis using this difference in minutes.
Code
<-
mean_sex |>
data filter(p17a %in% c(1, 2) & as.numeric(p17a) != as.numeric(sexo)) |>
summarise("avg_diff_min" = mean(diff_min, na.rm = TRUE), .by = sexo)
# +63.7 minutes in women vs -40.6 in men
<-
mean_sex_feminism |>
data filter(p17a %in% c(1, 2) & as.numeric(p17a) != as.numeric(sexo)) |>
summarise("avg_diff_min" = mean(diff_min, na.rm = TRUE),
.by = c(sexo, p16))
ggplot(mean_sex_feminism) +
geom_col(aes(x = p16, y = avg_diff_min, fill = avg_diff_min),
alpha = 0.7) +
scale_fill_viridis_c(option = "magma") +
facet_wrap(~to_factor(sexo)) +
theme_minimal()