---
title: "Spatial Joins Extended"
author: "Robin Lovelace, Jakub Nowosad, Jannes Muenchow"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Spatial Joins Extended}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This vignette provides some further detail on the Vector attribute joining section (see https://geocompr.robinlovelace.net/attr.html#vector-attribute-joining ) of [the Geocomputation with R book](https://geocompr.github.io/).

This vignette requires the following packages to be installed and attached:

```{r, message=FALSE}
library(sf)
library(spData)
library(dplyr)
```

We will use an `sf` object `north_america` with country codes (`iso_a2`), names and geometries, as well as a `data.frame` object `wb_north_america` containing information about urban population and unemployment for three countries.
Note that `north_america` contains data about Canada, Greenland and the United States but the World Bank dataset (`wb_north_america`) contains information about Canada, Mexico and the United States:

```{r}
north_america = world %>%
  filter(subregion == "Northern America") %>%
  dplyr::select(iso_a2, name_long)
north_america
```

```{r}
wb_north_america = worldbank_df %>% 
  filter(name %in% c("Canada", "Mexico", "United States")) %>%
  dplyr::select(name, iso_a2, urban_pop, unemploy = unemployment)
wb_north_america
```

We will use a left join to combine the two datasets.
Left joins are the most commonly used operation for adding attributes to spatial data, as they return all observations from the left object (`north_america`) and the matched observations from the right object (`wb_north_america`) in new columns.
Rows in the left object without matches in the right (`Greenland` in this case) result in `NA` values.

To join two objects we need to specify a key.
This is a variable (or a set of variables) that uniquely identifies each observation (row). 
The `by` argument of **dplyr**'s join functions lets you identify the key variable. 
In simple cases, a single, unique variable exist in both objects like the `iso_a2` column in our example (you may need to rename columns with identifying information for this to work):

```{r}
left_join1 = north_america %>% 
  left_join(wb_north_america, by = "iso_a2")
left_join1
```

This has created a spatial dataset with the new variables added.
The utility of this is shown in the figure below, which shows the unemployment rate (a World Bank variable) across the countries of North America.

```{r unemploy, echo=FALSE, fig.cap="Figure 1. The unemployment rate (taken from World Bank statistics) in Canada and the United States to illustrate the utility of joining attribute data on to spatial datasets.", fig.width=5}
# tmap::qtm(left_join1, "unemploy", fill.breaks = c(6, 6.5, 7), fill.title="Unemployment rate: ",
#           projection = "+proj=aea +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs")
library(tmap)
tm_shape(left_join1) + 
  tm_polygons("unemploy", breaks = c(6, 6.5, 7), title = "Unemployment rate: ") +
  tm_layout(legend.position = c("right", "bottom"))
```

It is also possible to join objects by different variables.
Both of the datasets have variables with names of countries, but they are named differently.
The `north_america` has a `name_long` column and the `wb_north_america` has a `name` column.
In these cases a named vector, such as `c("name_long" = "name")`, can specify the connection:

```{r}
left_join2 = north_america %>% 
  left_join(wb_north_america, by = c("name_long" = "name"))
left_join2
```

Note that the result contains two duplicated variables - `iso_a2.x` and `iso_a2.y` because both `x` and `y` objects have the column `iso_a2`.
This can be solved by specifying all the keys:

```{r}
left_join3 = north_america %>% 
  left_join(wb_north_america, by = c("iso_a2", "name_long" = "name"))
left_join3
```

Joins also work when a data frame is the first argument.
However, for them to work we need to drop the `sf` class.

```{r}
left_join4 = wb_north_america %>%
  left_join(st_drop_geometry(north_america), by = c("iso_a2"))
left_join4
```

```{r}
class(left_join4)
```

In contrast to `left_join()`, `inner_join()` keeps only observations from the left object (`north_america`) where there are matching observations in the right object (`wb_north_america`). 
All columns from the left and right object are still kept:

```{r}
inner_join1 = north_america %>% 
  inner_join(wb_north_america, by = c("iso_a2", "name_long" = "name"))
inner_join1
```