<!-- Generated by galley: do not edit by hand -->

The {dm} package offers functions to work with relational data models in
R. A common task for multiple, separated tables that have a shared
attribute is merging the data.

This document introduces you to the joining functions of {dm} and shows
how to apply them using data from the
[{nycflights13}](https://github.com/tidyverse/nycflights13) package.

[Relational data
models](https://cynkra.github.io/dm/articles/howto-dm-theory#model)
consist of multiple tables that are linked with [foreign
keys](https://cynkra.github.io/dm/articles/howto-dm-theory#fk). They are
the building blocks for joining tables. Read more about relational data
models in the vignette [“Introduction to Relational Data
Models”](https://cynkra.github.io/dm/articles/howto-dm-theory).

First, we load the packages that we need:

``` r
library(dm)
library(tidyverse)
```

## Data: nycflights13

To explore filtering with {dm}, we’ll use the {nycflights13} data with
its tables `flights`, `planes`, `airlines` and `airports`.

This dataset contains information about the 336 776 flights that
departed from New York City in 2013, with 3322 different planes and 1458
airports involved. The data comes from the US Bureau of Transportation
Statistics, and is documented in `?nycflights13`.

First, we have to create a `dm` object from the {nycflights13} data.
This is implemented with `dm::dm_nycflights13()`.

A [data model
object](https://cynkra.github.io/dm/articles/tech-dm-class.html#class-dm)
contains the data as well as metadata.

If you would like to create a `dm` from other tables, please look at
`?dm` and the function `new_dm()`.

``` r
dm <- dm_nycflights13()
```

## Joining a `dm` object

{dm} allows you to join two tables of a `dm` object based on a shared
column. You can use all join functions that you know from the [{dplyr}
package](https://dplyr.tidyverse.org/reference/join.html). Currently
{dplyr} supports four types of mutating joins, two types of filtering
joins, and a nesting join. See `?dplyr::join` for details.

### How it works

A join is the combination of two tables based on shared information. In
technical terms, we merge the tables that need to be directly connected
by a [foreign key
relation](https://cynkra.github.io/dm/articles/howto-dm-theory#fk).

The existing links can be inspected in two ways:

1.  Visually, by drawing the data model with `dm_draw()`

``` r
dm %>%
  dm_draw()
```

![](/home/kirill/git/R/dm/vignettes/out/tech-dm-join_files/figure-gfm/unnamed-chunk-3-1.png)<!-- -->

The directed arrows show explicitly the relation between different
columns.

1.  Printed to the console by calling `dm_get_all_fks()`

``` r
dm %>%
  dm_get_all_fks()
#> # A tibble: 4 x 4
#>   child_table child_fk_cols     parent_table parent_key_cols   
#>   <chr>       <keys>            <chr>        <keys>           
#> 1 flights     carrier           airlines     carrier          
#> 2 flights     origin            airports     faa              
#> 3 flights     tailnum           planes       tailnum          
#> 4 flights     origin, time_hour weather      origin, time_hour
```

### Joining Examples

Let’s look at some examples:

**Add a column with airline names from the `airlines` table to the
`flights` table.**

``` r
dm_joined <-
  dm %>%
  dm_join_to_tbl(flights, airlines, join = left_join)
dm_joined
#> # A tibble: 11,227 x 20
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     1    10        3           2359         4      426
#>  2  2013     1    10       16           2359        17      447
#>  3  2013     1    10      450            500       -10      634
#>  4  2013     1    10      520            525        -5      813
#>  5  2013     1    10      530            530         0      824
#>  6  2013     1    10      531            540        -9      832
#>  7  2013     1    10      535            540        -5     1015
#>  8  2013     1    10      546            600       -14      645
#>  9  2013     1    10      549            600       -11      652
#> 10  2013     1    10      550            600       -10      649
#> # … with 11,217 more rows, and 13 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, name <chr>
```

As you can see below, the `dm_joined` dataframe has one more column than
the `flights` table. The difference is the `name` column from the
`airlines` table.

``` r
dm %>%
  tbl("flights") %>%
  names()
#>  [1] "year"           "month"          "day"            "dep_time"      
#>  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
#>  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
#> [13] "origin"         "dest"           "air_time"       "distance"      
#> [17] "hour"           "minute"         "time_hour"

dm %>%
  tbl("airlines") %>%
  names()
#> [1] "carrier" "name"

dm_joined %>%
  names()
#>  [1] "year"           "month"          "day"            "dep_time"      
#>  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
#>  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
#> [13] "origin"         "dest"           "air_time"       "distance"      
#> [17] "hour"           "minute"         "time_hour"      "name"
```

The result is not a `dm` object anymore, but a conventional dataframe:

``` r
dm_joined %>%
  class()
#> [1] "tbl_df"     "tbl"        "data.frame"
```

Another example:

**Get all flights that can’t be matched with airlines names.**

We expect the flights data from {nycflights13} package to be clean and
well organized, so no flights should remain. You can check this with an
`anti_join`:

``` r
dm %>%
  dm_join_to_tbl(flights, airlines, join = anti_join)
#> # A tibble: 0 x 19
#> # … with 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
#> #   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

An example with filtering on a `dm` and then merging:

**Get all flights from Delta Air Lines which didn’t depart from John F.
Kennedy International Airport in May 2013 - and join all the airports
data into the `flights` table.**

Currently, it is important to call `dm_apply_filters()` after piping
your conditions. Only then the underlying tables and key relations are
updated and you can perform a join on the filtered data. We are working
towards removing this inconvenience
[\#62](https://github.com/cynkra/dm/issues/62).

``` r
dm_nycflights13() %>%
  dm_filter(airlines, name == "Delta Air Lines Inc.") %>%
  dm_filter(flights, month == 5) %>%
  dm_apply_filters() %>%
  dm_join_to_tbl(flights, airports, join = left_join)
#> # A tibble: 136 x 26
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     5    10      554            600        -6      739
#>  2  2013     5    10      556            600        -4      825
#>  3  2013     5    10      606            610        -4      743
#>  4  2013     5    10      625            630        -5      843
#>  5  2013     5    10      632            635        -3      847
#>  6  2013     5    10      653            700        -7      923
#>  7  2013     5    10      654            700        -6     1001
#>  8  2013     5    10      656            700        -4     1008
#>  9  2013     5    10      656            700        -4      911
#> 10  2013     5    10      657            700        -3     1006
#> # … with 126 more rows, and 19 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, name <chr>, lat <dbl>, lon <dbl>,
#> #   alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>
```

A last example:

**Merge all tables into one big table.**

Sometimes you need everything in one place. In this case you can use the
`dm_flatten_to_tbl()` function. It joins all the tables of your `dm`
object together into one wide table. All you have to do is to specify
the starting table. The following joins are determined by the foreign
key links.

``` r
dm_nycflights13() %>%
  dm_select_tbl(-weather) %>%
  dm_flatten_to_tbl(start = flights)
#> Renamed columns:
#> * year -> flights.year, planes.year
#> * name -> airlines.name, airports.name
#> # A tibble: 11,227 x 35
#>    flights.year month   day dep_time sched_dep_time dep_delay arr_time
#>           <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1         2013     1    10        3           2359         4      426
#>  2         2013     1    10       16           2359        17      447
#>  3         2013     1    10      450            500       -10      634
#>  4         2013     1    10      520            525        -5      813
#>  5         2013     1    10      530            530         0      824
#>  6         2013     1    10      531            540        -9      832
#>  7         2013     1    10      535            540        -5     1015
#>  8         2013     1    10      546            600       -14      645
#>  9         2013     1    10      549            600       -11      652
#> 10         2013     1    10      550            600       -10      649
#> # … with 11,217 more rows, and 28 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, airlines.name <chr>,
#> #   airports.name <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>,
#> #   dst <chr>, tzone <chr>, planes.year <int>, type <chr>,
#> #   manufacturer <chr>, model <chr>, engines <int>, seats <int>,
#> #   speed <int>, engine <chr>
```

Be aware that all column names need to be unique. The
`dm_flatten_to_tbl` cares about automatically renaming the relevant
columns and prints if something was changed,
e.g. `name -> airlines.name`.
