Skip to contents

Introduction

The following guide will assume you have R installed. I also highly recommend working in RStudio. If you need help getting those installed or are unfamiliar with how RStudio is laid out, please see this section of Lee Sharpe’s guide.

A quick word if you’re new to programming: all of this is happening in R. Obviously, you need to install R on your computer to do any of this. Make sure you save what you’re doing in a script (in RStudio, File –> New File –> R script) so you can save your work and run multiple lines of code at once. To run code from a script, highlight what you want, and press control + enter or press the Run button in the top of the editor (see Lee’s guide). If you don’t highlight anything and press control + enter, the currently selected line will run. As you go through your R journey, you might get stuck and have to google a bunch of things, but that’s totally okay and normal. That’s how I got started!

Setup

First, you need to install the magic packages. You only need to run this step once on a given computer. For these you can just type them into the RStudio console (look for the Console pane in RStudio) directly since you’re never going to be doing this again.

Install packages

install.packages("tidyverse", type = "binary")
install.packages("ggrepel", type = "binary")
install.packages("nflreadr", type = "binary")
install.packages("nflplotR", type = "binary")

Load packages

Okay, now here’s the stuff you’re going to want to start putting into your R script. The following loads tidyverse, which contains a lot of helper functions for working with data and ggrepel for making figures, along with nflreadr (which allows one to quickly download nflfastR data, along with a lot of other data). Finally, nflplotR makes plotting easier.

This one is optional but makes R prefer not to display numbers in scientific notation, which I find very annoying:

options(scipen = 9999)

Load data

This will load the full play by play for the 2019 season (including playoffs). We’ll get to how to get more seasons later. Note that this is downloading pre-cleaned data from the nflfastR data repository using the load_pbp() function included in nflreadr, which is much faster than building pbp from scratch.

data <- load_pbp(2019)

Basics: how to look at your data

Dimensions

Before moving forward, here are a few ways to get a sense of what’s in a dataframe. We can check the dimensions of the data, and this tells us that there are 48034 rows (i.e., plays) in the data and 372 columns (variables):

dim(data)
#> [1] 48034   372

str displays the structure of the dataframe:

str(data[1:10])
#> nflvrs_d [48,034 × 10] (S3: nflverse_data/tbl_df/tbl/data.table/data.frame)
#>  $ play_id     : num [1:48034] 1 36 51 79 100 121 148 185 214 239 ...
#>  $ game_id     : chr [1:48034] "2019_01_ATL_MIN" "2019_01_ATL_MIN" "2019_01_ATL_MIN" "2019_01_ATL_MIN" ...
#>  $ old_game_id : chr [1:48034] "2019090804" "2019090804" "2019090804" "2019090804" ...
#>  $ home_team   : chr [1:48034] "MIN" "MIN" "MIN" "MIN" ...
#>  $ away_team   : chr [1:48034] "ATL" "ATL" "ATL" "ATL" ...
#>  $ season_type : chr [1:48034] "REG" "REG" "REG" "REG" ...
#>  $ week        : int [1:48034] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ posteam     : chr [1:48034] NA "ATL" "ATL" "ATL" ...
#>  $ posteam_type: chr [1:48034] NA "away" "away" "away" ...
#>  $ defteam     : chr [1:48034] NA "MIN" "MIN" "MIN" ...
#>  - attr(*, "nflverse_timestamp")= POSIXct[1:1], format: "2022-09-27 11:19:31"
#>  - attr(*, "nflverse_type")= chr "play by play data"
#>  - attr(*, "nflfastR_version")=Classes 'package_version', 'numeric_version'  hidden list of 1
#>   ..$ : int [1:4] 4 4 0 9010

In the above, I’ve added in the [1:10], which selects only the first 10 columns, otherwise the list is extremely long (remember from above that there are 372 columns!). Normally, you would just type str(data).

You can similarly take a glimpse at your data:

glimpse(data[1:10])
#> Rows: 48,034
#> Columns: 10
#> $ play_id      <dbl> 1, 36, 51, 79, 100, 121, 148, 185, 214, 239, 255, 277, 29…
#> $ game_id      <chr> "2019_01_ATL_MIN", "2019_01_ATL_MIN", "2019_01_ATL_MIN", …
#> $ old_game_id  <chr> "2019090804", "2019090804", "2019090804", "2019090804", "…
#> $ home_team    <chr> "MIN", "MIN", "MIN", "MIN", "MIN", "MIN", "MIN", "MIN", "…
#> $ away_team    <chr> "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "ATL", "…
#> $ season_type  <chr> "REG", "REG", "REG", "REG", "REG", "REG", "REG", "REG", "…
#> $ week         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ posteam      <chr> NA, "ATL", "ATL", "ATL", "ATL", "ATL", "MIN", "MIN", "MIN…
#> $ posteam_type <chr> NA, "away", "away", "away", "away", "away", "home", "home…
#> $ defteam      <chr> NA, "MIN", "MIN", "MIN", "MIN", "MIN", "ATL", "ATL", "ATL…

Where again I’m only showing the first 10 columns. The usual command would be glimpse(data).

Variable names

Another very useful command is to get the names of the variables in the data, which you would get by entering names(data) (I won’t show here because, again, it is 372 columns).

That is a lot to work with!

Viewer

One more way to look at your data is with the View() function. If you’re coming from an Excel background, this will help you feel more at home as a way to see what’s in the data.

View(data)

This will open the viewer in RStudio in a new panel. Try it out yourself! Since there are so many columns, the Viewer won’t show them all. To pick which columns to view, you can select some:

data %>%
  select(home_team, away_team, posteam, desc) %>%
  View()

The %>% thing lets you pipe together a bunch of different commands. So we’re taking our data, “select”ing a few variables we want to look at, and then Viewing. Again, I can’t display the results of that here, but try it out yourself!

Head + manipulation

To start, let’s just look at the first few rows (the “head”) of the data.

data %>% 
  select(posteam, defteam, desc, rush, pass) %>% 
  head()
#> ── nflverse play by play data ──────────────────────────────────────────────────
#>  Data updated: 2022-09-27 11:19:31 UTC
#> # A tibble: 6 × 5
#>   posteam defteam desc                                                rush  pass
#>   <chr>   <chr>   <chr>                                              <dbl> <dbl>
#> 1 NA      NA      GAME                                                   0     0
#> 2 ATL     MIN     5-D.Bailey kicks 65 yards from MIN 35 to end zone…     0     0
#> 3 ATL     MIN     (15:00) 2-M.Ryan sacked at ATL 17 for -8 yards (5…     0     1
#> 4 ATL     MIN     (14:20) 24-D.Freeman right tackle to ATL 21 for 4…     1     0
#> 5 ATL     MIN     (13:41) (Shotgun) 2-M.Ryan scrambles left end to …     0     1
#> 6 ATL     MIN     (12:59) 5-M.Bosher punt is BLOCKED by 50-E.Wilson…     0     0

A couple things. “desc” is the important variable that lists the description of what happened on the play, and head says to show the first few rows (the “head” of the data). Since this is already sorted by game, these are the first 6 rows from a week 1 game, ATL @ MIN. To make code easier to read, people often put each part of a pipe on a new line, which is useful when working with more complicated functions. We could run:

data %>% select(posteam, defteam, desc, rush, pass) %>% head()

And it would return the exact same output as the one written out in multiple lines, but the code isn’t as easy to read.

We’ve covered select, and the next important function to learn is filter, which lets you filter the data to what you want. The following returns only plays that are run plays and pass plays; i.e., no punts, kickoffs, field goals, or dead ball penalties (e.g. false starts) where we don’t know what the attempted play was.

data %>% 
  filter(rush == 1 | pass == 1) %>%
  select(posteam, desc, rush, pass, name, passer, rusher, receiver) %>% 
  head()
#> ── nflverse play by play data ──────────────────────────────────────────────────
#>  Data updated: 2022-09-27 11:19:31 UTC
#> # A tibble: 6 × 8
#>   posteam desc                            rush  pass name  passer rusher recei…¹
#>   <chr>   <chr>                          <dbl> <dbl> <chr> <chr>  <chr>  <chr>  
#> 1 ATL     (15:00) 2-M.Ryan sacked at AT…     0     1 M.Ry… M.Ryan NA     NA     
#> 2 ATL     (14:20) 24-D.Freeman right ta…     1     0 D.Fr… NA     D.Fre… NA     
#> 3 ATL     (13:41) (Shotgun) 2-M.Ryan sc…     0     1 M.Ry… M.Ryan NA     NA     
#> 4 MIN     (12:53) 33-D.Cook right end p…     1     0 D.Co… NA     D.Cook NA     
#> 5 MIN     (12:32) 8-K.Cousins pass shor…     0     1 K.Co… K.Cou… NA     D.Cook 
#> 6 MIN     (11:57) 8-K.Cousins pass shor…     0     1 K.Co… K.Cou… NA     A.Thie…
#> # … with abbreviated variable name ¹​receiver

Compared to the first time we did this, the opening line for the start of the game, the kickoff, and the punt are now gone. Note that if you’re checking whether a variable is equal to something, we need to use the double equals sign == like above. There’s probably some technical reason for this [shrug emoji]. Also, the character | is used for “or”, and & for “and”. So rush == 1 | pass == 1 means “rush or pass”.

Note that the rush, pass, name, passer, rusher, and receiver columns are all nflfastR creations, where we have provided these to make working with the data easier. As we can see above, passer is filled in for all dropbacks (including sacks and scrambles, which also have pass = 1), and name is equal to the passer on pass plays and the rusher on rush plays. Think of this as the primary player involved on a play.

What if we wanted to view special teams plays? Again, we can use filter:

data %>% 
  filter(special == 1) %>%
  select(down, ydstogo, desc) %>% 
  head()
#> ── nflverse play by play data ──────────────────────────────────────────────────
#>  Data updated: 2022-09-27 11:19:31 UTC
#> # A tibble: 6 × 3
#>    down ydstogo desc                                                            
#>   <dbl>   <dbl> <chr>                                                           
#> 1    NA       0 5-D.Bailey kicks 65 yards from MIN 35 to end zone, Touchback.   
#> 2     4       2 (12:59) 5-M.Bosher punt is BLOCKED by 50-E.Wilson, Center-47-J.…
#> 3    NA       0 (Kick formation) 5-D.Bailey extra point is GOOD, Center-58-A.Cu…
#> 4    NA       0 5-D.Bailey kicks 67 yards from MIN 35 to ATL -2. 38-K.Barner to…
#> 5    NA       0 (Kick formation) 5-D.Bailey extra point is GOOD, Center-58-A.Cu…
#> 6    NA       0 5-D.Bailey kicks 65 yards from MIN 35 to end zone, Touchback.

Fourth down plays?

data %>% 
  filter(down == 4) %>%
  select(down, ydstogo, desc) %>% 
  head()
#> ── nflverse play by play data ──────────────────────────────────────────────────
#>  Data updated: 2022-09-27 11:19:31 UTC
#> # A tibble: 6 × 3
#>    down ydstogo desc                                                            
#>   <dbl>   <dbl> <chr>                                                           
#> 1     4       2 (12:59) 5-M.Bosher punt is BLOCKED by 50-E.Wilson, Center-47-J.…
#> 2     4      19 (2:38) 5-M.Bosher punts 33 yards to MIN 8, Center-47-J.Harris, …
#> 3     4      20 (12:33) 2-B.Colquitt punts 51 yards to ATL 17, Center-58-A.Cutt…
#> 4     4      27 (1:49) 5-M.Bosher punts 45 yards to MIN 10, Center-47-J.Harris,…
#> 5     4      10 (:49) 2-B.Colquitt punts 57 yards to ATL 33, Center-58-A.Cuttin…
#> 6     4       1 (10:56) 2-B.Colquitt punts 42 yards to ATL 10, Center-58-A.Cutt…

Fourth down plays that aren’t special teams plays?

data %>% 
  filter(down == 4 & special == 0) %>%
  select(down, ydstogo, desc) %>% 
  head()
#> ── nflverse play by play data ──────────────────────────────────────────────────
#>  Data updated: 2022-09-27 11:19:31 UTC
#> # A tibble: 6 × 3
#>    down ydstogo desc                                                            
#>   <dbl>   <dbl> <chr>                                                           
#> 1     4       5 (9:25) (Shotgun) 2-M.Ryan pass deep left to 18-C.Ridley for 20 …
#> 2     4       2 (4:39) (Punt formation) PENALTY on MIN, Delay of Game, 5 yards,…
#> 3     4       2 (1:27) (No Huddle, Shotgun) 2-M.Ryan pass short left to 11-J.Jo…
#> 4     4       1 (2:59) (Punt formation) Direct snap to 41-A.Levine.  41-A.Levin…
#> 5     4       3 (9:30) (Shotgun) 3-R.Griffin pass short left to 89-M.Andrews fo…
#> 6     4       1 (3:55) 17-J.Allen FUMBLES (Aborted) at NYJ 37, RECOVERED by NYJ…

So far, we’ve just been taking a look at the initial dataset we downloaded, but none of our results are preserved. To save a new dataframe of just the plays we want, we need to use <- to assign a new dataframe. Let’s save a new dataframe that’s just run plays and pass plays with non-missing EPA, called pbp_rp.

pbp_rp <- data %>%
  filter(rush == 1 | pass == 1, !is.na(epa))

In the above, !is.na(epa) means to exclude plays with missing (na) EPA. The ! symbol is often used by computer folk to negate something, so is.na(epa) means “EPA is missing” and !is.na(epa) means “EPA is not missing”, which we have used above.

Some basic stuff: Part 1

Okay, we have a big dataset where we call dropbacks pass plays and non-dropbacks rush plays. Now we actually want to, like, do stuff.

Group by and Summarize

Let’s take a look at how various Cowboys’ running backs fared on run plays in 2019:

pbp_rp %>%
    filter(posteam == "DAL", rush == 1) %>%
    group_by(rusher) %>%
    summarize(
      mean_epa = mean(epa), success_rate = mean(success), ypc = mean(yards_gained), plays = n()
      ) %>%
    arrange(-mean_epa) %>%
    filter(plays > 20)
#> # A tibble: 3 × 5
#>   rusher     mean_epa success_rate   ypc plays
#>   <chr>         <dbl>        <dbl> <dbl> <int>
#> 1 D.Prescott   0.288         0.591  6.41    22
#> 2 T.Pollard   -0.0265        0.456  5.08    90
#> 3 E.Elliott   -0.0412        0.411  4.39   309

There’s a lot going on here. We’ve covered filter already. The group_by function is an extremely useful function that, well, groups by what you tell it – in this case the rusher. Summarize is useful for collapsing the data down to a summary of what you’re looking at, and here, while grouping by player, we’re summarizing the mean of EPA, success, yardage (a bad rushing stat, but since we’re here), and getting the number of plays using n(), which returns the number in a group. Unsurprisingly, Prescott was much more effective as a rusher in 2019 than the running backs, and there was no meaningful difference between Pollard and Elliott in efficiency.

If you check the PFR team stats page, you’ll notice that the above doesn’t match up with the official stats. This is because nflfastR computes EPA and provides player names on plays with penalties and on two-point conversions. So if wanting to match the official stats, we need to restrict to down <= 4 (to excluded two-point conversions, which have down listed as NA) and play_type = run (to exclude penalties, which are play_type = no_play):

pbp_rp %>%
    filter(posteam == "DAL", down <= 4, play_type == 'run') %>%
    group_by(rusher) %>%
    summarize(
      mean_epa = mean(epa), success_rate = mean(success), ypc=mean(yards_gained), plays=n()
      ) %>%
    filter(plays > 20)
#> # A tibble: 3 × 5
#>   rusher     mean_epa success_rate   ypc plays
#>   <chr>         <dbl>        <dbl> <dbl> <int>
#> 1 D.Prescott   0.288         0.591  6.41    22
#> 2 E.Elliott   -0.0185        0.422  4.51   301
#> 3 T.Pollard   -0.0210        0.453  5.29    86

Now we exactly match PFR: Zeke has 301 carries at 4.5 yards/carry, and Pollard has 86 carries for 5.3 yards/carry. Note that we still aren’t matching Dak’s stats to PFR because the NFL classifies scrambles as rush attempts and nflfastR does not.

Manipulating columns: mutate, if_else, and case_when

Let’s say we want to make a new column, named home, which is equal to 1 if the team with the ball is the home team. Let’s introduce another extremely useful function, if_else:

pbp_rp %>%
  mutate(
    home = if_else(posteam == home_team, 1, 0)
  ) %>%
  select(posteam, home_team, home) %>%
  head(10)
#> ── nflverse play by play data ──────────────────────────────────────────────────
#>  Data updated: 2022-09-27 11:19:31 UTC
#> # A tibble: 10 × 3
#>    posteam home_team  home
#>    <chr>   <chr>     <dbl>
#>  1 ATL     MIN           0
#>  2 ATL     MIN           0
#>  3 ATL     MIN           0
#>  4 MIN     MIN           1
#>  5 MIN     MIN           1
#>  6 MIN     MIN           1
#>  7 ATL     MIN           0
#>  8 ATL     MIN           0
#>  9 ATL     MIN           0
#> 10 MIN     MIN           1

mutate is R’s word for creating a new column (or overwriting an existing one); in this case, we’ve created a new column called home. The above uses if_else, which uses the following pattern: condition (in this case, posteam == home_team), value if condition is true (in this case, if posteam == home_team, it is 1), and value if the condition is false (0). So we could use this to, for example, look at average EPA/play by home and road teams:

pbp_rp %>%
  mutate(
    home = if_else(posteam == home_team, 1, 0)
  ) %>%
  group_by(home) %>%
  summarize(epa = mean(epa))
#> # A tibble: 2 × 2
#>    home     epa
#>   <dbl>   <dbl>
#> 1     0  0.0215
#> 2     1 -0.0158

Note that EPA/play is similar for home teams and away teams because home is already built into the nflfastR EPA model, so this result is expected. Actually, away EPA/play is actually somewhat higher, presumably because away teams out-performed their usual in 2019 as homefield advantage continues to decline generally.

if_else is nice if you’re creating a new column based on a simple condition. But what if you need to do something more complicated? case_when is a good option. Here’s how it works:

pbp_rp %>%
  filter(!is.na(cp)) %>%
  mutate(
    depth = case_when(
      air_yards < 0 ~ "Negative",
      air_yards >= 0 & air_yards < 10 ~ "Short",
      air_yards >= 10 & air_yards < 20 ~ "Medium",
      air_yards >= 20 ~ "Deep"
    )
  ) %>%
  group_by(depth) %>%
  summarize(cp = mean(cp))
#> # A tibble: 4 × 2
#>   depth       cp
#>   <chr>    <dbl>
#> 1 Deep     0.367
#> 2 Medium   0.573
#> 3 Negative 0.847
#> 4 Short    0.718

Note the new syntax for case_when: we have condition (for the first one, air yards less than 0), followed by ~, followed by assignment (for the first one, “Negative”). In the above, we created 4 bins based on air yards and got average completion probability (cp) based on the nflfastR model. Unsurprisingly, cp is lower the longer downfield a throw goes.

A basic figure

Now that we’ve gained some skills at manipulating data, let’s put it to use by making things. Which teams were the most pass-heavy in the first half on early downs with win probability between 20 and 80, excluding the final 2 minutes of the half when everyone is pass-happy?

schotty <- pbp_rp %>%
    filter(wp > .20 & wp < .80 & down <= 2 & qtr <= 2 & half_seconds_remaining > 120) %>%
    group_by(posteam) %>%
    summarize(mean_pass = mean(pass), plays = n()) %>%
    arrange(-mean_pass)
schotty
#> # A tibble: 32 × 3
#>    posteam mean_pass plays
#>    <chr>       <dbl> <int>
#>  1 KC          0.691   388
#>  2 MIA         0.594   288
#>  3 NO          0.585   325
#>  4 LA          0.584   329
#>  5 CHI         0.563   309
#>  6 CLE         0.555   272
#>  7 CAR         0.554   271
#>  8 TB          0.551   321
#>  9 GB          0.550   291
#> 10 ARI         0.548   325
#> # … with 22 more rows

Again, we’ve already used filter, group_by, and summarize. The new function we are using here is arrange, which sorts the data by the variable(s) given. The minus sign in front of mean_pass means to sort in descending order.

Let’s make our first figure:

ggplot(schotty, aes(x=reorder(posteam,-mean_pass), y=mean_pass)) +
        geom_text(aes(label=posteam))

This image is kind of a mess – we still need a title, axis labels, etc – but gets the point across. We’ll get to that other stuff later. But more importantly, we made something interesting using nflfastR data! The “reorder” sorts the teams according to pass rate, with the “-” again saying to do it in descending order. “aes” is short for “aesthetic”, which is R’s weird way of asking which variables should go on the x and y axes.

Looking at the figure, the Chiefs will never have playoff success until they establish the run.

Loading multiple seasons

Because all the data is stored in the data repository, it is very fast to load data from multiple seasons.

pbp <- load_pbp(2015:2019)

This loads play-by-play data from the 2015 through 2019 seasons.

Let’s make sure we got it all. By now, you should understand what this is doing:

pbp %>%
  group_by(season) %>%
  summarize(n = n())
#> # A tibble: 5 × 2
#>   season     n
#>    <int> <int>
#> 1   2015 48869
#> 2   2016 48419
#> 3   2017 47997
#> 4   2018 47874
#> 5   2019 48034

So each season has about 48,000 plays. Just for fun, let’s look at the various play types:

pbp %>%
  group_by(play_type) %>%
  summarize(n = n())
#> # A tibble: 10 × 2
#>    play_type       n
#>    <chr>       <int>
#>  1 extra_point  6240
#>  2 field_goal   5156
#>  3 kickoff     13614
#>  4 no_play     22745
#>  5 pass        99986
#>  6 punt        12083
#>  7 qb_kneel     2090
#>  8 qb_spike      340
#>  9 run         68129
#> 10 NA          10810

Figures with QB stats

Let’s do some stuff with quarterbacks:

qbs <- pbp %>%
  filter(season_type == "REG", !is.na(epa)) %>%
  group_by(id, name) %>%
  summarize(
    epa = mean(qb_epa),
    cpoe = mean(cpoe, na.rm = T),
    n_dropbacks = sum(pass),
    n_plays = n(),
    team = last(posteam)
  ) %>%
  ungroup() %>%
  filter(n_dropbacks > 100 & n_plays > 1000)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.

Lots of new stuff here. First, we’re grouping by id and name to make sure we’re getting unique players; i.e., if two players have the same name (like Javorius Allen and Josh Allen both being J.Allen), we are also using their id to differentiate them. qb_epa is an nflfastR creation that is equal to EPA in all instances except for when a pass is completed and a fumble is lost, in which case a QB gets “credit” for the play up to the spot the fumble was lost (making EPA function like passing yards). The last part in the summarize comment gets the last team that a player was observed playing with.

My way of getting a dataset with only quarterbacks without joining to external roster data is to make sure they hit some number of dropbacks. In this case, filtering with n_dropbacks > 100 makes sure we’re only including quarterbacks. The ungroup() near the end is good practice after grouping to make sure you don’t get weird behavior with the data you created down the line.

Let’s make some more figures. The load_teams() function is provided in the nflreadr package, so since we have already loaded the package, it’s ready to use.

load_teams()
#> ── nflverse team graphics ──────────────────────────────────────────────────────
#>  Data updated: 2022-11-05 12:45:54 UTC
#> # A tibble: 32 × 16
#>    team_abbr team_name   team_id team_…¹ team_…² team_…³ team_…⁴ team_…⁵ team_…⁶
#>    <chr>     <chr>       <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
#>  1 ARI       Arizona Ca… 3800    Cardin… NFC     NFC We… #97233F #000000 #ffb612
#>  2 ATL       Atlanta Fa… 0200    Falcons NFC     NFC So… #A71930 #000000 #a5acaf
#>  3 BAL       Baltimore … 0325    Ravens  AFC     AFC No… #241773 #9E7C0C #9e7c0c
#>  4 BUF       Buffalo Bi… 0610    Bills   AFC     AFC Ea… #00338D #C60C30 #0c2e82
#>  5 CAR       Carolina P… 0750    Panthe… NFC     NFC So… #0085CA #000000 #bfc0bf
#>  6 CHI       Chicago Be… 0810    Bears   NFC     NFC No… #0B162A #C83803 #0b162a
#>  7 CIN       Cincinnati… 0920    Bengals AFC     AFC No… #FB4F14 #000000 #000000
#>  8 CLE       Cleveland … 1050    Browns  AFC     AFC No… #FF3C00 #311D00 #a5acaf
#>  9 DAL       Dallas Cow… 1200    Cowboys NFC     NFC Ea… #002244 #B0B7BC #acc0c6
#> 10 DEN       Denver Bro… 1400    Broncos AFC     AFC We… #002244 #FB4F14 #00234c
#> # … with 22 more rows, 7 more variables: team_color4 <chr>,
#> #   team_logo_wikipedia <chr>, team_logo_espn <chr>, team_wordmark <chr>,
#> #   team_conference_logo <chr>, team_league_logo <chr>,
#> #   team_logo_squared <chr>, and abbreviated variable names ¹​team_nick,
#> #   ²​team_conf, ³​team_division, ⁴​team_color, ⁵​team_color2, ⁶​team_color3

Let’s join this to the qbs dataframe we created:

qbs <- qbs %>%
  left_join(load_teams(), by = c('team' = 'team_abbr'))

left_join means keep all the rows from the left dataframe (the first one provided, qbs), and join those rows to available rows in the other dataframe. We also need to provide the joining variables, team from qbs and team_abbr from load_teams(). Why do we have to type by = c('team' = 'team_abbr')? Who knows, but it’s what left_join requires as instructions for how to match.

With team color dots

Now we can make a figure!

qbs %>%
  ggplot(aes(x = cpoe, y = epa)) +
  #horizontal line with mean EPA
  geom_hline(yintercept = mean(qbs$epa), color = "red", linetype = "dashed", alpha=0.5) +
  #vertical line with mean CPOE
  geom_vline(xintercept =  mean(qbs$cpoe), color = "red", linetype = "dashed", alpha=0.5) +
  #add points for the QBs with the right colors
  #cex controls point size and alpha the transparency (alpha = 1 is normal)
  geom_point(color = qbs$team_color, cex=qbs$n_plays / 350, alpha = .6) +
  #add names using ggrepel, which tries to make them not overlap
  geom_text_repel(aes(label=name)) +
  #add a smooth line fitting cpoe + epa
  stat_smooth(geom='line', alpha=0.5, se=FALSE, method='lm')+
  #titles and caption
  labs(x = "Completion % above expected (CPOE)",
       y = "EPA per play (passes, rushes, and penalties)",
       title = "Quarterback Efficiency, 2015 - 2019",
       caption = "Data: @nflfastR") +
  #uses the black and white ggplot theme
  theme_bw() +
  #center title with hjust = 0.5
  theme(
    plot.title = element_text(size = 14, hjust = 0.5, face = "bold")
  ) +
  #make ticks look nice
  #if this doesn't work, `install.packages('scales')`
  scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))

This looks complicated, but is just a way of getting a bunch of different stuff on the same plot: we have lines for averages, dots, names, etc. I added comments above to explain what is going on, but in practice for making figures I usually just copy and paste stuff and/or google what I need.

With team logos

We could also make the same plot with team logos:

qbs %>%
  ggplot(aes(x = cpoe, y = epa)) +
  #horizontal line with mean EPA
  geom_hline(yintercept = mean(qbs$epa), color = "red", linetype = "dashed", alpha=0.5) +
  #vertical line with mean CPOE
  geom_vline(xintercept =  mean(qbs$cpoe), color = "red", linetype = "dashed", alpha=0.5) +
  #add points for the QBs with the logos (this uses nflplotR package)
  geom_nfl_logos(aes(team_abbr = team), width = qbs$n_plays / 45000, alpha = 0.75) +
  #add names using ggrepel, which tries to make them not overlap
  geom_text_repel(aes(label=name)) +
  #add a smooth line fitting cpoe + epa
  stat_smooth(geom='line', alpha=0.5, se=FALSE, method='lm')+
  #titles and caption
  labs(x = "Completion % above expected (CPOE)",
       y = "EPA per play (passes, rushes, and penalties)",
       title = "Quarterback Efficiency, 2015 - 2019",
       caption = "Data: @nflfastR") +
  theme_bw() +
  #center title
  theme(
    plot.title = element_text(size = 14, hjust = 0.5, face = "bold")
  ) +
  #make ticks look nice
  scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))

The only changes we’ve made are to use geom_nfl_logos instead of geom_point (how to figure out the right size for the images in the width part? Trial and error).

This figure would look better with fewer players shown, but the point of this is explaining how to do stuff, so let’s call this good enough.

Team tiers plot

If it’s helpful, here are a few notes about the chart originally shown here, which like the above uses nflplotR for team logos.

library(nflplotR)
# get pbp and filter to regular season rush and pass plays
pbp <- nflreadr::load_pbp(2005) %>%
  dplyr::filter(season_type == "REG") %>%
  dplyr::filter(!is.na(posteam) & (rush == 1 | pass == 1))
# offense epa
offense <- pbp %>%
  dplyr::group_by(team = posteam) %>%
  dplyr::summarise(off_epa = mean(epa, na.rm = TRUE))
# defense epa
defense <- pbp %>%
  dplyr::group_by(team = defteam) %>%
  dplyr::summarise(def_epa = mean(epa, na.rm = TRUE))
# make figure
offense %>%
  dplyr::inner_join(defense, by = "team") %>%
  ggplot2::ggplot(aes(x = off_epa, y = def_epa)) +
  # tier lines
  ggplot2::geom_abline(slope = -1.5, intercept = (4:-3)/10, alpha = .2) +
  # nflplotR magic
  nflplotR::geom_mean_lines(aes(h_var = off_epa, v_var = def_epa)) +
  nflplotR::geom_nfl_logos(aes(team_abbr = team), width = 0.07, alpha = 0.7) +
  ggplot2::labs(
    x = "Offense EPA/play",
    y = "Defense EPA/play",
    caption = "Data: @nflfastR",
    title = "2005 NFL Offensive and Defensive EPA per Play"
  ) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    plot.title = ggplot2::element_text(size = 12, hjust = 0.5, face = "bold")
  ) +
  ggplot2::scale_y_reverse()

Everything else should be comprehensible by now!

A few more things on plotting

There are two ways to view plots. One is in the RStudio Viewer, which shows up in RStudio when you plot something. If plots in your RStudio viewer look ugly and pixelated, you probably need to install the Cairo package and then set that as the default viewer by doing Tools –> Global Options –> General –> Graphics –> Backend: Set to Cairo.

The other is to save a .png with your preferred dimensions and resolution. For example, ggsave("test.png", width = 16, height = 9, units = "cm") would save the current plot as “test.png” with the units specified (you can view all the ggsave options here).

One more note: the RStudio Viewer can take a long time to preview ggplots, especially if you’re doing things like adding images. If you’re getting frustrated with a plot taking a long time to display, you can take advantage of ggpreview from nflplotR. To do this, first save the plot to an object and then run ggpreview on it (if this doesn’t make sense, see the examples here).

Real life example: let’s make a win total model

I’m going to try to go through the process of cleaning and joining multiple data sets to try to get a sense of how I would approach something like this, step-by-step.

Get team wins each season

We’re going to cheat a little and take advantage of Lee Sharpe’s famous games file. Most of this stuff has been added into nflfastR, but it’s easier working with this file where each game is one row. If you’re curious, the triple colon is a way to access what is referred to as non-exported functions in a package. Think of this as like a secret menu (why is this secret? Sometimes package developers want to limit the number of exported functions as to be not overwhelming).

games <- nflreadr::load_schedules()
str(games)
#> nflvrs_d [6,409 × 45] (S3: nflverse_data/tbl_df/tbl/data.table/data.frame)
#>  $ game_id         : chr [1:6409] "1999_01_MIN_ATL" "1999_01_KC_CHI" "1999_01_PIT_CLE" "1999_01_OAK_GB" ...
#>  $ season          : int [1:6409] 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
#>  $ game_type       : chr [1:6409] "REG" "REG" "REG" "REG" ...
#>  $ week            : int [1:6409] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ gameday         : chr [1:6409] "1999-09-12" "1999-09-12" "1999-09-12" "1999-09-12" ...
#>  $ weekday         : chr [1:6409] "Sunday" "Sunday" "Sunday" "Sunday" ...
#>  $ gametime        : chr [1:6409] NA NA NA NA ...
#>  $ away_team       : chr [1:6409] "MIN" "KC" "PIT" "OAK" ...
#>  $ away_score      : int [1:6409] 17 17 43 24 14 3 10 30 25 28 ...
#>  $ home_team       : chr [1:6409] "ATL" "CHI" "CLE" "GB" ...
#>  $ home_score      : int [1:6409] 14 20 0 28 31 41 19 28 24 20 ...
#>  $ location        : chr [1:6409] "Home" "Home" "Home" "Home" ...
#>  $ result          : int [1:6409] -3 3 -43 4 17 38 9 -2 -1 -8 ...
#>  $ total           : int [1:6409] 31 37 43 52 45 44 29 58 49 48 ...
#>  $ overtime        : int [1:6409] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ old_game_id     : chr [1:6409] "1999091210" "1999091206" "1999091213" "1999091208" ...
#>  $ gsis            : int [1:6409] 598 597 604 602 591 603 592 600 588 596 ...
#>  $ nfl_detail_id   : chr [1:6409] NA NA NA NA ...
#>  $ pfr             : chr [1:6409] "199909120atl" "199909120chi" "199909120cle" "199909120gnb" ...
#>  $ pff             : int [1:6409] NA NA NA NA NA NA NA NA NA NA ...
#>  $ espn            : chr [1:6409] "190912001" "190912003" "190912005" "190912009" ...
#>  $ away_rest       : int [1:6409] 7 7 7 7 7 7 7 7 7 7 ...
#>  $ home_rest       : int [1:6409] 7 7 7 7 7 7 7 7 7 7 ...
#>  $ away_moneyline  : int [1:6409] NA NA NA NA NA NA NA NA NA NA ...
#>  $ home_moneyline  : int [1:6409] NA NA NA NA NA NA NA NA NA NA ...
#>  $ spread_line     : num [1:6409] -4 -3 -6 9 -3 5.5 3.5 7 -3 9.5 ...
#>  $ away_spread_odds: int [1:6409] NA NA NA NA NA NA NA NA NA NA ...
#>  $ home_spread_odds: int [1:6409] NA NA NA NA NA NA NA NA NA NA ...
#>  $ total_line      : num [1:6409] 49 38 37 43 45.5 49 38 44.5 37 42 ...
#>  $ under_odds      : int [1:6409] NA NA NA NA NA NA NA NA NA NA ...
#>  $ over_odds       : int [1:6409] NA NA NA NA NA NA NA NA NA NA ...
#>  $ div_game        : int [1:6409] 0 0 1 0 1 0 1 1 1 0 ...
#>  $ roof            : chr [1:6409] "dome" "outdoors" "outdoors" "outdoors" ...
#>  $ surface         : chr [1:6409] "astroturf" "grass" "grass" "grass" ...
#>  $ temp            : int [1:6409] NA 80 78 67 NA 76 NA 73 75 NA ...
#>  $ wind            : int [1:6409] NA 12 12 10 NA 8 NA 5 3 NA ...
#>  $ away_qb_id      : chr [1:6409] "00-0003761" "00-0006300" "00-0015700" "00-0005741" ...
#>  $ home_qb_id      : chr [1:6409] "00-0002876" "00-0010560" "00-0004230" "00-0005106" ...
#>  $ away_qb_name    : chr [1:6409] "Randall Cunningham" "Elvis Grbac" "Kordell Stewart" "Rich Gannon" ...
#>  $ home_qb_name    : chr [1:6409] "Chris Chandler" "Shane Matthews" "Ty Detmer" "Brett Favre" ...
#>  $ away_coach      : chr [1:6409] "Dennis Green" "Gunther Cunningham" "Bill Cowher" "Jon Gruden" ...
#>  $ home_coach      : chr [1:6409] "Dan Reeves" "Dick Jauron" "Chris Palmer" "Ray Rhodes" ...
#>  $ referee         : chr [1:6409] "Gerry Austin" "Phil Luckett" "Bob McElwee" "Tony Corrente" ...
#>  $ stadium_id      : chr [1:6409] "ATL00" "CHI98" "CLE00" "GNB00" ...
#>  $ stadium         : chr [1:6409] "Georgia Dome" "Soldier Field" "Cleveland Browns Stadium" "Lambeau Field" ...
#>  - attr(*, "nflverse_type")= chr "games and schedules"
#>  - attr(*, "nflverse_timestamp")= POSIXct[1:1], format: "2022-11-05 12:46:34"

To start, we want to create a dataframe where each row is a team-season observation, listing how many games they won. There are multiple ways to do this, but I’m going to just take the home and away results and bind together. As an example, here’s what the home results look like:

home <- games %>%
  filter(game_type == 'REG') %>%
  select(season, week, home_team, result) %>%
  rename(team = home_team)
home %>% head(5)
#> ── nflverse games and schedules ────────────────────────────────────────────────
#>  Data updated: 2022-11-05 12:46:34 UTC
#> # A tibble: 5 × 4
#>   season  week team  result
#>    <int> <int> <chr>  <int>
#> 1   1999     1 ATL       -3
#> 2   1999     1 CHI        3
#> 3   1999     1 CLE      -43
#> 4   1999     1 GB         4
#> 5   1999     1 IND       17

Note that we used rename to change home_team to team.

away <- games %>%
  filter(game_type == 'REG') %>%
  select(season, week, away_team, result) %>%
  rename(team = away_team) %>%
  mutate(result = -result)
away %>% head(5)
#> ── nflverse games and schedules ────────────────────────────────────────────────
#>  Data updated: 2022-11-05 12:46:34 UTC
#> # A tibble: 5 × 4
#>   season  week team  result
#>    <int> <int> <chr>  <int>
#> 1   1999     1 MIN        3
#> 2   1999     1 KC        -3
#> 3   1999     1 PIT       43
#> 4   1999     1 OAK       -4
#> 5   1999     1 BUF      -17

For away teams, we need to flip the result since result is given from the perspective of the home team. Now let’s make a columns called win based on the result.

results <- bind_rows(home, away) %>%
  arrange(week) %>%
  mutate(
    win = case_when(
      result > 0 ~ 1,
      result < 0 ~ 0,
      result == 0 ~ 0.5
    )
  )

results %>% filter(season == 2019 & team == 'SEA')
#> ── nflverse games and schedules ────────────────────────────────────────────────
#>  Data updated: 2022-11-05 12:46:34 UTC
#> # A tibble: 16 × 5
#>    season  week team  result   win
#>     <int> <int> <chr>  <int> <dbl>
#>  1   2019     1 SEA        1     1
#>  2   2019     2 SEA        2     1
#>  3   2019     3 SEA       -6     0
#>  4   2019     4 SEA       17     1
#>  5   2019     5 SEA        1     1
#>  6   2019     6 SEA        4     1
#>  7   2019     7 SEA      -14     0
#>  8   2019     8 SEA        7     1
#>  9   2019     9 SEA        6     1
#> 10   2019    10 SEA        3     1
#> 11   2019    12 SEA        8     1
#> 12   2019    13 SEA        7     1
#> 13   2019    14 SEA      -16     0
#> 14   2019    15 SEA        6     1
#> 15   2019    16 SEA      -14     0
#> 16   2019    17 SEA       -5     0

Doing the results %>% filter(season == 2019 & team == 'SEA') part at the end isn’t actually for saving the data in a new form, but just making sure the previous step did what I wanted. This is a good habit to get into: frequently inspect your data and make sure it looks like you think it should.

Now that we have the dataframe we wanted, we can get team wins by season easily:

team_wins <- results %>%
  group_by(team, season) %>%
  summarize(
    wins = sum(win),
    point_diff = sum(result)) %>%
  ungroup()
#> `summarise()` has grouped output by 'team'. You can override using the
#> `.groups` argument.

team_wins %>%
  arrange(-wins) %>%
  head(5)
#> # A tibble: 5 × 4
#>   team  season  wins point_diff
#>   <chr>  <int> <dbl>      <int>
#> 1 NE      2007    16        315
#> 2 CAR     2015    15        192
#> 3 GB      2011    15        201
#> 4 PIT     2004    15        121
#> 5 BAL     2019    14        249

Again, we’re making sure the data looks like it “should” by checking the 5 seasons with the most wins, and making sure it looks right.

Now that the team-season win and point differential data is ready, we need to go back to the nflfastR data to get EPA/play.

Get team EPA by season

Let’s start by getting data from every season from the nflfastR data repository:

pbp <- load_pbp(1999:2019) %>%
    filter(rush == 1 | pass == 1, season_type == "REG", !is.na(epa), !is.na(posteam), posteam != "") %>%
    select(season, posteam, pass, defteam, epa)

I’m being pretty aggressive with dropping rows and columns (filter and select) because otherwise loading this all into memory can be painful on the computer. But this is all we need for what we’re doing. Note that I’m only keeping regular season games here (season_type == "REG") since this is how this analysis is usually done.

Now we can get EPA/play on offense and defense. Let’s break it out by pass and rush too. I don’t remember how to do some of this so let’s do it in steps. We know we need to group by team, season, and pass, so there’s the beginning:

pbp %>%
  group_by(posteam, season, pass) %>% 
  summarize(epa = mean(epa)) %>%
  head(4)
#> `summarise()` has grouped output by 'posteam', 'season'. You can override using
#> the `.groups` argument.
#> # A tibble: 4 × 4
#> # Groups:   posteam, season [2]
#>   posteam season  pass     epa
#>   <chr>    <int> <dbl>   <dbl>
#> 1 ARI       1999     0 -0.201 
#> 2 ARI       1999     1 -0.162 
#> 3 ARI       2000     0 -0.240 
#> 4 ARI       2000     1 -0.0718

But this makes two rows per team-season. How to get each team-season on the same row? pivot_wider is what we need:

pbp %>%
  group_by(posteam, season, pass) %>% 
  summarize(epa = mean(epa)) %>%
  pivot_wider(names_from = pass, values_from = epa) %>%
  head(4)
#> `summarise()` has grouped output by 'posteam', 'season'. You can override using
#> the `.groups` argument.
#> # A tibble: 4 × 4
#> # Groups:   posteam, season [4]
#>   posteam season    `0`     `1`
#>   <chr>    <int>  <dbl>   <dbl>
#> 1 ARI       1999 -0.201 -0.162 
#> 2 ARI       2000 -0.240 -0.0718
#> 3 ARI       2001 -0.177  0.0740
#> 4 ARI       2002 -0.134 -0.0661

This one is hard to wrap my head around so I usually open up the reference page, read the example, and pray that what I try works. In this case it did. Hooray! This turned our two-lines-per-team dataframe into one, with the 0 column being pass == 0 (run plays) and the 1 column pass == 1.

Now let’s rename to something more sensible and save:

offense <- pbp %>%
  group_by(posteam, season, pass) %>% 
  summarize(epa = mean(epa)) %>%
  pivot_wider(names_from = pass, values_from = epa) %>%
  rename(off_pass_epa = `1`, off_rush_epa = `0`)
#> `summarise()` has grouped output by 'posteam', 'season'. You can override using
#> the `.groups` argument.

Note that variable names that are numbers need to be surrounded in tick marks for this to work.

Now we can repeat the same process for defense:

defense <- pbp %>%
  group_by(defteam, season, pass) %>% 
  summarize(epa = mean(epa)) %>%
  pivot_wider(names_from = pass, values_from = epa) %>%
  rename(def_pass_epa = `1`, def_rush_epa = `0`)
#> `summarise()` has grouped output by 'defteam', 'season'. You can override using
#> the `.groups` argument.

Let’s do another sanity check looking at the top 5 pass offenses and defenses:

#top 5 offenses
offense %>%
  arrange(-off_pass_epa) %>%
  head(5)
#> # A tibble: 5 × 4
#> # Groups:   posteam, season [5]
#>   posteam season off_rush_epa off_pass_epa
#>   <chr>    <int>        <dbl>        <dbl>
#> 1 NE        2007      0.00380        0.422
#> 2 IND       2004     -0.00125        0.413
#> 3 GB        2011     -0.114          0.412
#> 4 KC        2018      0.0209         0.348
#> 5 DEN       2013     -0.0296         0.343

#top 5 defenses
defense %>%
  arrange(def_pass_epa) %>%
  head(5)
#> # A tibble: 5 × 4
#> # Groups:   defteam, season [5]
#>   defteam season def_rush_epa def_pass_epa
#>   <chr>    <int>        <dbl>        <dbl>
#> 1 TB        2002      -0.0756       -0.292
#> 2 NE        2019      -0.168        -0.241
#> 3 JAX       2017      -0.141        -0.223
#> 4 NYJ       2009      -0.104        -0.220
#> 5 LA        2003      -0.0548       -0.214

The top pass defenses (2002 TB, 2017 JAX, 2019 NE) and offenses (2007 Pats, 2004 Colts, 2011 Packers) definitely check out!

Fix team names and join

Now we’re ready to bind it all together. Actually, let’s make sure all the team names are ready too.

team_wins %>%
  group_by(team) %>%
  summarize(n=n()) %>%
  arrange(n)
#> # A tibble: 35 × 2
#>    team      n
#>    <chr> <int>
#>  1 LV        3
#>  2 LAC       6
#>  3 LA        7
#>  4 STL      17
#>  5 SD       18
#>  6 HOU      21
#>  7 OAK      21
#>  8 ARI      24
#>  9 ATL      24
#> 10 BAL      24
#> # … with 25 more rows

Nope, not yet, we need to fix the Raiders, Rams, and Chargers, which are LV, LA, and LAC in nflfastR.

team_wins <- team_wins %>%
  mutate(
    team = case_when(
      team == 'OAK' ~ 'LV',
      team == 'SD' ~ 'LAC',
      team == 'STL' ~ 'LA',
      TRUE ~ team
    )
  )

The TRUE statement at the bottom says that if none of the above cases are found, keep team the same. Let’s make sure this worked:

team_wins %>%
  group_by(team) %>%
  summarize(n=n()) %>%
  arrange(n)
#> # A tibble: 32 × 2
#>    team      n
#>    <chr> <int>
#>  1 HOU      21
#>  2 ARI      24
#>  3 ATL      24
#>  4 BAL      24
#>  5 BUF      24
#>  6 CAR      24
#>  7 CHI      24
#>  8 CIN      24
#>  9 CLE      24
#> 10 DAL      24
#> # … with 22 more rows

HOU has 3 fewer seasons because it didn’t exist from 1999 through 2001, which is fine, and all the other team names have number of seasons that they should. Okay NOW we can join:

data <- team_wins %>%
  left_join(offense, by = c('team' = 'posteam', 'season')) %>%
  left_join(defense, by = c('team' = 'defteam', 'season'))

data %>%
  filter(team == 'SEA' & season >= 2012)
#> # A tibble: 11 × 8
#>    team  season  wins point_diff off_rush_epa off_pass_epa def_rush_epa def_pa…¹
#>    <chr>  <int> <dbl>      <int>        <dbl>        <dbl>        <dbl>    <dbl>
#>  1 SEA     2012  11          167     -0.00476       0.213       -0.0738  -0.0434
#>  2 SEA     2013  13          186     -0.101         0.188       -0.126   -0.165 
#>  3 SEA     2014  12          140      0.0216        0.139       -0.231   -0.0364
#>  4 SEA     2015  10          146     -0.105         0.249       -0.148   -0.0421
#>  5 SEA     2016  10.5         62     -0.126         0.102       -0.207    0.0588
#>  6 SEA     2017   9           34     -0.192         0.0584      -0.122   -0.0139
#>  7 SEA     2018  10           81     -0.0273        0.210       -0.130    0.0738
#>  8 SEA     2019  11            7     -0.136         0.119       -0.0930   0.0745
#>  9 SEA     2020  12           88     NA            NA           NA       NA     
#> 10 SEA     2021   7           29     NA            NA           NA       NA     
#> 11 SEA     2022  NA           NA     NA            NA           NA       NA     
#> # … with abbreviated variable name ¹​def_pass_epa

Now we’re getting really close to doing what we want! Next we need to create new columns for prior year EPA, and let’s do point differential too.

data <- data %>% 
  arrange(team, season) %>%
  group_by(team) %>% 
  mutate(
    prior_off_rush_epa = lag(off_rush_epa),
    prior_off_pass_epa = lag(off_pass_epa),
    prior_def_rush_epa = lag(def_rush_epa),
    prior_def_pass_epa = lag(def_pass_epa),
    prior_point_diff = lag(point_diff)
  ) %>% 
  ungroup()

data %>%
  head(5)
#> # A tibble: 5 × 13
#>   team  season  wins point_diff off_ru…¹ off_p…² def_r…³ def_p…⁴ prior…⁵ prior…⁶
#>   <chr>  <int> <dbl>      <int>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1 ARI     1999     6       -137   -0.201 -0.162  -0.0105 0.00198  NA     NA     
#> 2 ARI     2000     3       -233   -0.240 -0.0718  0.0333 0.180    -0.201 -0.162 
#> 3 ARI     2001     7        -48   -0.177  0.0740 -0.0689 0.0802   -0.240 -0.0718
#> 4 ARI     2002     5       -155   -0.134 -0.0661 -0.0192 0.165    -0.177  0.0740
#> 5 ARI     2003     4       -227   -0.219 -0.120  -0.0627 0.191    -0.134 -0.0661
#> # … with 3 more variables: prior_def_rush_epa <dbl>, prior_def_pass_epa <dbl>,
#> #   prior_point_diff <int>, and abbreviated variable names ¹​off_rush_epa,
#> #   ²​off_pass_epa, ³​def_rush_epa, ⁴​def_pass_epa, ⁵​prior_off_rush_epa,
#> #   ⁶​prior_off_pass_epa

Finally! Now we have the data in place and can start doing things with it.

Correlations and regressions

data %>% 
  select(-team, -season) %>%
  cor(use="complete.obs") %>%
  round(2)
#>                     wins point_diff off_rush_epa off_pass_epa def_rush_epa
#> wins                1.00       0.92         0.43         0.70        -0.29
#> point_diff          0.92       1.00         0.49         0.76        -0.33
#> off_rush_epa        0.43       0.49         1.00         0.40         0.05
#> off_pass_epa        0.70       0.76         0.40         1.00        -0.01
#> def_rush_epa       -0.29      -0.33         0.05        -0.01         1.00
#> def_pass_epa       -0.57      -0.62        -0.04        -0.10         0.31
#> prior_off_rush_epa  0.23       0.26         0.32         0.23         0.02
#> prior_off_pass_epa  0.29       0.32         0.18         0.46        -0.01
#> prior_def_rush_epa -0.12      -0.15         0.03        -0.04         0.27
#> prior_def_pass_epa -0.18      -0.20        -0.07        -0.05         0.05
#> prior_point_diff    0.36       0.41         0.22         0.36        -0.09
#>                    def_pass_epa prior_off_rush_epa prior_off_pass_epa
#> wins                      -0.57               0.23               0.29
#> point_diff                -0.62               0.26               0.32
#> off_rush_epa              -0.04               0.32               0.18
#> off_pass_epa              -0.10               0.23               0.46
#> def_rush_epa               0.31               0.02              -0.01
#> def_pass_epa               1.00              -0.10               0.00
#> prior_off_rush_epa        -0.10               1.00               0.41
#> prior_off_pass_epa         0.00               0.41               1.00
#> prior_def_rush_epa         0.16               0.05              -0.01
#> prior_def_pass_epa         0.28              -0.01              -0.09
#> prior_point_diff          -0.19               0.47               0.76
#>                    prior_def_rush_epa prior_def_pass_epa prior_point_diff
#> wins                            -0.12              -0.18             0.36
#> point_diff                      -0.15              -0.20             0.41
#> off_rush_epa                     0.03              -0.07             0.22
#> off_pass_epa                    -0.04              -0.05             0.36
#> def_rush_epa                     0.27               0.05            -0.09
#> def_pass_epa                     0.16               0.28            -0.19
#> prior_off_rush_epa               0.05              -0.01             0.47
#> prior_off_pass_epa              -0.01              -0.09             0.76
#> prior_def_rush_epa               1.00               0.32            -0.34
#> prior_def_pass_epa               0.32               1.00            -0.60
#> prior_point_diff                -0.34              -0.60             1.00

We’ve covered select, but here we see a new use where a minus sign de-selects variables (we need to de-select team name for correlation to work because it doesn’t work for character strings, and correlation with the season number itself is meaningless). We’ve run the correlation on this dataframe, removing missing values, and then rounding to 2 digits. Not surprisingly, we see that wins in the current season are more strongly related to passing offense EPA than rushing EPA or defense EPA, and prior offense carries more predictive power than prior defense. Pass offense is more stable year to year (0.46) than rush offense (0.32), pass defense (0.28), or rush defense (0.27).

I’m actually surprised that the values for passing offense aren’t higher relative to the others. Maybe it was because most of our prior results come from the nflscrapR era (2009 - 2019)? Let’s check what this looks like since 2009 relative to earlier seasons:

message("2009 through 2019")
#> 2009 through 2019
data %>% 
  filter(season >= 2009) %>%
  select(wins, point_diff, off_pass_epa, off_rush_epa, prior_point_diff, prior_off_pass_epa, prior_off_rush_epa) %>%
  cor(use="complete.obs") %>%
  round(2)
#>                    wins point_diff off_pass_epa off_rush_epa prior_point_diff
#> wins               1.00       0.92         0.73         0.40             0.43
#> point_diff         0.92       1.00         0.79         0.47             0.44
#> off_pass_epa       0.73       0.79         1.00         0.37             0.39
#> off_rush_epa       0.40       0.47         0.37         1.00             0.19
#> prior_point_diff   0.43       0.44         0.39         0.19             1.00
#> prior_off_pass_epa 0.34       0.36         0.45         0.11             0.78
#> prior_off_rush_epa 0.24       0.25         0.16         0.24             0.45
#>                    prior_off_pass_epa prior_off_rush_epa
#> wins                             0.34               0.24
#> point_diff                       0.36               0.25
#> off_pass_epa                     0.45               0.16
#> off_rush_epa                     0.11               0.24
#> prior_point_diff                 0.78               0.45
#> prior_off_pass_epa               1.00               0.35
#> prior_off_rush_epa               0.35               1.00
message("1999 through 2008")
#> 1999 through 2008
data %>% 
  filter(season < 2009) %>%
  select(wins, point_diff, off_pass_epa, off_rush_epa, prior_point_diff, prior_off_pass_epa, prior_off_rush_epa) %>%
  cor(use="complete.obs") %>%
  round(2)
#>                    wins point_diff off_pass_epa off_rush_epa prior_point_diff
#> wins               1.00       0.92         0.68         0.47             0.28
#> point_diff         0.92       1.00         0.73         0.51             0.36
#> off_pass_epa       0.68       0.73         1.00         0.47             0.34
#> off_rush_epa       0.47       0.51         0.47         1.00             0.25
#> prior_point_diff   0.28       0.36         0.34         0.25             1.00
#> prior_off_pass_epa 0.23       0.29         0.45         0.30             0.74
#> prior_off_rush_epa 0.23       0.28         0.30         0.40             0.50
#>                    prior_off_pass_epa prior_off_rush_epa
#> wins                             0.23               0.23
#> point_diff                       0.29               0.28
#> off_pass_epa                     0.45               0.30
#> off_rush_epa                     0.30               0.40
#> prior_point_diff                 0.74               0.50
#> prior_off_pass_epa               1.00               0.48
#> prior_off_rush_epa               0.48               1.00

Yep, that seems to be the case. So in the more recent period, passing offense has become slightly more stable but more predictive of following-year success, while at the same time rushing offense has become substantially less stable and less predictive of future team success.

Now let’s do a basic regression of wins on prior offense and defense EPA/play. Maybe we should only look at this more recent period to fit our model since it’s more relevant for 2020. In the real world, we would be more rigorous about making decisions like this, but let’s proceed anyway.

data <- data %>% filter(season >= 2009)

fit <- lm(wins ~ prior_off_pass_epa  + prior_off_rush_epa + prior_def_pass_epa + prior_def_rush_epa, data = data)

summary(fit)
#> 
#> Call:
#> lm(formula = wins ~ prior_off_pass_epa + prior_off_rush_epa + 
#>     prior_def_pass_epa + prior_def_rush_epa, data = data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7.7039 -1.8906  0.0644  2.2506  7.0871 
#> 
#> Coefficients:
#>                    Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)          7.9634     0.3884  20.504 < 0.0000000000000002 ***
#> prior_off_pass_epa   6.5867     1.2805   5.144          0.000000433 ***
#> prior_off_rush_epa   5.9976     2.2712   2.641              0.00861 ** 
#> prior_def_pass_epa  -4.0619     1.6449  -2.469              0.01397 *  
#> prior_def_rush_epa  -5.1581     2.3329  -2.211              0.02763 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.86 on 379 degrees of freedom
#>   (64 observations deleted due to missingness)
#> Multiple R-squared:  0.1639, Adjusted R-squared:  0.155 
#> F-statistic: 18.57 on 4 and 379 DF,  p-value: 0.00000000000005992

I’m actually pretty surprised passing offense isn’t higher here. How does this compare to simply using point differential?

fit2 <- lm(wins ~ prior_point_diff, data = data)

summary(fit2)
#> 
#> Call:
#> lm(formula = wins ~ prior_point_diff, data = data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7.1664 -1.9700  0.1115  2.2317  7.4260 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept)      8.038462   0.136630  58.834 <0.0000000000000002 ***
#> prior_point_diff 0.013270   0.001345   9.869 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.787 on 414 degrees of freedom
#>   (32 observations deleted due to missingness)
#> Multiple R-squared:  0.1904, Adjusted R-squared:  0.1885 
#> F-statistic: 97.39 on 1 and 414 DF,  p-value: < 0.00000000000000022

So R2 is somewhat higher for just point differential. This isn’t surprising as we’ve thrown away special teams plays and haven’t attempted to make any adjustments for things like fumble luck that we know can improve EPA’s predictive power.

Predictions

Now let’s get the predictions from the EPA model:

preds <- predict(fit, data %>% filter(season == 2020)) %>%
  #was just a vector, need a tibble to bind
  as_tibble() %>%
  #make the column name make sense
  rename(prediction = value) %>%
  round(1) %>%
  #get names
  bind_cols(
    data %>% filter(season == 2020) %>% select(team)
  )

preds %>%
  arrange(-prediction) %>%
  head(5)
#> # A tibble: 5 × 2
#>   prediction team 
#>        <dbl> <chr>
#> 1       11.4 BAL  
#> 2       10.2 SF   
#> 3        9.8 NE   
#> 4        9.6 DAL  
#> 5        9.6 NO

This mostly checks out.

What if we just used simple point differential to predict?

preds2 <- predict(fit2, data %>% filter(season == 2020)) %>%
  #was just a vector, need a tibble to bind
  as_tibble() %>%
  #make the column name make sense
  rename(prediction = value) %>%
  round(1) %>%
  #get names
  bind_cols(
    data %>% filter(season == 2020) %>% select(team)
  )

preds2 %>%
  arrange(-prediction) %>%
  head(5)
#> # A tibble: 5 × 2
#>   prediction team 
#>        <dbl> <chr>
#> 1       11.3 BAL  
#> 2       10.6 NE   
#> 3       10.3 SF   
#> 4        9.9 KC   
#> 5        9.6 NO

Not surprisingly, this looks pretty similar. These are very basic models that don’t incorporate schedule, roster changes, etc. For example, a better model would take into account Tom Brady no longer playing for the Patriots. But hopefully this has been useful!

Next Steps

You now should know enough to be able to tackle a great deal of questions using nflfastR data. A good way to build up skills is to take interesting things you see and try to replicate them (for making figures, this will also involve a heavy dose of googling stuff).

Looking at others’ code is also a good way to learn. One option is to look through the nflfastR code base, much of which you should now understand what it’s doing. For example, here is the function that cleans up the data and prepares it for later stages: there’s a heavy dose of mutate, group_by, arrange, lag, if_else, and case_when.

Resources: The gold standards

This is an R package so this section is pretty R heavy.