Related to: Data Computing, “Collective Properties of Cases”, Ch. 14

<– Back to Table of Contents

Linear regression lines

ggplot makes it easy to add linear regression lines to a plot. A linear regression line is a very simple way to visualize the direction and magnitude of a relationship between two variables. It helps you say things like: “If variable X goes up by one unit, then variable Y tends to also go up by Z number of units” or “If variable X goes up by one unit, then variable Y tends to go down by Z number of units”.

It is important to note that sometimes this type of linear relationship can be a bit too simple to effectively sum up more complex relationships between variables. To address this concern, we’ll look at an alternative approach in a minute. But for now, let’s look at how to add a simple linear regression line to a ggplot graphic.

Before we get started, let’s again load both the dplyr and ggplot libraries, as well as the Minneapolis buildings energy benchmarking dataset:


data <- read.csv("../datasets/mpls_energy_benchmarking_2015.csv", header=TRUE)

To add a linear regression line to your graphic, simply add the stat_smooth() glyph to the code for your plot, and then pass it the argument method='lm'. ‘lm’ stands for “linear model”, which tells ggplot to generate a linear regression line representing the relationship between the X and Y variables in your aesthetics arguments and then add this line to your graphic:

ggplot(data, aes(x=year_built, y=site_EUI)) +
  geom_point() +

You can now see that ggplot has added a line indicating a very flat-looking relationship between the year built and site energy use intensity (EUI) for the buildings in our dataset. From the plot above, it looks like most of the buildings have a site EUI that falls between 0 to about 300 kBtu/sq. ft. There are a few outliers, however, that have a site EUI over 300 kBtu/sq. ft. Let’s see which buildings these are:

data %>% filter(site_EUI > 300)
##              org_name                               prop_name
## 1 City of Minneapolis Water Treatment and Distribution Campus
## 2             Private                                 Private
## 3             Private                                 Private
## 4             Private                                 Private
## 5             Private                                 Private
##   public_private                 address zip_code energy_star_score
## 1         Public 4500 Marshall Street NE    55421                 0
## 2        Private        323 Stinson Blvd    55413                 1
## 3        Private          524 23Rd Ave S    55454                 0
## 4        Private             17 7th St S    55402                 0
## 5        Private          2430 3rd Ave S    55404                 0
##                                 prop_type floor_area floor_area_parking
## 1 Drinking Water Treatment & Distribution     650000                  0
## 2                                  Office      50826                  0
## 3   Hospital (General Medical & Surgical)      82807                  0
## 4                                 Parking       2870             892673
## 5                                  Museum      85840                  0
##   year_built total_GHG_emissions site_EUI weather_normalized_site_EUI
## 1       1930               30269      313                         327
## 2       1969                3856      400                         403
## 3       1990                2981      368                         368
## 4       1963                 542      920                         920
## 5       2005                3335      377                         379
##   source_EUI weather_normalized_source_EUI water_use
## 1        740                           750         0
## 2       1170                          1171      1121
## 3        604                           604      5954
## 4       2887                          2887       419
## 5        654                           653      4550

Let’s now change something: let’s restrict the limits of the Y axis so it ranges from 0 to 300 kBtu/sq. ft. This will eliminate the handful of outliers that have very high site EUI values. We will want to make a note about these outliers elsewhere in the analysis, but restricting the Y axis limits for now will allow you to “zoom in” and see more details of the relationship between the buildings’ year built and their site EUI values:

ggplot(data, aes(x=year_built, y=site_EUI)) +
  geom_point() +
  ylim(0, 300) +

The plot above is “zoomed in” a bit, making it a little easier to see the regression line. You can now see that the line is quite flat and the points seem to scatter pretty randomly above and below the regression line. This indicates that there doesn’t seem to be a consistent relationship between the year a building was built and its site EUI. Our buildings’ year_built, then, does not help explain much–if any–of the variation across the buildings’ site_EUI values.

Activity A: Make a new linear regression line

Now you are interested in whether the floor area of a building is related to its level of greenhouse gas emisssions. You suspect that buildings larger floor areas have higher greenhouse gas emissions than buildings with smaller floor areas. Create a plot with the following aesthetics and components:

  • X variable: floor_area

  • Y variable: total_GHG_emissions

  • Regression: single linear regression line relating the X & Y variables

Does the plot offer any evidence to support your hypothesis that buildings with larger floor areas tend to have higher levels of greenhouse gas emissions?

Grouped linear regression lines

Now, let’s add a color (col=) argument to the aesthetic. This will group the data, generate a separate regression line for each group, and then color the points and regression lines based on these groups. For example, we can try using the public_private variable as our color grouping variable. This will let us discern if there’s a difference in the relationship for public vs. private buildings when comparing the buildings’ year built with their site EUI values:

ggplot(data, aes(x=year_built, y=site_EUI, col=public_private)) +
  geom_point() +
  ylim(0, 300) +