Learn how to add a basic linear regression line to a ggplot graphic and interpret the result

Learn how to add more complex lines to ggplot graphics, including: 1) linear regression lines separated by observation groups, and 2) smoothed curve lines

**Related to:** *Data Computing*, “Collective Properties of Cases”, Ch. 14

ggplot makes it easy to add linear regression lines to a plot. A **linear regression line** is a very simple way to visualize the **direction** and **magnitude** of a relationship between two variables. It helps you say things like: “If variable X goes up by one unit, then variable Y tends to also go up by Z number of units” or “If variable X goes up by one unit, then variable Y tends to go down by Z number of units”.

It is important to note that sometimes this type of linear relationship can be a bit *too* simple to effectively sum up more complex relationships between variables. To address this concern, we’ll look at an alternative approach in a minute. But for now, let’s look at how to add a simple linear regression line to a ggplot graphic.

Before we get started, let’s again load both the dplyr and ggplot libraries, as well as the Minneapolis buildings energy benchmarking dataset:

```
library(dplyr)
library(ggplot2)
data <- read.csv("../datasets/mpls_energy_benchmarking_2015.csv", header=TRUE)
```

To add a linear regression line to your graphic, simply add the `stat_smooth()`

glyph to the code for your plot, and then pass it the argument `method='lm'`

. ‘lm’ stands for “linear model”, which tells ggplot to generate a **linear regression line** representing the relationship between the X and Y variables in your aesthetics arguments and then add this line to your graphic:

```
ggplot(data, aes(x=year_built, y=site_EUI)) +
geom_point() +
geom_smooth(method='lm')
```

You can now see that ggplot has added a line indicating a very flat-looking relationship between the year built and site energy use intensity (EUI) for the buildings in our dataset. From the plot above, it looks like most of the buildings have a site EUI that falls between 0 to about 300 kBtu/sq. ft. There are a few outliers, however, that have a site EUI over 300 kBtu/sq. ft. Let’s see which buildings these are:

`data %>% filter(site_EUI > 300)`

```
## org_name prop_name
## 1 City of Minneapolis Water Treatment and Distribution Campus
## 2 Private Private
## 3 Private Private
## 4 Private Private
## 5 Private Private
## public_private address zip_code energy_star_score
## 1 Public 4500 Marshall Street NE 55421 0
## 2 Private 323 Stinson Blvd 55413 1
## 3 Private 524 23Rd Ave S 55454 0
## 4 Private 17 7th St S 55402 0
## 5 Private 2430 3rd Ave S 55404 0
## prop_type floor_area floor_area_parking
## 1 Drinking Water Treatment & Distribution 650000 0
## 2 Office 50826 0
## 3 Hospital (General Medical & Surgical) 82807 0
## 4 Parking 2870 892673
## 5 Museum 85840 0
## year_built total_GHG_emissions site_EUI weather_normalized_site_EUI
## 1 1930 30269 313 327
## 2 1969 3856 400 403
## 3 1990 2981 368 368
## 4 1963 542 920 920
## 5 2005 3335 377 379
## source_EUI weather_normalized_source_EUI water_use
## 1 740 750 0
## 2 1170 1171 1121
## 3 604 604 5954
## 4 2887 2887 419
## 5 654 653 4550
```

Let’s now change something: let’s restrict the limits of the Y axis so it ranges from 0 to 300 kBtu/sq. ft. This will eliminate the handful of outliers that have very high site EUI values. We will want to make a note about these outliers elsewhere in the analysis, but restricting the Y axis limits for now will allow you to “zoom in” and see more details of the relationship between the buildings’ year built and their site EUI values:

```
ggplot(data, aes(x=year_built, y=site_EUI)) +
geom_point() +
ylim(0, 300) +
stat_smooth(method='lm')
```

The plot above is “zoomed in” a bit, making it a little easier to see the regression line. You can now see that the line is quite flat and the points seem to scatter pretty randomly above and below the regression line. This indicates that there doesn’t seem to be a consistent relationship between the year a building was built and its site EUI. Our buildings’ `year_built`

, then, *does not* help explain much–if any–of the variation across the buildings’ `site_EUI`

values.

Now you are interested in whether the floor area of a building is related to its level of greenhouse gas emisssions. You suspect that buildings larger floor areas have higher greenhouse gas emissions than buildings with smaller floor areas. Create a plot with the following aesthetics and components:

X variable: floor_area

Y variable: total_GHG_emissions

Regression: single linear regression line relating the X & Y variables

Does the plot offer any evidence to support your hypothesis that buildings with larger floor areas tend to have higher levels of greenhouse gas emissions?

Now, let’s add a color (`col=`

) argument to the aesthetic. This will group the data, generate a separate regression line for each group, and then color the points and regression lines based on these groups. For example, we can try using the `public_private`

variable as our color grouping variable. This will let us discern if there’s a difference in the relationship for public vs. private buildings when comparing the buildings’ year built with their site EUI values:

```
ggplot(data, aes(x=year_built, y=site_EUI, col=public_private)) +
geom_point() +
ylim(0, 300) +
stat_smooth(method='lm')
```