Related to: Data Computing, “Introduction to Data Graphics”, Ch. 5

<– Back to Table of Contents

Time to check out your data’s “profile pictures”!

Now we’re finally at the point where you can start visualizing your data! The basic visualizations we’ll look at here can be a helpful part of your exploratory analysis. In fact, you can think of them a bit like “profile pictures” for your data: they help you understand how the data looks at various different angles. (In real dating, you probably would have started looking at profile pictures a little earlier on in the process, but for data dating, we’ll save the fun stuff for the second date.)

So, let’s start by reading in the data:

Boxplot: The classic “downward camera angle selfie” profile pic

One way you can start sizing up your data is to create some boxplots. Boxplots are great at giving you a rough estimate of the spread of your data: What is the maximum value for a variable? What is the minimum? Where is the middle (median) value? The downside of boxplots is that they only reveal a small set of basic information about your data, but they are a nice place to start before moving onto more complicated plots.

If you remember the summary() function we looked at earlier, a boxplot is essentially a visual representation of most of those values: minimum, 1st quartile, median, 3rd quartile, and maximum. A boxplot is has the following structure:

For example, let’s say you want to understand more about the range of scores our buildings received on the “energy_star_score” variable. You can create a boxplot for this variable using the boxplot() function. When you’re done, you should see a boxplot indicating that the median Energy Star score for the buildings in our dataset is about 75, and the 1st and 3rd quartile scores are about 50 and 90, respectively:


You can also group your boxplot, to examine how values vary between groups. For example, we have a variable called “org_name” in our buildings dataset that indicates which organization is responsible for managing the building (or, if it’s a private building, it simply lists “Private”). Maybe we’re wondering how a building’s Energy Star rating tends to vary based on the organization that the building is affiliated with. Let’s group buildings with the same “org_name” together, then, and see how that changes our boxplox:

boxplot(data$energy_star_score ~ data$org_name)

If your plot is too small to view in RStudio, or if it looks like parts of the plot are scrunched together, you can click the “Zoom” button directly above the plot to open up a new window with a zoomed-in version of the plot. One the zoom window is open, you can also stretch and shrink the size of the window to stretch and shrink the plot.

Looking at the plot above, it seems that Hennepin County buildings tend to have higher Energy Star scores than any of the other groups of buildings. But wait a minute…that boxplot has an empty space in it! you may be saying. And you’re right: the “Minneapolis Park and Recreation Board” has no associated boxplot in the chart you just created. This could be for one of two reasons: 1) there are not enough building observations within the “Minneapolis Park and Recreation Board” group to able to calculate a meaningful boxplot, or 2) there are missing values (“NA”) within the group. In general, you need two or more non-missing values for each group in your dataset in order to create a boxplot for each group. If this isn’t the case, your grouped boxplot will be missing values. So, let’s try to figure out what is going on.

First, you can use the table() function to check to see how many buldings in the dataset fall within each “org_name” group:


After you run the command above, you should see that there are 49 buildings that are affiliated with the “Minneapolis Park and Recreation Board”. That should be enough to create a boxplot, so clearly there is something else going on. Next, you can view your data by clicking on the name of your dataset from the “Environment” tab inside of RStudio. This will open up a preview tab where you can view the data. Click the sort arrows next to “org_name” to sort your data based on the “org_name” variable:

screenshot of data preview tab in RStudio with column sort arrows

Finally, scroll down and look for the buildings that are affiliated with the “Minneapolis Park and Recreation Board”: what do you see in the “energy_star_score” column for these buildings? It looks like most of them are missing values (“NA”). For some reason, it seems that Minneapolis Park and Recreation Board buildings do not generally receive “energy_star_score” ratings. That explains why you don’t see a boxplot for this group of buildings in the grouped boxplot you just created!

This is a good lesson: anytime you group your data based on a variable like this, it’s always good to check how many values fall into each group before plotting or doing further analysis. If you don’t have enough observations in a particular group, or you have missing observations within the group, you will need to account for that during your analysis.

Activity A: Practice the table() function and examine missing values

Use the table() function again, this time on the data$prop_type. Using the same process above, try to decide: would you be able to create a grouped boxplot displaying the “energy_star_score” distribution for buildings whose “prop_type” is “Parking”? Why or why not? How about buildings whose “prop_type” is “Fire Station”? Why or why not?

Histogram: The full-body mirror shot

Another way of examining your data is to look at a histogram, which helps you learn more about both the spread and the shape of your data: What values are most common in the data? What values are rare in the data? Do most of the values fall in the middle, or do they lie on separate extremes of the data’s range? A histogram reveals some of the same basic information a boxplot can show you, plus some additional information about the data’s shape. Here’s a quick comparison between boxplots and histograms, and how to interpret the minimum, 1st quartile, median, 3rd quartile, and maxiumum values on each: