Related to: Data Computing, “Tidy Data”, Ch. 1; “From Strings to Numbers”, p. 131; “Factors or Strings?”, p. 133; “Wide versus Narrow Data Layouts”, Ch. 11

<– Back to Table of Contents

Find yourself some data

The first step to embarking on your own data visualization adventure is to find some interesting data! Fortunately, this shouldn’t be too tricky. In recent years, there has been increased attention to making data “open” across organizations and levels of government–from city, to state, to national-level data. This is good news for you, because it means it is becoming easier and easier to find interesting datasets online!

Here are a few places you can look for datasets that may be interesting to analyze and visualize:

Tidy up your data

When dealing with a new dataset, you will first want to make sure the data is in a “tidy” format. For example, here is a dataset that is not very “tidy”. On the surface, it may look pretty, but it is currently in a format that is impossible for R to manage. Notice how it has irregularly-shaped columns and rows, and several header lines:

screenshot of untidy dataset with multiple heading rows, cells that span multiple columns, and time series information arranges side-by-side

If you run into a dataset like this, you may need to use a spreadsheet program like Microsoft Excel or Open Office to cut and paste your data into a tidier format before reading it into R. To make your dataset “tidy”, do the following:

  1. Make sure the cells of your data are shaped as a regular rectangular array, with exactly one cell present for every row and column. Eliminate irregularly-shaped cells that span multiple columns or rows.

  2. Fix your column names. Column names must start with a letter, and be sure to avoid using special characters (“&”, “$”, “#”, "@", “!”) when naming columns.

  3. If your data contains time series information (years, months, etc.), format the data so that each time period has its own, full set of rows that contain values for each observation of interest. This is what we call a “long” or “narrow” data format. (Note: In general, this “long” format results in a lot of data rows, and may not seem very compact as a result. This is why a lot of data providers prefer to provide time series data in “wide” format. When creating data visualizations, however, ggplot generally works best with “long” format data, so be sure to restructure your data before reading it into R to make it easier to visualize your data using ggplot.)

  4. Make sure your dataset is saved as a .CSV file. If your dataset came in an Excel (.XLSX) or other format, you will need to choose File > Save As… > then select Comma Separated Values (.csv) as your format.

When you’re done, you should end up with a dataset that is tidy and ready to be read into R: