library(tidyverse)
library(readxl)
library(skimr)
# Packages for manipulating name strings and guessing genders based on names
library(tm)
library(gender)
# Note: Need to install the 'genderdata' package from the ROpenSci repository -- not available on CRAN
# install.packages("genderdata", repos = "http://packages.ropensci.org")
library(genderdata)
# Packages for timing
library(beepr)
library(tictoc)
hr_2019 <- read_excel("./data/fiscal-year-2019.xlsx", sheet="FY19 HR INFO", na = "'', '-'")
Expecting numeric in J22556 / R22556C10: got '-'
# Convert these two columns to Date type files, instead of the strange character fields
# that Excel date columns sometimes get parsed as
hr_2019$ORIGINAL_HIRE_DATE <- as.Date(as.numeric(hr_2019$ORIGINAL_HIRE_DATE), origin = "1899-12-30")
NAs introduced by coercion
hr_2019$LAST_HIRE_DATE <- as.Date(as.numeric(hr_2019$LAST_HIRE_DATE), origin = "1899-12-30")
NAs introduced by coercion
# This column seems to import as POSIXct by default--not sure why it's different from the
# other two. The only thing we have to do is convert it to an R Date type.
hr_2019$JOB_ENTRY_DATE <- as.Date(hr_2019$JOB_ENTRY_DATE)
#earnings_2019 <- read_excel("./data/fiscal-year-2019.xlsx", sheet="FY19 EARNINGS")
Salaries are listed on either a biweekly or hourly basis: table(hr_2019$COMP_FREQUENCY_DESC). To be able to compare across salary types, we can standardize all salaries into an hourly rate. This will become our main response variable:
hr_2019 <- hr_2019 %>%
mutate(COMP_RATE_STND_HOURLY = ifelse(COMP_FREQUENCY_CODE == 'H', COMPENSATION_RATE, COMPENSATION_RATE / 80))
We can convert the date variables to a format that makes them slightly more interpretable. We’ll use the end of the calendar year (December 31, 2019) as a reference time frame:
hr_2019 <- hr_2019 %>%
mutate(YRS_SINCE_ORIGINAL_HIRE = as.numeric((as.Date('2019-12-31') - ORIGINAL_HIRE_DATE)) / 365,
YRS_SINCE_LAST_HIRE = as.numeric((as.Date('2019-12-31') - LAST_HIRE_DATE)) / 365,
YRS_SINCE_JOB_ENTRY = as.numeric((as.Date('2019-12-31') - JOB_ENTRY_DATE)) / 365)
Clean up the postal codes so they’re all only 5 digits long:
hr_2019 <- hr_2019 %>%
mutate(LOCATION_POSTAL_CODE_CLEANED = substr(LOCATION_POSTAL_CODE, 1, 5))
There is one TEMPORARY_ID value that appears to be shared across two different individuals. Let’s create a new UNIQUE_ID field that contains the TEMPORARY_ID plus the first letter of the person’s last name to create field that is truly distinct, in case we need to count or group by this field at some point:
hr_2019 %>% group_by(TEMPORARY_ID) %>%
summarise(distinct_names = length(unique(EMPLOYEE_NAME))) %>%
filter(distinct_names > 1)
hr_2019$UNIQUE_ID <- paste(hr_2019$TEMPORARY_ID, substring(hr_2019$EMPLOYEE_NAME, 1, 1), sep="")
We have a lot of roles that have a COMPENSATION_RATE value of ‘0’. When we examine these more closely broken down by agency, we see that most of these are concentrated in ‘MN St Colleges & Universities’, where ~40% of roles have a compensation rate of ‘0’ listed!
hr_2019 %>% group_by(AGENCY_NAME) %>%
summarise(pct_of_roles_where_comp_rate_equals_zero =
sum(ifelse(COMPENSATION_RATE == 0, 1, 0)) / n() ) %>%
filter(pct_of_roles_where_comp_rate_equals_zero > 0) %>%
arrange(desc(pct_of_roles_where_comp_rate_equals_zero))
This is profound enough that it seems likely to skew any conclusions we would make about this agency, so to keep thing simple for now, let’s filter this agency out of the rest of the analysis:
hr_2019 <- hr_2019 %>% filter(AGENCY_NAME != 'MN St Colleges & Universities')
The skim() function shows how many missing values we have within each column, along with some other helpful information. It looks like the LAST_HIRE_DATE is often missing, but we have pretty consistent data for pretty much everything else:
skim(hr_2019)
── Data Summary ────────────────────────
Values
Name hr_2019
Number of rows 47759
Number of columns 39
_______________________
Column type frequency:
character 25
Date 3
numeric 11
________________________
Group variables None
── Variable type: character ─────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 TEMPORARY_ID 0 1 12 12 0 46721 0
2 EMPLOYEE_NAME 0 1 5 41 0 46614 0
3 AGENCY_NBR 0 1 3 3 0 91 0
4 AGENCY_NAME 0 1 6 30 0 91 0
5 DEPARTMENT_NBR 0 1 7 7 0 6879 0
6 DEPARTMENT_NAME 0 1 3 30 0 6038 0
7 BRANCH_CODE 0 1 1 1 0 4 0
8 BRANCH_NAME 0 1 5 11 0 4 0
9 JOB_TITLE 0 1 3 30 0 1642 0
10 LOCATION_NBR 0 1 5 5 0 2029 0
11 LOCATION_NAME 0 1 1 30 0 1713 0
12 LOCATION_POSTAL_CODE 0 1 1 9 0 662 0
13 REG_TEMP_CODE 0 1 1 1 0 8 0
14 REG_TEMP_DESC 0 1 7 16 0 8 0
15 CLASSIFIED_CODE 0 1 1 1 0 3 0
16 CLASSIFIED_DESC 0 1 10 12 0 3 0
17 FULL_PART_TIME_CODE 0 1 1 1 0 3 0
18 FULL_PART_TIME_DESC 0 1 9 12 0 3 0
19 SALARY_PLAN_GRID 0 1 1 4 0 89 0
20 COMP_FREQUENCY_CODE 0 1 1 1 0 2 0
21 COMP_FREQUENCY_DESC 0 1 6 8 0 2 0
22 BARGAINING_UNIT_NAME 0 1 7 30 0 30 0
23 ACTIVE_ON_JUNE_30_2019 0 1 2 3 0 2 0
24 LOCATION_POSTAL_CODE_CLEANED 0 1 1 5 0 373 0
25 UNIQUE_ID 0 1 13 13 0 46721 0
── Variable type: Date ──────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max median n_unique
1 ORIGINAL_HIRE_DATE 27 0.999 1949-06-13 2019-06-11 2010-10-12 7529
2 LAST_HIRE_DATE 7236 0.848 1978-08-08 2019-06-27 2014-08-11 4586
3 JOB_ENTRY_DATE 0 1 1965-02-03 2019-07-02 2016-06-15 5430
── Variable type: numeric ───────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
1 RECORD_NBR 0 1 0.0928 0.370 0 0 0 0 7
2 JOB_CODE 0 1 3297. 2660. 1 1346. 2867 3677 9366
3 SALARY_GRADE_RANGE 0 1 27.6 25.3 1 8 16 57 96
4 MAX_SALARY_STEP 0 1 10.7 4.02 1 10 12 13 19
5 COMPENSATION_RATE 0 1 79.9 522. 0 22.3 28.2 38.6 7706.
6 POSITION_FTE 0 1 0.942 0.177 0 1 1 1 1.33
7 BARGAINING_UNIT_NBR 0 1 217. 25.6 201 206 214 216 335
8 COMP_RATE_STND_HOURLY 0 1 31.2 12.3 0 22.3 28.2 38.6 171.
9 YRS_SINCE_ORIGINAL_HIRE 27 0.999 12.5 10.8 0.556 3.35 9.22 19.5 70.6
10 YRS_SINCE_LAST_HIRE 7236 0.848 7.55 6.41 0.512 2.16 5.39 12.2 41.4
11 YRS_SINCE_JOB_ENTRY 0 1 6.10 6.56 0.499 1.61 3.55 7.90 54.9
hist
1 ▇▁▁▁▁
2 ▆▇▁▁▃
3 ▇▁▁▃▁
4 ▂▁▂▇▁
5 ▇▁▁▁▁
6 ▁▁▁▇▁
7 ▇▁▁▁▁
8 ▇▃▁▁▁
9 ▇▃▁▁▁
10 ▇▃▂▁▁
11 ▇▂▁▁▁
Let’s filter the dataset to drop the LAST_HIRE_DATE columns, where we see a high number of NAs. Then, we can scan across the remaining columns and filter the dataset to include all rows where we have complete cases for the remaining columns. Note: Normally we would want to be a little more careful about filtering out incomplete cases before inspecting them more closely, but since our goal is to learn multilevel modeling without getting tripped up by errors due to missing data, we’ll consider this a “didactic shortcut”.
hr_2019 <- hr_2019 %>% select(-LAST_HIRE_DATE, -YRS_SINCE_LAST_HIRE)
hr_2019 <- hr_2019 %>% filter(complete.cases(.))
Here’s how we’ve narrowed down the dataset after cleaning:
74,304 (all roles, unfiltered) > 47,759 (after filtering out “MN St Colleges & Universities”) > 47,732 (after filtering out incomplete cases)
The public employees dataset doesn’t come with gender as a delivered field. Because we are interested in seeing if there are any gender-related trends around compensation, the best we will be able to do is guess the gender of each employee based on their first name. The gender and genderdata packages can help out with this. These packages leverage information from the Social Security Administration on which names were most frequently associated with which gender at different eras in history.
First, we need to extract the first name and last name into separate columns:
hr_2019 <- hr_2019 %>%
mutate(LAST_NAME = removePunctuation(str_extract(EMPLOYEE_NAME, "^(.+?),")),
FIRST_NAME = removePunctuation(str_extract(EMPLOYEE_NAME, ",([A-Z|a-z]+)")))
Next, we define a get_gender() function that creates a mapping from each first name in the dataset to the most likely gender corresponding to that first name. This function takes a long time to run, so we’ll make sure to distinct() the set of names before passing them through to the function. We’ll also add some code that times how long it takes the function to run and emits a “beep” when it’s finished. (This is a convenient time to go take a coffee break!) When the results are ready, we’ll save them to a .csv file that we can use from session to session, so we don’t have to run this function more than once.
get_gender <- function(first_name) {
results_df <- gender(as.character(first_name), method = "ssa", years = c(1950, 2012))
gender_estimate <- as.character(results_df$gender)
# If we didn't get a valid result, return 'unknown', otherwise return the result
if(length(gender_estimate) == 0) {
return("unknown")
} else {
return(results_df$gender)
}
}
# Get all distinct first names and guess their genders (so we can join this back to the HR data after making the guesses)
# Runtime: ~22 min
tic("guessing names")
gender_guesses <- hr_2019 %>%
#filter(BARGAINING_UNIT_NAME == "Corrections Officers") %>% # test on a single bargaining unit to see if it's working
select(FIRST_NAME) %>%
distinct(FIRST_NAME) %>%
mutate(GENDER = map_chr(FIRST_NAME, get_gender))
toc()
beepr::beep()
write.csv(gender_guesses, file="./data/gender_guesses.csv", row.names = FALSE)
Join the gender data back to the original dataset:
gender_guesses <- read.csv("./data/gender_guesses_backup.csv", header=TRUE)
hr_2019 <- left_join(hr_2019, gender_guesses, by=c("FIRST_NAME"))
Column `FIRST_NAME` joining character vector and factor, coercing into character vector
Here are what the results from the gender package look like when it guesses the gender based on a name. You can see that it’s guessing the gender based on a simple “maximum likelihood” assessment of whether the name occurs more frequently in males vs. females based on the U.S. Social Security Administration’s baby name data. For now, we’ll use the guesses “as is”, but more sophisticated analyses may want to figure out ways to adjust for the level of uncertainty present in these guesses:
gender("River", method = "ssa", years = c(1950, 2012))
Just for fun…what are the most popular names?
hr_2019 %>% group_by(FIRST_NAME) %>%
summarise(count = length(unique(TEMPORARY_ID))) %>%
arrange(desc(count)) %>%
top_n(10)
Selecting by count
Let’s inspect the names for which our gender package wasn’t able to make a gender guess. It looks like we’re getting more ‘unknown’ values for names that may be more frequently associated with individuals with an immigrant background, or whose families have chosen to give them names that are outside of the American mainstream. In fact, this ‘unknown’ appears like it could act as a pretty strong proxy for something other than gender. We need to proceed with caution! This is a source of bias, and we need to take it into account when performing gender-based analyses with this column moving forward.
gender_guesses %>% filter(GENDER == 'unknown') %>% select(FIRST_NAME) %>% top_n(10)
Selecting by FIRST_NAME