Module 5 Cleaning Data in R
It’s commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions.
In this course, you’ll learn how to clean dirty data. Using R, you’ll learn how to identify values that don’t look right and fix dirty data by converting data types, filling in missing values, and using fuzzy string matching. As you learn, you’ll brush up on your skills by working with real-world datasets, including bike-share trips, customer asset portfolios, and restaurant reviews—developing the skills you need to go from raw data to awesome insights as quickly and accurately as possible!
5.1 1: Common Data Problems
In this chapter, you’ll learn how to overcome some of the most common dirty data problems. You’ll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.
5.1.1 Video: Data type constraints
Errors can be introduced by typos and misspellings. Dirty data can the data science workflow before we even access the data, and if we don’t address these errors early on, they can follow us all the way through the workflow.
A quick reminder of data type constraints:
5.1.2 Question: Common data types
Correctly identifying what type your data is is one of the easiest ways to avoid hampering your analysis due to data type constraints in the long run.
5.1.3 Converting data types
Throughout this chapter, you’ll be working with San Francisco bike share ride data called bike_share_rides
. It contains information on start and end stations of each trip, the trip duration, and some user information.
Before beginning to analyze any dataset, it’s important to take a look at the different types of columns you’ll be working with, which you can do using glimpse().
In this exercise, you’ll take a look at the data types contained in bike_share_rides and see how an incorrect data type can flaw your analysis.
dplyr
was previously loaded, so we just load assertive
. bike_share_rides
is not available so code requiring it will not be loaded.
# Load assertive package
library(assertive)
# Glimpse at bike_share_rides
glimpse(bike_share_rides)
# Summary of user_birth_year
summary(bike_share_rides$user_birth_year)
Question
The summary statistics of user_birth_year
don’t seem to offer much useful information about the different birth years in our dataset. Why do you think that is?
The
user_birth_year
column is not of the correct type and should be converted to a character.The
user_birth_year
column has an infinite set of possible values and should be converted to a factor.The
user_birth_year
column represents groupings of data and should be converted to a factor. [Correct]
# Convert user_birth_year to factor: user_birth_year_fct
<- bike_share_rides %>%
bike_share_rides mutate(user_birth_year_fct = as.factor(user_birth_year))
# Assert user_birth_year_fct is a factor
assert_is_factor(bike_share_rides$user_birth_year_fct)
# Summary of user_birth_year_fct
summary(bike_share_rides$user_birth_year_fct)
Looking at the new summary statistics, more riders were born in 1988
than any other year.
5.1.4 Trimming strings
In the previous exercise, you were able to identify the correct data type and convert user_birth_year
to the correct type, allowing you to extract counts that gave you a bit more insight into the dataset.
Another common dirty data problem is having extra bits like percent signs or periods in numbers, causing them to be read in as character
s. In order to be able to crunch these numbers, the extra bits need to be removed and the numbers need to be converted from character
to numeric
. In this exercise, you’ll need to convert the duration
column from character
to numeric
, but before this can happen, the word "minutes"
needs to be removed from each value.
# Load stringr package
library(stringr)
## Warning: package 'stringr' was built under R version 3.5.3
<- bike_share_rides %>%
bike_share_rides # Remove 'minutes' from duration: duration_trimmed
mutate(duration_trimmed = str_remove(duration, "minutes"),
# Convert duration_trimmed to numeric: duration_mins
duration_mins = as.numeric(duration_trimmed))
# Glimpse at bike_share_rides
glimpse(bike_share_rides)
# Assert duration_mins is numeric
assert_is_numeric(bike_share_rides$duration_mins)
# Calculate mean duration
mean(bike_share_rides$duration_mins)
By removing characters and converting to a numeric type, you were able to figure out that the average ride duration is about 13 minutes - not bad for a city like San Francisco!
5.1.5 Video: Range constraints
What to do about about values outside of a variable’s allowed range?
- We could remove these rows from the data frame, but this should only be done when a small proportion of the values are out of range; otherwise we would significantly increase the amount of bias in our dataset.
- Treat these values as missing (
NA
). - Replace them with the range limit, e.g. replace a rating of
6
on a1
to5
rating scale with5
. - Replace with other value based on domain knowledge and/or knowledge of dataset, e.g. replace with average rating.
5.1.6 Ride duration constraints
Values that are out of range can throw off an analysis, so it’s important to catch them early on. In this exercise, you’ll be examining the duration_min
column more closely. Bikes are not allowed to be kept out for more than 24 hours, or 1440 minutes at a time, but issues with some of the bikes caused inaccurate recording of the time they were returned.
In this exercise, you’ll replace erroneous data with the range limit (1440 minutes), however, you could just as easily replace these values with NA
s.
# Create breaks
<- c(min(bike_share_rides$duration_min), 0, 1440, max(bike_share_rides$duration_min))
breaks
# Create a histogram of duration_min
ggplot(bike_share_rides, aes(duration_min)) +
geom_histogram(breaks = breaks)
# duration_min_const: replace vals of duration_min > 1440 with 1440
<- bike_share_rides %>%
bike_share_rides mutate(duration_min_const = replace(duration_min, duration_min > 1440, 1440))
# Make sure all values of duration_min_const are between 0 and 1440
assert_all_are_in_closed_range(bike_share_rides$duration_min_const, lower = 0, upper = 1440)
The method of replacing erroneous data with the range limit works well, but you could just as easily replace these values with NA
s or something else instead.
5.1.7 Back to the future
Something has gone wrong and it looks like you have data with dates from the future, which is way outside of the date range you expected to be working with. To fix this, you’ll need to remove any rides from the dataset that have a date
in the future. Before you can do this, the date
column needs to be converted from a character
to a Date
. Having these as Date
objects will make it much easier to figure out which rides are from the future, since R makes it easy to check if one Date
object is before (<
) or after (>
) another.
# Convert date to Date type
<- bike_share_rides %>%
bike_share_rides mutate(date = as.Date(date))
# Make sure all dates are in the past
assert_all_are_in_past(bike_share_rides$date)
# Filter for rides that occurred before or on today's date
<- bike_share_rides %>%
bike_share_rides_past filter(date <= today())
# Make sure all dates from bike_share_rides_past are in the past
assert_all_are_in_past(bike_share_rides_past$date)
Handling data from the future like this is much easier than trying to verify the data’s correctness by time traveling.
5.1.8 Video: Uniqueness constraints
- Full duplicates: Two rows have the exact same entries in every column.
- Partial duplicates: Two rows have the exact same entries in some columns.
5.1.9 Full duplicates
You’ve been notified that an update has been made to the bike sharing data pipeline to make it more efficient, but that duplicates are more likely to be generated as a result. To make sure that you can continue using the same scripts to run your weekly analyses about ride statistics, you’ll need to ensure that any duplicates in the dataset are removed first.
When multiple rows of a data frame share the same values for all columns, they’re full duplicates of each other. Removing duplicates like this is important, since having the same value repeated multiple times can alter summary statistics like the mean and median. Each ride, including its ride_id
should be unique.
# Count the number of full duplicates
sum(duplicated(bike_share_rides))
# Remove duplicates
<- distinct(bike_share_rides)
bike_share_rides_unique
# Count the full duplicates in bike_share_rides_unique
sum(duplicated(bike_share_rides_unique))
Removing full duplicates will ensure that summary statistics aren’t altered by repeated data points.
5.1.10 Removing partial duplicates
Now that you’ve identified and removed the full duplicates, it’s time to check for partial duplicates. Partial duplicates are a bit tricker to deal with than full duplicates. In this exercise, you’ll first identify any partial duplicates and then practice the most common technique to deal with them, which involves dropping all partial duplicates, keeping only the first.
# Find duplicated ride_ids
%>%
bike_share_rides # Count the number of occurrences of each ride_id
count(ride_id) %>%
# Filter for rows with a count > 1
filter(n > 1)
# Remove full and partial duplicates
<- bike_share_rides %>%
bike_share_rides_unique # Only based on ride_id instead of all cols
distinct(ride_id, .keep_all = TRUE)
# Find duplicated ride_ids in bike_share_rides_unique
%>%
bike_share_rides_unique # Count the number of occurrences of each ride_id
count(ride_id) %>%
# Filter for rows with a count > 1
filter(n > 1)
It’s important to consider the data you’re working with before removing partial duplicates, since sometimes it’s expected that there will be partial duplicates in a dataset, such as if the same customer makes multiple purchases.
5.1.11 Aggregating partial duplicates
Another way of handling partial duplicates is to compute a summary statistic of the values that differ between partial duplicates, such as mean, median, maximum, or minimum. This can come in handy when you’re not sure how your data was collected and want an average, or if based on domain knowledge, you’d rather have too high of an estimate than too low of an estimate (or vice versa).
%>%
bike_share_rides # Group by ride_id and date
group_by(ride_id, date) %>%
# Add duration_min_avg column
mutate(duration_min_avg = mean(duration_min) ) %>%
# Remove duplicates based on ride_id and date, keep all cols
distinct(ride_id, date, .keep_all = TRUE) %>%
# Remove duration_min column
select(-duration_min)
Aggregation of partial duplicates allows you to keep some information about all data points instead of keeping information about just one data point.
5.2 2: Categorical and Text Data
Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.
5.2.1 Video: Checking membership
In R, categories are stored as factor
s in R. They are stored as numbers, and each number has a corresponding label.
- How many levels do the categorical variables above have?
How do we end up with values outside those allowed by our factors?
Filtering joins are a type of join thaat keeps or removes observations from the first table, but doesn’t add any new columns.
5.2.2 Question: Members only
So far in the course, you’ve learned about a number of different problems you can run into when you have dirty data, including
- data type constraints,
- range constraints,
- uniqueness constraints, and
- membership constraints.
It’s important to be able to correctly identify the type of problem you’re dealing with so that you can treat it correctly. In this exercise, you’ll practice identifying these problems by mapping dirty data scenarios to their constraint type.
Being able to identify what kinds of errors are in your data is important so that you know how to go about fixing them.
5.2.3 Not a member
Now that you’ve practiced identifying membership constraint problems, it’s time to fix these problems in a new dataset. Throughout this chapter, you’ll be working with a dataset called sfo_survey
, containing survey responses from passengers taking flights from San Francisco International Airport (SFO). Participants were asked questions about the airport’s cleanliness, wait times, safety, and their overall satisfaction.
There were a few issues during data collection that resulted in some inconsistencies in the dataset. In this exercise, you’ll be working with the dest_size
column, which categorizes the size of the destination airport that the passengers were flying to. A data frame called dest_sizes
is available that contains all the possible destination sizes. Your mission is to find rows with invalid dest_sizes
and remove them from the data frame.
# Count the number of occurrences of dest_size
%>%
sfo_survey count(dest_size)
# Find bad dest_size rows
%>%
sfo_survey # Join with dest_sizes data frame to get bad dest_size rows
anti_join(dest_sizes, by = "dest_size") %>%
# Select id, airline, destination, and dest_size cols
select(id, airline, destination, dest_size)
# Remove bad dest_size rows
%>%
sfo_survey # Join with dest_sizes
semi_join(dest_sizes, by = "dest_size") %>%
# Count the number of each dest_size
count(dest_size)
Anti-joins can help you identify the rows that are causing issues, and semi-joins can remove the issue-causing rows. In the next lesson, you’ll learn about other ways to deal with bad values so that you don’t have to lose rows of data.
5.2.4 Video: Categorical data problems
5.2.5 Identifying inconsistency
In the video exercise, you learned about different kinds of inconsistencies that can occur within categories, making it look like a variable has more categories than it should.
In this exercise, you’ll continue working with the sfo_survey
dataset. You’ll examine the dest_size
column again as well as the cleanliness
column and determine what kind of issues, if any, these two categorical variables face.
# Count dest_size
%>%
sfo_survey count(dest_size)
# Count cleanliness
%>%
sfo_survey count(cleanliness)
In the next exercise, you’ll fix these inconsistencies to get more accurate counts.
5.2.6 Correcting inconsistency
Now that you’ve identified that dest_size
has whitespace inconsistencies and cleanliness
has capitalization inconsistencies, you’ll use the new tools at your disposal to fix the inconsistent values in sfo_survey
instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped.
# Add new columns to sfo_survey
<- sfo_survey %>%
sfo_survey # dest_size_trimmed: dest_size without whitespace
mutate(dest_size_trimmed = str_trim(dest_size),
# cleanliness_lower: cleanliness converted to lowercase
cleanliness_lower = str_to_lower(cleanliness))
# Count values of dest_size_trimmed
%>%
sfo_survey count(dest_size_trimmed)
# Count values of cleanliness_lower
%>%
sfo_survey count(cleanliness_lower)
You were able to convert seven-category data into four-category data, which will help your analysis go more smoothly.
5.2.7 Collapsing categories
One of the tablets that participants filled out the sfo_survey
on was not properly configured, allowing the response for dest_region
to be free text instead of a dropdown menu. This resulted in some inconsistencies in the dest_region
variable that you’ll need to correct in this exercise to ensure that the numbers you report to your boss are as accurate as possible.
# Count categories of dest_region
%>%
sfo_survey count(dest_region)
# Categories to map to Europe
<- c("EU", "eur", "Europ")
europe_categories
# Add a new col dest_region_collapsed
%>%
sfo_survey # Map all categories in europe_categories to Europe
mutate(dest_region_collapsed = fct_collapse(dest_region,
Europe = europe_categories)) %>%
# Count categories of dest_region_collapsed
count(dest_region_collapsed)
You’ve reduced the number of categories from 12 to 9, and you can now be confident that 401 of the survey participants were heading to Europe.
5.2.8 Video: Cleaning text data
5.2.9 Detecting inconsistent text data
You’ve recently received some news that the customer support team wants to ask the SFO survey participants some follow-up questions. However, the auto-dialer that the call center uses isn’t able to parse all of the phone numbers since they’re all in different formats. After some investigation, you found that some phone numbers are written with hyphens (-
) and some are written with parentheses ((
,)
). In this exercise, you’ll figure out which phone numbers have these issues so that you know which ones need fixing.
# Filter for rows with "-" in the phone column
%>%
sfo_survey filter(str_detect(phone, "-"))
# Filter for rows with "(" or ")" in the phone column
%>%
sfo_survey filter(str_detect(phone, fixed("(")) | str_detect(phone, fixed(")")))
Now that you’ve identified the inconsistencies in the phone
column, it’s time to remove unnecessary characters to make the follow-up survey go as smoothly as possible.
5.2.10 Replacing and removing
In the last exercise, you saw that the phone
column of sfo_data
is plagued with unnecessary parentheses and hyphens. The customer support team has requested that all phone numbers be in the format “123 456 7890
”. In this exercise, you’ll use your new stringr
skills to fulfill this request.
# Remove parentheses from phone column
<- sfo_survey$phone %>%
phone_no_parens # Remove "("s
str_remove_all(fixed("(")) %>%
# Remove ")"s
str_remove_all(fixed(")"))
# Add phone_no_parens as column
%>%
sfo_survey mutate(sfo_survey, phone_no_parens)
# Add phone_no_parens as column
%>%
sfo_survey mutate(phone_no_parens = phone_no_parens,
# Replace all hyphens in phone_no_parens with spaces
phone_clean = str_replace_all(phone_no_parens, "-", " "))
Now that your phone numbers are all in a single format, the machines in the call center will be able to auto-dial the numbers, making it easier to ask participants follow-up questions.
5.2.11 Invalid phone numbers
The customer support team is grateful for your work so far, but during their first day of calling participants, they ran into some phone numbers that were invalid. In this exercise, you’ll remove any rows with invalid phone numbers so that these faulty numbers don’t keep slowing the team down.
# Check out the invalid numbers
%>%
sfo_survey filter(str_length(phone) != 12)
# Remove rows with invalid numbers
%>%
sfo_survey filter(str_length(phone) == 12)
Thanks to your savvy string skills, the follow-up survey will be done in no time!
5.3 3: Advanced Data Problems
In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.
5.3.1 Date uniformity
In this chapter, you work at an asset management company and you’ll be working with the accounts
dataset, which contains information about each customer, the amount in their account, and the date their account was opened. Your boss has asked you to calculate some summary statistics about the average value of each account and whether the age of the account is associated with a higher or lower account value. Before you can do this, you need to make sure that the accounts
dataset you’ve been given doesn’t contain any uniformity problems. In this exercise, you’ll investigate the date_opened
column and clean it up so that all the dates are in the same format.
# Check out the accounts data frame
head(accounts)
# Define the date formats
<- c("%Y-%m-%d", "%B %d, %Y")
formats
# Convert dates to the same format
%>%
accounts mutate(date_opened_clean = parse_date_time(date_opened, orders = formats))
Now that the date_opened
dates are in the same format, you’ll be able to use them for some plotting in the next exercise.
5.3.2 Currency uniformity
Now that your dates are in order, you’ll need to correct any unit differences. When you first plot the data, you’ll notice that there’s a group of very high values, and a group of relatively lower values. The bank has two different offices - one in New York, and one in Tokyo, so you suspect that the accounts managed by the Tokyo office are in Japanese yen instead of U.S. dollars. Luckily, you have a data frame called account_offices
that indicates which office manages each customer’s account, so you can use this information to figure out which total
s need to be converted from yen to dollars.
The formula to convert yen to dollars is USD = JPY / 104
.
# Scatter plot of opening date and total amount
%>%
accounts ggplot(aes(x = date_opened, y = total)) +
geom_point()
# Left join accounts and account_offices by id
%>%
accounts left_join(account_offices, by="id") %>%
# Convert totals from the Tokyo office to USD
mutate(total_usd = ifelse(office == "Tokyo", total / 104, total)) %>%
# Scatter plot of opening date vs total_usd
ggplot(aes(x = date_opened, y = total_usd)) +
geom_point()
The points in your last scatter plot all fall within a much smaller range now and you’ll be able to accurately assess the differences between accounts from different countries.
5.3.3 Video: Cross field validation
- Cross field validation
- Does this value make sense based on other values?
Link: https://www.buzzfeednews.com/article/katienotopoulos/graphs-that-lied-to-us
(Note, impute means to assign (a value) to something by inference from the value of the products or processes to which it contributes.)
5.3.4 Validating totals
In this lesson, you’ll continue to work with the accounts
data frame, but this time, you have a bit more information about each account. There are three different funds that account holders can store their money in. In this exercise, you’ll validate whether the total
amount in each account is equal to the sum of the amount in fund_A
, fund_B
, and fund_C.
If there are any accounts that don’t match up, you can look into them further to see what went wrong in the bookkeeping that led to inconsistencies.
# Find invalid totals
%>%
accounts # theoretical_total: sum of the three funds
mutate(theoretical_total = fund_A + fund_B + fund_C) %>%
# Find accounts where total doesn't match theoretical_total
filter(theoretical_total != total)
By using cross field validation, you’ve been able to detect values that don’t make sense. How you choose to handle these values will depend on the dataset.
5.3.5 Validating age
Now that you found some inconsistencies in the total
amounts, you’re suspicious that there may also be inconsistencies in the acct_agecolumn
, and you want to see if these inconsistencies are related. Using the skills you learned from the video exercise, you’ll need to validate the age of each account and see if rows with inconsistent acct_ages
are the same ones that had inconsistent total
s.
# Find invalid acct_age
%>%
accounts # theoretical_age: age of acct based on date_opened
mutate(theoretical_age = floor(as.numeric(date_opened %--% today(), "years"))) %>%
# Filter for rows where acct_age is different from theoretical_age
filter(theoretical_age != acct_age)
There are three accounts that all have ages off by one year, but none of them are the same as the accounts that had total
inconsistencies, so it looks like these two bookkeeping errors may not be related.
5.3.6 Video: Completeness
5.3.7 Question: Types of missingness
You just learned about the three flavors of missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In this exercise, you’ll solidify your new knowledge by mapping examples to the types of missingness.
5.3.8 Visualizing missing data
Dealing with missing data is one of the most common tasks in data science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.
You just received a new version of the accounts
data frame containing data on the amount held and amount invested for new and existing customers. However, there are rows with missing inv_amount
values.
You know for a fact that most customers below 25 do not have investment accounts yet, and suspect it could be driving the missingness.
# Visualize the missing values by column
vis_miss(accounts)
%>%
accounts # missing_inv: Is inv_amount missing?
mutate(missing_inv = is.na(inv_amount)) %>%
# Group by missing_inv
group_by(missing_inv) %>%
# Calculate mean age for each missing_inv group
summarize(avg_age = mean(age, na.rm=TRUE))
Since the average age for missing_inv = TRUE
is 22 and the average age for missing_inv = FALSE
is 44, it is likely that the inv_amount
variable is missing mostly in young customers.
# Sort by age and visualize missing vals
%>%
accounts arrange(age) %>%
vis_miss()
Investigating summary statistics based on missingness is a great way to determine if data is missing completely at random or missing at random.
5.3.9 Treating missing data
In this exercise, you’re working with another version of the accounts
data that contains missing values for both the cust_id
and acct_amount
columns.
You want to figure out how many unique customers the bank has, as well as the average amount held by customers. You know that rows with missing cust_id
don’t really help you, and that on average, the acct_amount
is usually 5 times the amount of inv_amount.
In this exercise, you will drop rows of accounts
with missing cust_ids
, and impute missing values of inv_amount
with some domain knowledge.
We’ll need to install and load the assertive
package.
# Create accounts_clean
<- accounts %>%
accounts_clean # Filter to remove rows with missing cust_id
filter(!is.na(cust_id)) %>%
# Add new col acct_amount_filled with replaced NAs
mutate(acct_amount_filled = ifelse(is.na(acct_amount), inv_amount * 5, acct_amount))
# Assert that cust_id has no missing vals
assert_all_are_not_na(accounts_clean$cust_id)
# or
sum(is.na(accounts_clean$cust_id))
# Assert that acct_amount_filled has no missing vals
assert_all_are_not_na(accounts_clean$acct_amount_filled)
Since your assertions passed, there’s no missing data left, and you can definitely bank on nailing your analysis!
5.4 4: Record Linkage
Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you’ll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.
5.4.1 Calculating distance
In the video exercise, you saw how to use Damerau-Levenshtein distance to identify how similar two strings are. As a reminder, Damerau-Levenshtein distance is the minimum number of steps needed to get from String A to String B, using these operations:
- Insertion of a new character.
- Deletion of an existing character.
- Substitution of an existing character.
- Transposition of two existing consecutive characters.
Substituting and inserting is the best way to get from "puffin"
to "muffins"
. In the next exercise, you’ll calculate string distances using R functions.
5.4.2 Small distance, small difference
In the video exercise, you learned that there are multiple ways to calculate how similar or different two strings are. Now you’ll practice using the stringdist
package to compute string distances using various methods. It’s important to be familiar with different methods, as some methods work better on certain datasets, while others work better on other datasets.
library(stringdist)
##
## Attaching package: 'stringdist'
## The following object is masked from 'package:tidyr':
##
## extract
# Calculate Damerau-Levenshtein distance
stringdist("las angelos", "los angeles", method = "dl")
## [1] 2
# Calculate LCS distance
stringdist("las angelos", "los angeles", method = "lcs")
## [1] 4
The Jaccard method 1. Count the total number of distinct elements in the two strings. 2. Count the number of those elements which appear in both strings. 3. Divide the latter by the former and multiply by 100.
# Calculate Jaccard distance
stringdist("las angelos", "los angeles", method = "jaccard")
## [1] 0
As there are no elements in the first string that aren’t in the second string, and vice versa, the Jaccard method finds no difference between the two strings. In this way, the Jaccard method treats the strings more as sets of letters rather than sequences of letters.
In the next exercise, you’ll use Damerau-Levenshtein distance to map typo-ridden cities to their true spellings.
5.4.3 Fixing typos with string distance
In this chapter, one of the datasets you’ll be working with, zagat
, is a set of restaurants in New York, Los Angeles, Atlanta, San Francisco, and Las Vegas. The data is from Zagat, a company that collects restaurant reviews, and includes the restaurant names, addresses, phone numbers, as well as other restaurant information.
The city
column contains the name of the city that the restaurant is located in. However, there are a number of typos throughout the column. Your task is to map each city to one of the five correctly-spelled cities contained in the cities data frame.
We’ll need to install and load fuzzyjoin
package.
library(fuzzyjoin)
# Count the number of each city variation
%>%
zagat count(city)
# Join zagat and cities and look at results
%>%
zagat # Left join based on stringdist using city and city_actual cols
stringdist_left_join(cities, by = c("city" = "city_actual")) %>%
# Select the name, city, and city_actual cols
select(name, city, city_actual)
Now that you’ve created consistent spelling for each city, it will be much easier to compute summary statistics by city.
5.4.4 Video: Generating and comparing pairs
5.4.5 Exercise: Link or join?
Similar to joins, record linkage is the act of linking data from different sources regarding the same entity. But unlike joins, record linkage does not require exact matches between different pairs of data, and instead can find close matches using string similarity. This is why record linkage is effective when there are no common unique keys between the data sources you can rely upon when linking data sources such as a unique identifier.
Don’t make things more complicated than they need to be: record linkage is a powerful tool, but it’s more complex than using a traditional join.
5.4.6 Pair blocking
Zagat and Fodor’s are both companies that gather restaurant reviews. The zagat
and fodors
datasets both contain information about various restaurants, including addresses, phone numbers, and cuisine types. Some restaurants appear in both datasets, but don’t necessarily have the same exact name or phone number written down. In this chapter, you’ll work towards figuring out which restaurants appear in both datasets.
The first step towards this goal is to generate pairs of records so that you can compare them. In this exercise, you’ll first generate all possible pairs, and then use your newly-cleaned city column as a blocking variable.
# Load reclin
library(reclin)
# Generate all possible pairs
pair_blocking(zagat, fodors)
# Generate all possible pairs
pair_blocking(zagat, fodors, blocking_var = "city")
By using city
as a blocking variable, you were able to reduce the number of pairs you’ll need to compare from 165,230 pairs to 40,532.
5.4.7 Comparing pairs
Now that you’ve generated the pairs of restaurants, it’s time to compare them. You can easily customize how you perform your comparisons using the by and default_comparator arguments. There’s no right answer as to what each should be set to, so in this exercise, you’ll try a couple options out.
# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs by name using lcs()
compare_pairs(by ="name",
default_comparator = lcs())
# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs by name using lcs()
compare_pairs(by = c("name", "phone", "addr"),
default_comparator = jaro_winkler())
Choosing a comparator and the columns to compare is highly dataset-dependent, so it’s best to try out different combinations to see which works best on the dataset you’re working with. Next, you’ll build on your string comparison skills and learn about record linkage!
5.4.8 Video: Scoring and linking
5.4.9 Score then select or select then score?
Record linkage requires a number of steps that can be difficult to keep straight. In this exercise, you’ll solidify your knowledge of the record linkage process so that it’s a breeze when you code it yourself!
5.4.10 Putting it together
During this chapter, you’ve cleaned up the city
column of zagat
using string similarity, as well as generated and compared pairs of restaurants from zagat
and fodors.
The end is near - all that’s left to do is score and select pairs and link the data together, and you’ll be able to begin your analysis in no time!
# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs
compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
# Score pairs
score_problink() %>%
# Select pairs
select_n_to_m() %>%
# Link data
link()
Now that your two datasets are merged, you can use the data to figure out if there are certain characteristics that make a restaurant more likely to be reviewed by Zagat or Fodor’s.