Module 5 Cleaning Data in R

It’s commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions.

In this course, you’ll learn how to clean dirty data. Using R, you’ll learn how to identify values that don’t look right and fix dirty data by converting data types, filling in missing values, and using fuzzy string matching. As you learn, you’ll brush up on your skills by working with real-world datasets, including bike-share trips, customer asset portfolios, and restaurant reviews—developing the skills you need to go from raw data to awesome insights as quickly and accurately as possible!

5.1 1: Common Data Problems

In this chapter, you’ll learn how to overcome some of the most common dirty data problems. You’ll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

5.1.1 Video: Data type constraints

Errors can be introduced by typos and misspellings. Dirty data can the data science workflow before we even access the data, and if we don’t address these errors early on, they can follow us all the way through the workflow.

Errors appearing throughout the data science workflow. Source: DataCamp

A quick reminder of data type constraints:

Data types vs. data types in R. Source: DataCamp

5.1.2 Question: Common data types

Solution. Source: DataCamp

Correctly identifying what type your data is is one of the easiest ways to avoid hampering your analysis due to data type constraints in the long run.

5.1.3 Converting data types

Throughout this chapter, you’ll be working with San Francisco bike share ride data called bike_share_rides. It contains information on start and end stations of each trip, the trip duration, and some user information.

Before beginning to analyze any dataset, it’s important to take a look at the different types of columns you’ll be working with, which you can do using glimpse().

In this exercise, you’ll take a look at the data types contained in bike_share_rides and see how an incorrect data type can flaw your analysis.

dplyr was previously loaded, so we just load assertive. bike_share_rides is not available so code requiring it will not be loaded.

# Load assertive package
library(assertive)
# Glimpse at bike_share_rides
glimpse(bike_share_rides)

# Summary of user_birth_year
summary(bike_share_rides$user_birth_year)

Question

The summary statistics of user_birth_year don’t seem to offer much useful information about the different birth years in our dataset. Why do you think that is?

  • The user_birth_year column is not of the correct type and should be converted to a character.

  • The user_birth_year column has an infinite set of possible values and should be converted to a factor.

  • The user_birth_year column represents groupings of data and should be converted to a factor. [Correct]

# Convert user_birth_year to factor: user_birth_year_fct
bike_share_rides <- bike_share_rides %>%
  mutate(user_birth_year_fct = as.factor(user_birth_year))

# Assert user_birth_year_fct is a factor
assert_is_factor(bike_share_rides$user_birth_year_fct)

# Summary of user_birth_year_fct
summary(bike_share_rides$user_birth_year_fct)

Looking at the new summary statistics, more riders were born in 1988 than any other year.

5.1.4 Trimming strings

In the previous exercise, you were able to identify the correct data type and convert user_birth_year to the correct type, allowing you to extract counts that gave you a bit more insight into the dataset.

Another common dirty data problem is having extra bits like percent signs or periods in numbers, causing them to be read in as characters. In order to be able to crunch these numbers, the extra bits need to be removed and the numbers need to be converted from character to numeric. In this exercise, you’ll need to convert the duration column from character to numeric, but before this can happen, the word "minutes" needs to be removed from each value.

# Load stringr package
library(stringr)
## Warning: package 'stringr' was built under R version 3.5.3
bike_share_rides <- bike_share_rides %>%
  # Remove 'minutes' from duration: duration_trimmed
  mutate(duration_trimmed = str_remove(duration, "minutes"),
         # Convert duration_trimmed to numeric: duration_mins
         duration_mins = as.numeric(duration_trimmed))

# Glimpse at bike_share_rides
glimpse(bike_share_rides)

# Assert duration_mins is numeric
assert_is_numeric(bike_share_rides$duration_mins)

# Calculate mean duration
mean(bike_share_rides$duration_mins)

By removing characters and converting to a numeric type, you were able to figure out that the average ride duration is about 13 minutes - not bad for a city like San Francisco!

5.1.5 Video: Range constraints

What to do about about values outside of a variable’s allowed range?

  • We could remove these rows from the data frame, but this should only be done when a small proportion of the values are out of range; otherwise we would significantly increase the amount of bias in our dataset.
  • Treat these values as missing (NA).
  • Replace them with the range limit, e.g. replace a rating of 6 on a 1 to 5 rating scale with 5.
  • Replace with other value based on domain knowledge and/or knowledge of dataset, e.g. replace with average rating.

5.1.6 Ride duration constraints

Values that are out of range can throw off an analysis, so it’s important to catch them early on. In this exercise, you’ll be examining the duration_min column more closely. Bikes are not allowed to be kept out for more than 24 hours, or 1440 minutes at a time, but issues with some of the bikes caused inaccurate recording of the time they were returned.

In this exercise, you’ll replace erroneous data with the range limit (1440 minutes), however, you could just as easily replace these values with NAs.

# Create breaks
breaks <- c(min(bike_share_rides$duration_min), 0, 1440, max(bike_share_rides$duration_min))

# Create a histogram of duration_min
ggplot(bike_share_rides, aes(duration_min)) +
  geom_histogram(breaks = breaks)

# duration_min_const: replace vals of duration_min > 1440 with 1440
bike_share_rides <- bike_share_rides %>%
  mutate(duration_min_const = replace(duration_min, duration_min > 1440, 1440))

# Make sure all values of duration_min_const are between 0 and 1440
assert_all_are_in_closed_range(bike_share_rides$duration_min_const, lower = 0, upper = 1440)

The method of replacing erroneous data with the range limit works well, but you could just as easily replace these values with NAs or something else instead.

5.1.7 Back to the future

Something has gone wrong and it looks like you have data with dates from the future, which is way outside of the date range you expected to be working with. To fix this, you’ll need to remove any rides from the dataset that have a date in the future. Before you can do this, the date column needs to be converted from a character to a Date. Having these as Date objects will make it much easier to figure out which rides are from the future, since R makes it easy to check if one Date object is before (<) or after (>) another.

# Convert date to Date type
bike_share_rides <- bike_share_rides %>%
  mutate(date = as.Date(date))

# Make sure all dates are in the past
assert_all_are_in_past(bike_share_rides$date)

# Filter for rides that occurred before or on today's date
bike_share_rides_past <- bike_share_rides %>%
  filter(date <= today())

# Make sure all dates from bike_share_rides_past are in the past
assert_all_are_in_past(bike_share_rides_past$date)

Handling data from the future like this is much easier than trying to verify the data’s correctness by time traveling.

5.1.8 Video: Uniqueness constraints

Duplicate entries. Source: DataCamp

Source of duplicates. Source: DataCamp

  • Full duplicates: Two rows have the exact same entries in every column.
  • Partial duplicates: Two rows have the exact same entries in some columns.

5.1.9 Full duplicates

You’ve been notified that an update has been made to the bike sharing data pipeline to make it more efficient, but that duplicates are more likely to be generated as a result. To make sure that you can continue using the same scripts to run your weekly analyses about ride statistics, you’ll need to ensure that any duplicates in the dataset are removed first.

When multiple rows of a data frame share the same values for all columns, they’re full duplicates of each other. Removing duplicates like this is important, since having the same value repeated multiple times can alter summary statistics like the mean and median. Each ride, including its ride_id should be unique.

# Count the number of full duplicates
sum(duplicated(bike_share_rides))

# Remove duplicates
bike_share_rides_unique <- distinct(bike_share_rides)

# Count the full duplicates in bike_share_rides_unique
sum(duplicated(bike_share_rides_unique))

Removing full duplicates will ensure that summary statistics aren’t altered by repeated data points.

5.1.10 Removing partial duplicates

Now that you’ve identified and removed the full duplicates, it’s time to check for partial duplicates. Partial duplicates are a bit tricker to deal with than full duplicates. In this exercise, you’ll first identify any partial duplicates and then practice the most common technique to deal with them, which involves dropping all partial duplicates, keeping only the first.

# Find duplicated ride_ids
bike_share_rides %>% 
  # Count the number of occurrences of each ride_id
  count(ride_id) %>% 
  # Filter for rows with a count > 1
  filter(n > 1)

# Remove full and partial duplicates
bike_share_rides_unique <- bike_share_rides %>%
  # Only based on ride_id instead of all cols
  distinct(ride_id, .keep_all = TRUE)

# Find duplicated ride_ids in bike_share_rides_unique
bike_share_rides_unique %>%
  # Count the number of occurrences of each ride_id
  count(ride_id) %>%
  # Filter for rows with a count > 1
  filter(n > 1)

It’s important to consider the data you’re working with before removing partial duplicates, since sometimes it’s expected that there will be partial duplicates in a dataset, such as if the same customer makes multiple purchases.

5.1.11 Aggregating partial duplicates

Another way of handling partial duplicates is to compute a summary statistic of the values that differ between partial duplicates, such as mean, median, maximum, or minimum. This can come in handy when you’re not sure how your data was collected and want an average, or if based on domain knowledge, you’d rather have too high of an estimate than too low of an estimate (or vice versa).

bike_share_rides %>%
  # Group by ride_id and date
  group_by(ride_id, date) %>%
  # Add duration_min_avg column
  mutate(duration_min_avg = mean(duration_min) ) %>%
  # Remove duplicates based on ride_id and date, keep all cols
  distinct(ride_id, date, .keep_all = TRUE) %>%
  # Remove duration_min column
  select(-duration_min)

Aggregation of partial duplicates allows you to keep some information about all data points instead of keeping information about just one data point.


5.2 2: Categorical and Text Data

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

5.2.1 Video: Checking membership

In R, categories are stored as factors in R. They are stored as numbers, and each number has a corresponding label.

Factors. Source: DataCamp

  • How many levels do the categorical variables above have?

Values that don’t belong. Source: DataCamp

How do we end up with values outside those allowed by our factors?

Source: DataCamp

Filtering joins are a type of join thaat keeps or removes observations from the first table, but doesn’t add any new columns.

Semi-joins vs. anti-joins. Source: DataCamp

Values that aren’t factors. Source: DataCamp

5.2.2 Question: Members only

So far in the course, you’ve learned about a number of different problems you can run into when you have dirty data, including

  • data type constraints,
  • range constraints,
  • uniqueness constraints, and
  • membership constraints.

It’s important to be able to correctly identify the type of problem you’re dealing with so that you can treat it correctly. In this exercise, you’ll practice identifying these problems by mapping dirty data scenarios to their constraint type.

Solution. Source: DataCamp

Being able to identify what kinds of errors are in your data is important so that you know how to go about fixing them.

5.2.3 Not a member

Now that you’ve practiced identifying membership constraint problems, it’s time to fix these problems in a new dataset. Throughout this chapter, you’ll be working with a dataset called sfo_survey, containing survey responses from passengers taking flights from San Francisco International Airport (SFO). Participants were asked questions about the airport’s cleanliness, wait times, safety, and their overall satisfaction.

There were a few issues during data collection that resulted in some inconsistencies in the dataset. In this exercise, you’ll be working with the dest_size column, which categorizes the size of the destination airport that the passengers were flying to. A data frame called dest_sizes is available that contains all the possible destination sizes. Your mission is to find rows with invalid dest_sizes and remove them from the data frame.

# Count the number of occurrences of dest_size
sfo_survey %>%
  count(dest_size)

# Find bad dest_size rows
sfo_survey %>% 
  # Join with dest_sizes data frame to get bad dest_size rows
  anti_join(dest_sizes, by = "dest_size") %>%
  # Select id, airline, destination, and dest_size cols
  select(id, airline, destination, dest_size)

# Remove bad dest_size rows
sfo_survey %>% 
  # Join with dest_sizes
  semi_join(dest_sizes, by = "dest_size") %>%
  # Count the number of each dest_size
  count(dest_size)

Anti-joins can help you identify the rows that are causing issues, and semi-joins can remove the issue-causing rows. In the next lesson, you’ll learn about other ways to deal with bad values so that you don’t have to lose rows of data.

5.2.4 Video: Categorical data problems

Categorical data problems Source: DataCamp

5.2.5 Identifying inconsistency

In the video exercise, you learned about different kinds of inconsistencies that can occur within categories, making it look like a variable has more categories than it should.

In this exercise, you’ll continue working with the sfo_survey dataset. You’ll examine the dest_size column again as well as the cleanliness column and determine what kind of issues, if any, these two categorical variables face.

# Count dest_size
sfo_survey %>%
  count(dest_size)

# Count cleanliness
sfo_survey %>%
  count(cleanliness)

In the next exercise, you’ll fix these inconsistencies to get more accurate counts.

5.2.6 Correcting inconsistency

Now that you’ve identified that dest_size has whitespace inconsistencies and cleanliness has capitalization inconsistencies, you’ll use the new tools at your disposal to fix the inconsistent values in sfo_survey instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped.

# Add new columns to sfo_survey
sfo_survey <- sfo_survey %>%
  # dest_size_trimmed: dest_size without whitespace
  mutate(dest_size_trimmed = str_trim(dest_size),
         # cleanliness_lower: cleanliness converted to lowercase
         cleanliness_lower = str_to_lower(cleanliness))

# Count values of dest_size_trimmed
sfo_survey %>%
  count(dest_size_trimmed)

# Count values of cleanliness_lower
sfo_survey %>%
  count(cleanliness_lower)

You were able to convert seven-category data into four-category data, which will help your analysis go more smoothly.

5.2.7 Collapsing categories

One of the tablets that participants filled out the sfo_survey on was not properly configured, allowing the response for dest_region to be free text instead of a dropdown menu. This resulted in some inconsistencies in the dest_region variable that you’ll need to correct in this exercise to ensure that the numbers you report to your boss are as accurate as possible.

# Count categories of dest_region
sfo_survey %>%
  count(dest_region)

# Categories to map to Europe
europe_categories <- c("EU", "eur", "Europ")

# Add a new col dest_region_collapsed
sfo_survey %>%
  # Map all categories in europe_categories to Europe
  mutate(dest_region_collapsed = fct_collapse(dest_region, 
                                     Europe = europe_categories)) %>%
  # Count categories of dest_region_collapsed
  count(dest_region_collapsed)

You’ve reduced the number of categories from 12 to 9, and you can now be confident that 401 of the survey participants were heading to Europe.

5.2.8 Video: Cleaning text data

Text data. Source: DataCamp

Unstructured data problems. Source: DataCamp

More complex text problems. Source: DataCamp

5.2.9 Detecting inconsistent text data

You’ve recently received some news that the customer support team wants to ask the SFO survey participants some follow-up questions. However, the auto-dialer that the call center uses isn’t able to parse all of the phone numbers since they’re all in different formats. After some investigation, you found that some phone numbers are written with hyphens (-) and some are written with parentheses ((,)). In this exercise, you’ll figure out which phone numbers have these issues so that you know which ones need fixing.

# Filter for rows with "-" in the phone column
sfo_survey %>%
  filter(str_detect(phone, "-"))

# Filter for rows with "(" or ")" in the phone column
sfo_survey %>%
  filter(str_detect(phone, fixed("(")) | str_detect(phone, fixed(")")))

Now that you’ve identified the inconsistencies in the phone column, it’s time to remove unnecessary characters to make the follow-up survey go as smoothly as possible.

5.2.10 Replacing and removing

In the last exercise, you saw that the phone column of sfo_data is plagued with unnecessary parentheses and hyphens. The customer support team has requested that all phone numbers be in the format “123 456 7890”. In this exercise, you’ll use your new stringr skills to fulfill this request.

# Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
  # Remove "("s
  str_remove_all(fixed("(")) %>%
  # Remove ")"s
  str_remove_all(fixed(")"))

# Add phone_no_parens as column
sfo_survey %>%
  mutate(sfo_survey, phone_no_parens)

# Add phone_no_parens as column
sfo_survey %>%
  mutate(phone_no_parens = phone_no_parens,
  # Replace all hyphens in phone_no_parens with spaces
         phone_clean = str_replace_all(phone_no_parens, "-", " "))

Now that your phone numbers are all in a single format, the machines in the call center will be able to auto-dial the numbers, making it easier to ask participants follow-up questions.

5.2.11 Invalid phone numbers

The customer support team is grateful for your work so far, but during their first day of calling participants, they ran into some phone numbers that were invalid. In this exercise, you’ll remove any rows with invalid phone numbers so that these faulty numbers don’t keep slowing the team down.

# Check out the invalid numbers
sfo_survey %>%
  filter(str_length(phone) != 12)

# Remove rows with invalid numbers
sfo_survey %>%
  filter(str_length(phone) == 12)

Thanks to your savvy string skills, the follow-up survey will be done in no time!


5.3 3: Advanced Data Problems

In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.

Uniformity. Source: DataCamp

Sources of uniformity issues. Source: DataCamp

What to do about a lack of uniformity? Source: DataCamp

Date uniformity. Source: DataCamp

Ambiguous dates. Source: DataCamp

5.3.1 Date uniformity

In this chapter, you work at an asset management company and you’ll be working with the accounts dataset, which contains information about each customer, the amount in their account, and the date their account was opened. Your boss has asked you to calculate some summary statistics about the average value of each account and whether the age of the account is associated with a higher or lower account value. Before you can do this, you need to make sure that the accounts dataset you’ve been given doesn’t contain any uniformity problems. In this exercise, you’ll investigate the date_opened column and clean it up so that all the dates are in the same format.

# Check out the accounts data frame
head(accounts)

# Define the date formats
formats <- c("%Y-%m-%d", "%B %d, %Y")

# Convert dates to the same format
accounts %>%
  mutate(date_opened_clean = parse_date_time(date_opened, orders = formats))

Now that the date_opened dates are in the same format, you’ll be able to use them for some plotting in the next exercise.

5.3.2 Currency uniformity

Now that your dates are in order, you’ll need to correct any unit differences. When you first plot the data, you’ll notice that there’s a group of very high values, and a group of relatively lower values. The bank has two different offices - one in New York, and one in Tokyo, so you suspect that the accounts managed by the Tokyo office are in Japanese yen instead of U.S. dollars. Luckily, you have a data frame called account_offices that indicates which office manages each customer’s account, so you can use this information to figure out which totals need to be converted from yen to dollars.

The formula to convert yen to dollars is USD = JPY / 104.

# Scatter plot of opening date and total amount
accounts %>%
  ggplot(aes(x = date_opened, y = total)) +
  geom_point()

# Left join accounts and account_offices by id
accounts %>%
  left_join(account_offices, by="id") %>%
  # Convert totals from the Tokyo office to USD
  mutate(total_usd = ifelse(office == "Tokyo", total / 104, total)) %>%
  # Scatter plot of opening date vs total_usd
  ggplot(aes(x = date_opened, y = total_usd)) +
    geom_point()

The points in your last scatter plot all fall within a much smaller range now and you’ll be able to accurately assess the differences between accounts from different countries.

5.3.3 Video: Cross field validation

  • Cross field validation
  • Does this value make sense based on other values?

Link: https://www.buzzfeednews.com/article/katienotopoulos/graphs-that-lied-to-us

What to do with nonsense data. Source: DataCamp

(Note, impute means to assign (a value) to something by inference from the value of the products or processes to which it contributes.)

5.3.4 Validating totals

In this lesson, you’ll continue to work with the accounts data frame, but this time, you have a bit more information about each account. There are three different funds that account holders can store their money in. In this exercise, you’ll validate whether the total amount in each account is equal to the sum of the amount in fund_A, fund_B, and fund_C. If there are any accounts that don’t match up, you can look into them further to see what went wrong in the bookkeeping that led to inconsistencies.

# Find invalid totals
accounts %>%
  # theoretical_total: sum of the three funds
  mutate(theoretical_total = fund_A + fund_B + fund_C) %>%
  # Find accounts where total doesn't match theoretical_total
  filter(theoretical_total != total)

By using cross field validation, you’ve been able to detect values that don’t make sense. How you choose to handle these values will depend on the dataset.

5.3.5 Validating age

Now that you found some inconsistencies in the total amounts, you’re suspicious that there may also be inconsistencies in the acct_agecolumn, and you want to see if these inconsistencies are related. Using the skills you learned from the video exercise, you’ll need to validate the age of each account and see if rows with inconsistent acct_ages are the same ones that had inconsistent totals.

# Find invalid acct_age
accounts %>%
  # theoretical_age: age of acct based on date_opened
  mutate(theoretical_age = floor(as.numeric(date_opened %--% today(), "years"))) %>%
  # Filter for rows where acct_age is different from theoretical_age
  filter(theoretical_age != acct_age)

There are three accounts that all have ages off by one year, but none of them are the same as the accounts that had total inconsistencies, so it looks like these two bookkeeping errors may not be related.

5.3.6 Video: Completeness

What is missing data. Source: DataCamp

Types of missingness. Source: DataCamp

Dealing with missing data. Source: DataCamp

5.3.7 Question: Types of missingness

You just learned about the three flavors of missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In this exercise, you’ll solidify your new knowledge by mapping examples to the types of missingness.

Question types of missingness. Source: DataCamp

5.3.8 Visualizing missing data

Dealing with missing data is one of the most common tasks in data science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.

You just received a new version of the accounts data frame containing data on the amount held and amount invested for new and existing customers. However, there are rows with missing inv_amount values.

You know for a fact that most customers below 25 do not have investment accounts yet, and suspect it could be driving the missingness.

# Visualize the missing values by column
vis_miss(accounts)

accounts %>%
  # missing_inv: Is inv_amount missing?
  mutate(missing_inv = is.na(inv_amount)) %>%
  # Group by missing_inv
  group_by(missing_inv) %>%
  # Calculate mean age for each missing_inv group
  summarize(avg_age = mean(age, na.rm=TRUE))

Since the average age for missing_inv = TRUE is 22 and the average age for missing_inv = FALSE is 44, it is likely that the inv_amount variable is missing mostly in young customers.

# Sort by age and visualize missing vals
accounts %>%
  arrange(age) %>%
  vis_miss()

Investigating summary statistics based on missingness is a great way to determine if data is missing completely at random or missing at random.

5.3.9 Treating missing data

In this exercise, you’re working with another version of the accounts data that contains missing values for both the cust_id and acct_amount columns.

You want to figure out how many unique customers the bank has, as well as the average amount held by customers. You know that rows with missing cust_id don’t really help you, and that on average, the acct_amount is usually 5 times the amount of inv_amount.

In this exercise, you will drop rows of accounts with missing cust_ids, and impute missing values of inv_amount with some domain knowledge.

We’ll need to install and load the assertive package.

# Create accounts_clean
accounts_clean <- accounts %>%
  # Filter to remove rows with missing cust_id
  filter(!is.na(cust_id)) %>%
  # Add new col acct_amount_filled with replaced NAs
  mutate(acct_amount_filled = ifelse(is.na(acct_amount), inv_amount * 5, acct_amount))

# Assert that cust_id has no missing vals
assert_all_are_not_na(accounts_clean$cust_id)
# or
sum(is.na(accounts_clean$cust_id))

# Assert that acct_amount_filled has no missing vals
assert_all_are_not_na(accounts_clean$acct_amount_filled)

Since your assertions passed, there’s no missing data left, and you can definitely bank on nailing your analysis!


5.4 4: Record Linkage

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you’ll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

Minimum edit distance. Source: DataCamp

Edit distance = 1. Source: DataCamp

Edit distance = 4. Source: DataCamp

Types of edit distance. Source: DataCamp

Comparing strings to clean data. Source: DataCamp

5.4.1 Calculating distance

In the video exercise, you saw how to use Damerau-Levenshtein distance to identify how similar two strings are. As a reminder, Damerau-Levenshtein distance is the minimum number of steps needed to get from String A to String B, using these operations:

  • Insertion of a new character.
  • Deletion of an existing character.
  • Substitution of an existing character.
  • Transposition of two existing consecutive characters.

Substituting and inserting is the best way to get from "puffin" to "muffins". In the next exercise, you’ll calculate string distances using R functions.

5.4.2 Small distance, small difference

In the video exercise, you learned that there are multiple ways to calculate how similar or different two strings are. Now you’ll practice using the stringdist package to compute string distances using various methods. It’s important to be familiar with different methods, as some methods work better on certain datasets, while others work better on other datasets.

library(stringdist)
## 
## Attaching package: 'stringdist'
## The following object is masked from 'package:tidyr':
## 
##     extract
# Calculate Damerau-Levenshtein distance
stringdist("las angelos", "los angeles", method = "dl")
## [1] 2
# Calculate LCS distance
stringdist("las angelos", "los angeles", method = "lcs")
## [1] 4

The Jaccard method 1. Count the total number of distinct elements in the two strings. 2. Count the number of those elements which appear in both strings. 3. Divide the latter by the former and multiply by 100.

# Calculate Jaccard distance
stringdist("las angelos", "los angeles", method = "jaccard")
## [1] 0

As there are no elements in the first string that aren’t in the second string, and vice versa, the Jaccard method finds no difference between the two strings. In this way, the Jaccard method treats the strings more as sets of letters rather than sequences of letters.

In the next exercise, you’ll use Damerau-Levenshtein distance to map typo-ridden cities to their true spellings.

5.4.3 Fixing typos with string distance

In this chapter, one of the datasets you’ll be working with, zagat, is a set of restaurants in New York, Los Angeles, Atlanta, San Francisco, and Las Vegas. The data is from Zagat, a company that collects restaurant reviews, and includes the restaurant names, addresses, phone numbers, as well as other restaurant information.

The city column contains the name of the city that the restaurant is located in. However, there are a number of typos throughout the column. Your task is to map each city to one of the five correctly-spelled cities contained in the cities data frame.

We’ll need to install and load fuzzyjoin package.

library(fuzzyjoin)
# Count the number of each city variation
zagat %>%
  count(city)

# Join zagat and cities and look at results
zagat %>%
  # Left join based on stringdist using city and city_actual cols
  stringdist_left_join(cities, by = c("city" = "city_actual")) %>%
  # Select the name, city, and city_actual cols
  select(name, city, city_actual)

Now that you’ve created consistent spelling for each city, it will be much easier to compute summary statistics by city.

5.4.4 Video: Generating and comparing pairs

When joins wont work. Source: DataCamp

What is record linkage? Source: DataCamp

5.4.6 Pair blocking

Zagat and Fodor’s are both companies that gather restaurant reviews. The zagat and fodors datasets both contain information about various restaurants, including addresses, phone numbers, and cuisine types. Some restaurants appear in both datasets, but don’t necessarily have the same exact name or phone number written down. In this chapter, you’ll work towards figuring out which restaurants appear in both datasets.

The first step towards this goal is to generate pairs of records so that you can compare them. In this exercise, you’ll first generate all possible pairs, and then use your newly-cleaned city column as a blocking variable.

# Load reclin
library(reclin)

# Generate all possible pairs
pair_blocking(zagat, fodors)

# Generate all possible pairs
pair_blocking(zagat, fodors, blocking_var = "city")

By using city as a blocking variable, you were able to reduce the number of pairs you’ll need to compare from 165,230 pairs to 40,532.

5.4.7 Comparing pairs

Now that you’ve generated the pairs of restaurants, it’s time to compare them. You can easily customize how you perform your comparisons using the by and default_comparator arguments. There’s no right answer as to what each should be set to, so in this exercise, you’ll try a couple options out.

# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs by name using lcs()
  compare_pairs(by ="name",
      default_comparator = lcs())

# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs by name using lcs()
  compare_pairs(by = c("name", "phone", "addr"),
      default_comparator = jaro_winkler())

Choosing a comparator and the columns to compare is highly dataset-dependent, so it’s best to try out different combinations to see which works best on the dataset you’re working with. Next, you’ll build on your string comparison skills and learn about record linkage!

5.4.8 Video: Scoring and linking

5.4.9 Score then select or select then score?

Record linkage requires a number of steps that can be difficult to keep straight. In this exercise, you’ll solidify your knowledge of the record linkage process so that it’s a breeze when you code it yourself!

Question order for record linkage. Source: DataCamp

5.4.10 Putting it together

During this chapter, you’ve cleaned up the city column of zagat using string similarity, as well as generated and compared pairs of restaurants from zagat and fodors. The end is near - all that’s left to do is score and select pairs and link the data together, and you’ll be able to begin your analysis in no time!

# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs
  compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
  # Score pairs
  score_problink() %>%
  # Select pairs
  select_n_to_m() %>%
  # Link data 
  link()

Now that your two datasets are merged, you can use the data to figure out if there are certain characteristics that make a restaurant more likely to be reviewed by Zagat or Fodor’s.

5.4.11 Video: Congratulations!

Module 5, chapter 1 summary. Source: DataCamp

Module 5, chapter 2 summary. Source: DataCamp

Module 5, chapter 3 summary. Source: DataCamp

Module 5, chapter 4 summary. Source: DataCamp

Module 5 - further courses. Source: DataCamp