Module 4 Data Manipulation with dplyr
Say you’ve found a great dataset and would like to learn more about it. How can you start to answer the questions you have about the data? You can use dplyr to answer those questions—it can also help with basic transformations of your data. You’ll also learn to aggregate your data and add, remove, or change the variables. Along the way, you’ll explore a dataset containing information about counties in the United States. You’ll finish the course by applying these tools to the babynames dataset to explore trends of baby names in the United States.
4.1 Transforming Data with dplyr
Learn verbs you can use to transform your data, including select, filter, arrange, and mutate. You’ll use these functions to modify the counties dataset to view particular observations and answer questions about the data.
4.1.1 Video: The counties dataset
4.1.2 Question: Understanding your data
4.1.2.1 Load the counties
data set and the dplyr
package
<- read.csv("data_acs2015_county_data.csv")
counties library("dplyr")
Take a look at the counties
dataset using the glimpse()
function.
glimpse(counties)
## Observations: 3,141
## Variables: 40
## $ census_id <int> 1001, 1003, 1005, 1007, 1009, 1011, 10...
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Al...
## $ county <fct> Autauga, Baldwin, Barbour, Bibb, Bloun...
## $ region <fct> South, South, South, South, South, Sou...
## $ metro <fct> , , , , , , , , , , , , , , , , , , , ...
## $ population <int> 55221, 195121, 26932, 22604, 57710, 10...
## $ men <int> 26745, 95314, 14497, 12073, 28512, 566...
## $ women <int> 28476, 99807, 12435, 10531, 29198, 501...
## $ hispanic <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5...
## $ white <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53...
## $ black <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8...
## $ native <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2...
## $ asian <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9...
## $ pacific <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
## $ citizens <int> 40725, 147695, 20714, 17495, 42345, 80...
## $ income <int> 51281, 50254, 32964, 38678, 45813, 319...
## $ income_err <int> 2391, 1263, 2973, 3995, 3141, 5884, 17...
## $ income_per_cap <int> 24974, 27317, 16824, 18431, 20532, 175...
## $ income_per_cap_err <int> 1080, 711, 798, 1618, 708, 2055, 714, ...
## $ poverty <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25...
## $ child_poverty <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39...
## $ professional <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27...
## $ service <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16...
## $ office <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21...
## $ construction <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10....
## $ production <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23...
## $ drive <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84...
## $ carpool <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4...
## $ transit <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2...
## $ walk <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2...
## $ other_transp <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2...
## $ work_at_home <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7...
## $ mean_commute <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24...
## $ employed <int> 23986, 85953, 8597, 8294, 22189, 3865,...
## $ private_work <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77...
## $ public_work <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16...
## $ self_employed <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0...
## $ family_work <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1...
## $ unemployment <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, ...
## $ land_area <dbl> 594.44, 1589.78, 884.88, 622.58, 644.7...
4.1.3 Selecting columns
Select the following four columns from the counties
variable:
state
county
population
poverty
You don’t need to save the result to a variable.
4.1.3.1 Select the columns
%>%
counties select(state, county, population, poverty)
Recall that if you want to keep the data you’ve selected, you can use assignment to create a new table.
4.1.4 Video: The filter and arrange verbs
4.1.5 Arranging observations
Here you see the counties_selected dataset
with a few interesting variables selected. These variables: private_work
, public_work
, self_employed
describe whether people work for the government, for private companies, or for themselves.
In these exercises, you’ll sort these observations to find the most interesting cases.
<- counties %>%
counties_selected select(state, county, population, private_work, public_work, self_employed)
4.1.5.1 Add a verb to sort in descending order of public_work
%>%
counties_selected arrange(desc(public_work))
We sorted the counties in descending order according to public_work
. What if we were interested in looking at observations in counties that have a large population or within a specific state? Let’s take a look at that next!
4.1.6 Filtering for conditions
You use the filter()
verb to get only observations that match a particular condition, or match multiple conditions.
<- counties %>%
counties_selected select(state, county, population)
4.1.6.1 Filter for counties with a population above 1000000
%>%
counties_selected filter(population > 1000000)
4.1.6.2 Filter for counties in the state of California that have a population above 1000000
%>%
counties_selected filter(state == "California",
> 1000000) population
## state county population
## 1 California Alameda 1584983
## 2 California Contra Costa 1096068
## 3 California Los Angeles 10038388
## 4 California Orange 3116069
## 5 California Riverside 2298032
## 6 California Sacramento 1465832
## 7 California San Bernardino 2094769
## 8 California San Diego 3223096
## 9 California Santa Clara 1868149
Now you know that there are 9 counties in the state of California with a population greater than one million. In the next exercise, you’ll practice filtering and then sorting a dataset to focus on specific observations!
4.1.7 Filtering and arranging
We’re often interested in both filtering and sorting a dataset, to focus on observations of particular interest to you. Here, you’ll find counties that are extreme examples of what fraction of the population works in the private sector.
<- counties %>%
counties_selected select(state, county, population, private_work, public_work, self_employed)
4.1.7.1 Filter for Texas and more than 10000 people; sort in descending order of private_work
%>%
counties_selected filter(state == "Texas", population > 10000) %>%
arrange(desc(private_work))
## state county population private_work public_work
## 1 Texas Gregg 123178 84.7 9.8
## 2 Texas Collin 862215 84.1 10.0
## 3 Texas Dallas 2485003 83.9 9.5
## 4 Texas Harris 4356362 83.4 10.1
## 5 Texas Andrews 16775 83.1 9.6
## 6 Texas Tarrant 1914526 83.1 11.4
## 7 Texas Titus 32553 82.5 10.0
## 8 Texas Denton 731851 82.2 11.9
## 9 Texas Ector 149557 82.0 11.2
## 10 Texas Moore 22281 82.0 11.7
## 11 Texas Jefferson 252872 81.9 13.4
## 12 Texas Fort Bend 658331 81.7 12.5
## 13 Texas Panola 23900 81.5 11.7
## 14 Texas Midland 151290 81.2 11.4
## 15 Texas Potter 122352 81.2 12.4
## 16 Texas Frio 18168 81.1 15.4
## 17 Texas Johnson 155450 80.9 13.3
## 18 Texas Smith 217552 80.9 12.7
## 19 Texas Orange 83217 80.8 13.7
## 20 Texas Harrison 66417 80.6 13.9
## 21 Texas Brazoria 331741 80.5 14.4
## 22 Texas Calhoun 21666 80.5 12.8
## 23 Texas Austin 28886 80.4 11.8
## 24 Texas Montgomery 502586 80.4 11.7
## 25 Texas Jim Wells 41461 80.3 12.6
## 26 Texas Hutchinson 21858 80.2 13.9
## 27 Texas Ochiltree 10642 80.1 10.4
## 28 Texas Victoria 90099 80.1 12.5
## 29 Texas Hardin 55375 80.0 15.7
## 30 Texas Bexar 1825502 79.6 14.8
## 31 Texas Gray 22983 79.6 13.3
## 32 Texas McLennan 241505 79.5 14.7
## 33 Texas Newton 14231 79.4 15.8
## 34 Texas Ward 11225 79.3 15.0
## 35 Texas Grimes 26961 79.2 15.0
## 36 Texas Lamar 49566 79.2 13.9
## 37 Texas Rusk 53457 79.2 12.8
## 38 Texas Wharton 41264 78.9 12.3
## 39 Texas Williamson 473592 78.9 14.9
## 40 Texas Marion 10248 78.7 15.2
## 41 Texas Hood 53171 78.6 11.0
## 42 Texas Hunt 88052 78.6 14.4
## 43 Texas Ellis 157058 78.5 14.2
## 44 Texas Upshur 40096 78.5 13.0
## 45 Texas Grayson 122780 78.4 13.8
## 46 Texas Liberty 77486 78.4 13.6
## 47 Texas Atascosa 47050 78.1 14.3
## 48 Texas Waller 45847 78.1 16.3
## 49 Texas Deaf Smith 19245 78.0 13.6
## 50 Texas Chambers 37251 77.9 15.9
## 51 Texas Jasper 35768 77.7 15.5
## 52 Texas Scurry 17238 77.7 16.2
## 53 Texas Parmer 10004 77.6 12.5
## 54 Texas San Patricio 66070 77.6 17.9
## 55 Texas Taylor 134435 77.5 16.5
## 56 Texas Fayette 24849 77.4 13.4
## 57 Texas Kaufman 109289 77.4 16.7
## 58 Texas Matagorda 36598 77.4 15.6
## 59 Texas Comal 119632 77.2 13.2
## 60 Texas Gonzales 20172 77.2 13.2
## 61 Texas Hill 34923 77.1 15.0
## 62 Texas Hockley 23322 77.1 15.6
## 63 Texas Guadalupe 143460 77.0 17.9
## 64 Texas Cooke 38761 76.9 14.8
## 65 Texas Rockwall 85536 76.9 16.2
## 66 Texas Wise 61243 76.9 14.9
## 67 Texas Gaines 18916 76.8 12.0
## 68 Texas Tom Green 115056 76.8 16.5
## 69 Texas Erath 40039 76.7 15.3
## 70 Texas Hopkins 35645 76.7 13.2
## 71 Texas Palo Pinto 27921 76.7 14.1
## 72 Texas Nueces 352060 76.6 16.3
## 73 Texas Parker 121418 76.6 15.2
## 74 Texas Lubbock 290782 76.5 16.8
## 75 Texas Travis 1121645 76.5 16.0
## 76 Texas Eastland 18328 76.4 15.1
## 77 Texas Montague 19478 76.4 13.3
## 78 Texas Angelina 87748 76.2 17.8
## 79 Texas Jackson 14486 76.1 13.6
## 80 Texas Nolan 15061 76.1 17.1
## 81 Texas Cass 30328 76.0 15.9
## 82 Texas Duval 11577 76.0 17.2
## 83 Texas Hale 35504 75.9 16.4
## 84 Texas Galveston 308163 75.8 18.2
## 85 Texas Bosque 17971 75.7 12.8
## 86 Texas Henderson 79016 75.6 13.9
## 87 Texas Bowie 93155 75.5 19.4
## 88 Texas Aransas 24292 75.4 12.4
## 89 Texas Randall 126782 75.4 17.5
## 90 Texas Morris 12700 75.3 17.5
## 91 Texas Rains 11037 75.3 15.6
## 92 Texas Shelby 25725 75.3 16.8
## 93 Texas Brown 37833 75.2 16.5
## 94 Texas Sabine 10440 75.1 18.1
## 95 Texas Wood 42712 75.0 15.6
## 96 Texas Gillespie 25398 74.8 10.5
## 97 Texas Navarro 48118 74.8 17.3
## 98 Texas Caldwell 39347 74.7 18.6
## 99 Texas Tyler 21462 74.6 19.8
## 100 Texas Webb 263251 74.6 19.4
## 101 Texas Wichita 131957 74.6 19.2
## 102 Texas Lee 16664 74.5 16.7
## 103 Texas Burnet 44144 74.1 14.2
## 104 Texas El Paso 831095 73.9 20.0
## 105 Texas Kendall 37361 73.9 15.7
## 106 Texas Lamb 13742 73.8 16.1
## 107 Texas Camp 12516 73.7 16.0
## 108 Texas Freestone 19586 73.6 19.0
## 109 Texas Medina 47392 73.6 17.3
## 110 Texas Washington 34236 73.6 18.1
## 111 Texas Hays 177562 73.5 19.4
## 112 Texas Live Oak 11873 73.5 14.2
## 113 Texas Lavaca 19549 73.4 14.4
## 114 Texas Zapata 14308 73.3 18.1
## 115 Texas Fannin 33748 73.1 19.8
## 116 Texas Young 18329 73.1 15.7
## 117 Texas Cherokee 51167 72.9 20.4
## 118 Texas Kerr 50149 72.8 16.7
## 119 Texas Cameron 417947 72.7 18.2
## 120 Texas Colorado 20757 72.6 17.3
## 121 Texas Clay 10479 72.4 17.0
## 122 Texas DeWitt 20540 72.4 17.7
## 123 Texas Franklin 10599 72.3 13.7
## 124 Texas Comanche 13623 72.2 15.0
## 125 Texas Howard 36105 72.0 23.0
## 126 Texas Llano 19323 72.0 11.4
## 127 Texas Nacogdoches 65531 72.0 20.0
## 128 Texas Wilson 45509 71.8 20.6
## 129 Texas Van Zandt 52736 71.6 16.8
## 130 Texas Bastrop 76948 71.3 20.1
## 131 Texas Callahan 13532 71.2 17.0
## 132 Texas Falls 17410 70.8 20.5
## 133 Texas Hidalgo 819217 70.7 17.2
## 134 Texas Bee 32659 70.5 22.7
## 135 Texas Dimmit 10682 70.5 21.9
## 136 Texas Blanco 10723 70.4 17.5
## 137 Texas Zavala 12060 70.4 23.8
## 138 Texas Terry 12687 70.2 19.7
## 139 Texas Anderson 57915 69.8 23.6
## 140 Texas Robertson 16532 69.8 22.4
## 141 Texas Karnes 14879 69.6 22.0
## 142 Texas Kleberg 32029 69.5 25.9
## 143 Texas Burleson 17293 69.4 21.7
## 144 Texas Leon 16819 69.4 14.5
## 145 Texas Bandera 20796 69.2 19.0
## 146 Texas Bell 326041 69.2 25.7
## 147 Texas San Jacinto 27023 69.1 23.0
## 148 Texas Maverick 56548 68.8 25.0
## 149 Texas Trinity 14405 68.8 17.8
## 150 Texas Jones 19978 68.7 20.5
## 151 Texas Polk 46113 68.7 22.2
## 152 Texas Pecos 15807 68.6 24.5
## 153 Texas Dawson 13542 68.5 21.9
## 154 Texas Milam 24344 68.5 21.3
## 155 Texas Red River 12567 68.5 17.4
## 156 Texas Brazos 205271 68.3 25.6
## 157 Texas Starr 62648 68.2 22.5
## 158 Texas Houston 22949 67.8 22.1
## 159 Texas Reeves 14179 67.8 28.3
## 160 Texas Runnels 10445 67.7 19.9
## 161 Texas Willacy 22002 66.9 23.9
## 162 Texas Uvalde 26952 66.8 23.8
## 163 Texas Wilbarger 13158 66.0 29.6
## 164 Texas Val Verde 48980 65.9 28.9
## 165 Texas Lampasas 20219 65.5 26.5
## 166 Texas Madison 13838 65.1 22.4
## self_employed
## 1 5.4
## 2 5.8
## 3 6.4
## 4 6.3
## 5 6.8
## 6 5.4
## 7 7.4
## 8 5.7
## 9 6.7
## 10 5.9
## 11 4.5
## 12 5.7
## 13 6.9
## 14 7.1
## 15 6.2
## 16 3.4
## 17 5.7
## 18 6.3
## 19 5.4
## 20 5.4
## 21 5.0
## 22 6.7
## 23 7.7
## 24 7.8
## 25 7.1
## 26 5.8
## 27 9.2
## 28 7.1
## 29 4.2
## 30 5.6
## 31 7.0
## 32 5.7
## 33 4.8
## 34 5.3
## 35 5.7
## 36 6.7
## 37 7.9
## 38 8.6
## 39 6.0
## 40 5.9
## 41 9.9
## 42 7.0
## 43 7.3
## 44 8.3
## 45 7.5
## 46 7.8
## 47 7.5
## 48 5.6
## 49 7.9
## 50 6.1
## 51 6.7
## 52 6.0
## 53 9.3
## 54 4.4
## 55 5.7
## 56 9.0
## 57 5.8
## 58 7.0
## 59 9.1
## 60 9.1
## 61 7.6
## 62 7.1
## 63 4.8
## 64 8.1
## 65 6.7
## 66 8.1
## 67 11.2
## 68 6.6
## 69 7.9
## 70 9.8
## 71 9.0
## 72 6.9
## 73 8.0
## 74 6.5
## 75 7.4
## 76 7.8
## 77 9.9
## 78 5.9
## 79 9.3
## 80 6.5
## 81 8.1
## 82 6.4
## 83 7.4
## 84 5.8
## 85 11.2
## 86 10.4
## 87 5.1
## 88 12.2
## 89 7.1
## 90 7.1
## 91 9.0
## 92 7.8
## 93 8.2
## 94 6.8
## 95 9.0
## 96 14.3
## 97 7.8
## 98 6.7
## 99 5.3
## 100 5.8
## 101 6.0
## 102 8.4
## 103 11.7
## 104 6.0
## 105 10.4
## 106 9.9
## 107 9.9
## 108 7.3
## 109 9.0
## 110 8.3
## 111 7.0
## 112 12.1
## 113 12.1
## 114 8.7
## 115 7.1
## 116 10.8
## 117 6.5
## 118 10.4
## 119 8.9
## 120 9.0
## 121 10.4
## 122 9.6
## 123 13.3
## 124 12.3
## 125 4.9
## 126 16.4
## 127 7.9
## 128 7.6
## 129 11.1
## 130 8.3
## 131 9.4
## 132 8.4
## 133 11.9
## 134 6.7
## 135 7.6
## 136 11.8
## 137 5.6
## 138 9.8
## 139 6.4
## 140 7.6
## 141 7.8
## 142 4.4
## 143 8.4
## 144 15.7
## 145 11.6
## 146 5.0
## 147 7.6
## 148 6.1
## 149 12.1
## 150 10.3
## 151 8.7
## 152 6.0
## 153 9.5
## 154 9.8
## 155 13.8
## 156 5.8
## 157 8.6
## 158 9.8
## 159 3.9
## 160 12.1
## 161 9.1
## 162 9.4
## 163 4.4
## 164 5.1
## 165 8.0
## 166 11.8
## [ reached getOption("max.print") -- omitted 3 rows ]
You’ve learned how to filter and sort a dataset to answer questions about the data. Notice that you only need to slightly modify your code if you are interested in sorting the observations by a different column.
4.1.8 Video: Mutate
4.1.9 Calculating the number of government employees
In the video, you used the unemployment
variable, which is a percentage, to calculate the number of unemployed people in each county. In this exercise, you’ll do the same with another percentage variable: public_work
.
The code provided already selects the state
, county
, population
, and public_work
columns.
<- counties %>%
counties_selected select(state, county, population, public_work)
4.1.9.1 Sort in descending order of the public_workers column
%>%
counties_selected mutate(public_workers = public_work * population / 100) %>%
arrange(desc(public_workers))
It looks like Los Angeles is the county with the most government employees.
4.1.10 Calculating the percentage of women in a county
The dataset includes columns for the total number (not percentage) of men and women in each county. You could use this, along with the population
variable, to compute the fraction of men (or women) within each county.
In this exercise, you’ll select the relevant columns yourself.
4.1.10.1 Select the columns state, county, population, men, and women
<- counties %>%
counties_selected select(state, county, population, men, women)
4.1.10.2 Calculate proportion_women as the fraction of the population made up of women
%>%
counties_selected mutate(proportion_women = women / population)
Notice that the proportion_women
variable was added as a column to the counties_selected
dataset, and the data now has 6 columns instead of 5.
4.1.11 Select, mutate, filter, and arrange
In this exercise, you’ll put together everything you’ve learned in this chapter (select()
, mutate()
, filter()
and arrange()
), to find the counties with the highest proportion of men.
%>%
counties # Select the five columns
select(state, county, population, men, women) %>%
# Add the proportion_men variable
mutate(proportion_men = men / population) %>%
# Filter for population of at least 10,000
filter(population >= 10000) %>%
# Arrange proportion of men in descending order
arrange(desc(proportion_men))
## state county population men
## 1 Virginia Sussex 11864 8130
## 2 California Lassen 32645 21818
## 3 Georgia Chattahoochee 11914 7940
## 4 Louisiana West Feliciana 15415 10228
## 5 Florida Union 15191 9830
## 6 Texas Jones 19978 12652
## 7 Missouri DeKalb 12782 8080
## 8 Texas Madison 13838 8648
## 9 Virginia Greensville 11760 7303
## 10 Texas Anderson 57915 35469
## 11 Arkansas Lincoln 14062 8596
## 12 Texas Bee 32659 19722
## 13 Florida Hamilton 14395 8671
## 14 Illinois Lawrence 16665 9997
## 15 Mississippi Tallahatchie 14959 8907
## 16 Mississippi Greene 14129 8409
## 17 Texas Karnes 14879 8799
## 18 Texas Frio 18168 10723
## 19 Florida Gulf 15785 9235
## 20 Florida Franklin 11628 6800
## 21 Texas Walker 69330 40484
## 22 Georgia Telfair 16416 9502
## 23 Colorado Fremont 46809 27003
## 24 Louisiana Allen 25653 14784
## 25 Ohio Noble 14508 8359
## 26 Georgia Tattnall 25302 14529
## 27 Louisiana Claiborne 16639 9524
## 28 Texas Pecos 15807 9003
## 29 Missouri Pulaski 53443 30390
## 30 Texas Howard 36105 20496
## 31 Illinois Johnson 12829 7267
## 32 Texas Dawson 13542 7670
## 33 Florida DeSoto 34957 19756
## 34 Texas Reeves 14179 8004
## 35 Florida Taylor 22685 12781
## 36 Tennessee Bledsoe 13686 7677
## 37 Louisiana Grant 22362 12523
## 38 Florida Wakulla 31128 17366
## 39 Tennessee Morgan 21794 12150
## 40 Kentucky Morgan 13428 7477
## 41 Florida Bradford 27223 15150
## 42 California Kings 150998 83958
## 43 Kentucky Martin 12631 7020
## 44 Virginia Buckingham 17068 9476
## 45 California Del Norte 27788 15418
## 46 Missouri Pike 18517 10273
## 47 Florida Jackson 48900 27124
## 48 Mississippi Yazoo 27911 15460
## 49 Florida Glades 13272 7319
## 50 New York Franklin 51280 28253
## 51 Pennsylvania Union 44958 24755
## 52 Tennessee Wayne 16897 9295
## 53 Colorado Summit 28940 15912
## 54 Texas Grimes 26961 14823
## 55 Georgia Macon 14045 7720
## 56 Ohio Madison 43456 23851
## 57 Arkansas St. Francis 27345 14996
## 58 Michigan Chippewa 38586 21156
## 59 Kentucky McCreary 18001 9863
## 60 Florida Washington 24629 13478
## 61 Virginia Prince George 37380 20440
## 62 Florida Calhoun 14615 7991
## 63 North Carolina Avery 17695 9673
## 64 Indiana Sullivan 21111 11540
## 65 Oregon Malheur 30551 16697
## 66 Wyoming Carbon 15739 8592
## 67 Maryland Somerset 25980 14150
## 68 Tennessee Hardeman 26253 14297
## 69 Florida Dixie 16091 8746
## 70 Florida Hardee 27468 14920
## 71 Texas Willacy 22002 11947
## 72 North Dakota Williams 29619 16066
## 73 Colorado Logan 21928 11889
## 74 Michigan Houghton 36660 19876
## 75 Texas Tyler 21462 11629
## 76 North Carolina Onslow 183753 99526
## 77 New York Wyoming 41446 22442
## 78 Illinois Randolph 33069 17902
## 79 California Amador 36995 20012
## 80 Texas Live Oak 11873 6421
## 81 Michigan Gogebic 15824 8556
## 82 Oklahoma Hughes 13785 7450
## 83 Texas Scurry 17238 9312
## 84 Georgia Dooly 14293 7717
## 85 Wisconsin Adams 20451 11033
## 86 Florida Okeechobee 39255 21176
## 87 North Carolina Greene 21328 11499
## 88 Missouri St. Francois 66010 35586
## 89 Texas Terry 12687 6835
## 90 Colorado Gunnison 15651 8427
## 91 Missouri Mississippi 14208 7650
## 92 Alabama Barbour 26932 14497
## 93 Indiana Miami 36211 19482
## 94 South Carolina Edgefield 26466 14234
## 95 Oklahoma Okfuskee 12248 6585
## 96 Virginia Powhatan 28207 15159
## 97 Louisiana East Feliciana 19855 10666
## 98 Oklahoma Beckham 23300 12515
## 99 Texas Polk 46113 24757
## 100 Wisconsin Jackson 20543 11022
## 101 Alaska Fairbanks North Star Borough 99705 53477
## 102 Indiana Perry 19414 10407
## 103 Arizona Graham 37407 20049
## 104 Kentucky Clay 21300 11415
## 105 Tennessee Johnson 18017 9643
## 106 Colorado Chaffee 18309 9798
## 107 Louisiana Catahoula 10247 5483
## 108 Georgia Charlton 13130 7024
## 109 Florida Holmes 19635 10501
## 110 Alaska Kodiak Island Borough 13973 7468
## 111 Kansas Leavenworth 78227 41806
## 112 Alabama Bibb 22604 12073
## 113 Minnesota Pine 29218 15605
## 114 Colorado Grand 14411 7689
## 115 South Carolina Marlboro 27993 14930
## 116 Kansas Riley 75022 40002
## 117 Texas Gray 22983 12253
## 118 Texas Houston 22949 12227
## 119 Oklahoma Woodward 20986 11172
## 120 Ohio Marion 65943 35022
## 121 Georgia Butts 23445 12451
## 122 Pennsylvania Wayne 51642 27420
## 123 Illinois Perry 21810 11572
## 124 Michigan Gratiot 41878 22213
## 125 Colorado Eagle 52576 27887
## 126 Indiana Putnam 37650 19967
## 127 Missouri Cooper 17593 9328
## 128 Colorado Pitkin 17420 9235
## 129 Wyoming Goshen 13544 7180
## 130 Alabama Bullock 10678 5660
## 131 Florida Jefferson 14198 7522
## 132 Florida Hendry 38363 20318
## 133 Virginia Nottoway 15711 8320
## 134 Georgia Dodge 21180 11215
## 135 Massachusetts Nantucket 10556 5589
## 136 Michigan Ionia 64064 33917
## 137 Wyoming Sublette 10117 5353
## 138 Wisconsin Juneau 26494 14018
## 139 Florida Monroe 75901 40159
## 140 Pennsylvania Huntingdon 45906 24284
## 141 Illinois Lee 35027 18524
## 142 Louisiana Winn 14855 7851
## 143 Ohio Pickaway 56515 29866
## 144 Kentucky Oldham 63037 33307
## 145 Texas Fannin 33748 17810
## 146 New York Seneca 35144 18527
## 147 Texas Rusk 53457 28172
## 148 Florida Madison 18729 9869
## 149 Colorado Park 16189 8525
## 150 Florida Gilchrist 16992 8946
## 151 Florida Baker 27135 14277
## 152 Alaska Bethel Census Area 17776 9351
## 153 Virginia Manassas Park city 15625 8219
## 154 Nevada Humboldt 17067 8971
## 155 Wisconsin Waushara 24321 12783
## 156 Texas DeWitt 20540 10795
## 157 Oklahoma Atoka 13906 7307
## 158 Idaho Idaho 16312 8571
## 159 Ohio Ross 77334 40627
## 160 Oklahoma Texas 21588 11340
## 161 Utah Sanpete 28261 14845
## 162 Illinois Fayette 22136 11622
## 163 Tennessee Hickman 24283 12745
## 164 Virginia Southampton 18410 9661
## 165 Virginia Brunswick 16930 8882
## 166 Virginia Lunenburg 12558 6588
## women proportion_men
## 1 3734 0.6852664
## 2 10827 0.6683412
## 3 3974 0.6664428
## 4 5187 0.6635096
## 5 5361 0.6470937
## 6 7326 0.6332966
## 7 4702 0.6321389
## 8 5190 0.6249458
## 9 4457 0.6210034
## 10 22446 0.6124320
## 11 5466 0.6112928
## 12 12937 0.6038764
## 13 5724 0.6023619
## 14 6668 0.5998800
## 15 6052 0.5954275
## 16 5720 0.5951589
## 17 6080 0.5913704
## 18 7445 0.5902136
## 19 6550 0.5850491
## 20 4828 0.5847953
## 21 28846 0.5839319
## 22 6914 0.5788255
## 23 19806 0.5768762
## 24 10869 0.5763069
## 25 6149 0.5761649
## 26 10773 0.5742234
## 27 7115 0.5723902
## 28 6804 0.5695578
## 29 23053 0.5686432
## 30 15609 0.5676776
## 31 5562 0.5664510
## 32 5872 0.5663861
## 33 15201 0.5651515
## 34 6175 0.5644968
## 35 9904 0.5634119
## 36 6009 0.5609382
## 37 9839 0.5600125
## 38 13762 0.5578900
## 39 9644 0.5574929
## 40 5951 0.5568216
## 41 12073 0.5565147
## 42 67040 0.5560206
## 43 5611 0.5557755
## 44 7592 0.5551910
## 45 12370 0.5548438
## 46 8244 0.5547875
## 47 21776 0.5546830
## 48 12451 0.5539035
## 49 5953 0.5514617
## 50 23027 0.5509555
## 51 20203 0.5506250
## 52 7602 0.5500977
## 53 13028 0.5498272
## 54 12138 0.5497941
## 55 6325 0.5496618
## 56 19605 0.5488540
## 57 12349 0.5484001
## 58 17430 0.5482818
## 59 8138 0.5479140
## 60 11151 0.5472411
## 61 16940 0.5468165
## 62 6624 0.5467670
## 63 8022 0.5466516
## 64 9571 0.5466345
## 65 13854 0.5465288
## 66 7147 0.5459051
## 67 11830 0.5446497
## 68 11956 0.5445854
## 69 7345 0.5435337
## 70 12548 0.5431775
## 71 10055 0.5429961
## 72 13553 0.5424221
## 73 10039 0.5421835
## 74 16784 0.5421713
## 75 9833 0.5418414
## 76 84227 0.5416293
## 77 19004 0.5414757
## 78 15167 0.5413529
## 79 16983 0.5409380
## 80 5452 0.5408069
## 81 7268 0.5406977
## 82 6335 0.5404425
## 83 7926 0.5402019
## 84 6576 0.5399146
## 85 9418 0.5394846
## 86 18079 0.5394472
## 87 9829 0.5391504
## 88 30424 0.5391001
## 89 5852 0.5387404
## 90 7224 0.5384320
## 91 6558 0.5384291
## 92 12435 0.5382816
## 93 16729 0.5380133
## 94 12232 0.5378221
## 95 5663 0.5376388
## 96 13048 0.5374198
## 97 9189 0.5371947
## 98 10785 0.5371245
## 99 21356 0.5368768
## 100 9521 0.5365331
## 101 46228 0.5363522
## 102 9007 0.5360565
## 103 17358 0.5359692
## 104 9885 0.5359155
## 105 8374 0.5352167
## 106 8511 0.5351466
## 107 4764 0.5350834
## 108 6106 0.5349581
## 109 9134 0.5348103
## 110 6505 0.5344593
## 111 36421 0.5344191
## 112 10531 0.5341090
## 113 13613 0.5340886
## 114 6722 0.5335508
## 115 13063 0.5333476
## 116 35020 0.5332036
## 117 10730 0.5331332
## 118 10722 0.5327901
## 119 9814 0.5323549
## 120 30921 0.5310950
## 121 10994 0.5310727
## 122 24222 0.5309632
## 123 10238 0.5305823
## 124 19665 0.5304217
## 125 24689 0.5304131
## 126 17683 0.5303320
## 127 8265 0.5302109
## 128 8185 0.5301378
## 129 6364 0.5301240
## 130 5018 0.5300618
## 131 6676 0.5297929
## 132 18045 0.5296249
## 133 7391 0.5295653
## 134 9965 0.5295090
## 135 4967 0.5294619
## 136 30147 0.5294237
## 137 4764 0.5291094
## 138 12476 0.5291009
## 139 35742 0.5290971
## 140 21622 0.5289940
## 141 16503 0.5288492
## 142 7004 0.5285089
## 143 26649 0.5284615
## 144 29730 0.5283722
## 145 15938 0.5277350
## 146 16617 0.5271739
## 147 25285 0.5270030
## 148 8860 0.5269368
## 149 7664 0.5265921
## 150 8046 0.5264831
## 151 12858 0.5261470
## 152 8425 0.5260464
## 153 7406 0.5260160
## 154 8096 0.5256343
## 155 11538 0.5255952
## 156 9745 0.5255599
## 157 6599 0.5254566
## 158 7741 0.5254414
## 159 36707 0.5253446
## 160 10248 0.5252918
## 161 13416 0.5252822
## 162 10514 0.5250271
## 163 11538 0.5248528
## 164 8749 0.5247691
## 165 8048 0.5246308
## 166 5970 0.5246058
## [ reached getOption("max.print") -- omitted 2272 rows ]
Notice Sussex County in Virginia is more than two thirds male: this is because of two men’s prisons in the county.
4.2 Aggregating Data
Now that you know how to transform your data, you’ll want to know more about how to aggregate your data to make it more interpretable. You’ll learn a number of functions you can use to take many observations in your data and summarize them, including count, group_by, summarize, ungroup, and top_n.
4.2.1 Video: The count verb
4.2.2 Counting by region
The counties dataset contains columns for region, state, population, and the number of citizens, which we selected and saved as the counties_selected
table. In this exercise, you’ll focus on the region
column.
<- counties %>%
counties_selected select(region, state, population, citizens)
4.2.2.1 Use count to find the number of counties in each region
%>%
counties_selected count(region, sort=TRUE)
## # A tibble: 4 x 2
## region n
## <fct> <int>
## 1 South 1420
## 2 North Central 1055
## 3 West 448
## 4 Northeast 218
Since the results have been arranged, you can see that the South has the greatest number of counties.
4.2.3 Counting citizens by state
You can weigh your count by particular variables rather than finding the number of counties. In this case, you’ll find the number of citizens in each state.
<- counties %>%
counties_selected select(region, state, population, citizens)
4.2.3.1 Find number of counties per state, weighted by citizens
%>%
counties_selected count(state, wt=citizens, sort=TRUE)
## # A tibble: 50 x 2
## state n
## <fct> <int>
## 1 California 24280349
## 2 Texas 16864962
## 3 Florida 13933052
## 4 New York 13531404
## 5 Pennsylvania 9710416
## 6 Illinois 8979999
## 7 Ohio 8709050
## 8 Michigan 7380136
## 9 North Carolina 7107998
## 10 Georgia 6978660
## # ... with 40 more rows
From our result, we can see that California is the state with the most citizens.
4.2.4 Mutating and counting
You can combine multiple verbs together to answer increasingly complicated questions of your data. For example: “What are the US states where the most people walk to work?”
You’ll use the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count based on it.
<- counties %>%
counties_selected select(state, population, walk)
%>%
counties_selected # Add population_walk containing the total number of people who walk to work
mutate(population_walk = population * walk / 100) %>%
# Count weighted by the new column
count(state, wt = population_walk, sort=TRUE)
## # A tibble: 50 x 2
## state n
## <fct> <dbl>
## 1 New York 1237938.
## 2 California 1017964.
## 3 Pennsylvania 505397.
## 4 Texas 430793.
## 5 Illinois 400346.
## 6 Massachusetts 316765.
## 7 Florida 284723.
## 8 New Jersey 273047.
## 9 Ohio 266911.
## 10 Washington 239764.
## # ... with 40 more rows
We can see that while California had the largest total population, New York state has the largest number of people who walk to work.
4.2.5 Video: The group by, summarize and ungroup verbs
4.2.5.1 Summarizing
The summarize()
verb is very useful for collapsing a large dataset into a single observation.
<- counties %>%
counties_selected select(county, population, income, unemployment)
4.2.5.2 Summarize to find minimum population, maximum unemployment, and average income
%>%
counties_selected summarise(min_population = min(population), max_unemployment = max(unemployment), average_income = mean(income))
## min_population max_unemployment average_income
## 1 85 29.4 NA
If we wanted to take this a step further, we could use filter()
to determine the specific counties that returned the value for min_population
and max_unemployment
.
%>%
counties_selected filter(population == min(population))
## county population income unemployment
## 1 Kalawao 85 66250 0
%>%
counties_selected filter(unemployment == max(unemployment))
## county population income unemployment
## 1 Corson 4149 31676 29.4
4.2.6 Summarizing by state
Another interesting column is land_area
, which shows the land area in square miles. Here, you’ll summarize both population and land area by state, with the purpose of finding the density (in people per square miles).
<- counties %>%
counties_selected select(state, county, population, land_area)
4.2.6.1 Add a density column, then sort in descending order
%>%
counties_selected group_by(state) %>%
summarize(total_area = sum(land_area),
total_population = sum(population)) %>%
mutate(density = total_population / total_area) %>%
arrange(desc(density))
## # A tibble: 50 x 4
## state total_area total_population density
## <fct> <dbl> <int> <dbl>
## 1 New Jersey 7354. 8904413 1211.
## 2 Rhode Island 1034. 1053661 1019.
## 3 Massachusetts 7800. 6705586 860.
## 4 Connecticut 4842. 3593222 742.
## 5 Maryland 9707. 5930538 611.
## 6 Delaware 1949. 926454 475.
## 7 New York 47126. 19673174 417.
## 8 Florida 53625. 19645772 366.
## 9 Pennsylvania 44743. 12779559 286.
## 10 Ohio 40861. 11575977 283.
## # ... with 40 more rows
Looks like New Jersey and Rhode Island are the “most crowded” of the US states, with more than a thousand people per square mile.
4.2.7 Summarizing by state and region
You can group by multiple columns instead of grouping by one. Here, you’ll practice aggregating by state and region, and notice how useful it is for performing multiple aggregations in a row.
<- counties %>%
counties_selected select(region, state, county, population)
4.2.7.1 Calculate the average_pop and median_pop columns
%>%
counties_selected group_by(region, state) %>%
summarize(total_pop = sum(population)) %>%
summarize(average_pop = mean(total_pop), median_pop = median(total_pop))
## # A tibble: 4 x 3
## region average_pop median_pop
## <fct> <dbl> <dbl>
## 1 North Central 5628866. 5580644
## 2 Northeast 5600438. 2461161
## 3 South 7369565. 4804098
## 4 West 5723364. 2798636
It looks like the South has the highest average_pop
of 7370486, while the North Central region has the highest median_pop
of 5580644.
4.2.8 Video: The top_n verb
4.2.9 Selecting a county from each region
Previously, you used the walk
column, which offers a percentage of people in each county that walk to work, to add a new column and count to find the total number of people who walk to work in each county.
Now, you’re interested in finding the county within each region with the highest percentage of citizens who walk to work.
<- counties %>%
counties_selected select(region, state, county, metro, population, walk)
4.2.9.1 Group by region and find the greatest number of citizens who walk to work
%>%
counties_selected group_by(region) %>%
top_n(1, walk)
## # A tibble: 4 x 6
## # Groups: region [4]
## region state county metro population walk
## <fct> <fct> <fct> <fct> <int> <dbl>
## 1 West Alaska Aleutians East Bo~ Nonmet~ 3304 71.2
## 2 Northeast New York New York Metro 1629507 20.7
## 3 North Centr~ North Dako~ McIntosh Nonmet~ 2759 17.5
## 4 South Virginia Lexington city Nonmet~ 7071 31.7
Notice that three of the places lots of people walk to work are low-population nonmetro counties, but that New York City also pops up!
4.2.10 Finding the highest-income state in each region
You’ve been learning to combine multiple dplyr
verbs together. Here, you’ll combine group_by()
, summarize()
, and top_n()
to find the state in each region with the highest income.
When you group by multiple columns and then summarize, it’s important to remember that the summarize “peels off” one of the groups, but leaves the rest on. For example, if you group_by(X, Y)
then summarize
, the result will still be grouped by X
.
<- counties %>%
counties_selected select(region, state, county, population, income)
%>%
counties_selected group_by(region, state) %>%
# Calculate average income
summarise(average_income = mean(income)) %>%
# Find the highest income state in each region
top_n(1, average_income)
## # A tibble: 4 x 3
## # Groups: region [4]
## region state average_income
## <fct> <fct> <dbl>
## 1 North Central North Dakota 55575.
## 2 Northeast New Jersey 73014.
## 3 South Maryland 69200.
## 4 West Hawaii 64879
From our results, we can see that New Jersey in the Northeast is the state with the highest average_income
of 73014.
4.2.11 Using summarize, top_n, and count together
In this chapter, you’ve learned to use five dplyr
verbs related to aggregation: count()
, group_by()
, summarize()
, ungroup()
, and top_n()
. In this exercise, you’ll use all of them to answer a question: In how many states do more people live in metro areas than non-metro areas?
Recall that the metro
column has one of the two values “Metro” (for high-density city areas) or “Nonmetro” (for suburban and country areas).
<- counties %>%
counties_selected select(state, metro, population)
4.2.11.1 Count the states with more people in Metro or Nonmetro areas
%>%
counties_selected group_by(state, metro) %>%
summarize(total_pop = sum(population)) %>%
top_n(1, total_pop) %>%
ungroup(total_pop) %>%
count(metro)
## # A tibble: 1 x 2
## metro n
## <fct> <int>
## 1 "" 50
Notice that 44 states have more people living in Metro areas, and 6 states have more people living in Nonmetro areas.
4.3 Selecting and Transforming Data
Learn advanced methods to select and transform columns. Also learn about select helpers, which are functions that specify criteria for columns you want to choose, as well as the rename and transmute verbs.
4.3.1 Video: Selecting
4.3.2 Selecting columns
Using the select verb, we can answer interesting questions about our dataset by focusing in on related groups of verbs. The colon (:
) is useful for getting many columns at a time.
4.3.2.1 Glimpse the counties table
glimpse(counties)
## Observations: 3,141
## Variables: 40
## $ census_id <int> 1001, 1003, 1005, 1007, 1009, 1011, 10...
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Al...
## $ county <fct> Autauga, Baldwin, Barbour, Bibb, Bloun...
## $ region <fct> South, South, South, South, South, Sou...
## $ metro <fct> , , , , , , , , , , , , , , , , , , , ...
## $ population <int> 55221, 195121, 26932, 22604, 57710, 10...
## $ men <int> 26745, 95314, 14497, 12073, 28512, 566...
## $ women <int> 28476, 99807, 12435, 10531, 29198, 501...
## $ hispanic <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5...
## $ white <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53...
## $ black <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8...
## $ native <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2...
## $ asian <dbl> 1.0, 0.7, 0.4, 0.1, 0.1, 0.2, 0.4, 0.9...
## $ pacific <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
## $ citizens <int> 40725, 147695, 20714, 17495, 42345, 80...
## $ income <int> 51281, 50254, 32964, 38678, 45813, 319...
## $ income_err <int> 2391, 1263, 2973, 3995, 3141, 5884, 17...
## $ income_per_cap <int> 24974, 27317, 16824, 18431, 20532, 175...
## $ income_per_cap_err <int> 1080, 711, 798, 1618, 708, 2055, 714, ...
## $ poverty <dbl> 12.9, 13.4, 26.7, 16.8, 16.7, 24.6, 25...
## $ child_poverty <dbl> 18.6, 19.2, 45.3, 27.9, 27.2, 38.4, 39...
## $ professional <dbl> 33.2, 33.1, 26.8, 21.5, 28.5, 18.8, 27...
## $ service <dbl> 17.0, 17.7, 16.1, 17.9, 14.1, 15.0, 16...
## $ office <dbl> 24.2, 27.1, 23.1, 17.8, 23.9, 19.7, 21...
## $ construction <dbl> 8.6, 10.8, 10.8, 19.0, 13.5, 20.1, 10....
## $ production <dbl> 17.1, 11.2, 23.1, 23.7, 19.9, 26.4, 23...
## $ drive <dbl> 87.5, 84.7, 83.8, 83.2, 84.9, 74.9, 84...
## $ carpool <dbl> 8.8, 8.8, 10.9, 13.5, 11.2, 14.9, 12.4...
## $ transit <dbl> 0.1, 0.1, 0.4, 0.5, 0.4, 0.7, 0.0, 0.2...
## $ walk <dbl> 0.5, 1.0, 1.8, 0.6, 0.9, 5.0, 0.8, 1.2...
## $ other_transp <dbl> 1.3, 1.4, 1.5, 1.5, 0.4, 1.7, 0.6, 1.2...
## $ work_at_home <dbl> 1.8, 3.9, 1.6, 0.7, 2.3, 2.8, 1.7, 2.7...
## $ mean_commute <dbl> 26.5, 26.4, 24.1, 28.8, 34.9, 27.5, 24...
## $ employed <int> 23986, 85953, 8597, 8294, 22189, 3865,...
## $ private_work <dbl> 73.6, 81.5, 71.8, 76.8, 82.0, 79.5, 77...
## $ public_work <dbl> 20.9, 12.3, 20.8, 16.1, 13.5, 15.1, 16...
## $ self_employed <dbl> 5.5, 5.8, 7.3, 6.7, 4.2, 5.4, 6.2, 5.0...
## $ family_work <dbl> 0.0, 0.4, 0.1, 0.4, 0.4, 0.0, 0.2, 0.1...
## $ unemployment <dbl> 7.6, 7.5, 17.6, 8.3, 7.7, 18.0, 10.9, ...
## $ land_area <dbl> 594.44, 1589.78, 884.88, 622.58, 644.7...
%>%
counties # Select state, county, population, and industry-related columns
select(state, county, population, professional:production) %>%
# Arrange service in descending order
arrange(desc(service))
Notice that when you select a group of related variables, it’s easy to find the insights you’re looking for.
4.3.3 Select helpers
In the video you learned about the select helper starts_with()
. Another select helper is ends_with()
, which finds the columns that end with a particular string.
%>%
counties # Select the state, county, population, and those ending with "work"
select(state, county, population, ends_with("work")) %>%
# Filter for counties that have at least 50% of people engaged in public work
filter(public_work >= 50)
## state county population private_work
## 1 Alaska Kusilvak Census Area 7914 45.4
## 2 Alaska Lake and Peninsula Borough 1474 42.2
## 3 Alaska Yukon-Koyukuk Census Area 5644 33.3
## 4 California Lassen 32645 42.6
## 5 Hawaii Kalawao 85 25.0
## 6 North Dakota Sioux 4380 32.9
## 7 South Dakota Oglala Lakota 14153 29.5
## 8 South Dakota Todd 9942 34.4
## 9 Wisconsin Menominee 4451 36.8
## public_work family_work
## 1 53.8 0.3
## 2 51.6 0.2
## 3 61.7 0.0
## 4 50.5 0.1
## 5 64.1 0.0
## 6 56.8 0.1
## 7 66.2 0.0
## 8 55.0 0.8
## 9 59.1 0.4
It looks like only a few counties have more than half the population working for the government.
4.3.4 Video: The rename verb
4.3.5 Renaming a column after count
The rename()
verb is often useful for changing the name of a column that comes out of another verb, such ascount()
. In this exercise, you’ll rename the n
column from count()
(which you learned about in Chapter 2) to something more descriptive.
# Rename the n column to num_counties
%>%
counties count(state) %>%
rename(num_counties = n)
## # A tibble: 50 x 2
## state num_counties
## <fct> <int>
## 1 Alabama 67
## 2 Alaska 29
## 3 Arizona 15
## 4 Arkansas 75
## 5 California 58
## 6 Colorado 64
## 7 Connecticut 8
## 8 Delaware 3
## 9 Florida 67
## 10 Georgia 159
## # ... with 40 more rows
Notice the difference between column names in the output from the first step to the second step. Don’t forget, using rename()
isn’t the only way to choose a new name for a column!
4.3.6 Renaming a column as part of a select
rename()
isn’t the only way you can choose a new name for a column: you can also choose a name as part of a select()
.
# Select state, county, and poverty as poverty_rate
%>%
counties select(state, county, poverty_rate = poverty)
As you can see, we were able to select the four columns of interest from our dataset, and rename one of those columns, using only the select()
verb!
4.3.7 Video: The transmute verb
4.3.8 Question: Choosing among verbs
Source: DataCamp
Recall, you can think of transmute()
as a combination of select()
and mutate()
, since you are getting back a subset of columns, but you are transforming and changing them at the same time.
4.3.9 Using transmute
As you learned in the video, the transmute verb allows you to control which variables you keep, which variables you calculate, and which variables you drop.
%>%
counties # Keep the state, county, and populations columns, and add a density column
transmute(state, county, population, density = population / land_area) %>%
# Filter for counties with a population greater than one million
filter(population > 1000000) %>%
# Sort density in ascending order
arrange(density)
## state county population density
## 1 California San Bernardino 2094769 104.4411
## 2 Nevada Clark 2035572 257.9472
## 3 California Riverside 2298032 318.8841
## 4 Arizona Maricopa 4018143 436.7480
## 5 Florida Palm Beach 1378806 699.9868
## 6 California San Diego 3223096 766.1943
## 7 Washington King 2045756 966.9999
## 8 Texas Travis 1121645 1132.7459
## 9 Florida Hillsborough 1302884 1277.0743
## 10 Florida Orange 1229039 1360.4142
## 11 Florida Miami-Dade 2639042 1390.6382
## 12 Michigan Oakland 1229503 1417.0332
## 13 California Santa Clara 1868149 1448.0653
## 14 Utah Salt Lake 1078958 1453.5728
## 15 Texas Bexar 1825502 1472.3928
## 16 California Sacramento 1465832 1519.5638
## 17 Florida Broward 1843152 1523.5305
## 18 California Contra Costa 1096068 1530.9495
## 19 New York Suffolk 1501373 1646.1521
## 20 Pennsylvania Allegheny 1231145 1686.3152
## 21 Massachusetts Middlesex 1556116 1902.7610
## 22 Missouri St. Louis 1001327 1971.8925
## 23 Maryland Montgomery 1017859 2071.9776
## 24 California Alameda 1584983 2144.7092
## 25 Minnesota Hennepin 1197776 2163.6518
## 26 Texas Tarrant 1914526 2216.8873
## 27 Ohio Franklin 1215761 2284.4492
## 28 California Los Angeles 10038388 2473.8011
## 29 Texas Harris 4356362 2557.3309
## 30 Ohio Cuyahoga 1263189 2762.9410
## 31 Texas Dallas 2485003 2852.1291
## 32 Virginia Fairfax 1128722 2886.9785
## 33 Michigan Wayne 1778969 2906.4322
## 34 California Orange 3116069 3941.5472
## 35 New York Nassau 1354612 4757.6988
## 36 Illinois Cook 5236393 5539.2223
## 37 Pennsylvania Philadelphia 1555072 11596.3609
## 38 New York Queens 2301139 21202.7919
## 39 New York Bronx 1428357 33927.7197
## 40 New York Kings 2595259 36645.8486
## 41 New York New York 1629507 71375.6899
Looks like San Bernadino is the lowest density county with a population about one million.
%>%
counties # Keep the state, county, and populations columns, and add a density column
transmute(state, county, population, density = population / land_area) %>%
# Filter for counties with a population greater than one million
filter(population > 1000000) %>%
# Sort density in ascending order
arrange(density)
## state county population density
## 1 California San Bernardino 2094769 104.4411
## 2 Nevada Clark 2035572 257.9472
## 3 California Riverside 2298032 318.8841
## 4 Arizona Maricopa 4018143 436.7480
## 5 Florida Palm Beach 1378806 699.9868
## 6 California San Diego 3223096 766.1943
## 7 Washington King 2045756 966.9999
## 8 Texas Travis 1121645 1132.7459
## 9 Florida Hillsborough 1302884 1277.0743
## 10 Florida Orange 1229039 1360.4142
## 11 Florida Miami-Dade 2639042 1390.6382
## 12 Michigan Oakland 1229503 1417.0332
## 13 California Santa Clara 1868149 1448.0653
## 14 Utah Salt Lake 1078958 1453.5728
## 15 Texas Bexar 1825502 1472.3928
## 16 California Sacramento 1465832 1519.5638
## 17 Florida Broward 1843152 1523.5305
## 18 California Contra Costa 1096068 1530.9495
## 19 New York Suffolk 1501373 1646.1521
## 20 Pennsylvania Allegheny 1231145 1686.3152
## 21 Massachusetts Middlesex 1556116 1902.7610
## 22 Missouri St. Louis 1001327 1971.8925
## 23 Maryland Montgomery 1017859 2071.9776
## 24 California Alameda 1584983 2144.7092
## 25 Minnesota Hennepin 1197776 2163.6518
## 26 Texas Tarrant 1914526 2216.8873
## 27 Ohio Franklin 1215761 2284.4492
## 28 California Los Angeles 10038388 2473.8011
## 29 Texas Harris 4356362 2557.3309
## 30 Ohio Cuyahoga 1263189 2762.9410
## 31 Texas Dallas 2485003 2852.1291
## 32 Virginia Fairfax 1128722 2886.9785
## 33 Michigan Wayne 1778969 2906.4322
## 34 California Orange 3116069 3941.5472
## 35 New York Nassau 1354612 4757.6988
## 36 Illinois Cook 5236393 5539.2223
## 37 Pennsylvania Philadelphia 1555072 11596.3609
## 38 New York Queens 2301139 21202.7919
## 39 New York Bronx 1428357 33927.7197
## 40 New York Kings 2595259 36645.8486
## 41 New York New York 1629507 71375.6899
4.3.10 Question: Matching verbs to their definitions
We’ve learned a number of new verbs in this chapter that you can use to modify and change the variables you have.
rename
: Leaves the column you don’t mention alone; doesn’t allow you to calculate or change values.transmute
: Must mention all the columns you keep; allows you to calculate or change values.mutate
: Leaves the columns you don’t mention alone; allows you to calculate or change values.
Let’s continue practising using the verbs to gain a better understanding of the differences between them.
4.3.11 Choosing among the four verbs
In this chapter you’ve learned about the four verbs: select, mutate, transmute, and rename. Here, you’ll choose the appropriate verb for each situation. You won’t need to change anything inside the parentheses.
# Change the name of the unemployment column
%>%
counties rename(unemployment_rate = unemployment)
# Keep the state and county columns, and the columns containing poverty
%>%
counties select(state, county, contains("poverty"))
# Calculate the fraction_women column without dropping the other columns
%>%
counties mutate(fraction_women = women / population)
# Keep only the state, county, and employment_rate columns
%>%
counties transmute(state, county, employment_rate = employed / population)
Now you know which variable to choose depending on whether you want to keep, drop, rename, or change a variable in the dataset.
4.4 Case Study: The babynames Dataset
Work with a new dataset that represents the names of babies born in the United States each year. Learn how to use grouped mutates and window functions to ask and answer more complex questions about your data. And use a combination of dplyr and ggplot2 to make interesting graphs to further explore your data.
4.4.1 Video: The babynames data
library(babynames)
library(dplyr)
library(tidyr)
<- babynames %>%
babynames select(year, name, number) %>%
group_by(year, name) %>%
summarise(number = sum(number)) %>%
# ungroup() %>%
arrange(year, name)
babynames
## # A tibble: 1,756,284 x 3
## # Groups: year [138]
## year name number
## <dbl> <chr> <int>
## 1 1880 Aaron 102
## 2 1880 Ab 5
## 3 1880 Abbie 71
## 4 1880 Abbott 5
## 5 1880 Abby 6
## 6 1880 Abe 50
## 7 1880 Abel 9
## 8 1880 Abigail 12
## 9 1880 Abner 27
## 10 1880 Abraham 81
## # ... with 1,756,274 more rows
4.4.2 Filtering and arranging for one year
The dplyr verbs you’ve learned are useful for exploring data. For instance, you could find out the most common names in a particular year.
%>%
babynames # Filter for the year 1990
filter(year == 1990) %>%
# Sort the number column in descending order
arrange(desc(number))
## # A tibble: 22,678 x 3
## # Groups: year [1]
## year name number
## <dbl> <chr> <int>
## 1 1990 Michael 65560
## 2 1990 Christopher 52520
## 3 1990 Jessica 46615
## 4 1990 Ashley 45797
## 5 1990 Matthew 44925
## 6 1990 Joshua 43382
## 7 1990 Brittany 36650
## 8 1990 Amanda 34504
## 9 1990 Daniel 33963
## 10 1990 David 33862
## # ... with 22,668 more rows
It looks like the most common names for babies born in the US in 1990 were Michael, Christopher, and Jessica.
4.4.3 Using top_n with babynames
You saw that you could use filter()
and arrange()
to find the most common names in one year. However, you could also use group_by
and top_n
to find the most common name in every year.
# Find the most common name in each year
%>%
babynames group_by(year) %>%
top_n(1, number)
## # A tibble: 138 x 3
## # Groups: year [138]
## year name number
## <dbl> <chr> <int>
## 1 1880 John 9701
## 2 1881 John 8795
## 3 1882 John 9597
## 4 1883 John 8934
## 5 1884 John 9428
## 6 1885 Mary 9166
## 7 1886 Mary 9921
## 8 1887 Mary 9935
## 9 1888 Mary 11804
## 10 1889 Mary 11689
## # ... with 128 more rows
It looks like John was the most common name in 1880, and Mary was the most common name for a while after that.
4.4.4 Visualizing names with ggplot2
The dplyr
package is very useful for exploring data, but it’s especially useful when combined with other tidyverse
packages like ggplot2
. (As of tidyverse 1.3.0, the following packages are included in the core tidyverse: dplyr
, ggplot2
, tidyr
, readr
, purrr
, tibble
, stringr
, forcats
. To make sure you are able to access the package, install all the packages in the tidyverse by running install.packages("tidyverse")
, then run library(tidyverse)
to load the core tidyverse and make it available in your current R session.)
# Filter for the names Steven, Thomas, and Matthew
<- babynames %>%
selected_names filter(name %in% c("Steven", "Thomas", "Matthew"))
# Plot the names using a different color for each name
ggplot(selected_names, aes(x = year, y = number, color = name)) +
geom_line()
It looks like names like Steven and Thomas were common in the 1950s, but Matthew became common more recently.
4.4.5 Video: Grouped mutates
4.4.6 Finding the year each name is most common
In an earlier video, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you’re going to explore which year each name was the most common.
To do this, you’ll be combining the grouped mutate approach with a top_n
.
# Calculate the fraction of people born each year with the same name
%>%
babynames group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total) %>%
# Find the year each name is most common
group_by(name) %>%
top_n(1, fraction)
## # A tibble: 97,310 x 5
## # Groups: name [97,310]
## year name number year_total fraction
## <dbl> <chr> <int> <int> <dbl>
## 1 1880 Abbott 5 201484 0.0000248
## 2 1880 Abe 50 201484 0.000248
## 3 1880 Adelbert 28 201484 0.000139
## 4 1880 Adella 26 201484 0.000129
## 5 1880 Agustus 5 201484 0.0000248
## 6 1880 Albert 1493 201484 0.00741
## 7 1880 Albertus 5 201484 0.0000248
## 8 1880 Alcide 7 201484 0.0000347
## 9 1880 Alonzo 122 201484 0.000606
## 10 1880 Amos 128 201484 0.000635
## # ... with 97,300 more rows
Notice that the results are grouped by year
, then name
, so the first few entries are names that were most popular in the 1880’s that start with the letter A.
4.4.7 Adding the total and maximum for each name
In the video, you learned how you could group by the year and use mutate()
to add a total for that year.
In these exercises, you’ll learn to normalize by a different, but also interesting metric: you’ll divide each name by the maximum for that name. This means that every name will peak at 1.
Once you add new columns, the result will still be grouped by name. This splits it into 48,000 groups, which actually makes later steps like mutate
slower.
%>%
babynames group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
# Ungroup the table
ungroup() %>%
# Add the fraction_max column containing the number by the name maximum
mutate(fraction_max = number / name_max)
## # A tibble: 1,756,284 x 6
## year name number name_total name_max fraction_max
## <dbl> <chr> <int> <int> <int> <dbl>
## 1 1880 Aaron 102 579589 15411 0.00662
## 2 1880 Ab 5 362 41 0.122
## 3 1880 Abbie 71 21716 536 0.132
## 4 1880 Abbott 5 1020 59 0.0847
## 5 1880 Abby 6 57756 2048 0.00293
## 6 1880 Abe 50 9158 280 0.179
## 7 1880 Abel 9 50236 3245 0.00277
## 8 1880 Abigail 12 357031 15948 0.000752
## 9 1880 Abner 27 7641 202 0.134
## 10 1880 Abraham 81 88852 2575 0.0315
## # ... with 1,756,274 more rows
This tells you, for example, that the name Abe was at 18.5% of its peak in the year 1880.
4.4.8 Visualizing the normalized change in popularity
You picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where you’re focused on the relative change within each name rather than the overall popularity of the name.
In this exercise, you’ll visualize the normalized popularity of each name. Your work from the previous exercise, names_normalized
, has been provided for you.
<- babynames %>%
names_normalized group_by(name) %>%
mutate(name_total = sum(number),
name_max = max(number)) %>%
ungroup() %>%
mutate(fraction_max = number / name_max)
# Filter for the names Steven, Thomas, and Matthew
<- names_normalized %>%
names_filtered filter(name %in% c("Steven", "Thomas", "Matthew"))
# Visualize these names over time
ggplot(names_filtered, aes(x = year, y = fraction_max, color = name)) +
geom_line()
As you can see, the line for each name hits a peak at 1, although the peak year differs for each name.
4.4.9 Video: Window functions
The code shown at the end of the video (finding the variable difference
for all names) contains a mistake. Can you spot it? Below is the corrected code.
# As we did before, but now naming it babynames_fraction
<- babynames %>%
babynames_fraction group_by(year) %>%
mutate(year_total = sum(number)) %>%
ungroup() %>%
mutate(fraction = number / year_total)
# Just for Matthew
%>%
babynames_fraction filter(name == "Matthew") %>%
arrange(year) %>%
# Display change in prevalence from year to year
mutate(difference = fraction - lag(fraction)) %>%
# Arange in descending order
arrange(desc(difference))
## # A tibble: 138 x 6
## year name number year_total fraction difference
## <dbl> <chr> <int> <int> <dbl> <dbl>
## 1 1981 Matthew 43531 3459182 0.0126 0.00154
## 2 1983 Matthew 50531 3462826 0.0146 0.00138
## 3 1971 Matthew 22653 3432585 0.00660 0.000982
## 4 1967 Matthew 13629 3395130 0.00401 0.000894
## 5 1973 Matthew 24658 3017412 0.00817 0.000835
## 6 1974 Matthew 27332 3040409 0.00899 0.000818
## 7 1978 Matthew 34468 3174268 0.0109 0.000741
## 8 1972 Matthew 23066 3143627 0.00734 0.000738
## 9 1968 Matthew 15915 3378876 0.00471 0.000696
## 10 1982 Matthew 46333 3507664 0.0132 0.000625
## # ... with 128 more rows
# For all names
%>%
babynames_fraction group_by(name) %>%
# Display change in prevalence from year to year
mutate(difference = fraction - lag(fraction)) %>%
# Arange in descending order
arrange(name, year)
## # A tibble: 1,756,284 x 6
## # Groups: name [97,310]
## year name number year_total fraction difference
## <dbl> <chr> <int> <int> <dbl> <dbl>
## 1 2007 Aaban 5 3994007 0.00000125 NA
## 2 2009 Aaban 6 3815638 0.00000157 3.21e-7
## 3 2010 Aaban 9 3690700 0.00000244 8.66e-7
## 4 2011 Aaban 11 3651914 0.00000301 5.74e-7
## 5 2012 Aaban 11 3650462 0.00000301 1.20e-9
## 6 2013 Aaban 14 3637310 0.00000385 8.36e-7
## 7 2014 Aaban 16 3696311 0.00000433 4.80e-7
## 8 2015 Aaban 15 3688687 0.00000407 -2.62e-7
## 9 2016 Aaban 9 3652968 0.00000246 -1.60e-6
## 10 2017 Aaban 11 3546301 0.00000310 6.38e-7
## # ... with 1,756,274 more rows
4.4.10 Using ratios to describe the frequency of a name
In the video, you learned how to find the difference in the frequency of a baby name between consecutive years. What if instead of finding the difference, you wanted to find the ratio?
You’ll start with the babynames_fraction
data already, so that you can consider the popularity of each name within each year.
<- babynames_fraction %>%
babynames_ratio # Arrange the data in order of name, then year
arrange(name, year) %>%
# Group the data by name
group_by(name) %>%
# Add a ratio column that contains the ratio between each year
mutate(ratio = fraction / lag(fraction))
Notice that the first observation for each name is missing a ratio, since there is no previous year.
4.4.11 Biggest jumps in a name
Previously, you added a ratio
column to describe the ratio of the frequency of a baby name between consecutive years to describe the changes in the popularity of a name. Now, you’ll look at a subset of that data, called babynames_ratios_filtered
, to look further into the names that experienced the biggest jumps in popularity in consecutive years.
<- babynames_ratio %>%
babynames_ratios_filtered filter(fraction >= 0.00001)
%>%
babynames_ratios_filtered # Extract the largest ratio from each name
top_n(1, ratio) %>%
# Sort the ratio column in descending order
arrange(desc(ratio)) %>%
# Filter for fractions greater than or equal to 0.001
filter(fraction >= 0.001)
## # A tibble: 155 x 6
## # Groups: name [155]
## year name number year_total fraction ratio
## <dbl> <chr> <int> <int> <dbl> <dbl>
## 1 1957 Tammy 4398 4200007 0.00105 16.3
## 2 1912 Woodrow 1854 988064 0.00188 9.99
## 3 1931 Marlene 2599 2104071 0.00124 8.97
## 4 1898 Dewey 1219 381458 0.00320 6.48
## 5 2010 Bentley 4001 3690700 0.00108 6.21
## 6 1884 Grover 809 243462 0.00332 5.68
## 7 1984 Jenna 5898 3487820 0.00169 5.01
## 8 1991 Mariah 5200 3894329 0.00134 4.78
## 9 1943 Cheryl 2894 2822127 0.00103 4.75
## 10 1989 Ethan 4067 3843559 0.00106 4.37
## # ... with 145 more rows
Some of these can be interpreted: for example, Grover Cleveland was a president elected in 1884.
4.4.12 Video: Contratulations!
You’ll find all of these skills valuable is these other DataCamp courses:
- Exploratory Data Analysis in R: Case Study
- Working with Data in the Tidyverse
- Machine Learning in the Tidyverse
- Categorical Data in the Tidyverse
4.5 Challenge
Hey
4.6 Solutions
<- dplyr::starwars # Create sw dataframe
sw head(sw)
## # A tibble: 6 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke~ 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl~ red 33 <NA>
## 4 Dart~ 202 136 none white yellow 41.9 male
## 5 Leia~ 150 49 brown light brown 19 female
## 6 Owen~ 178 120 brown, gr~ light blue 52 male
## # ... with 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
4.6.1 Question 1
dim(sw) # 1.a.
## [1] 87 13
<- nrow(sw)*ncol(sw) # 1.b. dim.sw
4.6.2 Question 2
<- sum(is.na(sw)) # 2.a.
na.cells
round(100*na.cells/dim.sw,1) # 2.b.
## [1] 8.9
colnames(sw[,colSums(is.na(sw)) == max(colSums(is.na(sw)))]) # 2.c.
## [1] "birth_year"
filter(sw, is.na(homeworld)==TRUE & (is.na(birth_year)==FALSE | is.na(mass)==FALSE)) # 2.d.
## # A tibble: 3 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Yoda 66 17 white green brown 896 male
## 2 IG-88 200 140 none metal red 15 none
## 3 Qui-~ 193 89 brown fair blue 92 male
## # ... with 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
4.6.3 Question 3
length(unique(sw$species)) # 3.
## [1] 38
4.6.4 Question 4
length(which(sw$species == "Human")) # 4.a.i
## [1] 35
%>% # 4.a.ii.
sw filter(species=="Human") %>%
group_by(gender) %>%
count()
## # A tibble: 2 x 2
## # Groups: gender [2]
## gender n
## <chr> <int>
## 1 female 9
## 2 male 26
filter(sw, is.na(gender) | (gender != "male" & gender != "female")) %>% # 4.b.
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 6
%>% # 4.c.
sw group_by(species, gender) %>% # Creates piles of similarly sexed members of the same species
count() %>% # Produces a table of tallies of the above groups
group_by(species) %>% # Groups within this table of tallies
count() %>% # Produces a tally of the numbers of members of the groups above
filter(n > 1)
## # A tibble: 5 x 2
## # Groups: species [5]
## species n
## <chr> <int>
## 1 Droid 2
## 2 Human 2
## 3 Kaminoan 2
## 4 Twi'lek 2
## 5 <NA> 2
4.6.5 Question 5
<- sw %>% # 5.a.
tb group_by(gender) %>%
count()
<- unikn::usecol(c(Pinky, Seegruen, Seeblau, Grau))
my_cols ggplot(sw, aes(x = gender)) + # 5.b.
geom_bar(aes(fill = gender)) +
labs(title = "Sex distribtion of the sw universe",
tag = "5.b.",
x = "Gender",
y = "Frequency",
caption = "Using raw data sw.") +
scale_fill_manual(name = "Gender:", values = my_cols, na.value = "black") +
::theme_ds4psy() ds4psy
ggplot(tb, aes(x = gender, y = n)) + # 5.c., adding y=n to the aes()
geom_bar(aes(fill = gender), stat = "identity") + # Adding stat="identity" to the aes()
labs(title = "Sex distribtion of the sw universe",
tag = "5.c.",
x = "Gender",
y = "Frequency",
caption = "Using raw data sw.") +
scale_fill_manual(name = "Gender:", values = my_cols, na.value = "black") +
::theme_ds4psy() ds4psy
4.6.6 Question 6
%>% # 6.a.
sw group_by(homeworld) %>%
count() %>%
arrange(desc(n))
## # A tibble: 49 x 2
## # Groups: homeworld [49]
## homeworld n
## <chr> <int>
## 1 Naboo 11
## 2 Tatooine 10
## 3 <NA> 10
## 4 Alderaan 3
## 5 Coruscant 3
## 6 Kamino 3
## 7 Corellia 2
## 8 Kashyyyk 2
## 9 Mirial 2
## 10 Ryloth 2
## # ... with 39 more rows
%>% # 6.b.
sw filter(homeworld == "Naboo", eye_color == "orange") %>%
summarise(n = n(),
mn_height =mean(height))
## # A tibble: 1 x 2
## n mn_height
## <int> <dbl>
## 1 3 209.
4.6.7 Question 7
%>% # 7.
sw filter(species == "Droid") %>%
summarise(n = n(),
not_NA_h = sum(!is.na(height)),
md_height = median(height, na.rm=TRUE),
mn_height = mean(height, na.rm=TRUE),
sd_height = sd(height, na.rm=TRUE))
## # A tibble: 1 x 5
## n not_NA_h md_height mn_height sd_height
## <int> <int> <dbl> <dbl> <dbl>
## 1 5 4 132 140 52.0
4.6.8 Question 8
<- sw %>% # 8.a.
h_m group_by(species) %>%
summarise(mn_height = mean(height, na.rm=TRUE),
mn_mass = mean(mass, na.rm=TRUE))
%>% # 8.b.
h_m arrange(mn_height) %>%
slice(1:3)
## # A tibble: 3 x 3
## species mn_height mn_mass
## <chr> <dbl> <dbl>
## 1 Yoda's species 66 17
## 2 Aleena 79 15
## 3 Ewok 88 20
%>% # 8.c.
h_m arrange(desc(mn_mass)) %>%
slice(1:3)
## # A tibble: 3 x 3
## species mn_height mn_mass
## <chr> <dbl> <dbl>
## 1 Hutt 175 1358
## 2 Kaleesh 216 159
## 3 Wookiee 231 124
4.6.9 Question 9
%>% # 9.a.
sw filter(!is.na(species)) %>%
group_by(species) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
slice(1:3)
## # A tibble: 3 x 2
## species n
## <chr> <int>
## 1 Human 35
## 2 Droid 5
## 3 Gungan 3
%>% # 9.b.
sw select(name, homeworld, mass) %>%
group_by(homeworld) %>%
mutate(mn_mass = mean(mass, na.rm = TRUE),
lighter = mass < (mn_mass*0.8)
%>%
) filter(lighter == TRUE)
## # A tibble: 5 x 5
## # Groups: homeworld [4]
## name homeworld mass mn_mass lighter
## <chr> <chr> <dbl> <dbl> <lgl>
## 1 R2-D2 Naboo 32 64.2 TRUE
## 2 Leia Organa Alderaan 49 64 TRUE
## 3 R5-D4 Tatooine 32 85.4 TRUE
## 4 Yoda <NA> 17 82 TRUE
## 5 Padmé Amidala Naboo 45 64.2 TRUE