本节使用数据集starwars。
#加载包dplyr
library(dplyr)
starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
1 dplyr函数概况
1.1 dplyr函数的分类
所有的dplyr函数,第1项参数都是数据框(data frame 或者 tibble)
dplyr提供了一系列管理数据的函数,根据管理的对象不同,可以将这些函数分为对列的管理、对行的管理、对行组的管理三类。
- 列的管理 Columns:
select()
changes whether or not a column is included.rename()
changes the name of columns.mutate()
changes the values of columns and creates new columns.relocate()
changes the order of the columns.
- 行的管理 Rows:
filter()
chooses rows based on column values.slice()
chooses rows based on location.arrange()
changes the order of the rows.
- 行组的管理Groups of rows:
summarise()
collapses a group into a single row.count()
top_n()
1.2 dplyr函数的共性 Commonalities
- 第1项参数是data frame.
- 第2项参数描述动作。
- 输出结果是一个新的data frame.
2 列的管理Column
2.1 select()
- contains()
- starts_with()
- ends_with()
- last_col()
- ?select_helpers
#names()列出数据框所有变量名
names(starwars)
[1] "name" "height" "mass" "hair_color" "skin_color"
[6] "eye_color" "birth_year" "sex" "gender" "homeworld"
[11] "species" "films" "vehicles" "starships"
#选择需要的变量,创建新的数据框
<- starwars %>% select(
starwars_selected
name, height, mass, sex, species, films) starwars_selected
# A tibble: 87 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Luke Skywalker 172 77 male Human <chr [5]>
2 C-3PO 167 75 none Droid <chr [6]>
3 R2-D2 96 32 none Droid <chr [7]>
4 Darth Vader 202 136 male Human <chr [4]>
5 Leia Organa 150 49 female Human <chr [5]>
6 Owen Lars 178 120 male Human <chr [3]>
7 Beru Whitesun lars 165 75 female Human <chr [3]>
8 R5-D4 97 32 none Droid <chr [1]>
9 Biggs Darklighter 183 84 male Human <chr [1]>
10 Obi-Wan Kenobi 182 77 male Human <chr [6]>
# ℹ 77 more rows
#选择一个范围,变量名1:变量2
<- starwars %>%
data1 select(name:mass)
data1
# A tibble: 87 × 3
name height mass
<chr> <int> <dbl>
1 Luke Skywalker 172 77
2 C-3PO 167 75
3 R2-D2 96 32
4 Darth Vader 202 136
5 Leia Organa 150 49
6 Owen Lars 178 120
7 Beru Whitesun lars 165 75
8 R5-D4 97 32
9 Biggs Darklighter 183 84
10 Obi-Wan Kenobi 182 77
# ℹ 77 more rows
#设置条件 contains()
<- starwars %>%
data2 select(name, species, contains("color"))
data2
# A tibble: 87 × 5
name species hair_color skin_color eye_color
<chr> <chr> <chr> <chr> <chr>
1 Luke Skywalker Human blond fair blue
2 C-3PO Droid <NA> gold yellow
3 R2-D2 Droid <NA> white, blue red
4 Darth Vader Human none white yellow
5 Leia Organa Human brown light brown
6 Owen Lars Human brown, grey light blue
7 Beru Whitesun lars Human brown light blue
8 R5-D4 Droid <NA> white, red red
9 Biggs Darklighter Human black light brown
10 Obi-Wan Kenobi Human auburn, white fair blue-gray
# ℹ 77 more rows
2.2 mutate()
在数据集中追加新的变量。
#生成新变量
%>%
starwars_selected mutate(BMI = mass/((height/100)^2)) %>%
arrange(desc(BMI))
# A tibble: 87 × 7
name height mass sex species films BMI
<chr> <int> <dbl> <chr> <chr> <list> <dbl>
1 Jabba Desilijic Tiure 175 1358 hermaphroditic Hutt <chr> 443.
2 Dud Bolt 94 45 male Vulptereen <chr> 50.9
3 Yoda 66 17 male Yoda's species <chr> 39.0
4 Owen Lars 178 120 male Human <chr> 37.9
5 IG-88 200 140 none Droid <chr> 35
6 R2-D2 96 32 none Droid <chr> 34.7
7 Grievous 216 159 male Kaleesh <chr> 34.1
8 R5-D4 97 32 none Droid <chr> 34.0
9 Jek Tono Porkins 180 110 male Human <chr> 34.0
10 Darth Vader 202 136 male Human <chr> 33.3
# ℹ 77 more rows
starwars_selected
# A tibble: 87 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Luke Skywalker 172 77 male Human <chr [5]>
2 C-3PO 167 75 none Droid <chr [6]>
3 R2-D2 96 32 none Droid <chr [7]>
4 Darth Vader 202 136 male Human <chr [4]>
5 Leia Organa 150 49 female Human <chr [5]>
6 Owen Lars 178 120 male Human <chr [3]>
7 Beru Whitesun lars 165 75 female Human <chr [3]>
8 R5-D4 97 32 none Droid <chr [1]>
9 Biggs Darklighter 183 84 male Human <chr [1]>
10 Obi-Wan Kenobi 182 77 male Human <chr [6]>
# ℹ 77 more rows
%>%
starwars_selected mutate(wt_rank = min_rank(desc(mass))) %>%
arrange((wt_rank))
# A tibble: 87 × 7
name height mass sex species films wt_rank
<chr> <int> <dbl> <chr> <chr> <list> <int>
1 Jabba Desilijic Tiure 175 1358 hermaphroditic Hutt <chr> 1
2 Grievous 216 159 male Kaleesh <chr> 2
3 IG-88 200 140 none Droid <chr> 3
4 Darth Vader 202 136 male Human <chr> 4
5 Tarfful 234 136 male Wookiee <chr> 4
6 Owen Lars 178 120 male Human <chr> 6
7 Bossk 190 113 male Trandoshan <chr> 7
8 Chewbacca 228 112 male Wookiee <chr> 8
9 Jek Tono Porkins 180 110 male Human <chr> 9
10 Dexter Jettster 198 102 male Besalisk <chr> 10
# ℹ 77 more rows
#overwrite the variable
%>%
starwars_selected mutate(height = height/100) %>%
arrange(desc(height))
# A tibble: 87 × 6
name height mass sex species films
<chr> <dbl> <dbl> <chr> <chr> <list>
1 Yarael Poof 2.64 NA male Quermian <chr [1]>
2 Tarfful 2.34 136 male Wookiee <chr [1]>
3 Lama Su 2.29 88 male Kaminoan <chr [1]>
4 Chewbacca 2.28 112 male Wookiee <chr [5]>
5 Roos Tarpals 2.24 82 male Gungan <chr [1]>
6 Grievous 2.16 159 male Kaleesh <chr [1]>
7 Taun We 2.13 NA female Kaminoan <chr [1]>
8 Rugor Nass 2.06 NA male Gungan <chr [1]>
9 Tion Medon 2.06 80 male Pau'an <chr [1]>
10 Darth Vader 2.02 136 male Human <chr [4]>
# ℹ 77 more rows
#everything() selects all variable.
%>%
starwars mutate(height_m = height / 100) %>%
select(height_m, height, everything())
# A tibble: 87 × 15
height_m height name mass hair_color skin_color eye_color birth_year sex
<dbl> <int> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 1.72 172 Luke … 77 blond fair blue 19 male
2 1.67 167 C-3PO 75 <NA> gold yellow 112 none
3 0.96 96 R2-D2 32 <NA> white, bl… red 33 none
4 2.02 202 Darth… 136 none white yellow 41.9 male
5 1.5 150 Leia … 49 brown light brown 19 fema…
6 1.78 178 Owen … 120 brown, gr… light blue 52 male
7 1.65 165 Beru … 75 brown light blue 47 fema…
8 0.97 97 R5-D4 32 <NA> white, red red NA none
9 1.83 183 Biggs… 84 black light brown 24 male
10 1.82 182 Obi-W… 77 auburn, w… fair blue-gray 57 male
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
2.3 rename()
#重命名变量,rename(新名=旧名)
%>%
starwars_selected rename(weight = mass)
# A tibble: 87 × 6
name height weight sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Luke Skywalker 172 77 male Human <chr [5]>
2 C-3PO 167 75 none Droid <chr [6]>
3 R2-D2 96 32 none Droid <chr [7]>
4 Darth Vader 202 136 male Human <chr [4]>
5 Leia Organa 150 49 female Human <chr [5]>
6 Owen Lars 178 120 male Human <chr [3]>
7 Beru Whitesun lars 165 75 female Human <chr [3]>
8 R5-D4 97 32 none Droid <chr [1]>
9 Biggs Darklighter 183 84 male Human <chr [1]>
10 Obi-Wan Kenobi 182 77 male Human <chr [6]>
# ℹ 77 more rows
#在select 的时候重命名
<- starwars %>%
data3 select(name, height, weight = mass)
data3
# A tibble: 87 × 3
name height weight
<chr> <int> <dbl>
1 Luke Skywalker 172 77
2 C-3PO 167 75
3 R2-D2 96 32
4 Darth Vader 202 136
5 Leia Organa 150 49
6 Owen Lars 178 120
7 Beru Whitesun lars 165 75
8 R5-D4 97 32
9 Biggs Darklighter 183 84
10 Obi-Wan Kenobi 182 77
# ℹ 77 more rows
2.4 transmute()
transmute()是select()和mutate()的结合。
Transmute is like a combination of select and mutate: you are getting back just a subset of the columns, but you are transforming and changing them at the same time.
<- starwars %>%
data4 filter(height != "NA") %>%
transmute(name, height, mass, bmi = mass/((height/100)^2))
data4
# A tibble: 81 × 4
name height mass bmi
<chr> <int> <dbl> <dbl>
1 Luke Skywalker 172 77 26.0
2 C-3PO 167 75 26.9
3 R2-D2 96 32 34.7
4 Darth Vader 202 136 33.3
5 Leia Organa 150 49 21.8
6 Owen Lars 178 120 37.9
7 Beru Whitesun lars 165 75 27.5
8 R5-D4 97 32 34.0
9 Biggs Darklighter 183 84 25.1
10 Obi-Wan Kenobi 182 77 23.2
# ℹ 71 more rows
2.5 relocate()
调整列的排列顺序。
%>% relocate(sex:homeworld, .before = height) starwars
# A tibble: 87 × 14
name sex gender homeworld height mass hair_color skin_color eye_color
<chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Sky… male mascu… Tatooine 172 77 blond fair blue
2 C-3PO none mascu… Tatooine 167 75 <NA> gold yellow
3 R2-D2 none mascu… Naboo 96 32 <NA> white, bl… red
4 Darth Va… male mascu… Tatooine 202 136 none white yellow
5 Leia Org… fema… femin… Alderaan 150 49 brown light brown
6 Owen Lars male mascu… Tatooine 178 120 brown, gr… light blue
7 Beru Whi… fema… femin… Tatooine 165 75 brown light blue
8 R5-D4 none mascu… Tatooine 97 32 <NA> white, red red
9 Biggs Da… male mascu… Tatooine 183 84 black light brown
10 Obi-Wan … male mascu… Stewjon 182 77 auburn, w… fair blue-gray
# ℹ 77 more rows
# ℹ 5 more variables: birth_year <dbl>, species <chr>, films <list>,
# vehicles <list>, starships <list>
starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
%>%
starwars select(name, species, everything())
# A tibble: 87 × 14
name species height mass hair_color skin_color eye_color birth_year sex
<chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Luke S… Human 172 77 blond fair blue 19 male
2 C-3PO Droid 167 75 <NA> gold yellow 112 none
3 R2-D2 Droid 96 32 <NA> white, bl… red 33 none
4 Darth … Human 202 136 none white yellow 41.9 male
5 Leia O… Human 150 49 brown light brown 19 fema…
6 Owen L… Human 178 120 brown, gr… light blue 52 male
7 Beru W… Human 165 75 brown light blue 47 fema…
8 R5-D4 Droid 97 32 <NA> white, red red NA none
9 Biggs … Human 183 84 black light brown 24 male
10 Obi-Wa… Human 182 77 auburn, w… fair blue-gray 57 male
# ℹ 77 more rows
# ℹ 5 more variables: gender <chr>, homeworld <chr>, films <list>,
# vehicles <list>, starships <list>
starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
2.6 if_else()
对变量值进行条件转换。
#加载包tidyr,drop_na()属于包tidyr
library(tidyr)
<- starwars_selected %>%
size_of_weight drop_na(mass) %>%
mutate(weight_size = if_else(mass >100,
"large",
"small"))
size_of_weight
# A tibble: 59 × 7
name height mass sex species films weight_size
<chr> <int> <dbl> <chr> <chr> <list> <chr>
1 Luke Skywalker 172 77 male Human <chr [5]> small
2 C-3PO 167 75 none Droid <chr [6]> small
3 R2-D2 96 32 none Droid <chr [7]> small
4 Darth Vader 202 136 male Human <chr [4]> large
5 Leia Organa 150 49 female Human <chr [5]> small
6 Owen Lars 178 120 male Human <chr [3]> large
7 Beru Whitesun lars 165 75 female Human <chr [3]> small
8 R5-D4 97 32 none Droid <chr [1]> small
9 Biggs Darklighter 183 84 male Human <chr [1]> small
10 Obi-Wan Kenobi 182 77 male Human <chr [6]> small
# ℹ 49 more rows
2.7 recode()
对变量值进行重新编码。
# overwrite the original variable.
%>%
size_of_weight mutate(weight_size = recode(weight_size,
"large" =1,
"small" =2))
# A tibble: 59 × 7
name height mass sex species films weight_size
<chr> <int> <dbl> <chr> <chr> <list> <dbl>
1 Luke Skywalker 172 77 male Human <chr [5]> 2
2 C-3PO 167 75 none Droid <chr [6]> 2
3 R2-D2 96 32 none Droid <chr [7]> 2
4 Darth Vader 202 136 male Human <chr [4]> 1
5 Leia Organa 150 49 female Human <chr [5]> 2
6 Owen Lars 178 120 male Human <chr [3]> 1
7 Beru Whitesun lars 165 75 female Human <chr [3]> 2
8 R5-D4 97 32 none Droid <chr [1]> 2
9 Biggs Darklighter 183 84 male Human <chr [1]> 2
10 Obi-Wan Kenobi 182 77 male Human <chr [6]> 2
# ℹ 49 more rows
size_of_weight
# A tibble: 59 × 7
name height mass sex species films weight_size
<chr> <int> <dbl> <chr> <chr> <list> <chr>
1 Luke Skywalker 172 77 male Human <chr [5]> small
2 C-3PO 167 75 none Droid <chr [6]> small
3 R2-D2 96 32 none Droid <chr [7]> small
4 Darth Vader 202 136 male Human <chr [4]> large
5 Leia Organa 150 49 female Human <chr [5]> small
6 Owen Lars 178 120 male Human <chr [3]> large
7 Beru Whitesun lars 165 75 female Human <chr [3]> small
8 R5-D4 97 32 none Droid <chr [1]> small
9 Biggs Darklighter 183 84 male Human <chr [1]> small
10 Obi-Wan Kenobi 182 77 male Human <chr [6]> small
# ℹ 49 more rows
2.8 duplicated()
# Create a Data Frame
<- data.frame(x1 = c(1:5),
df6 x2 = letters[1:5],
x3 = seq(2,10,2),
x1 = LETTERS[22:26],
check.names = FALSE)
df6
x1 x2 x3 x1
1 1 a 2 V
2 2 b 4 W
3 3 c 6 X
4 4 d 8 Y
5 5 e 10 Z
# Find Duplicate Column Names
<- duplicated(colnames(df6))
duplicated_names
# Remove Duplicate Column Names
!duplicated_names] df6[
x1 x2 x3
1 1 a 2
2 2 b 4
3 3 c 6
4 4 d 8
5 5 e 10
2.9 as.factor()
将字符串转换为因子。
%>%
starwars_selected mutate(vore = as.factor(species)) %>%
glimpse()
Rows: 87
Columns: 7
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organ…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 228,…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.0, …
$ sex <chr> "male", "none", "none", "male", "female", "male", "female", "n…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Human",…
$ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return of…
$ vore <fct> Human, Droid, Droid, Human, Human, Human, Human, Droid, Human,…
3 行的管理Rows
3.1 arrange()
将个案按某个变量排序。
#升序
%>% arrange(height) starwars_selected
# A tibble: 87 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Yoda 66 17 male Yoda's species <chr [5]>
2 Ratts Tyerell 79 15 male Aleena <chr [1]>
3 Wicket Systri Warrick 88 20 male Ewok <chr [1]>
4 Dud Bolt 94 45 male Vulptereen <chr [1]>
5 R2-D2 96 32 none Droid <chr [7]>
6 R4-P17 96 NA none Droid <chr [2]>
7 R5-D4 97 32 none Droid <chr [1]>
8 Sebulba 112 40 male Dug <chr [1]>
9 Gasgano 122 NA male Xexto <chr [1]>
10 Watto 137 NA male Toydarian <chr [2]>
# ℹ 77 more rows
#降序 desc()
%>%
starwars_selected arrange(desc(height))
# A tibble: 87 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Yarael Poof 264 NA male Quermian <chr [1]>
2 Tarfful 234 136 male Wookiee <chr [1]>
3 Lama Su 229 88 male Kaminoan <chr [1]>
4 Chewbacca 228 112 male Wookiee <chr [5]>
5 Roos Tarpals 224 82 male Gungan <chr [1]>
6 Grievous 216 159 male Kaleesh <chr [1]>
7 Taun We 213 NA female Kaminoan <chr [1]>
8 Rugor Nass 206 NA male Gungan <chr [1]>
9 Tion Medon 206 80 male Pau'an <chr [1]>
10 Darth Vader 202 136 male Human <chr [4]>
# ℹ 77 more rows
# 查看前6行
head(starwars_selected)
# A tibble: 6 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Luke Skywalker 172 77 male Human <chr [5]>
2 C-3PO 167 75 none Droid <chr [6]>
3 R2-D2 96 32 none Droid <chr [7]>
4 Darth Vader 202 136 male Human <chr [4]>
5 Leia Organa 150 49 female Human <chr [5]>
6 Owen Lars 178 120 male Human <chr [3]>
# see a few values from all the columns
glimpse(starwars_selected)
Rows: 87
Columns: 6
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organ…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 228,…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.0, …
$ sex <chr> "male", "none", "none", "male", "female", "male", "female", "n…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Human",…
$ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return of…
3.2 filter()
提取满足条件的个案。
#列出变量的取值种类
unique(starwars$species)
[1] "Human" "Droid" "Wookiee" "Rodian"
[5] "Hutt" "Yoda's species" "Trandoshan" "Mon Calamari"
[9] "Ewok" "Sullustan" "Neimodian" "Gungan"
[13] NA "Toydarian" "Dug" "Zabrak"
[17] "Twi'lek" "Vulptereen" "Xexto" "Toong"
[21] "Cerean" "Nautolan" "Tholothian" "Iktotchi"
[25] "Quermian" "Kel Dor" "Chagrian" "Geonosian"
[29] "Mirialan" "Clawdite" "Besalisk" "Kaminoan"
[33] "Aleena" "Skakoan" "Muun" "Togruta"
[37] "Kaleesh" "Pau'an"
#提取species == "Droid"的个案
%>%
starwars_selected filter(species == "Droid") %>%
arrange(height)
# A tibble: 6 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 R2-D2 96 32 none Droid <chr [7]>
2 R4-P17 96 NA none Droid <chr [2]>
3 R5-D4 97 32 none Droid <chr [1]>
4 C-3PO 167 75 none Droid <chr [6]>
5 IG-88 200 140 none Droid <chr [1]>
6 BB8 NA NA none Droid <chr [1]>
#提取height < 200的个案
%>%
starwars_selected arrange(desc(height)) %>%
filter(height < 200)
# A tibble: 70 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Ki-Adi-Mundi 198 82 male Cerean <chr [3]>
2 Dexter Jettster 198 102 male Besalisk <chr [1]>
3 Jar Jar Binks 196 66 male Gungan <chr [2]>
4 Kit Fisto 196 87 male Nautolan <chr [3]>
5 Mas Amedda 196 NA male Chagrian <chr [2]>
6 Qui-Gon Jinn 193 89 male Human <chr [1]>
7 Dooku 193 80 male Human <chr [2]>
8 Wat Tambor 193 48 male Skakoan <chr [1]>
9 Nute Gunray 191 90 male Neimodian <chr [3]>
10 Bail Prestor Organa 191 NA male Human <chr [2]>
# ℹ 60 more rows
#用逗号分隔多个提取条件
%>%
starwars_selected filter(height < 200, species == "Human") %>%
arrange(desc(height))
# A tibble: 30 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Qui-Gon Jinn 193 89 male Human <chr [1]>
2 Dooku 193 80 male Human <chr [2]>
3 Bail Prestor Organa 191 NA male Human <chr [2]>
4 Anakin Skywalker 188 84 male Human <chr [3]>
5 Mace Windu 188 84 male Human <chr [3]>
6 Raymus Antilles 188 79 male Human <chr [2]>
7 Gregar Typho 185 85 male Human <chr [1]>
8 Biggs Darklighter 183 84 male Human <chr [1]>
9 Boba Fett 183 78.2 male Human <chr [3]>
10 Cliegg Lars 183 NA male Human <chr [1]>
# ℹ 20 more rows
starwars_selected
# A tibble: 87 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Luke Skywalker 172 77 male Human <chr [5]>
2 C-3PO 167 75 none Droid <chr [6]>
3 R2-D2 96 32 none Droid <chr [7]>
4 Darth Vader 202 136 male Human <chr [4]>
5 Leia Organa 150 49 female Human <chr [5]>
6 Owen Lars 178 120 male Human <chr [3]>
7 Beru Whitesun lars 165 75 female Human <chr [3]>
8 R5-D4 97 32 none Droid <chr [1]>
9 Biggs Darklighter 183 84 male Human <chr [1]>
10 Obi-Wan Kenobi 182 77 male Human <chr [6]>
# ℹ 77 more rows
%>%
starwars_selected filter((species == "Human"|species == "Droid") &
< 200 ) %>%
height arrange(desc(height))
# A tibble: 34 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Qui-Gon Jinn 193 89 male Human <chr [1]>
2 Dooku 193 80 male Human <chr [2]>
3 Bail Prestor Organa 191 NA male Human <chr [2]>
4 Anakin Skywalker 188 84 male Human <chr [3]>
5 Mace Windu 188 84 male Human <chr [3]>
6 Raymus Antilles 188 79 male Human <chr [2]>
7 Gregar Typho 185 85 male Human <chr [1]>
8 Biggs Darklighter 183 84 male Human <chr [1]>
9 Boba Fett 183 78.2 male Human <chr [3]>
10 Cliegg Lars 183 NA male Human <chr [1]>
# ℹ 24 more rows
starwars_selected
# A tibble: 87 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Luke Skywalker 172 77 male Human <chr [5]>
2 C-3PO 167 75 none Droid <chr [6]>
3 R2-D2 96 32 none Droid <chr [7]>
4 Darth Vader 202 136 male Human <chr [4]>
5 Leia Organa 150 49 female Human <chr [5]>
6 Owen Lars 178 120 male Human <chr [3]>
7 Beru Whitesun lars 165 75 female Human <chr [3]>
8 R5-D4 97 32 none Droid <chr [1]>
9 Biggs Darklighter 183 84 male Human <chr [1]>
10 Obi-Wan Kenobi 182 77 male Human <chr [6]>
# ℹ 77 more rows
#%in% 属于
%>%
starwars_selected filter((species %in% c("Human","Droid")) &
< 200 ) %>%
height arrange(desc(height))
# A tibble: 34 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Qui-Gon Jinn 193 89 male Human <chr [1]>
2 Dooku 193 80 male Human <chr [2]>
3 Bail Prestor Organa 191 NA male Human <chr [2]>
4 Anakin Skywalker 188 84 male Human <chr [3]>
5 Mace Windu 188 84 male Human <chr [3]>
6 Raymus Antilles 188 79 male Human <chr [2]>
7 Gregar Typho 185 85 male Human <chr [1]>
8 Biggs Darklighter 183 84 male Human <chr [1]>
9 Boba Fett 183 78.2 male Human <chr [3]>
10 Cliegg Lars 183 NA male Human <chr [1]>
# ℹ 24 more rows
starwars_selected
# A tibble: 87 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Luke Skywalker 172 77 male Human <chr [5]>
2 C-3PO 167 75 none Droid <chr [6]>
3 R2-D2 96 32 none Droid <chr [7]>
4 Darth Vader 202 136 male Human <chr [4]>
5 Leia Organa 150 49 female Human <chr [5]>
6 Owen Lars 178 120 male Human <chr [3]>
7 Beru Whitesun lars 165 75 female Human <chr [3]>
8 R5-D4 97 32 none Droid <chr [1]>
9 Biggs Darklighter 183 84 male Human <chr [1]>
10 Obi-Wan Kenobi 182 77 male Human <chr [6]>
# ℹ 77 more rows
#filter(!is.na(height))保留height没有缺失值的个案
%>%
starwars_selected arrange(desc(height)) %>%
filter(!is.na(height))
# A tibble: 81 × 6
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Yarael Poof 264 NA male Quermian <chr [1]>
2 Tarfful 234 136 male Wookiee <chr [1]>
3 Lama Su 229 88 male Kaminoan <chr [1]>
4 Chewbacca 228 112 male Wookiee <chr [5]>
5 Roos Tarpals 224 82 male Gungan <chr [1]>
6 Grievous 216 159 male Kaleesh <chr [1]>
7 Taun We 213 NA female Kaminoan <chr [1]>
8 Rugor Nass 206 NA male Gungan <chr [1]>
9 Tion Medon 206 80 male Pau'an <chr [1]>
10 Darth Vader 202 136 male Human <chr [4]>
# ℹ 71 more rows
3.3 slice()
slice() lets you index rows by their (integer) locations. It allows you to select, remove, and duplicate rows.
slice_min() and slice_max() select rows with highest or lowest values of a variable.
#提取第5到第10行
%>% slice(5:10) starwars
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Leia Org… 150 49 brown light brown 19 fema… femin…
2 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
3 Beru Whi… 165 75 brown light blue 47 fema… femin…
4 R5-D4 97 32 <NA> white, red red NA none mascu…
5 Biggs Da… 183 84 black light brown 24 male mascu…
6 Obi-Wan … 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
#提取前3行
%>% slice_head(n = 3) starwars
# A tibble: 3 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
#提取后3行
%>% slice_tail(n = 3) starwars
# A tibble: 3 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 BB8 NA NA none none black NA none mascu…
2 Captain … NA NA unknown unknown unknown NA <NA> <NA>
3 Padmé Am… 165 45 brown light brown 46 fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
#随机抽样3行
%>% slice_sample(n = 5) starwars
# A tibble: 5 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Cordé 157 NA brown light brown NA fema… femin…
2 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
3 Dormé 165 NA brown light brown NA fema… femin…
4 Ayla Sec… 178 55 none blue hazel 48 fema… femin…
5 Captain … NA NA unknown unknown unknown NA <NA> <NA>
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
#随机抽样比例0.1
%>% slice_sample(prop = 0.1) starwars
# A tibble: 8 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Tion Med… 206 80 none grey black NA male mascu…
2 Finis Va… 170 NA blond fair blue 91 male mascu…
3 Nute Gun… 191 90 none mottled g… red NA male mascu…
4 San Hill 191 NA none grey gold NA male mascu…
5 Yoda 66 17 white green brown 896 male mascu…
6 Gregar T… 185 85 black dark brown NA male mascu…
7 Finn NA NA black dark dark NA male mascu…
8 Dooku 193 80 white fair brown 102 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
#切取某个变量最大或最小的几个个案
%>%
starwars filter(!is.na(height)) %>%
slice_max(height, n = 3)
# A tibble: 3 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Yarael P… 264 NA none white yellow NA male mascu…
2 Tarfful 234 136 brown brown blue NA male mascu…
3 Lama Su 229 88 none grey black NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
3.4 distinct()
剔除重复的行。
## 剔除所有列的值都相同的行
<- data.frame(ID =c(1,2,3,4,2),
df4 name = c("John","Bill","Jim","Kate", "Bill"))
df4
ID name
1 1 John
2 2 Bill
3 3 Jim
4 4 Kate
5 2 Bill
<- distinct(df4)
df4 df4
ID name
1 1 John
2 2 Bill
3 3 Jim
4 4 Kate
## 剔除某一列的值相同的行
<- data.frame(ID =c(1,2,3,4,2),
df5 name = c("John","Bill","Jim","Kate", "Alex"))
df5
ID name
1 1 John
2 2 Bill
3 3 Jim
4 4 Kate
5 2 Alex
<- distinct(df5, ID,.keep_all = TRUE)
df5 df5
ID name
1 1 John
2 2 Bill
3 3 Jim
4 4 Kate
unique(df4)
ID name
1 1 John
2 2 Bill
3 3 Jim
4 4 Kate
4 分组和概要 groupby and summarize
4.1 count()
The simple way you can aggregate data is to count it: to find out the number of observations.
%>%
starwars count()
# A tibble: 1 × 1
n
<int>
1 87
#按性别计数
%>%
starwars count(sex)
# A tibble: 5 × 2
sex n
<chr> <int>
1 female 16
2 hermaphroditic 1
3 male 60
4 none 6
5 <NA> 4
#计数和排序
%>%
starwars count(sex, sort = TRUE)
# A tibble: 5 × 2
sex n
<chr> <int>
1 male 60
2 female 16
3 none 6
4 <NA> 4
5 hermaphroditic 1
#按性别分组,各组个案的体重之和
%>%
starwars count(sex, wt = mass, sort = TRUE)
# A tibble: 5 × 2
sex n
<chr> <dbl>
1 male 3564.
2 hermaphroditic 1358
3 female 492.
4 none 279
5 <NA> 48
4.2 group_by()
对个案分组。
4.3 summarize()
summarise() and summarize() are synonyms.
summary functions
- sum()
- mean()
- median()
- min()
- max()
- n()
%>%
starwars_selected filter(height != "NA", mass != "NA", sex != "NA") %>%
group_by(sex) %>%
summarize(number = n(), median_height = median(height),
median_weight = median(mass)) %>%
arrange(desc(number))
# A tibble: 4 × 4
sex number median_height median_weight
<chr> <int> <dbl> <dbl>
1 male 44 183 80
2 female 9 168 55
3 none 4 132 53.5
4 hermaphroditic 1 175 1358
4.4 top_n()
%>%
starwars_selected filter(height != "NA", mass != "NA", sex != "NA") %>%
group_by(sex) %>%
top_n(1,height)
# A tibble: 4 × 6
# Groups: sex [4]
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Jabba Desilijic Tiure 175 1358 hermaphroditic Hutt <chr [3]>
2 IG-88 200 140 none Droid <chr [1]>
3 Adi Gallia 184 50 female Tholothian <chr [2]>
4 Tarfful 234 136 male Wookiee <chr [1]>
%>%
starwars_selected filter(height != "NA", mass != "NA", sex != "NA") %>%
group_by(sex) %>%
top_n(2,mass) %>%
arrange(sex)
# A tibble: 8 × 6
# Groups: sex [4]
name height mass sex species films
<chr> <int> <dbl> <chr> <chr> <list>
1 Beru Whitesun lars 165 75 female Human <chr [3]>
2 Shaak Ti 178 57 female Togruta <chr [2]>
3 Jabba Desilijic Tiure 175 1358 hermaphroditic Hutt <chr [3]>
4 Darth Vader 202 136 male Human <chr [4]>
5 Grievous 216 159 male Kaleesh <chr [1]>
6 Tarfful 234 136 male Wookiee <chr [1]>
7 C-3PO 167 75 none Droid <chr [6]>
8 IG-88 200 140 none Droid <chr [1]>
4.5 缺失值 Missing data
#有missing data, 无法计算均值
mean(starwars$height)
[1] NA
mean(starwars$height,na.rm = TRUE)
[1] 174.358
%>%
starwars select(name, gender, hair_color, height) %>%
na.omit()
# A tibble: 73 × 4
name gender hair_color height
<chr> <chr> <chr> <int>
1 Luke Skywalker masculine blond 172
2 Darth Vader masculine none 202
3 Leia Organa feminine brown 150
4 Owen Lars masculine brown, grey 178
5 Beru Whitesun lars feminine brown 165
6 Biggs Darklighter masculine black 183
7 Obi-Wan Kenobi masculine auburn, white 182
8 Anakin Skywalker masculine blond 188
9 Wilhuff Tarkin masculine auburn, grey 180
10 Chewbacca masculine brown 228
# ℹ 63 more rows
starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# 查看有缺失值的个案
%>%
starwars select(name, gender, hair_color, height) %>%
filter(!complete.cases(.))
# A tibble: 14 × 4
name gender hair_color height
<chr> <chr> <chr> <int>
1 C-3PO masculine <NA> 167
2 R2-D2 masculine <NA> 96
3 R5-D4 masculine <NA> 97
4 Greedo masculine <NA> 173
5 Jabba Desilijic Tiure masculine <NA> 175
6 Arvel Crynyd masculine brown NA
7 Ric Olié <NA> brown 183
8 Quarsh Panaka <NA> black 183
9 Sly Moore <NA> none 178
10 Finn masculine black NA
11 Rey feminine brown NA
12 Poe Dameron masculine brown NA
13 BB8 masculine none NA
14 Captain Phasma <NA> unknown NA
starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
%>%
starwars select(name, gender, hair_color, height) %>%
filter(!complete.cases(.)) %>%
drop_na(height)
# A tibble: 8 × 4
name gender hair_color height
<chr> <chr> <chr> <int>
1 C-3PO masculine <NA> 167
2 R2-D2 masculine <NA> 96
3 R5-D4 masculine <NA> 97
4 Greedo masculine <NA> 173
5 Jabba Desilijic Tiure masculine <NA> 175
6 Ric Olié <NA> brown 183
7 Quarsh Panaka <NA> black 183
8 Sly Moore <NA> none 178
starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
%>%
starwars select(name, gender, hair_color, height) %>%
filter(complete.cases(.)) %>%
mutate(hair_color2 = replace_na(hair_color, "none"))
# A tibble: 73 × 5
name gender hair_color height hair_color2
<chr> <chr> <chr> <int> <chr>
1 Luke Skywalker masculine blond 172 blond
2 Darth Vader masculine none 202 none
3 Leia Organa feminine brown 150 brown
4 Owen Lars masculine brown, grey 178 brown, grey
5 Beru Whitesun lars feminine brown 165 brown
6 Biggs Darklighter masculine black 183 black
7 Obi-Wan Kenobi masculine auburn, white 182 auburn, white
8 Anakin Skywalker masculine blond 188 blond
9 Wilhuff Tarkin masculine auburn, grey 180 auburn, grey
10 Chewbacca masculine brown 228 brown
# ℹ 63 more rows
starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
5 总结
These five functions provide the basis of a language of data manipulation.
- reorder the rows (arrange())
- pick observations and variables of interest (filter() and select())
- add new variables that are functions of existing variables (mutate())
- collapse many values to a summary (summarise()).
5.1 本章作业
数据集:mpg {ggplot2}
用select()函数从mpg提取5个变量⽣成新的数据框。
用mutate()函数在数据框中追加新的变量,将油耗变量cty和hwy(miles per gallon)转换成转公里/升(kilometers per liter)的油耗指标。
任选mpg中的某个变量,用if_else()函数对该变量的数值进⾏条件转换。
设置三个筛选条件,用filter()函数从mpg筛选个案⽣成新的数据框。
提交要求:R script