第5讲

第5讲 数据管理

代码

教学视频

函数select()和mutate()

函数if_else()对变量值进行条件转换

函数filter()筛选个案

如何删除含有缺失值的个案?

本章习题

第1题

第1题:数据 ggplot2::mpg

提示:加载包tidyverse,运行data(mpg), 即可调用数据框mpg。

1.1 用select()函数从mpg提取5个变量⽣成新的数据框。

1.2 用mutate()函数在数据框中追加新的变量,将油耗变量cty和hwy(miles per gallon)转换成转公里/升(kilometers per liter)的油耗指标。

1.3 任选mpg中的某个变量,用if_else()函数对该变量的数值进⾏条件转换。

1.4 设置三个筛选条件,用filter()函数从mpg筛选出满足条件的个案, 保存到新的数据框。

提交R script代码。
第2题

习题2 :数据nycflights13::flights

提示:加载包nycflights13,运行data(flights), 即可调用数据框flights。

2.1 根据distance(单位:英里)的数值,将航班分为短途、中途和长途三个类别。创建新变量distance_group,其值为”short”, “medium”, “long”。统计每个类别的航班数量。

  • 短途飞行(Short-haul flight):飞行距离少于500英里的航班。

  • 中途飞行(Medium-haul flight):飞行距离在500到1,550英里之间的航班。

  • 长途飞行(Long-haul flight):飞行距离超过1,550英里的航班

2.2 如果你想尽可能避免延误,你应该选择哪个时间段起飞的航班?

2.3 哪个航空公司(carrier)的平均延误时间最长?

习题答案

第1题

第1题

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gt)

1.1 用select()函数从mpg提取5个变量⽣成新的数据框。

#library(gt)
data(mpg)

mpg %>% 
  select(manufacturer:cyl) %>% 
  head() #查看前6行
# A tibble: 6 × 5
  manufacturer model displ  year   cyl
  <chr>        <chr> <dbl> <int> <int>
1 audi         a4      1.8  1999     4
2 audi         a4      1.8  1999     4
3 audi         a4      2    2008     4
4 audi         a4      2    2008     4
5 audi         a4      2.8  1999     6
6 audi         a4      2.8  1999     6
mpg %>% 
  select(1:5) %>% 
  head()
# A tibble: 6 × 5
  manufacturer model displ  year   cyl
  <chr>        <chr> <dbl> <int> <int>
1 audi         a4      1.8  1999     4
2 audi         a4      1.8  1999     4
3 audi         a4      2    2008     4
4 audi         a4      2    2008     4
5 audi         a4      2.8  1999     6
6 audi         a4      2.8  1999     6
mpg %>% 
  select(4,5,7:9) %>% 
  head()
# A tibble: 6 × 5
   year   cyl drv     cty   hwy
  <int> <int> <chr> <int> <int>
1  1999     4 f        18    29
2  1999     4 f        21    29
3  2008     4 f        20    31
4  2008     4 f        21    30
5  1999     6 f        16    26
6  1999     6 f        18    26
mpg %>% 
  select(year,cyl,drv:hwy) %>% 
  head()
# A tibble: 6 × 5
   year   cyl drv     cty   hwy
  <int> <int> <chr> <int> <int>
1  1999     4 f        18    29
2  1999     4 f        21    29
3  2008     4 f        20    31
4  2008     4 f        21    30
5  1999     6 f        16    26
6  1999     6 f        18    26

1.2 用mutate()函数在数据框中追加新的变量,将油耗变量cty和hwy(miles per gallon)转换成转公里/升(kilometers per liter)的油耗指标。(1 miles per gallon = 0.425 kilometers per liter)

mpgnew <- mpg %>% mutate(cty.kpl = 0.425*cty, 
               hwy.kpl = 0.425*hwy)

head(mpgnew) %>% gt()
manufacturer model displ year cyl trans drv cty hwy fl class cty.kpl hwy.kpl
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 7.650 12.325
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 8.925 12.325
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 8.500 13.175
audi a4 2.0 2008 4 auto(av) f 21 30 p compact 8.925 12.750
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 6.800 11.050
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 7.650 11.050
mpg$hwy.kpl <- 0.425*mpg$hwy
mpg$cty.kpl <- 0.425*mpg$cty

head(mpg) %>% gt()
manufacturer model displ year cyl trans drv cty hwy fl class hwy.kpl cty.kpl
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 12.325 7.650
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 12.325 8.925
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 13.175 8.500
audi a4 2.0 2008 4 auto(av) f 21 30 p compact 12.750 8.925
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 11.050 6.800
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 11.050 7.650

1.3 任选mpg中的某个变量,用if_else()函数对该变量的数值进⾏条件转换。

mpgnew <- mpg %>% 
  mutate(transmission = 
           if_else(substring(trans, 1,4) == "auto", 
                   "auto","manual"))

head(mpgnew) %>% gt()
manufacturer model displ year cyl trans drv cty hwy fl class hwy.kpl cty.kpl transmission
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 12.325 7.650 auto
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 12.325 8.925 manual
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 13.175 8.500 manual
audi a4 2.0 2008 4 auto(av) f 21 30 p compact 12.750 8.925 auto
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 11.050 6.800 auto
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 11.050 7.650 manual
#二值转换 if_else

mpg$transmission <- if_else(
  substring(mpg$trans, 1,4) == "auto", 
  "auto","manual") 

#多值转换 case_when
mpg$drive <- case_when(
  mpg$drv == "f" ~ "front-wheel",
  mpg$drv == "r" ~ "rear-wheel",
  mpg$drv == "4" ~ "four-wheel")


head(mpg) %>% gt()
manufacturer model displ year cyl trans drv cty hwy fl class hwy.kpl cty.kpl transmission drive
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 12.325 7.650 auto front-wheel
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 12.325 8.925 manual front-wheel
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 13.175 8.500 manual front-wheel
audi a4 2.0 2008 4 auto(av) f 21 30 p compact 12.750 8.925 auto front-wheel
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 11.050 6.800 auto front-wheel
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact 11.050 7.650 manual front-wheel

1.4 设置三个筛选条件,用filter()函数从mpg筛选个案⽣成新的数据框。

# the & operator is used to combine two conditions, where both conditions must be true for a row to be included in the filtered data frame.

mpg %>% 
  filter(class == "suv" & 
           cyl == 8 & 
          year == 2008) %>%
  head() %>%
  gt()
manufacturer model displ year cyl trans drv cty hwy fl class hwy.kpl cty.kpl transmission drive
chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv 8.500 5.950 auto rear-wheel
chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 11 15 e suv 6.375 4.675 auto rear-wheel
chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv 8.500 5.950 auto rear-wheel
chevrolet c1500 suburban 2wd 6.0 2008 8 auto(l4) r 12 17 r suv 7.225 5.100 auto rear-wheel
chevrolet k1500 tahoe 4wd 5.3 2008 8 auto(l4) 4 14 19 r suv 8.075 5.950 auto four-wheel
chevrolet k1500 tahoe 4wd 5.3 2008 8 auto(l4) 4 11 14 e suv 5.950 4.675 auto four-wheel
mpg %>% 
  filter(substr(trans, 1, 4) == "auto" & 
        class %in% c("compact", "subcompact") &
          !fl %in% c("c" ,"d" ,"e")) %>%
  head() %>%
  gt()
manufacturer model displ year cyl trans drv cty hwy fl class hwy.kpl cty.kpl transmission drive
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 12.325 7.650 auto front-wheel
audi a4 2.0 2008 4 auto(av) f 21 30 p compact 12.750 8.925 auto front-wheel
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 11.050 6.800 auto front-wheel
audi a4 3.1 2008 6 auto(av) f 18 27 p compact 11.475 7.650 auto front-wheel
audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact 10.625 6.800 auto four-wheel
audi a4 quattro 2.0 2008 4 auto(s6) 4 19 27 p compact 11.475 8.075 auto four-wheel
# the | operator combines the two conditions, where only one of the conditions needs to be true for a row to be included in the filtered data frame.

mpg %>% 
  filter(class == "suv" | 
           cyl == 8 |
          year == 2008)  %>% 
  head() %>% 
  gt()
manufacturer model displ year cyl trans drv cty hwy fl class hwy.kpl cty.kpl transmission drive
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 13.175 8.500 manual front-wheel
audi a4 2.0 2008 4 auto(av) f 21 30 p compact 12.750 8.925 auto front-wheel
audi a4 3.1 2008 6 auto(av) f 18 27 p compact 11.475 7.650 auto front-wheel
audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28 p compact 11.900 8.500 manual four-wheel
audi a4 quattro 2.0 2008 4 auto(s6) 4 19 27 p compact 11.475 8.075 auto four-wheel
audi a4 quattro 3.1 2008 6 auto(s6) 4 17 25 p compact 10.625 7.225 auto four-wheel

拓展资源

R for Data Science 2nd edition https://r4ds.hadley.nz/data-tidy

R代码风格

R如何快速预览变量分布?summarytools::dfSummary()

R代码风格:如何写出美观易读的代码?