1 数据类型

1.1 数值型 Numeric

Numeric: Numbers that have a decimal value or are a fraction in nature have a data type as numeric.

a <- 1.2
a
[1] 1.2
#class()查看a的数据类型
class(a)
[1] "numeric"

1.2 整数型 Integer

Integer: Numbers that do not contain decimal values have a data type as an integer. However, to create an integer data type, you explicitly use as.integer() and pass the variable as an argument.

#as.integer()取整
int <- as.integer(2.7)
print(int)
[1] 2
class(int)
[1] "integer"
#L代表指定该数据为整数型
b <- 2L
b
[1] 2
class(b)
[1] "integer"
c <- 2
c
[1] 2
class(c)
[1] "numeric"

1.3 字符串 Character

Character: it can be a letter or a combination of letters enclosed by quotes is considered as a character data type by R. It can be alphabets or numbers.

d <- "hello"
d
[1] "hello"
class(d)
[1] "character"
e <- "5"
e
[1] "5"
class(e)
[1] "character"

1.4 逻辑型 Logical

Logical: A variable that can have a value of True and False like a boolean is called a logical variable.

f <- TRUE
g <- F
print(c(f,g))
[1]  TRUE FALSE
class(c(f,g))
[1] "logical"
#对逻辑向量求和就是统计TRUE的个数,求平均就是计算TRUE的比例
f1 <- c(TRUE, FALSE, TRUE, TRUE, TRUE)
sum(f1)
[1] 4
mean(f1)
[1] 0.8

1.5 因子型 Factor

Factor: They are a data type that is used to refer to a qualitative relationship like colors, good & bad, course or movie ratings, etc.

1.5.1 Unordered Factor

不可排序,分类数据

group <- factor(c("red", "blue", "yellow","white"))
group
[1] red    blue   yellow white 
Levels: blue red white yellow

1.5.2 Ordered Factor

可以排序,顺序型数据,需要注意顺序是否合理。

# Create a factor with the wrong order of levels
sizes <- factor(c("small", "large", "large", "small", "medium"))
sizes
[1] small  large  large  small  medium
Levels: large medium small
class(sizes)
[1] "factor"
#levels()查看因子水平
levels(sizes)
[1] "large"  "medium" "small" 
#levels = c()设置因子水平的顺序
sizes <- factor(sizes, levels = c("small", "medium", "large"))
sizes
[1] small  large  large  small  medium
Levels: small medium large
#ordered()函数设置因子水平的顺序
sizes <- ordered(c("small", "large", "large", "small", "medium"))
sizes <- ordered(sizes, levels = c("small", "medium", "large"))
sizes
[1] small  large  large  small  medium
Levels: small < medium < large
#查看因子水平的个数
nlevels(sizes)
[1] 3
class(levels(sizes))
[1] "character"
#查看工作环境中所有的对象
ls()
 [1] "a"     "b"     "c"     "d"     "e"     "f"     "f1"    "g"     "group"
[10] "int"   "sizes"

2 数据结构 Data Structure

数据结构是指数据是如何储存的。

2.1 向量 Vector

原子向量(atomic vector)是R中数据存储最基本的结构。原子向量只能存储同种类型的元素。每个向量都两个关键的属性:长度和类型。

  • 原子向量有四种类型(复杂程度逐渐增加):

    • 逻辑型(logical):TRUE和FALSE

    • 整型(integer):整数(整数后加L)

    • 双精度型(double):实数

    • 字符型(character):引号括起来的字符

  • 整型和双精度型统称为数值型(numeric)。

  • R语言中没有”标量”这种数据结构,只有长度为1的向量。

  • 向量存储一维信息,向量中的元素数据类型必须相同。

  • 函数c()创建向量。

  • 函数length()查看向量的长度,向量中包含的元素个数。

  • 向量中的某个元素[]

  • slicing, 向量中的第2至第5个元素[2:5]

  • Missing data

#创建数值型向量
mark <- c(60,70,75,80,85,90,95)
mark
[1] 60 70 75 80 85 90 95
class(mark)
[1] "numeric"
length(mark)
[1] 7
#向量中的元素[]
mark[3]
[1] 75
# 超出向量长度,NA.
mark[8]
[1] NA
mark[2:4]
[1] 70 75 80
#创建字符串向量
name <- c("alex","bill", "lily","harry","steve","john")
name
[1] "alex"  "bill"  "lily"  "harry" "steve" "john" 
class(name)
[1] "character"
#如果c()中既包括数值,又包括字符串,将把数值转换为字符串
x1 <- c(90,"john")
x1
[1] "90"   "john"
class(x1)
[1] "character"
x2 <- c(90,TRUE)
x2
[1] 90  1
class(x2)
[1] "numeric"
x3 <- c(1:1024)
length(x3)
[1] 1024
#创建奇数序列
x4 <- seq(1,20,2)
x4
 [1]  1  3  5  7  9 11 13 15 17 19
#向量中包含missing data
x5 <- c(seq(1:10), NA)
x5
 [1]  1  2  3  4  5  6  7  8  9 10 NA
#查看向量中是否有缺失值
is.na(x5)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
anyNA(x5)
[1] TRUE

2.2 矩阵 Matrix

A matrix is a collection of data elements arranged in a two-dimensional rectangular layout.

  • a matrix is used to store information about the same data type.

  • matrices are capable of holding two-dimensional information inside it.

  • 创建矩阵

M <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames))

  • byrow=FALSE 默认按行填充
#创建矩阵
m1 <- matrix(c("alex","bill", "lily","harry","steve","john"),2,3)
m1
     [,1]   [,2]    [,3]   
[1,] "alex" "lily"  "steve"
[2,] "bill" "harry" "john" 
m2 <- matrix(c("alex","bill", "lily","harry","steve","john"),3,2)
m2
     [,1]   [,2]   
[1,] "alex" "harry"
[2,] "bill" "steve"
[3,] "lily" "john" 
m3 <- matrix(c("alex","bill", "lily","harry","steve","john"))
m3
     [,1]   
[1,] "alex" 
[2,] "bill" 
[3,] "lily" 
[4,] "harry"
[5,] "steve"
[6,] "john" 
m4 <- matrix(c("alex","bill", "lily","harry","steve","john"),T)
m4
     [,1]   [,2]   [,3]   [,4]    [,5]    [,6]  
[1,] "alex" "bill" "lily" "harry" "steve" "john"
m5 <- matrix(c("alex","bill", "lily","harry","steve","john"),2,3,byrow = F, list(c("class A","class B"),c("group1","group2","group3")))
m5
        group1 group2  group3 
class A "alex" "lily"  "steve"
class B "bill" "harry" "john" 
#提取矩阵中的元素
m5[2,2]
[1] "harry"
m5[1:2,1:2]
        group1 group2 
class A "alex" "lily" 
class B "bill" "harry"

2.3 列表 List

列表中可以包含不同类型的数据,可以包含向量,函数,矩阵,或者其他的列表。

lists act as containers.

x <- list(1, "a", TRUE, 1+4i)
class(x[[1]])
[1] "numeric"
#给list中的每个对象命名
x <- list(x1 = c(1:10), x2 = "a", x3 = c(TRUE,FALSE), x4 = 1+4i)
x
$x1
 [1]  1  2  3  4  5  6  7  8  9 10

$x2
[1] "a"

$x3
[1]  TRUE FALSE

$x4
[1] 1+4i
names(x)
[1] "x1" "x2" "x3" "x4"
list1 <- 1:5
list1
[1] 1 2 3 4 5
list2 <- factor(1:5)
list2
[1] 1 2 3 4 5
Levels: 1 2 3 4 5
list3 <- letters[1:5]
list3
[1] "a" "b" "c" "d" "e"
combined_list <- list(list1, list2, list3)
combined_list
[[1]]
[1] 1 2 3 4 5

[[2]]
[1] 1 2 3 4 5
Levels: 1 2 3 4 5

[[3]]
[1] "a" "b" "c" "d" "e"
#提取list中的某个对象,用双层方括号[[]]
combined_list[[2]]
[1] 1 2 3 4 5
Levels: 1 2 3 4 5
#提取list中第2个对象中的第3个元素,用双层方括号[[2]][3]

combined_list[[2]][3]
[1] 3
Levels: 1 2 3 4 5

flatten the list: One important thing to remember is that since combined_list is a combination of character and numeric datatype, the character data type will get the precedence, and the data type of complete list will become a character.

flat_list <- unlist(combined_list)
class(flat_list)
[1] "character"
length(flat_list)
[1] 15

2.4 数据框 DataFrame

2.4.1 数据框与矩阵的区别

矩阵中的元素必须是同一类型,数据框中每一列是同一类型,不同列可以不同的类型,包含的数据类型更加丰富。

2.4.2 数据框与列表的区别

数据框是一种特殊形式的列表,“矩形”列表,列表中每个元素的长度一样。

A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list).

It contains data in a tabular fashion. The data in the data frame can be spread across various columns, having different data types. The first column can be a character while the second column can be an integer, and the third column can be logical.

  • DataFrame can be created using the data.frame() function.

  • DataFrame has been widely used in the reading comma-separated files (CSV), text files.

  • DataFrames can be useful for understanding the data, data wrangling, plotting and visualizing.

2.4.3 dataframe和tibble的区别

tibble是一种数据结构,其概念来自tidyverse扩展包。tibble中展示了列数据的维度,数据类型,展示的信息比dataframe更丰富。

R中某些函数只适合于dataframe, 或者只适合于tibble, 或者只适合于matrix.

data.frame是比较老的数据结构,现代数据分析常用tibble(强化版data.frame)。

The tidyverse uses a structure called a “tibble”, which is simply a Data Frame (like an excel table) but with more informative printing than the default data frame.

#创建数据框
team1_df <- data.frame(
   Person = c("Alex", "Bill","John"),
   Age = c(26, 26, 27),
   Weight = c(72, 65, 90),
   Height = c(175, 170, 180),
   Salary = c(8000, 6000, 7000),
   Sex = as.factor(c("Male","Male","Male"))
)
team1_df
  Person Age Weight Height Salary  Sex
1   Alex  26     72    175   8000 Male
2   Bill  26     65    170   6000 Male
3   John  27     90    180   7000 Male
#查看数据框中的元素
team1_df[2,3]
[1] 65
#创建tibble

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
team1_tbl <- tibble(
   Person = c("Alex", "Bill","John"),
   Age = c(26, 26, 27),
   Weight = c(72, 65, 90),
   Height = c(175, 170, 180),
   Salary = c(8000, 6000, 7000),
   Sex = as.factor(c("Male","Male","Male"))
)
team1_tbl
# A tibble: 3 × 6
  Person   Age Weight Height Salary Sex  
  <chr>  <dbl>  <dbl>  <dbl>  <dbl> <fct>
1 Alex      26     72    175   8000 Male 
2 Bill      26     65    170   6000 Male 
3 John      27     90    180   7000 Male 

2.4.4 创建数据框的方法

2.4.4.1 data.frame()

team1 <- data.frame(
   Person = c("Alex", "Bill","John"),
   Age = c(26, 26, 27),
   Weight = c(72, 65, 90),
   Height = c(175, 170, 180),
   Salary = c(8000, 6000, 7000),
   Sex = as.factor(c("Male","Male","Male"))
)
team1
  Person Age Weight Height Salary  Sex
1   Alex  26     72    175   8000 Male
2   Bill  26     65    170   6000 Male
3   John  27     90    180   7000 Male
class(team1)
[1] "data.frame"
str(team1)
'data.frame':   3 obs. of  6 variables:
 $ Person: chr  "Alex" "Bill" "John"
 $ Age   : num  26 26 27
 $ Weight: num  72 65 90
 $ Height: num  175 170 180
 $ Salary: num  8000 6000 7000
 $ Sex   : Factor w/ 1 level "Male": 1 1 1
#查看数据框的行数
nrow(team1) 
[1] 3
#查看数据框的列数
ncol(team1)
[1] 6
team2 <- data.frame(
   Person = c("Lily", "Kate","Susan"),
   Age = c(25, 24, 26),
   Weight = c(45, 55, 48),
   Height = c(155, 160, 162),
   Salary = c(6500, 8200, 7000),
   Sex = as.factor(c("Female","Female","Female"))
)

#rbind()按行合并两个dataframe
df1 <- rbind(team1, team2)
df1
  Person Age Weight Height Salary    Sex
1   Alex  26     72    175   8000   Male
2   Bill  26     65    170   6000   Male
3   John  27     90    180   7000   Male
4   Lily  25     45    155   6500 Female
5   Kate  24     55    160   8200 Female
6  Susan  26     48    162   7000 Female
#rbind()按列合并两个dataframe
df2 <- cbind(team1, team2)
df2
  Person Age Weight Height Salary  Sex Person Age Weight Height Salary    Sex
1   Alex  26     72    175   8000 Male   Lily  25     45    155   6500 Female
2   Bill  26     65    170   6000 Male   Kate  24     55    160   8200 Female
3   John  27     90    180   7000 Male  Susan  26     48    162   7000 Female
#查看dataframe的前行3
head(df1,3)
  Person Age Weight Height Salary  Sex
1   Alex  26     72    175   8000 Male
2   Bill  26     65    170   6000 Male
3   John  27     90    180   7000 Male
#查看dataframe的最后几行
tail(df1,3)
  Person Age Weight Height Salary    Sex
4   Lily  25     45    155   6500 Female
5   Kate  24     55    160   8200 Female
6  Susan  26     48    162   7000 Female
#查看dataframe每一列的数据类型
str(df1)
'data.frame':   6 obs. of  6 variables:
 $ Person: chr  "Alex" "Bill" "John" "Lily" ...
 $ Age   : num  26 26 27 25 24 26
 $ Weight: num  72 65 90 45 55 48
 $ Height: num  175 170 180 155 160 162
 $ Salary: num  8000 6000 7000 6500 8200 7000
 $ Sex   : Factor w/ 2 levels "Male","Female": 1 1 1 2 2 2
summary(df1)
    Person               Age            Weight          Height     
 Length:6           Min.   :24.00   Min.   :45.00   Min.   :155.0  
 Class :character   1st Qu.:25.25   1st Qu.:49.75   1st Qu.:160.5  
 Mode  :character   Median :26.00   Median :60.00   Median :166.0  
                    Mean   :25.67   Mean   :62.50   Mean   :167.0  
                    3rd Qu.:26.00   3rd Qu.:70.25   3rd Qu.:173.8  
                    Max.   :27.00   Max.   :90.00   Max.   :180.0  
     Salary         Sex   
 Min.   :6000   Male  :3  
 1st Qu.:6625   Female:3  
 Median :7000             
 Mean   :7117             
 3rd Qu.:7750             
 Max.   :8200             

3 作业

3.1 创建数据框

要求:

观测单元:6个本班同学

数据框中需要包含的信息:名字、身高、体重、生源地、性别、某科目成绩。

3.2 创建列表

  1. 创建一个列表,列表中包含3个对象,这3个对象的数据类型分别是数值型、字符串和逻辑型。

  2. 查看列表中的第2个对象。

  3. 查看列表中第2个对象的数据类型。

  4. 查看列表中第2个对象的第3个元素的值。

3.3 查看starwars的信息

提示:在help中输入starwars,可以查看其帮助文档

  1. starwars中有多少行数据?
  2. starwars中有多少列数据?
  3. starwars中每一列的数据的类型是什么?