1 数据类型

数值型 Numeric
整数型 Integer
字符串 Character
逻辑型 Logical
因子型 Factor

1.1 数值型 Numeric

Numeric: Numbers that have a decimal value or are a fraction in nature have a data type as numeric.

a <- 1.2
a

[1] 1.2

#class()查看a的数据类型
class(a)

[1] "numeric"

1.2 整数型 Integer

Integer: Numbers that do not contain decimal values have a data type as an integer. However, to create an integer data type, you explicitly use as.integer() and pass the variable as an argument.

#as.integer()取整
int <- as.integer(2.7)
print(int)

[1] 2

class(int)

[1] "integer"

#L代表指定该数据为整数型
b <- 2L
b

[1] 2

class(b)

[1] "integer"

c <- 2
c

[1] 2

class(c)

[1] "numeric"

1.3 字符串 Character

Character: it can be a letter or a combination of letters enclosed by quotes is considered as a character data type by R. It can be alphabets or numbers.

d <- "hello"
d

[1] "hello"

class(d)

[1] "character"

e <- "5"
e

[1] "5"

class(e)

[1] "character"

1.4 逻辑型 Logical

Logical: A variable that can have a value of True and False like a boolean is called a logical variable.

f <- TRUE
g <- F
print(c(f,g))

[1]  TRUE FALSE

class(c(f,g))

[1] "logical"

#对逻辑向量求和就是统计TRUE的个数，求平均就是计算TRUE的比例
f1 <- c(TRUE, FALSE, TRUE, TRUE, TRUE)
sum(f1)

[1] 4

mean(f1)

[1] 0.8

1.5 因子型 Factor

Factor: They are a data type that is used to refer to a qualitative relationship like colors, good & bad, course or movie ratings, etc.

1.5.1 Unordered Factor

不可排序，分类数据

group <- factor(c("red", "blue", "yellow","white"))
group

[1] red    blue   yellow white 
Levels: blue red white yellow

1.5.2 Ordered Factor

可以排序，顺序型数据，需要注意顺序是否合理。

# Create a factor with the wrong order of levels
sizes <- factor(c("small", "large", "large", "small", "medium"))
sizes

[1] small  large  large  small  medium
Levels: large medium small

class(sizes)

[1] "factor"

#levels()查看因子水平
levels(sizes)

[1] "large"  "medium" "small"

#levels = c()设置因子水平的顺序
sizes <- factor(sizes, levels = c("small", "medium", "large"))
sizes

[1] small  large  large  small  medium
Levels: small medium large

#ordered()函数设置因子水平的顺序
sizes <- ordered(c("small", "large", "large", "small", "medium"))
sizes <- ordered(sizes, levels = c("small", "medium", "large"))
sizes

[1] small  large  large  small  medium
Levels: small < medium < large

#查看因子水平的个数
nlevels(sizes)

[1] 3

class(levels(sizes))

[1] "character"

#查看工作环境中所有的对象
ls()

 [1] "a"     "b"     "c"     "d"     "e"     "f"     "f1"    "g"     "group"
[10] "int"   "sizes"

2 数据结构 Data Structure

数据结构是指数据是如何储存的。

2.1 向量 Vector

原子向量（atomic vector)是R中数据存储最基本的结构。原子向量只能存储同种类型的元素。每个向量都两个关键的属性：长度和类型。

原子向量有四种类型（复杂程度逐渐增加）：
- 逻辑型(logical)：TRUE和FALSE
- 整型(integer)：整数（整数后加L）
- 双精度型(double)：实数
- 字符型(character)：引号括起来的字符
整型和双精度型统称为数值型(numeric)。
R语言中没有”标量”这种数据结构，只有长度为1的向量。
向量存储一维信息，向量中的元素数据类型必须相同。
函数c()创建向量。
函数length()查看向量的长度,向量中包含的元素个数。
向量中的某个元素[]
slicing, 向量中的第2至第5个元素[2:5]
Missing data

#创建数值型向量
mark <- c(60,70,75,80,85,90,95)
mark

[1] 60 70 75 80 85 90 95

class(mark)

[1] "numeric"

length(mark)

[1] 7

#向量中的元素[]
mark[3]

[1] 75

# 超出向量长度，NA.
mark[8]

[1] NA

mark[2:4]

[1] 70 75 80

#创建字符串向量
name <- c("alex","bill", "lily","harry","steve","john")
name

[1] "alex"  "bill"  "lily"  "harry" "steve" "john"

class(name)

[1] "character"

#如果c()中既包括数值，又包括字符串，将把数值转换为字符串
x1 <- c(90,"john")
x1

[1] "90"   "john"

class(x1)

[1] "character"

x2 <- c(90,TRUE)
x2

[1] 90  1

class(x2)

[1] "numeric"

x3 <- c(1:1024)
length(x3)

[1] 1024

#创建奇数序列
x4 <- seq(1,20,2)
x4

 [1]  1  3  5  7  9 11 13 15 17 19

#向量中包含missing data
x5 <- c(seq(1:10), NA)
x5

 [1]  1  2  3  4  5  6  7  8  9 10 NA

#查看向量中是否有缺失值
is.na(x5)

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

anyNA(x5)

[1] TRUE

2.2 矩阵 Matrix

A matrix is a collection of data elements arranged in a two-dimensional rectangular layout.

a matrix is used to store information about the same data type.
matrices are capable of holding two-dimensional information inside it.
创建矩阵

M <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames))

byrow=FALSE 默认按行填充

#创建矩阵
m1 <- matrix(c("alex","bill", "lily","harry","steve","john"),2,3)
m1

     [,1]   [,2]    [,3]   
[1,] "alex" "lily"  "steve"
[2,] "bill" "harry" "john"

m2 <- matrix(c("alex","bill", "lily","harry","steve","john"),3,2)
m2

     [,1]   [,2]   
[1,] "alex" "harry"
[2,] "bill" "steve"
[3,] "lily" "john"

m3 <- matrix(c("alex","bill", "lily","harry","steve","john"))
m3

     [,1]   
[1,] "alex" 
[2,] "bill" 
[3,] "lily" 
[4,] "harry"
[5,] "steve"
[6,] "john"

m4 <- matrix(c("alex","bill", "lily","harry","steve","john"),T)
m4

     [,1]   [,2]   [,3]   [,4]    [,5]    [,6]  
[1,] "alex" "bill" "lily" "harry" "steve" "john"

m5 <- matrix(c("alex","bill", "lily","harry","steve","john"),2,3,byrow = F, list(c("class A","class B"),c("group1","group2","group3")))
m5

        group1 group2  group3 
class A "alex" "lily"  "steve"
class B "bill" "harry" "john"

#提取矩阵中的元素
m5[2,2]

[1] "harry"

m5[1:2,1:2]

        group1 group2 
class A "alex" "lily" 
class B "bill" "harry"

2.3 列表 List

列表中可以包含不同类型的数据，可以包含向量，函数，矩阵，或者其他的列表。

lists act as containers.

x <- list(1, "a", TRUE, 1+4i)
class(x[[1]])

[1] "numeric"

#给list中的每个对象命名
x <- list(x1 = c(1:10), x2 = "a", x3 = c(TRUE,FALSE), x4 = 1+4i)
x

$x1
 [1]  1  2  3  4  5  6  7  8  9 10

$x2
[1] "a"

$x3
[1]  TRUE FALSE

$x4
[1] 1+4i

names(x)

[1] "x1" "x2" "x3" "x4"

list1 <- 1:5
list1

[1] 1 2 3 4 5

list2 <- factor(1:5)
list2

[1] 1 2 3 4 5
Levels: 1 2 3 4 5

list3 <- letters[1:5]
list3

[1] "a" "b" "c" "d" "e"

combined_list <- list(list1, list2, list3)
combined_list

[[1]]
[1] 1 2 3 4 5

[[2]]
[1] 1 2 3 4 5
Levels: 1 2 3 4 5

[[3]]
[1] "a" "b" "c" "d" "e"

#提取list中的某个对象，用双层方括号[[]]
combined_list[[2]]

[1] 1 2 3 4 5
Levels: 1 2 3 4 5

#提取list中第2个对象中的第3个元素，用双层方括号[[2]][3]

combined_list[[2]][3]

[1] 3
Levels: 1 2 3 4 5

flatten the list: One important thing to remember is that since combined_list is a combination of character and numeric datatype, the character data type will get the precedence, and the data type of complete list will become a character.

flat_list <- unlist(combined_list)
class(flat_list)

[1] "character"

length(flat_list)

[1] 15

2.4 数据框 DataFrame

2.4.1 数据框与矩阵的区别

矩阵中的元素必须是同一类型，数据框中每一列是同一类型，不同列可以不同的类型，包含的数据类型更加丰富。

2.4.2 数据框与列表的区别

数据框是一种特殊形式的列表，“矩形”列表，列表中每个元素的长度一样。

A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list).

It contains data in a tabular fashion. The data in the data frame can be spread across various columns, having different data types. The first column can be a character while the second column can be an integer, and the third column can be logical.

DataFrame can be created using the data.frame() function.
DataFrame has been widely used in the reading comma-separated files (CSV), text files.
DataFrames can be useful for understanding the data, data wrangling, plotting and visualizing.

2.4.3 dataframe和tibble的区别

tibble是一种数据结构，其概念来自tidyverse扩展包。tibble中展示了列数据的维度，数据类型，展示的信息比dataframe更丰富。

R中某些函数只适合于dataframe, 或者只适合于tibble, 或者只适合于matrix.

data.frame是比较老的数据结构，现代数据分析常用tibble（强化版data.frame）。

The tidyverse uses a structure called a “tibble”, which is simply a Data Frame (like an excel table) but with more informative printing than the default data frame.

#创建数据框
team1_df <- data.frame(
   Person = c("Alex", "Bill","John"),
   Age = c(26, 26, 27),
   Weight = c(72, 65, 90),
   Height = c(175, 170, 180),
   Salary = c(8000, 6000, 7000),
   Sex = as.factor(c("Male","Male","Male"))
)
team1_df

  Person Age Weight Height Salary  Sex
1   Alex  26     72    175   8000 Male
2   Bill  26     65    170   6000 Male
3   John  27     90    180   7000 Male

#查看数据框中的元素
team1_df[2,3]

[1] 65

#创建tibble

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

team1_tbl <- tibble(
   Person = c("Alex", "Bill","John"),
   Age = c(26, 26, 27),
   Weight = c(72, 65, 90),
   Height = c(175, 170, 180),
   Salary = c(8000, 6000, 7000),
   Sex = as.factor(c("Male","Male","Male"))
)
team1_tbl

# A tibble: 3 × 6
  Person   Age Weight Height Salary Sex  
  <chr>  <dbl>  <dbl>  <dbl>  <dbl> <fct>
1 Alex      26     72    175   8000 Male 
2 Bill      26     65    170   6000 Male 
3 John      27     90    180   7000 Male

2.4.4 创建数据框的方法

2.4.4.1 data.frame()

team1 <- data.frame(
   Person = c("Alex", "Bill","John"),
   Age = c(26, 26, 27),
   Weight = c(72, 65, 90),
   Height = c(175, 170, 180),
   Salary = c(8000, 6000, 7000),
   Sex = as.factor(c("Male","Male","Male"))
)
team1

  Person Age Weight Height Salary  Sex
1   Alex  26     72    175   8000 Male
2   Bill  26     65    170   6000 Male
3   John  27     90    180   7000 Male

class(team1)

[1] "data.frame"

str(team1)

'data.frame':   3 obs. of  6 variables:
 $ Person: chr  "Alex" "Bill" "John"
 $ Age   : num  26 26 27
 $ Weight: num  72 65 90
 $ Height: num  175 170 180
 $ Salary: num  8000 6000 7000
 $ Sex   : Factor w/ 1 level "Male": 1 1 1

#查看数据框的行数
nrow(team1)

[1] 3

#查看数据框的列数
ncol(team1)

[1] 6

team2 <- data.frame(
   Person = c("Lily", "Kate","Susan"),
   Age = c(25, 24, 26),
   Weight = c(45, 55, 48),
   Height = c(155, 160, 162),
   Salary = c(6500, 8200, 7000),
   Sex = as.factor(c("Female","Female","Female"))
)

#rbind()按行合并两个dataframe
df1 <- rbind(team1, team2)
df1

  Person Age Weight Height Salary    Sex
1   Alex  26     72    175   8000   Male
2   Bill  26     65    170   6000   Male
3   John  27     90    180   7000   Male
4   Lily  25     45    155   6500 Female
5   Kate  24     55    160   8200 Female
6  Susan  26     48    162   7000 Female

#rbind()按列合并两个dataframe
df2 <- cbind(team1, team2)
df2

  Person Age Weight Height Salary  Sex Person Age Weight Height Salary    Sex
1   Alex  26     72    175   8000 Male   Lily  25     45    155   6500 Female
2   Bill  26     65    170   6000 Male   Kate  24     55    160   8200 Female
3   John  27     90    180   7000 Male  Susan  26     48    162   7000 Female

#查看dataframe的前行3
head(df1,3)

  Person Age Weight Height Salary  Sex
1   Alex  26     72    175   8000 Male
2   Bill  26     65    170   6000 Male
3   John  27     90    180   7000 Male

#查看dataframe的最后几行
tail(df1,3)

  Person Age Weight Height Salary    Sex
4   Lily  25     45    155   6500 Female
5   Kate  24     55    160   8200 Female
6  Susan  26     48    162   7000 Female

#查看dataframe每一列的数据类型
str(df1)

'data.frame':   6 obs. of  6 variables:
 $ Person: chr  "Alex" "Bill" "John" "Lily" ...
 $ Age   : num  26 26 27 25 24 26
 $ Weight: num  72 65 90 45 55 48
 $ Height: num  175 170 180 155 160 162
 $ Salary: num  8000 6000 7000 6500 8200 7000
 $ Sex   : Factor w/ 2 levels "Male","Female": 1 1 1 2 2 2

summary(df1)

    Person               Age            Weight          Height     
 Length:6           Min.   :24.00   Min.   :45.00   Min.   :155.0  
 Class :character   1st Qu.:25.25   1st Qu.:49.75   1st Qu.:160.5  
 Mode  :character   Median :26.00   Median :60.00   Median :166.0  
                    Mean   :25.67   Mean   :62.50   Mean   :167.0  
                    3rd Qu.:26.00   3rd Qu.:70.25   3rd Qu.:173.8  
                    Max.   :27.00   Max.   :90.00   Max.   :180.0  
     Salary         Sex   
 Min.   :6000   Male  :3  
 1st Qu.:6625   Female:3  
 Median :7000             
 Mean   :7117             
 3rd Qu.:7750             
 Max.   :8200

3 作业

3.1 创建数据框

要求：

观测单元：6个本班同学

数据框中需要包含的信息：名字、身高、体重、生源地、性别、某科目成绩。

3.2 创建列表

创建一个列表，列表中包含3个对象，这3个对象的数据类型分别是数值型、字符串和逻辑型。
查看列表中的第2个对象。
查看列表中第2个对象的数据类型。
查看列表中第2个对象的第3个元素的值。

3.3 查看starwars的信息

提示：在help中输入starwars，可以查看其帮助文档

starwars中有多少行数据？
starwars中有多少列数据？
starwars中每一列的数据的类型是什么？

2 R的数据类型和数据结构

Li Zongzhang

2023-04-27