1 数据类型
- 数值型 Numeric
- 整数型 Integer
- 字符串 Character
- 逻辑型 Logical
- 因子型 Factor
1.1 数值型 Numeric
Numeric: Numbers that have a decimal value or are a fraction in nature have a data type as numeric.
<- 1.2
a a
[1] 1.2
#class()查看a的数据类型
class(a)
[1] "numeric"
1.2 整数型 Integer
Integer: Numbers that do not contain decimal values have a data type as an integer. However, to create an integer data type, you explicitly use as.integer() and pass the variable as an argument.
#as.integer()取整
<- as.integer(2.7)
int print(int)
[1] 2
class(int)
[1] "integer"
#L代表指定该数据为整数型
<- 2L
b b
[1] 2
class(b)
[1] "integer"
<- 2
c c
[1] 2
class(c)
[1] "numeric"
1.3 字符串 Character
Character: it can be a letter or a combination of letters enclosed by quotes is considered as a character data type by R. It can be alphabets or numbers.
<- "hello"
d d
[1] "hello"
class(d)
[1] "character"
<- "5"
e e
[1] "5"
class(e)
[1] "character"
1.4 逻辑型 Logical
Logical: A variable that can have a value of True and False like a boolean is called a logical variable.
<- TRUE
f <- F
g print(c(f,g))
[1] TRUE FALSE
class(c(f,g))
[1] "logical"
#对逻辑向量求和就是统计TRUE的个数,求平均就是计算TRUE的比例
<- c(TRUE, FALSE, TRUE, TRUE, TRUE)
f1 sum(f1)
[1] 4
mean(f1)
[1] 0.8
1.5 因子型 Factor
Factor: They are a data type that is used to refer to a qualitative relationship like colors, good & bad, course or movie ratings, etc.
1.5.1 Unordered Factor
不可排序,分类数据
<- factor(c("red", "blue", "yellow","white"))
group group
[1] red blue yellow white
Levels: blue red white yellow
1.5.2 Ordered Factor
可以排序,顺序型数据,需要注意顺序是否合理。
# Create a factor with the wrong order of levels
<- factor(c("small", "large", "large", "small", "medium"))
sizes sizes
[1] small large large small medium
Levels: large medium small
class(sizes)
[1] "factor"
#levels()查看因子水平
levels(sizes)
[1] "large" "medium" "small"
#levels = c()设置因子水平的顺序
<- factor(sizes, levels = c("small", "medium", "large"))
sizes sizes
[1] small large large small medium
Levels: small medium large
#ordered()函数设置因子水平的顺序
<- ordered(c("small", "large", "large", "small", "medium"))
sizes <- ordered(sizes, levels = c("small", "medium", "large"))
sizes sizes
[1] small large large small medium
Levels: small < medium < large
#查看因子水平的个数
nlevels(sizes)
[1] 3
class(levels(sizes))
[1] "character"
#查看工作环境中所有的对象
ls()
[1] "a" "b" "c" "d" "e" "f" "f1" "g" "group"
[10] "int" "sizes"
2 数据结构 Data Structure
数据结构是指数据是如何储存的。
2.1 向量 Vector
原子向量(atomic vector)是R中数据存储最基本的结构。原子向量只能存储同种类型的元素。每个向量都两个关键的属性:长度和类型。
原子向量有四种类型(复杂程度逐渐增加):
逻辑型(logical):TRUE和FALSE
整型(integer):整数(整数后加L)
双精度型(double):实数
字符型(character):引号括起来的字符
整型和双精度型统称为数值型(numeric)。
R语言中没有”标量”这种数据结构,只有长度为1的向量。
向量存储一维信息,向量中的元素数据类型必须相同。
函数c()创建向量。
函数length()查看向量的长度,向量中包含的元素个数。
向量中的某个元素[]
slicing, 向量中的第2至第5个元素[2:5]
Missing data
#创建数值型向量
<- c(60,70,75,80,85,90,95)
mark mark
[1] 60 70 75 80 85 90 95
class(mark)
[1] "numeric"
length(mark)
[1] 7
#向量中的元素[]
3] mark[
[1] 75
# 超出向量长度,NA.
8] mark[
[1] NA
2:4] mark[
[1] 70 75 80
#创建字符串向量
<- c("alex","bill", "lily","harry","steve","john")
name name
[1] "alex" "bill" "lily" "harry" "steve" "john"
class(name)
[1] "character"
#如果c()中既包括数值,又包括字符串,将把数值转换为字符串
<- c(90,"john")
x1 x1
[1] "90" "john"
class(x1)
[1] "character"
<- c(90,TRUE)
x2 x2
[1] 90 1
class(x2)
[1] "numeric"
<- c(1:1024)
x3 length(x3)
[1] 1024
#创建奇数序列
<- seq(1,20,2)
x4 x4
[1] 1 3 5 7 9 11 13 15 17 19
#向量中包含missing data
<- c(seq(1:10), NA)
x5 x5
[1] 1 2 3 4 5 6 7 8 9 10 NA
#查看向量中是否有缺失值
is.na(x5)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
anyNA(x5)
[1] TRUE
2.2 矩阵 Matrix
A matrix is a collection of data elements arranged in a two-dimensional rectangular layout.
a matrix is used to store information about the same data type.
matrices are capable of holding two-dimensional information inside it.
创建矩阵
M <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames))
- byrow=FALSE 默认按行填充
#创建矩阵
<- matrix(c("alex","bill", "lily","harry","steve","john"),2,3)
m1 m1
[,1] [,2] [,3]
[1,] "alex" "lily" "steve"
[2,] "bill" "harry" "john"
<- matrix(c("alex","bill", "lily","harry","steve","john"),3,2)
m2 m2
[,1] [,2]
[1,] "alex" "harry"
[2,] "bill" "steve"
[3,] "lily" "john"
<- matrix(c("alex","bill", "lily","harry","steve","john"))
m3 m3
[,1]
[1,] "alex"
[2,] "bill"
[3,] "lily"
[4,] "harry"
[5,] "steve"
[6,] "john"
<- matrix(c("alex","bill", "lily","harry","steve","john"),T)
m4 m4
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "alex" "bill" "lily" "harry" "steve" "john"
<- matrix(c("alex","bill", "lily","harry","steve","john"),2,3,byrow = F, list(c("class A","class B"),c("group1","group2","group3")))
m5 m5
group1 group2 group3
class A "alex" "lily" "steve"
class B "bill" "harry" "john"
#提取矩阵中的元素
2,2] m5[
[1] "harry"
1:2,1:2] m5[
group1 group2
class A "alex" "lily"
class B "bill" "harry"
2.3 列表 List
列表中可以包含不同类型的数据,可以包含向量,函数,矩阵,或者其他的列表。
lists act as containers.
<- list(1, "a", TRUE, 1+4i)
x class(x[[1]])
[1] "numeric"
#给list中的每个对象命名
<- list(x1 = c(1:10), x2 = "a", x3 = c(TRUE,FALSE), x4 = 1+4i)
x x
$x1
[1] 1 2 3 4 5 6 7 8 9 10
$x2
[1] "a"
$x3
[1] TRUE FALSE
$x4
[1] 1+4i
names(x)
[1] "x1" "x2" "x3" "x4"
<- 1:5
list1 list1
[1] 1 2 3 4 5
<- factor(1:5)
list2 list2
[1] 1 2 3 4 5
Levels: 1 2 3 4 5
<- letters[1:5]
list3 list3
[1] "a" "b" "c" "d" "e"
<- list(list1, list2, list3)
combined_list combined_list
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 1 2 3 4 5
Levels: 1 2 3 4 5
[[3]]
[1] "a" "b" "c" "d" "e"
#提取list中的某个对象,用双层方括号[[]]
2]] combined_list[[
[1] 1 2 3 4 5
Levels: 1 2 3 4 5
#提取list中第2个对象中的第3个元素,用双层方括号[[2]][3]
2]][3] combined_list[[
[1] 3
Levels: 1 2 3 4 5
flatten the list: One important thing to remember is that since combined_list is a combination of character and numeric datatype, the character data type will get the precedence, and the data type of complete list will become a character.
<- unlist(combined_list)
flat_list class(flat_list)
[1] "character"
length(flat_list)
[1] 15
2.4 数据框 DataFrame
2.4.1 数据框与矩阵的区别
矩阵中的元素必须是同一类型,数据框中每一列是同一类型,不同列可以不同的类型,包含的数据类型更加丰富。
2.4.2 数据框与列表的区别
数据框是一种特殊形式的列表,“矩形”列表,列表中每个元素的长度一样。
A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list).
It contains data in a tabular fashion. The data in the data frame can be spread across various columns, having different data types. The first column can be a character while the second column can be an integer, and the third column can be logical.
DataFrame can be created using the
data.frame()
function.DataFrame has been widely used in the reading comma-separated files (CSV), text files.
DataFrames can be useful for understanding the data, data wrangling, plotting and visualizing.
2.4.3 dataframe和tibble的区别
tibble是一种数据结构,其概念来自tidyverse扩展包。tibble中展示了列数据的维度,数据类型,展示的信息比dataframe更丰富。
R中某些函数只适合于dataframe, 或者只适合于tibble, 或者只适合于matrix.
data.frame是比较老的数据结构,现代数据分析常用tibble(强化版data.frame)。
The tidyverse uses a structure called a “tibble”, which is simply a Data Frame (like an excel table) but with more informative printing than the default data frame.
#创建数据框
<- data.frame(
team1_df Person = c("Alex", "Bill","John"),
Age = c(26, 26, 27),
Weight = c(72, 65, 90),
Height = c(175, 170, 180),
Salary = c(8000, 6000, 7000),
Sex = as.factor(c("Male","Male","Male"))
) team1_df
Person Age Weight Height Salary Sex
1 Alex 26 72 175 8000 Male
2 Bill 26 65 170 6000 Male
3 John 27 90 180 7000 Male
#查看数据框中的元素
2,3] team1_df[
[1] 65
#创建tibble
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.1 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
<- tibble(
team1_tbl Person = c("Alex", "Bill","John"),
Age = c(26, 26, 27),
Weight = c(72, 65, 90),
Height = c(175, 170, 180),
Salary = c(8000, 6000, 7000),
Sex = as.factor(c("Male","Male","Male"))
) team1_tbl
# A tibble: 3 × 6
Person Age Weight Height Salary Sex
<chr> <dbl> <dbl> <dbl> <dbl> <fct>
1 Alex 26 72 175 8000 Male
2 Bill 26 65 170 6000 Male
3 John 27 90 180 7000 Male
2.4.4 创建数据框的方法
2.4.4.1 data.frame()
<- data.frame(
team1 Person = c("Alex", "Bill","John"),
Age = c(26, 26, 27),
Weight = c(72, 65, 90),
Height = c(175, 170, 180),
Salary = c(8000, 6000, 7000),
Sex = as.factor(c("Male","Male","Male"))
) team1
Person Age Weight Height Salary Sex
1 Alex 26 72 175 8000 Male
2 Bill 26 65 170 6000 Male
3 John 27 90 180 7000 Male
class(team1)
[1] "data.frame"
str(team1)
'data.frame': 3 obs. of 6 variables:
$ Person: chr "Alex" "Bill" "John"
$ Age : num 26 26 27
$ Weight: num 72 65 90
$ Height: num 175 170 180
$ Salary: num 8000 6000 7000
$ Sex : Factor w/ 1 level "Male": 1 1 1
#查看数据框的行数
nrow(team1)
[1] 3
#查看数据框的列数
ncol(team1)
[1] 6
<- data.frame(
team2 Person = c("Lily", "Kate","Susan"),
Age = c(25, 24, 26),
Weight = c(45, 55, 48),
Height = c(155, 160, 162),
Salary = c(6500, 8200, 7000),
Sex = as.factor(c("Female","Female","Female"))
)
#rbind()按行合并两个dataframe
<- rbind(team1, team2)
df1 df1
Person Age Weight Height Salary Sex
1 Alex 26 72 175 8000 Male
2 Bill 26 65 170 6000 Male
3 John 27 90 180 7000 Male
4 Lily 25 45 155 6500 Female
5 Kate 24 55 160 8200 Female
6 Susan 26 48 162 7000 Female
#rbind()按列合并两个dataframe
<- cbind(team1, team2)
df2 df2
Person Age Weight Height Salary Sex Person Age Weight Height Salary Sex
1 Alex 26 72 175 8000 Male Lily 25 45 155 6500 Female
2 Bill 26 65 170 6000 Male Kate 24 55 160 8200 Female
3 John 27 90 180 7000 Male Susan 26 48 162 7000 Female
#查看dataframe的前行3
head(df1,3)
Person Age Weight Height Salary Sex
1 Alex 26 72 175 8000 Male
2 Bill 26 65 170 6000 Male
3 John 27 90 180 7000 Male
#查看dataframe的最后几行
tail(df1,3)
Person Age Weight Height Salary Sex
4 Lily 25 45 155 6500 Female
5 Kate 24 55 160 8200 Female
6 Susan 26 48 162 7000 Female
#查看dataframe每一列的数据类型
str(df1)
'data.frame': 6 obs. of 6 variables:
$ Person: chr "Alex" "Bill" "John" "Lily" ...
$ Age : num 26 26 27 25 24 26
$ Weight: num 72 65 90 45 55 48
$ Height: num 175 170 180 155 160 162
$ Salary: num 8000 6000 7000 6500 8200 7000
$ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 2 2 2
summary(df1)
Person Age Weight Height
Length:6 Min. :24.00 Min. :45.00 Min. :155.0
Class :character 1st Qu.:25.25 1st Qu.:49.75 1st Qu.:160.5
Mode :character Median :26.00 Median :60.00 Median :166.0
Mean :25.67 Mean :62.50 Mean :167.0
3rd Qu.:26.00 3rd Qu.:70.25 3rd Qu.:173.8
Max. :27.00 Max. :90.00 Max. :180.0
Salary Sex
Min. :6000 Male :3
1st Qu.:6625 Female:3
Median :7000
Mean :7117
3rd Qu.:7750
Max. :8200
3 作业
3.1 创建数据框
要求:
观测单元:6个本班同学
数据框中需要包含的信息:名字、身高、体重、生源地、性别、某科目成绩。
3.2 创建列表
创建一个列表,列表中包含3个对象,这3个对象的数据类型分别是数值型、字符串和逻辑型。
查看列表中的第2个对象。
查看列表中第2个对象的数据类型。
查看列表中第2个对象的第3个元素的值。
3.3 查看starwars的信息
提示:在help中输入starwars,可以查看其帮助文档
- starwars中有多少行数据?
- starwars中有多少列数据?
- starwars中每一列的数据的类型是什么?