julia数据分析之DataFrames初探 - 《julia语言学习》

数据的连接

using DataFrames

df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])# dataframe 的构造

	A	B
	Int64	String

4 rows × 2 columns | 1 | 1 | M | | 2 | 2 | F | | 3 | 3 | F | | 4 | 4 | M |

df[!,:B] = String["F","M","F","M"] # 和R语言一样的索引，！代表不复制

4-element Array{String,1}:
 "F"
 "M"
 "F"
 "M"

df # 更改后原表格一起变了

	A	B
	Int64	String

4 rows × 2 columns | 1 | 1 | F | | 2 | 2 | M | | 3 | 3 | F | | 4 | 4 | M |

df[:,Symbol("B")] # Julia 使用Symbol来索引列名，这和R中的sym()很相似

4-element Array{String,1}:
 "F"
 "M"
 "F"
 "M"

names(df) # 获取列名，但不能重新付值

2-element Array{String,1}:
 "A"
 "B"

propertynames(df)

2-element Array{Symbol,1}:
 :A
 :B

size(df) ##查看纬度

(4, 2)

push!(df,[5,"M"]) # 添加行
df

	A	B
	Int64	String

5 rows × 2 columns | 1 | 1 | F | | 2 | 2 | M | | 3 | 3 | F | | 4 | 4 | M | | 5 | 5 | M |

df[!,Not(:A)] # 反选

	B
	String

5 rows × 1 columns | 1 | F | | 2 | M | | 3 | F | | 4 | M | | 5 | M |

df[df.A .> 3,:] # 可以选择行，但是注意要用 .>

	A	B
	Int64	String

2 rows × 2 columns | 1 | 4 | M | | 2 | 5 | M |

When broadcasting with in.(items, collection) or items .∈ collection, both item and collection are broadcasted over, which is often not what is intended. For example, if both arguments are vectors (and the dimensions match), the result is a vector indicating whether each value in collection items is in the value at the corresponding position in collection. To get a vector indicating whether each value in items is in collection, wrap collection in a tuple or a Ref like this: in.(items, Ref(collection)) or items .∈ Ref(collection)

in.([1,2,3,4,5],[1,2,3,4,6]) ## 奇怪的语法，需要长度相等的两个向量才能计算in

5-element BitArray{1}:
 1
 1
 1
 1
 0

in.([1,2,3,4],Ref([1,2])) ## 不等长需要用Ref

4-element BitArray{1}:
 1
 1
 0
 0

df[in.(df.A,Ref([1,2])),:] ## in本质还是个函数 x -> y in x

	A	B
	Int64	String

2 rows × 2 columns | 1 | 1 | F | | 2 | 2 | M |

df = DataFrame(x1=[1, 2], x2=[3, 4], y=[5, 6])

	x1	x2	y
	Int64	Int64	Int64

2 rows × 3 columns | 1 | 1 | 3 | 5 | | 2 | 2 | 4 | 6 |

select(df,Not(:x1)) ## 用select选择列

	x2	y
	Int64	Int64

2 rows × 2 columns | 1 | 3 | 5 | | 2 | 4 | 6 |

select(df,:x1,:x2=>(x->x*2)=>:x2) # 添加匿名函数可以边选择边mutate =》相当于数据的传递管道符

	x1	x2
	Int64	Int64

2 rows × 2 columns | 1 | 1 | 6 | | 2 | 2 | 8 |

select(df,:x2,:x2=>ByRow(sqrt)) ## ByRow(FUN)，另外的奇怪语法

	x2	x2_sqrt
	Int64	Float64

2 rows × 2 columns | 1 | 3 | 1.73205 | | 2 | 4 | 2.0 |

select(df,:x2,:x2=>(x->sqrt.(x))=>Symbol("x2","_sqrt")) # 传函数的方法计算不更容易理解吗？

	x2	x2_sqrt
	Int64	Float64

2 rows × 2 columns | 1 | 3 | 1.73205 | | 2 | 4 | 2.0 |

transform(df,All()=>+) # All()选择所有列

	x1	x2	y	x1x2_y+
	Int64	Int64	Int64	Int64

2 rows × 4 columns | 1 | 1 | 3 | 5 | 9 | | 2 | 2 | 4 | 6 | 12 |

transform(df,AsTable(:)=>ByRow(sum)=>:sum) # 使用ByRow的语法

	x1	x2	y	sum
	Int64	Int64	Int64	Int64

2 rows × 4 columns | 1 | 1 | 3 | 5 | 9 | | 2 | 2 | 4 | 6 | 12 |

df = DataFrame(a = ["a", "None", "b", "None"], b = 1:4, c = ["None", "j", "k", "h"], d = ["x", "y", "None", "z"])

	a	b	c	d
	String	Int64	String	String

4 rows × 4 columns | 1 | a | 1 | None | x | | 2 | None | 2 | j | y | | 3 | b | 3 | k | None | | 4 | None | 4 | h | z |

replace!(df.a,"None"=>"meiyou") ##列数据的替换，replacena

4-element Array{String,1}:
 "a"
 "meiyou"
 "b"
 "meiyou"

df

	a	b	c	d
	String	Int64	String	String

4 rows × 4 columns | 1 | a | 1 | None | x | | 2 | meiyou | 2 | j | y | | 3 | b | 3 | k | None | | 4 | meiyou | 4 | h | z |

数据的连接

people = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])

	ID	Name
	Int64	String

2 rows × 2 columns | 1 | 20 | John Doe | | 2 | 40 | Jane Doe |

jobs = DataFrame(ID = [20, 40], Job = ["Lawyer", "Doctor"])

	ID	Job
	Int64	String

2 rows × 2 columns | 1 | 20 | Lawyer | | 2 | 40 | Doctor |

innerjoin(people,jobs,on=:ID) ## innerjoin,leftjoin,rightjoin,outerjoin

	ID	Name	Job
	Int64	String	String

2 rows × 3 columns | 1 | 20 | John Doe | Lawyer | | 2 | 40 | Jane Doe | Doctor |