安装好 ClickHouse 之后,先来个例子感受下如何用的,
创建表 与 导入测试数据
CREATE TABLE trips
(
`trip_id` UInt32,
`vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
`pickup_date` Date,
`pickup_datetime` DateTime,
`dropoff_date` Date,
`dropoff_datetime` DateTime,
`store_and_fwd_flag` UInt8,
`rate_code_id` UInt8,
`pickup_longitude` Float64,
`pickup_latitude` Float64,
`dropoff_longitude` Float64,
`dropoff_latitude` Float64,
`passenger_count` UInt8,
`trip_distance` Float64,
`fare_amount` Float32,
`extra` Float32,
`mta_tax` Float32,
`tip_amount` Float32,
`tolls_amount` Float32,
`ehail_fee` Float32,
`improvement_surcharge` Float32,
`total_amount` Float32,
`payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
`trip_type` UInt8,
`pickup` FixedString(25),
`dropoff` FixedString(25),
`cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
`pickup_nyct2010_gid` Int8,
`pickup_ctlabel` Float32,
`pickup_borocode` Int8,
`pickup_ct2010` String,
`pickup_boroct2010` String,
`pickup_cdeligibil` String,
`pickup_ntacode` FixedString(4),
`pickup_ntaname` String,
`pickup_puma` UInt16,
`dropoff_nyct2010_gid` UInt8,
`dropoff_ctlabel` Float32,
`dropoff_borocode` UInt8,
`dropoff_ct2010` String,
`dropoff_boroct2010` String,
`dropoff_cdeligibil` String,
`dropoff_ntacode` FixedString(4),
`dropoff_ntaname` String,
`dropoff_puma` UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime;
插入数据
INSERT INTO trips
SELECT * FROM s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
'TabSeparatedWithNames'
)
由于我没有安装 client 使用的是 dataGrip 通过 JDBC 连接的 clickhouse(方式可查看 Docker 安装 ClickHouse),所以在执行上面的语句的时候会超时(暂时不知道如何设置超时时间),所以先将文件下载下来,使用 docker 容器的客户端再导入
# 下载有点慢,需要耐心等待,每个文件 78M,下载后然后解压
wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_1.gz
wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_2.gz
下载成功之后,将文件复制到 docker 容器里面
# 找到 clickhouse 容器名称
➜ clickhouse docker ps | grep clickhouse
36705c628e62 clickhouse/clickhouse-server:22.6.4.35 "/entrypoint.sh" About an hour ago Up About an hour 9009/tcp, 0.0.0.0:18123->8123/tcp, 0.0.0.0:19000->9000/tcp clickhouse
ca43ee3855c5 yandex/clickhouse-server:20.3.9.70 "/entrypoint.sh" 3 weeks ago Up 5 days (healthy) 8123/tcp, 9000/tcp, 9009/tcp sentry-self-hosted-clickhouse-1
# 当前目录下有两个文件(解压后的文件)
➜ clickhouse ls
config.yaml data logs trips_1 trips_2
# 将文件复制到容器中
➜ clickhouse docker cp ./trips_1 clickhouse:/
➜ clickhouse docker cp ./trips_2 clickhouse:/
# 进入容器
➜ clickhouse docker exec -it clickhouse /bin/bash
# 这里在容器里面的话,加不加用户名密码都可以
root@36705c628e62:/# clickhouse-client --user root --password root --query "INSERT INTO trips FORMAT TabSeparatedWithNames" < trips_1
root@36705c628e62:/# clickhouse-client --user root --password root --query "INSERT INTO trips FORMAT TabSeparatedWithNames" < trips_2
现在就可以使用 dataGrip(有如何用 dataGrip 链接) 的控制台查询了
select count() from trips;
-- 响应数量为 1999657
dataGrip 中也能直接查看表数据了
下面看看有多快
SELECT DISTINCT(pickup_ntaname) FROM trips
该语句处理 1,999,657 行数据,并返回 190 个值,大约耗时 0.05 秒
default> SELECT DISTINCT(pickup_ntaname) FROM trips
[2022-07-28 16:01:52] 在 58 ms (execution: 31 ms, fetching: 27 ms) 内检索到从 1 开始的 190 行