本文翻译自官方文档 Tutorial Docker 安装 ClickHouse

安装好 ClickHouse 之后,先来个例子感受下如何用的,

创建表 与 导入测试数据

  1. CREATE TABLE trips
  2. (
  3. `trip_id` UInt32,
  4. `vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
  5. `pickup_date` Date,
  6. `pickup_datetime` DateTime,
  7. `dropoff_date` Date,
  8. `dropoff_datetime` DateTime,
  9. `store_and_fwd_flag` UInt8,
  10. `rate_code_id` UInt8,
  11. `pickup_longitude` Float64,
  12. `pickup_latitude` Float64,
  13. `dropoff_longitude` Float64,
  14. `dropoff_latitude` Float64,
  15. `passenger_count` UInt8,
  16. `trip_distance` Float64,
  17. `fare_amount` Float32,
  18. `extra` Float32,
  19. `mta_tax` Float32,
  20. `tip_amount` Float32,
  21. `tolls_amount` Float32,
  22. `ehail_fee` Float32,
  23. `improvement_surcharge` Float32,
  24. `total_amount` Float32,
  25. `payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
  26. `trip_type` UInt8,
  27. `pickup` FixedString(25),
  28. `dropoff` FixedString(25),
  29. `cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
  30. `pickup_nyct2010_gid` Int8,
  31. `pickup_ctlabel` Float32,
  32. `pickup_borocode` Int8,
  33. `pickup_ct2010` String,
  34. `pickup_boroct2010` String,
  35. `pickup_cdeligibil` String,
  36. `pickup_ntacode` FixedString(4),
  37. `pickup_ntaname` String,
  38. `pickup_puma` UInt16,
  39. `dropoff_nyct2010_gid` UInt8,
  40. `dropoff_ctlabel` Float32,
  41. `dropoff_borocode` UInt8,
  42. `dropoff_ct2010` String,
  43. `dropoff_boroct2010` String,
  44. `dropoff_cdeligibil` String,
  45. `dropoff_ntacode` FixedString(4),
  46. `dropoff_ntaname` String,
  47. `dropoff_puma` UInt16
  48. )
  49. ENGINE = MergeTree
  50. PARTITION BY toYYYYMM(pickup_date)
  51. ORDER BY pickup_datetime;

插入数据

  1. INSERT INTO trips
  2. SELECT * FROM s3(
  3. 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
  4. 'TabSeparatedWithNames'
  5. )

由于我没有安装 client 使用的是 dataGrip 通过 JDBC 连接的 clickhouse(方式可查看 Docker 安装 ClickHouse),所以在执行上面的语句的时候会超时(暂时不知道如何设置超时时间),所以先将文件下载下来,使用 docker 容器的客户端再导入

  1. # 下载有点慢,需要耐心等待,每个文件 78M,下载后然后解压
  2. wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_1.gz
  3. wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_2.gz

下载成功之后,将文件复制到 docker 容器里面

  1. # 找到 clickhouse 容器名称
  2. clickhouse docker ps | grep clickhouse
  3. 36705c628e62 clickhouse/clickhouse-server:22.6.4.35 "/entrypoint.sh" About an hour ago Up About an hour 9009/tcp, 0.0.0.0:18123->8123/tcp, 0.0.0.0:19000->9000/tcp clickhouse
  4. ca43ee3855c5 yandex/clickhouse-server:20.3.9.70 "/entrypoint.sh" 3 weeks ago Up 5 days (healthy) 8123/tcp, 9000/tcp, 9009/tcp sentry-self-hosted-clickhouse-1
  5. # 当前目录下有两个文件(解压后的文件)
  6. clickhouse ls
  7. config.yaml data logs trips_1 trips_2
  8. # 将文件复制到容器中
  9. clickhouse docker cp ./trips_1 clickhouse:/
  10. clickhouse docker cp ./trips_2 clickhouse:/
  11. # 进入容器
  12. clickhouse docker exec -it clickhouse /bin/bash
  13. # 这里在容器里面的话,加不加用户名密码都可以
  14. root@36705c628e62:/# clickhouse-client --user root --password root --query "INSERT INTO trips FORMAT TabSeparatedWithNames" < trips_1
  15. root@36705c628e62:/# clickhouse-client --user root --password root --query "INSERT INTO trips FORMAT TabSeparatedWithNames" < trips_2

现在就可以使用 dataGrip(有如何用 dataGrip 链接) 的控制台查询了

  1. select count() from trips;
  2. -- 响应数量为 1999657

dataGrip 中也能直接查看表数据了
image.png
下面看看有多快

  1. SELECT DISTINCT(pickup_ntaname) FROM trips

该语句处理 1,999,657 行数据,并返回 190 个值,大约耗时 0.05 秒

  1. default> SELECT DISTINCT(pickup_ntaname) FROM trips
  2. [2022-07-28 16:01:52] 58 ms (execution: 31 ms, fetching: 27 ms) 内检索到从 1 开始的 190