1. 购买OSS存储表

image.png
image.png

2. 下载ClickHouse数据文件,并存储到OSS

:::info 通过阿里cloudshell(默认只让用1小时,1小时后开启新的实例)下载并传输到oss ::: image.png

  1. wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv
  2. aliyun oss cp pp-complete.csv oss://clickhouse-test-sinosoft/pp-complete.csv

:::warning 下载上传时间 1个小时之内 :::

处理完成后一定要删除 cloudshell里面得文件

3. 购买并导入数据

:::info 购买 ::: image.png :::info 创建账户 ::: image.png :::info 开通公网 ,并配置安全组 :::

3.1 创建表

  1. CREATE TABLE uk_price_paid_local on cluster default
  2. (
  3. price UInt32,
  4. date Date,
  5. postcode1 LowCardinality(String),
  6. postcode2 LowCardinality(String),
  7. type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),
  8. is_new UInt8,
  9. duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),
  10. addr1 String,
  11. addr2 String,
  12. street LowCardinality(String),
  13. locality LowCardinality(String),
  14. town LowCardinality(String),
  15. district LowCardinality(String),
  16. county LowCardinality(String),
  17. category UInt8
  18. ) ENGINE = MergeTree ORDER BY (postcode1, postcode2, addr1, addr2);
  1. CREATE TABLE uk_price_paid_distributed on cluster default
  2. (
  3. price UInt32,
  4. date Date,
  5. postcode1 LowCardinality(String),
  6. postcode2 LowCardinality(String),
  7. type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),
  8. is_new UInt8,
  9. duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),
  10. addr1 String,
  11. addr2 String,
  12. street LowCardinality(String),
  13. locality LowCardinality(String),
  14. town LowCardinality(String),
  15. district LowCardinality(String),
  16. county LowCardinality(String),
  17. category UInt8
  18. ) ENGINE = Distributed(default, default, uk_price_paid_local, rand());

3.2 查看OSS在线数据

  1. SELECT * FROM oss('oss-cn-hangzhou-internal.aliyuncs.com', '<AK>', '<SK>', 'oss://clickhouse-test-sinosoft/pp-complete.csv', 'CSV','uuid String,price UInt32,time String,postcode String,a String,b String,c String,addr1 String,addr2 String,street String,locality String,town String,district String,county String,d String,e String');

3.3 导入数据

  1. insert into uk_price_paid_distributed SELECT price,
  2. toDate(concat(time,':00')) AS date,
  3. splitByChar(' ', postcode)[1] AS postcode1,
  4. splitByChar(' ', postcode)[2] AS postcode2,
  5. transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,
  6. b = 'Y' AS is_new,
  7. transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration,
  8. addr1,
  9. addr2,
  10. street,
  11. locality,
  12. town,
  13. district,
  14. county,
  15. d = 'B' AS category FROM oss('oss-cn-hangzhou-internal.aliyuncs.com', '<AK>', '<SK>', 'oss://clickhouse-test-sinosoft/pp-complete.csv', 'CSV','uuid String,price UInt32,time String,postcode String,a String,b String,c String,addr1 String,addr2 String,street String,locality String,town String,district String,county String,d String,e String');

image.png

:::info 注意: 1个实例分布式表 导入数据大概 250s 官方案例上面说的 40s (因为官方是单机器)
如果更多实例,这个时间上会有提升 :::

3.4 其他查询

  1. SELECT count() FROM uk_price_paid_distributed;
  2. SELECT formatReadableSize(total_bytes) FROM system.tables WHERE name = 'uk_price_paid_local';
  3. -- 每年得平均价格 95ms
  4. SELECT toYear(date) AS year, round(avg(price)) AS price, bar(price, 0, 1000000, 80) FROM uk_price_paid_local GROUP BY year ORDER BY year;
  5. SELECT toYear(date) AS year, round(avg(price)) AS price, bar(price, 0, 1000000, 80) FROM uk_price_paid_distributed GROUP BY year ORDER BY year;
  6. -- 每年得平均价格 伦敦
  7. ------------------ 48ms
  8. SELECT toYear(date) AS year, round(avg(price)) AS price, bar(price, 0, 2000000, 100) FROM uk_price_paid_local WHERE town = 'LONDON' GROUP BY year ORDER BY year;
  9. ------------------62ms
  10. SELECT toYear(date) AS year, round(avg(price)) AS price, bar(price, 0, 2000000, 100) FROM uk_price_paid_distributed WHERE town = 'LONDON' GROUP BY year ORDER BY year;
  11. -- 查询昂贵的社区前100 148ms
  12. SELECT
  13. town,
  14. district,
  15. count() AS c,
  16. round(avg(price)) AS price,
  17. bar(price, 0, 5000000, 100)
  18. FROM uk_price_paid_distributed
  19. WHERE date >= '2020-01-01'
  20. GROUP BY
  21. town,
  22. district
  23. HAVING c >= 100
  24. ORDER BY price DESC
  25. LIMIT 100;

其他资料