https://blog.empathybox.com/post/24415262152/ssds-and-distributed-data-systems
ssd and distributed data systems, kafka作者 Jay Kreps
关于ssd有以下事实
- 消除了寻道,使得随机读写很快
- 顺序读写比起hard drive也好,但好得不是那么多
- 随机故障不那么容易发生,因为不是mechanic device
- 存在write endurance
- 块较大,512KB,写一个字节相当于要写一个block。传统的block是4kb,小很多
In the case of a traditional hard drive a single seek may have a latency cost easily 10 or 20x that of TCP request on a local network, 这意味着远程的cache hit都要比本地的cache miss要好。而ssd的存在使得访问远程与访问本地磁盘延时很接近,于是设计上应在本地多存储数据而不是发起网络请求;
对于DB而言,此种趋势的极端做法就是:The extreme version of this for databases is just abandoning partitioning altogether and storing all data on all machines,但是这要求应用写得不多,且量有限
ssd会在一定程度上影响caching的设计,毕竟RAM价格较贵;
ssd也对那种有随机访问需求的业务有更多的access patterns支持,比如graph,社交网络;
对于线下业务,Mapreduce本是针对于顺序IO来设计的,有了ssd,情况可能有所不同;
如何影响DB设计?
传统的数据结构不再是最佳的,这不是从时延下降的角度,而更多是write endurance角度;
在MLC drive里,会因为有内部的compaction而产生延时spike,但是往往仅在相对term下较大,在绝对term(under 1ms)下是很小的;在SLC,MLC中,block都只有有限次擦除次数,达到限度后,block就会停止接收写
A few obvious conclusions are that SLC SSDs are priced roughly the same as memory.
如果是基本是顺序写,那我们是可估算出ssd寿命的。
It is worth noting that it doesn’t actually matter if the writes to the filesystem are large or small so long as the writes to the physical disk are large.
更好的方式就是选择与ssd匹配的数据结构。A better option is just to use a storage format that naturally does linear writes. Traditional storage formats include the B+Tree and linear hashing. These formats group data into blocks by key, and hence writes are randomly scattered on the disk unless the write order happens to match the key ordering (which you can’t count on except in bulk load situations where you can chose the order in which records are updated)
将ssd用于lsm,一个重要因素是考虑是否要即时刷新(即每次写是否都要fsync),如果是,除非record本身较大,否则每次的写都很小
据说云厂商要根据ssd的擦除次数来收费,但是以什么样的形式和价格还不得而知