Azure Cosmos DB是一个全球分布、透明的多主复制的fully managed数据库.它支持多种api, 包括sql, Gremlin, mongodb, cassandra, table, etcd

https://docs.microsoft.com/en-us/azure/cosmos-db/global-dist-under-the-hood

全局分布拓扑图

image.png
一个cosmos db由一组cosmos container组成,container是分布和扩展单元。我们所创建的表、图等数据都是cosmos container,其数据在灌入时自动索引。在一个region内,一个container的数据根据partition-key进行分布

下图是一个container的分布的两个维度:region内和跨region
image.png
一个physical partitions是由一组replicas实现的。一个replica就属于一个租户(大租户就有着更大、更多的replica),每个replica里承载着一个cosmos db engine. Engine对schema无感知,通过自动地索引一切来实现,这样使得用户不必关心schema,index管理问题

Every component of the system is fully asynchronous – no thread ever blocks, and each thread does short-lived work without incurring any unnecessary thread switches. Rate-limiting and back-pressure are plumbed across the entire stack from the admission control to all I/O paths. Cosmos database engine is designed to exploit fine-grained concurrency and to deliver high throughput while operating within frugal amounts of system resources.

最重要的两个抽象是replica-sets, partition-sets

A physical partition is materialized as a self-managed and dynamically load-balanced group of replicas spread across multiple fault domains, called a replica-set

replica-sets

replica-set的membership 数N是动态的,它在NMin和NMax之间浮动,基于故障、运维操作和副本恢复时间而定。因此读写quorum也是跟着变动
为了在给定的physical partition上均匀地分布请求,采取了两种idea:

  • 副本集的leader要比follower承受更多的写,因此它要占据更多的系统资源
  • 读quorum尽可能从follower中选取(参考了论文

partition-sets

它是一组physical partitions,从每个region中抽出来的,在跨region范围内共同管理相同集合的key。
可以把partition-sets看成一个地理上分布上”super replica-set”。和replica-set一样,其membership也是动态变化的,它基于隐式的partition操作比如增加新partition到一个partition-set,增加一region等。

By virtue of having each of the partitions (of a partition-set) manage the partition-set membership within its own replica-set, the membership is fully decentralized and highly available. During the reconfiguration of a partition-set, the topology of the overlay between physical partitions is also established. The topology is dynamically selected based on the consistency level, geographical distance, and available network bandwidth between the source and the target physical partitions.

服务允许你配置写多少个region.系统采用了两层nested consensus protocol,一层是在replica-set的replicas里,另外一层是在partition-set层面上,保证committed write的完全顺序。

conflict resolution

基于两篇论文来设计update propagation, conflict resolution, causality tracking。但应用过程中,也没完全照搬论文,做了很多transformation。因为论文本身所提到的达不到cosmos要求的能力,如resource governance, strigent SLA, bounded staleness consistency

每个physical partition接受来自靠近它的client的读写请求,对于写请求,在给客户返回响应前会进行持久化,并且通过anti-entropy channel来复制到同一partition-set里的其它physical partition. anti-entropy的复制频率是动态的,基于partition-set拓扑、physical partitions的地域邻近程度、和consistency level而定。在一个partition-set内,采用带有动态arbiter partition的primary commit scheme.

采用被编码的vector clocks(包含region id, logical clocks)用于因果序追踪,使用version vector用于解决更新冲突。拓扑和peer 选择算法保证在冲突解决上尽可能小的网络存储开锁,也保证严格的收敛。

冲突解析策略有两种:

  • LWW,默认的
  • 应用自定义

consistency level

https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels
TLA描述:https://github.com/Azure/azure-cosmos-tla

cosmos DB在一致性的两个端点(强一致和最终一致)上实现不同强度的一致性保证,这样用户可以根据对可用性、性能的要求来灵活选择。
cosmos DB提供了五种一致性模型:
image.png

  • strong: 提供linearizalibity
  • bounded staleness(也叫time-delayed linearizability): 有两种配置方式,一是配置K,表示读的版本最后落后于写的最大距离; 二是配置T,读的版本落后于写的最大时间间隔。如果只是读一个region,它和strong没区别

与DocumentDB的区别

https://stackoverflow.com/questions/43932359/what-are-the-differences-between-cosmodb-and-documentdb
可以理解为cosmosdb是documentDB的升级版,支持更多的capabilities,documentDB可以无缝迁移到cosmosDb

试用30天

image.png

可以免费试用30天。创建了一个mongodb cosmos db实例(呃,都不知道怎么称呼它了),分布在美国东部和欧洲北部,collection名为Items,最高吞吐是400