Reliability

也就是fault-tolerant
failure = whole stops providing the required service to the user
fault-tolerant prevent faults from causing failures

Hardware Faults

redundancy
software fault-tolerant

Software Errors

no quick solution

  • carefully thinking about assumptions and interactions in the system

  • through testing

  • process isolation

  • allowing processes to crash and restart

  • measuring, monitoring, and analyzing system behavior in production

Human Errors

  • well designed abstractions

  • sandbox

  • test

  • allow quick and easy recovery from human errors

Scalability

Describing Load

depends on the architecture of your system
这里用twitter的fan-out举了一个例子,每个用户查看自己的timeline时tweet由两部分组成,自己发的和follow的人发的,这里有两种策略,一是实时读然后做merge,二是写的时候同时写到自己所有的follower的cache里面,读的时候读cache。
最终twitter对于follower比较少的用户用了策略1,对于名人用了策略2。
对于这个例子来说,key load parameter是用户的follower数量

Describing Performance

两个角度

  • 资源不变的情况下受的影响

  • 保持能力的情况下需要申请多少额外资源

对于批来说,一般是throughput

percentiles

aws上很有意思的设想,响应比较慢的用户 -> 数据量比较大 -> 有更多的付费
SLO = service level objectives
SLA = service level agreements

Approaches for Coping with Load

scaling up = vertical scaling = moving to a more powerful machine
scaling out = horizontal scaling = distributing the load across multiple smaller machine
highly specific, not generic. 100000req per sec, each 1kb 和3req per min, each 2G,尽管throughput相同,架构上有很大的区别。正确的假设load parameters很关键。快速迭代的能力比适应假想的load重要。

Maintainability

avoid creating legacy software

Operability

good operations can often work around the limitations of bad software, but good software cannot run reliably with bad operations. 能自动化的地方,尽量不要人为介入。

  • health

  • track problems

  • up to date

  • 关注上下游系统

  • 规避future problems

  • good practices and tools

  • complex maintenance tasks 举的例子是应用切换平台,docker?

  • 当配置变化时维护系统的安全性

  • defining process

  • preserving the organization’s knowledge about the system 指的是文档这方面?

data system可以做的事情

  • monitor 可视化

  • 自动化工具

  • 避免单点故障

  • document

  • 足够好的default行为,外加按需灵活的扩展

  • 自愈,外加admin人工介入

  • 行为可预见

Simplicity

基本是在讲设计上的事情,clean code + abstraction

Evolvability

敏捷 tdd refactor