Reliability
也就是fault-tolerant
failure = whole stops providing the required service to the user
fault-tolerant prevent faults from causing failures
Hardware Faults
redundancy
software fault-tolerant
Software Errors
no quick solution
carefully thinking about assumptions and interactions in the system
through testing
process isolation
allowing processes to crash and restart
measuring, monitoring, and analyzing system behavior in production
Human Errors
well designed abstractions
sandbox
test
allow quick and easy recovery from human errors
Scalability
Describing Load
depends on the architecture of your system
这里用twitter的fan-out举了一个例子,每个用户查看自己的timeline时tweet由两部分组成,自己发的和follow的人发的,这里有两种策略,一是实时读然后做merge,二是写的时候同时写到自己所有的follower的cache里面,读的时候读cache。
最终twitter对于follower比较少的用户用了策略1,对于名人用了策略2。
对于这个例子来说,key load parameter是用户的follower数量
Describing Performance
两个角度
资源不变的情况下受的影响
保持能力的情况下需要申请多少额外资源
percentiles
aws上很有意思的设想,响应比较慢的用户 -> 数据量比较大 -> 有更多的付费
SLO = service level objectives
SLA = service level agreements
Approaches for Coping with Load
scaling up = vertical scaling = moving to a more powerful machine
scaling out = horizontal scaling = distributing the load across multiple smaller machine
highly specific, not generic. 100000req per sec, each 1kb 和3req per min, each 2G,尽管throughput相同,架构上有很大的区别。正确的假设load parameters很关键。快速迭代的能力比适应假想的load重要。
Maintainability
avoid creating legacy software
Operability
good operations can often work around the limitations of bad software, but good software cannot run reliably with bad operations. 能自动化的地方,尽量不要人为介入。
health
track problems
up to date
关注上下游系统
规避future problems
good practices and tools
complex maintenance tasks 举的例子是应用切换平台,docker?
当配置变化时维护系统的安全性
defining process
preserving the organization’s knowledge about the system 指的是文档这方面?
data system可以做的事情
monitor 可视化
自动化工具
避免单点故障
document
足够好的default行为,外加按需灵活的扩展
自愈,外加admin人工介入
行为可预见
Simplicity
基本是在讲设计上的事情,clean code + abstraction
Evolvability
敏捷 tdd refactor