第1页
Twitter Mess1aging的架构演化之路
@sijieg | Twitter
第2页
2016-4-22
第3页
Agenda
Background Layered Architecture Design Details Performance Scale @Twitter Q&A
第4页
Publish-Subscribe
Online services - 10s of milliseconds Transaction log, Queues, RPC
Near real-time processing - 100s of milliseconds Change propagation, Stream Computing
Data delivery for analytics - seconds~minutes Log collection, aggregation
第5页
Twitter Messaging at 2012
Core Business Logic (tweets, fanouts …)
Scribe
Kestrel
Deferred RPC
Kestrel
Gizzard
Kestrel
Data base
Book Keeper
Search
MySQL
Kafka
HDFS
第6页
Online Services - Kestrel
Kestrel
Simple Perform well (as long as queue fits in memory) Fan-out Queues: One per subscriber Reliable reads - per item transaction Cross DC replication
第7页
Online Services - Kestrel Limitations
Kestrel Limitations
Durability is hard to achieve - Each queue is a separate file
Adding subscribers is expensive Separate physical copy of the queue for each fanout
Read-behind degrades performance - Too many random I/Os
Scales poorly as #queues increase
第8页
Stream Computing - Kafka
Kafka
Throughput/Latency through sequential I/O with small number of topics
Avoid data copying - Direct Network I/O (sendfile)
Batch Compression Cross DC replication (Mirroring)
第9页
Stream Computing - Kafka Limitation
Kafka Limitation
Relies on filesystem page cache Limit #topics: Ideally one or handful topics per
disk Performance degrades when subscriber falls
behind - Too much random I/O No durability and replication (0.7)
第10页
Problems
Each of the systems came with their maintenance overhead
Software Components - backend, clients and interop with the rest of Twitter stack
Manageability and Supportability - deployment, upgrades, hardware maintenance and optimization
Technical know-how
第11页
Rethink the messaging architecture
Unified Stack - tradeoffs for various workloads Durable writes, intra-cluster and geo-replication Multi tenancy Scale resources independently - Cost efficiency Ease of manageability
第12页
Layered Architecture
Data Model Software Stack Data Flow
第13页
Log Stream
Entry: a batch of
records
0100100111010 1010101110100 0101110110010 1010110110101
sequence of bytes
DLSN: (LSSN, Eid, Sid) Sequence ID Transaction ID
Log Segment X
Log Segment X + 1
Log Segment Y
第14页
Layered Architecture
APPLICATION
STATELESS SERVING
Write Clients
Ownership Tracker
Write Proxy
Read Clients
Routing Service Read Proxy
CORE
PERSISTENT STORAGE
Writer
Cold Storage (HDFS)
Reader
Metadata Store (ZooKeeper)
BookKeeper
Bookie Bookie Bookie
Bookie
第15页
Messaging Flow
3. Flush Write a batched entry to bookies
Bookie
7. Speculative Read
Read Client
1. write records
Write Client
Write Proxy
5. Commit -
Write Control
Record
6. Long poll
read
Bookie
Read Proxy
9. Long poll read
Read Client
4. acknowledge
2. transmit buffer
Bookie
8. Cache Records
Read Client
第16页
Design Details
Consistency
Global Replicated Log
第17页
Consistency
LastAddConfirmed => Consistent views among readers
Fencing => Consistent views among writers
第18页
Consistency - LastAddPushed
Writer
Add entries
7 8 9 10 11 12
LastAdd Pushed
第19页
Consistency - LastAddConfirmed
Ack Adds
Writer
Ownership Changed Add entries
Writer
7 8 9 10 11 12
Fencing
LastAdd Confirmed
LastAdd Pushed
Reader
Reader
第20页
Consistency - Fencing
New Writer
Completed Log
Segment X
CInopmrpolgerteesds LLoogg
SSeeggmmeenntt XX++11
0. Ownership Changed
1. Get Log Segments
2.3 Complete Inprogress LogSegment
Inprogress Log
Segment X+2
3 new inprogress
Bookie
Bookie
2.1 Fence Inprogress LogSegment
Bookie
Old Writer
Completed Log
Segment X
Inprogress Log
Segment X+1
2.2 write rejected
Bookie
Bookie
第21页
Consistency - Ownership Tracking
Ownership Tracking (Leader Election)
ZooKeeper Ephemeral Znodes (leases) Aggressive Failure Detection (within a second)
TickTime = 500 (ms) Session Timeout = 1000 (ms)
第22页
Global Replicated Log
Region Aware Data Placement Cross Region Speculative Reads
第23页
Global Replicated Log
Region 1 ZK
Writer
Ownership Tracker
Region 2 ZK
Region 3 ZK
Write Proxy
Write Proxy
Region Aware Placement Policy
Write Proxy
Bookie
Bookie
Reader
Speculative Read
Bookie
第24页
Region Aware Data Placement Policy
Hierarchical Data Placement
Data is spread uniformly across available regions Each region uses rack aware placement policy Acknowledge only when the data is persisted in
majority of regions
第25页
Cross Region Speculative Reads
Reader consults data placement policy for read order
First : the bookie node that is closest to the client Second: the closest node that is in a different failure
domain - different rack Third: the bookie node in a different closest region ...
第26页
Performance
Latency vs Throughput Scalability Efficiency
第27页
Support various workloads with latency/throughput tradeoffs Latency and throughput under different flush policies
第28页
Scale with multiple streams (single node vs multiple nodes)
Under 100k rps, latency increased with number of streams increased on a single hybrid proxy
Each stream writes 100 rps. Throughput increased linearly with
number of streams.
第29页
Scale with large number of fanout readers
Analytic application writes 2.45GB per second, while the data has been fanout to 40x to the readers.
第30页
Messaging
@ Twitter
Use Cases Deployment Scale
第31页
Applications at Twitter
Manhattan Key-Value Store Durable Deferred RPC Real-time search indexing Self-Served Pub-Sub System / Stream Computing Reliable cross datacenter replication ...
第32页
Scale at Twitter
One global cluster, and a few local clusters each dc O(10^3) bookie nodes O(10^3) global log streams and O(10^4) local log streams
O(10^6) live log segments Pub-Sub: deliver O(1) trillion records per day, roughly
accounting for O(10) PB per day
第33页
Lessons that we learned
Make foundation durable and consistent Don’t trust filesystem Think of workloads and I/O isolation Keep persistent state as simple as possible ...
第34页
DistributedLog is the new messaging foundation
Layered Architecture
Separated Stateless Serving from Stateful Storage Scale CPU/Memory/Network (shared mesos)
independent of Storage (hybrid mesos)
Messaging Design
Writes / Reads Isolation Scale Writes (Fan-in) independent of Reads (Fan-out)
Global Replicated Log
第35页
Future
Open source on Github (May)
https://github.com/twitter/distributedlog
Apache Incubating …
第36页
Q&A
Contact guosijie@gmail.com Twitter: @sijieg Wechat: guosijie_