Twitter Messaging的架构演化之路 by sijieg

AirJD

没有录音文件

00:00/00:00

加收藏

Twitter Messaging的架构演化之路 by sijieg

发布者 arch

发布于 1471220585823 浏览 7578 关键词 架构, DevOps

分享到

第1页

Twitter Mess1aging的架构演化之路

@sijieg | Twitter

第2页

2016-4-22

第3页

Agenda

Background Layered Architecture Design Details Performance Scale @Twitter Q&A

第4页

Publish-Subscribe

Online services - 10s of milliseconds Transaction log, Queues, RPC

Near real-time processing - 100s of milliseconds Change propagation, Stream Computing

Data delivery for analytics - seconds~minutes Log collection, aggregation

第5页

Twitter Messaging at 2012

Core Business Logic (tweets, fanouts …)

Scribe

Kestrel

Deferred RPC

Kestrel

Gizzard

Kestrel

Data base

Book Keeper

Search

MySQL

Kafka

HDFS

第6页

Online Services - Kestrel

Kestrel

Simple Perform well (as long as queue fits in memory) Fan-out Queues: One per subscriber Reliable reads - per item transaction Cross DC replication

第7页

Online Services - Kestrel Limitations

Kestrel Limitations

Durability is hard to achieve - Each queue is a separate file

Adding subscribers is expensive Separate physical copy of the queue for each fanout

Read-behind degrades performance - Too many random I/Os

Scales poorly as #queues increase

第8页

Stream Computing - Kafka

Kafka

Throughput/Latency through sequential I/O with small number of topics

Avoid data copying - Direct Network I/O (sendfile)

Batch Compression Cross DC replication (Mirroring)

第9页

Stream Computing - Kafka Limitation

Kafka Limitation

Relies on filesystem page cache Limit #topics: Ideally one or handful topics per

disk Performance degrades when subscriber falls

behind - Too much random I/O No durability and replication (0.7)

第10页

Problems

Each of the systems came with their maintenance overhead

Software Components - backend, clients and interop with the rest of Twitter stack

Manageability and Supportability - deployment, upgrades, hardware maintenance and optimization

Technical know-how

第11页

Rethink the messaging architecture

Unified Stack - tradeoffs for various workloads Durable writes, intra-cluster and geo-replication Multi tenancy Scale resources independently - Cost efficiency Ease of manageability

第12页

Layered Architecture

Data Model Software Stack Data Flow

第13页

Log Stream

Entry: a batch of

records

0100100111010 1010101110100 0101110110010 1010110110101

sequence of bytes

DLSN: (LSSN, Eid, Sid) Sequence ID Transaction ID

Log Segment X

Log Segment X + 1

Log Segment Y

第14页

Layered Architecture

APPLICATION

STATELESS SERVING

Write Clients

Ownership Tracker

Write Proxy

Read Clients

Routing Service Read Proxy

CORE

PERSISTENT STORAGE

Writer

Cold Storage (HDFS)

Reader

Metadata Store (ZooKeeper)

BookKeeper

Bookie Bookie Bookie

Bookie

第15页

Messaging Flow

3. Flush Write a batched entry to bookies

Bookie

7. Speculative Read

Read Client

1. write records

Write Client

Write Proxy

5. Commit -

Write Control

Record

6. Long poll

read

Bookie

Read Proxy

9. Long poll read

Read Client

4. acknowledge

2. transmit buffer

Bookie

8. Cache Records

Read Client

第16页

Design Details

Consistency

Global Replicated Log

第17页

Consistency

LastAddConfirmed => Consistent views among readers

Fencing => Consistent views among writers

第18页

Consistency - LastAddPushed

Writer

Add entries

7 8 9 10 11 12

LastAdd Pushed

第19页

Consistency - LastAddConfirmed

Ack Adds

Writer

Ownership Changed Add entries

Writer

7 8 9 10 11 12

Fencing

LastAdd Confirmed

LastAdd Pushed

Reader

Reader

第20页

Consistency - Fencing

New Writer

Completed Log

Segment X

CInopmrpolgerteesds LLoogg

SSeeggmmeenntt XX++11

0. Ownership Changed

1. Get Log Segments

2.3 Complete Inprogress LogSegment

Inprogress Log

Segment X+2

3 new inprogress

Bookie

Bookie

2.1 Fence Inprogress LogSegment

Bookie

Old Writer

Completed Log

Segment X

Inprogress Log

Segment X+1

2.2 write rejected

Bookie

Bookie

第21页

Consistency - Ownership Tracking

Ownership Tracking (Leader Election)

ZooKeeper Ephemeral Znodes (leases) Aggressive Failure Detection (within a second)

TickTime = 500 (ms) Session Timeout = 1000 (ms)

第22页

Global Replicated Log

Region Aware Data Placement Cross Region Speculative Reads

第23页

Global Replicated Log

Region 1 ZK

Writer

Ownership Tracker

Region 2 ZK

Region 3 ZK

Write Proxy

Write Proxy

Region Aware Placement Policy

Write Proxy

Bookie

Bookie

Reader

Speculative Read

Bookie

第24页

Region Aware Data Placement Policy

Hierarchical Data Placement

Data is spread uniformly across available regions Each region uses rack aware placement policy Acknowledge only when the data is persisted in

majority of regions

第25页

Cross Region Speculative Reads

Reader consults data placement policy for read order

First : the bookie node that is closest to the client Second: the closest node that is in a different failure

domain - different rack Third: the bookie node in a different closest region ...

第26页

Performance

Latency vs Throughput Scalability Efficiency

第27页

Support various workloads with latency/throughput tradeoffs Latency and throughput under different flush policies

第28页

Scale with multiple streams (single node vs multiple nodes)

Under 100k rps, latency increased with number of streams increased on a single hybrid proxy

Each stream writes 100 rps. Throughput increased linearly with

number of streams.

第29页

Scale with large number of fanout readers

Analytic application writes 2.45GB per second, while the data has been fanout to 40x to the readers.

第30页

Messaging

@ Twitter

Use Cases Deployment Scale

第31页

Applications at Twitter

Manhattan Key-Value Store Durable Deferred RPC Real-time search indexing Self-Served Pub-Sub System / Stream Computing Reliable cross datacenter replication ...

第32页

Scale at Twitter

One global cluster, and a few local clusters each dc O(10^3) bookie nodes O(10^3) global log streams and O(10^4) local log streams

O(10^6) live log segments Pub-Sub: deliver O(1) trillion records per day, roughly

accounting for O(10) PB per day

第33页

Lessons that we learned

Make foundation durable and consistent Don’t trust filesystem Think of workloads and I/O isolation Keep persistent state as simple as possible ...

第34页

DistributedLog is the new messaging foundation

Layered Architecture

Separated Stateless Serving from Stateful Storage Scale CPU/Memory/Network (shared mesos)

independent of Storage (hybrid mesos)

Messaging Design

Writes / Reads Isolation Scale Writes (Fan-in) independent of Reads (Fan-out)

Global Replicated Log

第35页

Future

Open source on Github (May)

https://github.com/twitter/distributedlog

Apache Incubating …

第36页

Q&A

Contact guosijie@gmail.com Twitter: @sijieg Wechat: guosijie_