第1页
Evolution of Open Source Data Infrastructure
Past, Present, and Future
Fangjin Yang
Cofounder @ Imply
第2页
2016-4-23
第3页
Overview
Simpler times with small data The rise of open source Current open source landscape Where are we headed?
第4页
Data Insights
Broadly, we care about two use cases: - OLTP - OLAP
OLTP - business processing - dealing with transactions OLAP - reporting - business intelligence OLAP data - dimensions & measures
第5页
Small Data
第6页
Small Data Analytics
+
Excel
Tableau
- Solutions are very simple - Fast and easy to extract insights - Easy to create different custom visualizations
第7页
Data Growth
MPP databases? - Oracle, Teradata, IBM, Microsoft, etc.
Proprietary databases are expensive!
第8页
The Rise of Hadoop
第9页
Hadoop
Google GFS paper published in 2003 Google MapReduce paper published in 2004 Nutch project started in 2005 at Yahoo Nutch became Hadoop and was open sourced in 2006 Community quickly grew
第10页
Early Open Source Stacks
Data
Database
Applications/users
Data
Hadoop
Applications/users
第11页
Hadoop
Data
Storage (HDFS)
Processing (MapReduce)
Applications/users
第12页
Hadoop
When one technology becomes very adopted, its limitations also become more well known Hadoop is a very flexible solution Most commonly used for data processing Not optimized for many things - many inefficiencies!
第13页
Hadoop
Data
Storage (HDFS)
Processing (MapReduce)
Queries
Applications/users
第14页
Rise of Open Source Data Infrastructure
Things Hadoop isn’t good at: - Fast queries - Deliver (streams of) events - Stream processing - In-memory computation
These limitations led to new technologies to be created
第15页
Data Infrastructure Space Today
第16页
Modern Open Source Stacks
Data
Delivery
Processing
Storage
Querying
Applications/users
第17页
Modern Open Source Stacks
Data
Delivery
Processing
Storage
Querying
Applications/users
第18页
Data Delivery
Data producers
Delivery
Data consumers
第19页
Data Delivery
Focus is storing data for a limited time and delivering it elsewhere
Three different approaches - Apache Kafka - publish/subscribe, transaction queues - RabbitMQ - publish/subscribe, distributed queues - Apache Flume - push-based event delivery
第20页
Data Delivery
第21页
Storage
Data
Delivery
Storage
第22页
Storage
Distributed file systems Store data indefinitely Standard: HDFS Can overlap with delivery systems (e.g. Kafka)
第23页
Processing
Processing systems are designed to transform data
Has overlap with querying systems - Query systems: output set smaller than input set - Processing systems: output set same size as input set - Having separation is more standard nowadays
第24页
Stream Processing
Data
Delivery
Stream Processing
Storage
Data
Delivery
Stream Processing
Querying
第25页
Stream Processing
Systems deal with unbounded messages/events
Different approaches - Spark Streaming - Storm - Samza - Flink - Kafka Streams - etc.
第26页
Stream Processing
第27页
Batch Processing
Data
Delivery
Storage
Batch Processing
第28页
Batch Processing
Manipulate (large) static sets of data Different approaches
- Spark - Hadoop
第29页
Batch Processing
第30页
Querying
Data
Delivery
Processing
Querying
Applications/users
Data
Delivery
Processing
Storage
Querying
Applications/users
第31页
Querying
Largest and most complex (broad range of use cases) Let’s focus on the most common use case:
- Business intelligence/data warehousing/OLAP Significant overlap with storage
- Separation is becoming more common
第32页
SQL-on-Hadoop
Enable ad-hoc queries on different input formats Examples: Impala, Hive, Spark SQL, Drill, Presto
第33页
SQL-on-Hadoop
第34页
SQL-on-Hadoop
Advantages: - Flexible /w full SQL support
Disadvantages: - Slow - serialization/deserialization can have overhead
Many new storage formats - Apache Parquet, Apache Kudu, Apache Arrow, etc.
第35页
Key/Value Stores
Very fast writes Very fast lookups Timeseries databases often have K/V storage engines
第36页
Key/Value Stores
Pre-computation - Pre-compute every possible query - Pre-compute a set of queries - Exponential scaling costs
第37页
Key/Value Stores
Range scans - Primary key: dimensions/attributes - Value: measures/metrics (things to aggregate) - Still too slow!
第38页
Column stores
Load/scan exactly what you need for a query Different compression algorithms for different columns
- Encoding for string columns - Compression for measure columns Different indexes for different columns
第39页
Druid
Targeted for extremely low latency queries - powering user-facing analytic applications Custom column format optimized for event data and BI queries Supports lots of concurrent reads Streaming data ingestion
第40页
So many choices!
Does the project solve your use case? Is it stable? Cheap? Fast? Is there an active and growing community? 10x faster or 10x cheaper -- upgrade!
第41页
The Next Few Years
第42页
General Trends
Number of projects reaching saturation point Streaming computation In-memory computation Standards are slowly emerging
第43页
Future Open Source Stack?
Data
Delivery
Stream Processing
Querying
第44页
Future Open Source Stack?
Data
Delivery
Stream Processing
Querying
第45页
Applications
第46页
Thanks!
imply.io druid.io