开源大数据技术之演进-Evolution of Open Source Data Infrastructure Past, Present, and Future by Fangjin Yang@Imply

AirJD

没有录音文件

00:00/00:00

加收藏

开源大数据技术之演进-Evolution of Open Source Data Infrastructure Past, Present, and Future by Fangjin Yang@Imply

发布者 big_data

发布于 1471566734963 浏览 7998 关键词 开源, 数据库

分享到

第1页

Evolution of Open Source Data Infrastructure

Past, Present, and Future

Fangjin Yang

Cofounder @ Imply

第2页

2016-4-23

第3页

Overview

Simpler times with small data The rise of open source Current open source landscape Where are we headed?

第4页

Data Insights

Broadly, we care about two use cases: - OLTP - OLAP

OLTP - business processing - dealing with transactions OLAP - reporting - business intelligence OLAP data - dimensions & measures

第5页

Small Data

第6页

Small Data Analytics

+

Excel

Tableau

- Solutions are very simple - Fast and easy to extract insights - Easy to create different custom visualizations

第7页

Data Growth

MPP databases? - Oracle, Teradata, IBM, Microsoft, etc.

Proprietary databases are expensive!

第8页

The Rise of Hadoop

第9页

Hadoop

Google GFS paper published in 2003 Google MapReduce paper published in 2004 Nutch project started in 2005 at Yahoo Nutch became Hadoop and was open sourced in 2006 Community quickly grew

第10页

Early Open Source Stacks

Data

Database

Applications/users

Data

Hadoop

Applications/users

第11页

Hadoop

Data

Storage (HDFS)

Processing (MapReduce)

Applications/users

第12页

Hadoop

When one technology becomes very adopted, its limitations also become more well known Hadoop is a very flexible solution Most commonly used for data processing Not optimized for many things - many inefficiencies!

第13页

Hadoop

Data

Storage (HDFS)

Processing (MapReduce)

Queries

Applications/users

第14页

Rise of Open Source Data Infrastructure

Things Hadoop isn’t good at: - Fast queries - Deliver (streams of) events - Stream processing - In-memory computation

These limitations led to new technologies to be created

第15页

Data Infrastructure Space Today

第16页

Modern Open Source Stacks

Data

Delivery

Processing

Storage

Querying

Applications/users

第17页

Modern Open Source Stacks

Data

Delivery

Processing

Storage

Querying

Applications/users

第18页

Data Delivery

Data producers

Delivery

Data consumers

第19页

Data Delivery

Focus is storing data for a limited time and delivering it elsewhere

Three different approaches - Apache Kafka - publish/subscribe, transaction queues - RabbitMQ - publish/subscribe, distributed queues - Apache Flume - push-based event delivery

第20页

Data Delivery

第21页

Storage

Data

Delivery

Storage

第22页

Storage

Distributed file systems Store data indefinitely Standard: HDFS Can overlap with delivery systems (e.g. Kafka)

第23页

Processing

Processing systems are designed to transform data

Has overlap with querying systems - Query systems: output set smaller than input set - Processing systems: output set same size as input set - Having separation is more standard nowadays

第24页

Stream Processing

Data

Delivery

Stream Processing

Storage

Data

Delivery

Stream Processing

Querying

第25页

Stream Processing

Systems deal with unbounded messages/events

Different approaches - Spark Streaming - Storm - Samza - Flink - Kafka Streams - etc.

第26页

Stream Processing

第27页

Batch Processing

Data

Delivery

Storage

Batch Processing

第28页

Batch Processing

Manipulate (large) static sets of data Different approaches

- Spark - Hadoop

第29页

Batch Processing

第30页

Querying

Data

Delivery

Processing

Querying

Applications/users

Data

Delivery

Processing

Storage

Querying

Applications/users

第31页

Querying

Largest and most complex (broad range of use cases) Let’s focus on the most common use case:

- Business intelligence/data warehousing/OLAP Significant overlap with storage

- Separation is becoming more common

第32页

SQL-on-Hadoop

Enable ad-hoc queries on different input formats Examples: Impala, Hive, Spark SQL, Drill, Presto

第33页

SQL-on-Hadoop

第34页

SQL-on-Hadoop

Advantages: - Flexible /w full SQL support

Disadvantages: - Slow - serialization/deserialization can have overhead

Many new storage formats - Apache Parquet, Apache Kudu, Apache Arrow, etc.

第35页

Key/Value Stores

Very fast writes Very fast lookups Timeseries databases often have K/V storage engines

第36页

Key/Value Stores

Pre-computation - Pre-compute every possible query - Pre-compute a set of queries - Exponential scaling costs

第37页

Key/Value Stores

Range scans - Primary key: dimensions/attributes - Value: measures/metrics (things to aggregate) - Still too slow!

第38页

Column stores

Load/scan exactly what you need for a query Different compression algorithms for different columns

- Encoding for string columns - Compression for measure columns Different indexes for different columns

第39页

Druid

Targeted for extremely low latency queries - powering user-facing analytic applications Custom column format optimized for event data and BI queries Supports lots of concurrent reads Streaming data ingestion

第40页

So many choices!

Does the project solve your use case? Is it stable? Cheap? Fast? Is there an active and growing community? 10x faster or 10x cheaper -- upgrade!

第41页

The Next Few Years

第42页

General Trends

Number of projects reaching saturation point Streaming computation In-memory computation Standards are slowly emerging

第43页

Future Open Source Stack?

Data

Delivery

Stream Processing

Querying

第44页

Future Open Source Stack?

Data

Delivery

Stream Processing

Querying

第45页

Applications

第46页

Thanks!

imply.io druid.io