第1页
基于 Elastic大S数据t平a台c架k构 的数据探索与分析
@medcl
第2页
2 2016-4-22
第3页
About me
• 曾勇(Medcl) • Developer @ Elastic
‒Follow Elasticsearch since v0.5, 2010 ‒Joined Elastic since September, 2015 ‒Now in Beats team
• @medcl • medcl@elastic.co • http://github.com/medcl • Based in Changsha, Hunan, China
第4页
What’s Elastic?
• A distributed startup company,since 2012
‒HQ: Mountain View, CA AND Amsterdam, Netherlands ‒With employees in 27 countries (and counting), spread across 18 time zones, speaking over 30 languages
• We are working on Open Source projects!
‒(Luckily some of them are popular, eg:elasticsearch)
• Offering support Subscription,X-pack,Cloud and Trainings • Find us on: https://github.com/elastic and https://www.elastic.co
第5页
听说过 “ELK” 么?
第6页
But ELK is out!
我来也!
Beats & Packetbeat
ELKB? BELK? LKBE? BKEL?
第7页
Logo
第8页
Release Bonanza
第9页
It’s time to unite!
第10页
Extensions
The “Elastic Stack” ,stay together from v 5.0
User Interface Store, Index, & Analyze Ingest
第11页
Elastic Stack 能做什么?
第12页
Github:Enable Powerful Search For Both End-Users And Developers
12 https://www.elastic.co/use-cases/github
第13页
NASA: Unlocking Interplanetary Datasets with Real-Time Search
Pic:http://mars.jpl.nasa.gov/msl/multimedia/images/?ImageID=7693
https://www.elastic.co/elasticon/2015/sf/unlocking-interplanetary-datasets-with-real-time-search
第14页
Datadog:analysis metrics and time-series data
h14ttps://www.elastic.co/use-cases/data-dog
第15页
更多: https://www.elastic.co/use-cases
第16页
Extensions
The “Elastic Stack”
User Interface Store, Index, & Analyze Ingest
第17页
Logstash: Collect from diverse inputs
Logs Machine Data Databases Message Queues Social Web APIs
•Collects diverse sources
–Logs + many others –Over 200 plugins
•Connects with live streams
–Real-Time data –Wire / Transaction data –Full-Packet Network Capture
Sensors
http://github.com/elastic/logstash
17 1
第18页
Extensions
The “Elastic Stack”
User Interface Store, Index, & Analyze Ingest
第19页
•Beats are lightweight shippers that collect and ship all kinds of operational data to Elasticsearch
‒Small application ‒Install as agent on your servers ‒Written in Golang ‒No runtime dependencies ‒Single purpose
http://github.com/elastic/beats
第20页
Examples of operational data
第21页
Packetbeat:Real-time application monitoring
Sniffs the traffic between your servers, parses the application-level protocols on the fly.
Built-in protocols:
• HTTP
• MySQL
Let’s go realtime!
• PostgreSQL
• Redis
• Thrift-RPC
• MongoDB
• DNS
• Memcache
• ICMP
• AMQP
21 • …
第22页
winlogbeat!
Forwards Windows Event logs to Elasticsearch
第23页
Filebeat
A more lightweight log shipper • Generic filtering
Flexibly reduce the amount of data sent of the wire and stored
第24页
Topbeat
Like the Unix top command but sends the output periodically to Elasticsearch. Also works on Windows.
System wide system load total CPU usage …
Per process state name command line …
Disk usage available disks used, free space …
第25页
That’s More!
Metricbeat: Connecting Numb3rs • Listens to the internal “beat” of systems via APIs.
http://github.com/elastic/beat-generator/
第26页
Extensions
The “Elastic Stack”
User Interface Store, Index, & Analyze Ingest
第27页
What’s Kibana?
Kibana is an open source analytics and visualization platform designed to work with Elasticsearch.
http://github.com/elastic/kibana
https://github.com/elastic/generator-kibana-plugin
第28页
Search & Exploration
第29页
Visualization & Dashboard
第30页
Extensions
The “Elastic Stack”
User Interface Store, Index, & Analyze Ingest
第31页
Elasticsearch is an open source, distributed, scalable, highly available, document-oriented, RESTful, full text search engine with real-time search and analytics capabilities
Thomson Reuters: “107 clusters ~1747 nodes” @Elastic{ON}16
https://speakerdeck.com/elastic/thomson-reuters-research-journalism-finance-and-elastic
Netflix:”~150 clusters totaling ~3,500 nodes hosting ~1.3 PB of data”
http://techblog.netflix.com/2016/02/evolution-of-netflix-data-pipeline.html?m=1
• Real-time analytics • Time series data analytics • Logging analytics • Security analytics • Fraud detection • Prediction modeling • Recommendations •…
http://github.com/elastic/elasticsearch
第32页
慢着,Elasticsearch 不是搜索引擎么?
第33页
You know for search,and analytics!
v0.09.0:Facets v1.0.0:Aggregation v2.0.0: Pipeline Aggregation
第34页
Aggregation
•Analytics
柱状图、分布、统计、地理…
任何数据
能被查询到的数据就能被分析
接近实时
按需实时计算,~1s 刷新间隔
可嵌套组合
不像facets只有一级
第35页
Aggregation
Buckets: Terms Histogram Geohash grids …
Metrics: min-avg-max Stats Cardinality …
SELECT COUNT( * ) , AVG( score) <--- Metrics FROM `table` GROUP BY province,city <--- Buckets
第36页
Aggregation == 3万英尺高空俯视 == Patterns
Find some beauty (insights)!
第37页
以 PM 2.5 数据分析为例
Or like this!
第38页
{ “city”:“北京” , “date”: “2016-02-08”, “aq_level”: “严重污染”, “aq_rank”:68,
“aqi”: 391, “co”:115.5, “no2”: 1.888, “o3”:62.2, “pm2_5”: 415.7,
“range”: “74~500”, “so2”: 523.5, “location”: {“lat”:39.92, “lon”:116.46} }
38 数据来源:http://www.aqistudy.cn
第39页
Structure of an Aggregation
第40页
空气质量统计(北京全年)
POST demo/_search?size=0 {
"query":{…} "aggs": {
"aq_stats": { "terms": { "field": "aq_level", "size": 10 }
} } }
第41页
平均空气质量统计(按城市)(nested)
{ "aggs": { "city_stats": { "terms": { "field": "city", "size": 10 } ,"aggs": { "avg_pm25": { "avg": { "field": "pm2_5" }
41 }}}}
第42页
30天空气质量趋势分析(Pipeline)
{"aggs": { "qa_date_histo":{ “date_histogram”: { "field":"date“, "interval":"day“}, "aggs":{ "the_avg":{ "avg":{ "field": "pm2_5“ } }, "the_movavg":{ "moving_avg":{ "buckets_path": "the_avg", "window" : 30 }}}
}}}
第43页
Aggregation 工作原理
•Lucene Collector
•Optimized data structure
–Compressed columnar datastore(previous FieldData,now DocValues) –Strings converted to enums(per segment)
•Single pass on your data,alone with the query
–No matter how complex of your aggregation
第44页
Aggregation 工作原理
Shard
Coord inator
Shard Shard
第45页
Aggregation 工作原理
Shard
Coord inator
Shard Shard
第46页
Aggregation 工作原理
Shard
Coord inator
Shard Shard
第47页
Aggregation 工作原理
Shard
Coord inator
Shard Shard
第48页
Aggregation 工作原理
Aggregation: DocValues
DocID Field 1 北京 2 上海 3 Beijing 4 上海
Aggregation Collector
Top Hits Collector
SeSSgeemggemmneetnntt
Lucene Index/ An ES Shard
Search: Invert Index
Term DocID
北京 1,5,7
上海 2,4,6
广州 ,11,19,23
Beijing 3,12,13,15
第49页
Aggregation 工作原理
POST demo/_search?size=0 {
"aggs": { "city_stats": { "terms": { "field": "city", "size": 10 } ,"aggs": { "avg_pm25": { "avg": { "field": "pm2_5" } }, "max_pm25": { "max": { "field": "pm2_5" } } } } }}
avg (pm2_5) terms root (city)
max (pm2_5)
第50页
Aggregation 工作原理
POST demo/_search?size=0 {
"aggs": { "city_stats": { "terms": { "field": "city", "size": 10 } ,"aggs": { "avg_pm25": { "avg": { "field": "pm2_5" } }, "max_pm25": { "max": { "field": "pm2_5" } } } } }}
avg (pm2_5) terms root (city)
max (pm2_5)
第51页
Aggregation 工作原理
POST demo/_search?size=0 {
"aggs": { "city_stats": { "terms": { "field": "city", "size": 10 } ,"aggs": { "avg_pm25": { "avg": { "field": "pm2_5" } }, "max_pm25": { "max": { "field": "pm2_5" } } } } }}
avg (pm2_5) terms root (city)
max (pm2_5)
第52页
Aggregation 工作原理
POST demo/_search?size=0 {
"aggs": { "city_stats": { "terms": { "field": "city", "size": 10 } ,"aggs": { "avg_pm25": { "avg": { "field": "pm2_5" } }, "max_pm25": { "max": { "field": "pm2_5" } } } } }}
avg (pm2_5) terms root (city)
max (pm2_5)
第53页
What’s more?
•近似算法(Approximate algorithms)
‒唯一值:Cardinality
‒Hyperloglog++
‒百分位:Percentile
Fixed memory!
‒TDigest
Real Time
Extract
• 控制效率和内存占用
‒Terms
‒Breath_first collect mode
‒Sampler
‒ Max docs per shard
Big Data
• 更多有趣的Aggregation!
‒Significant Terms Aggregation
‒The uncommonly common
‒Geohash grid
第54页
54
第55页
DEMO
Kibana、Timelion、Graph
第56页
56
第57页
57
第58页
DEMO
第59页
59
第60页
60
第61页
Demo
第62页
Demo
第63页
Extensions
The “Elastic Stack”
User Interface Store, Index, & Analyze Ingest
第64页
Community
• 源码 & Issue: http://github.com/elastic/ • 英文社区: http://discuss.elastic.co • 中文社区: http://elasticsearch.cn • 官方 QQ 群: 190605846 • 下载: https://www.elastic.co/downloads • 博客: https://www.elastic.co/blog • 线下活动: http://elasticsearch.meetup.com/ • IRC: #elasticsearch, #logstash, #kibana, #beats • 官方 Twitter @elastic
第65页
More questions?
欢迎来 Elastic 展台找 我!
第66页
Thanks!
第67页
附录
第68页
ES 基本操作
索引(插入数据)
{
POST demo/pm2_5/1
"_index": "demo", "_type": "pm2_5",
{ "_id": "1",
“city”:“北京” , “date”: “2016-""0__2vs-he0ras8rid”os,n""::{1,
“aq_level”: “严重污染”, “aq_rank"t”o:ta6l8",: 2“, aqi”: 391, “co”:115.5,
“no2”: 1.888, “o3”:62.2, “pm2_"5s”u:cc4e1s5s.fu7l,": 1“,range”: “74~500”, “so2”: 523.5, “location”: {“la"tf”a:ile3d9".:902, “lon”:116.46}
}
}, "created": true
}
第69页
基本操作
获取数据
GET demo/pm2_5/1
{ "_index": "demo", "_type": "pm2_5", "_id": "1", "_version": 1, "found": true, "_source": { "aq_level": "严重污染“, "aq_rank": 68,
"aqi": 391, "city": "北京“, "co": 115.5, "date": "2016-02-08“, "province": "北京",
"range": "74~500“, "so2": 523.5 } }
第70页
基本操作
删除数据
DELETE demo/pm2_5/1
第71页
搜索
通过GET参数进行搜索
GET demo/pm2_5/_search?q=北京
第72页
搜索
查询北京的 PM2.5 数据
POST demo/pm2_5/_search {
"query": { "match": { “city”: “北京" }
} }
第73页
搜索
查询北京的 PM2.5 数据
POST demo/pm2_5/_search {"query": {"bool": {
"must": [ {"term": {"city": {"value":
"北京“}}}, {"term": {"date": {"value":
"2016-02-08“}}}]}} }
第74页
分析
Aggregation
POST twitter/tweet/_search
{
"aggs" : { "uers_stats" : { "terms" : { "field" : "user" }
} } }