第1页
Data Warehouses and Multi-Dimensional Data Analysis
Raimonds Simanovskis
@rsim
第5页
Vampires live here
第6页
500km long beach (310.686 miles)
Other vampires live here
第8页
Data Warehouses and Multi-Dimensional Data Analysis
Raimonds Simanovskis
@rsim
第9页
Sales app example
class Customer < ActiveRecord::Base has_many :orders
end
class Order < ActiveRecord::Base belongs_to :customer has_many :order_items
end
class Product < ActiveRecord::Base belongs_to :product_class has_many :order_items
end
class OrderItem < ActiveRecord::Base belongs_to :order belongs_to :product
end
class ProductClass < ActiveRecord::Base has_many :products
end
第10页
Database schema
第11页
One day CEO asks a question…
What were the total sales amounts
in California in Q1 2014 by product families?
第12页
Total sales amount …
OrderItem.sum("amount")
第13页
… in California …
OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). sum("order_items.amount")
第14页
… in Q1 2014 …
OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). sum("order_items.amount")
第15页
… by product families
OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). sum("order_items.amount")
第16页
Generated SQL
OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). sum("order_items.amount")
SELECT SUM(order_items.amount) AS sum_order_items_amount, product_classes.product_family AS product_classes_product_family
FROM "order_items" INNER JOIN "orders" ON "orders"."id" = "order_items"."order_id" INNER JOIN "customers" ON "customers"."id" = "orders"."customer_id" INNER JOIN "products" ON "products"."id" = "order_items"."product_id" INNER JOIN "product_classes" ON "product_classes"."id" = "products"."product_class_id" WHERE "customers"."country" = 'USA'
AND "customers"."state_province" = 'CA' AND (extract(YEAR FROM orders.order_date) = 2014) AND (extract(quarter FROM orders.order_date) = 1) GROUP BY product_classes.product_family
第17页
… and also sales cost?
OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost”). map{|i| i.attributes.compact}
第18页
… and unique customers count?
OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost,"+ "COUNT(DISTINCT customers.id) AS customers_count"). map{|i| i.attributes.compact}
第19页
Is it clear?
OrderItem.joins(:order => :customer). where("customers.country" => "USA", "customers.state_province" => "CA"). where("extract(year from orders.order_date) = ?", 2014). where("extract(quarter from orders.order_date) = ?", 1). joins(:product => :product_class). group("product_classes.product_family"). select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+ "SUM(order_items.cost) AS sales_cost,"+ "COUNT(DISTINCT customers.id) AS customers_count"). map{|i| i.attributes.compact}
#@%$^&
第20页
Performance slows down
on larger data volumes
$ rails console
>> OrderItem.count
(677.0ms) SELECT COUNT(*) FROM "order_items"
=> 6218022 >> Order.count
6 million rows
(126.0ms) SELECT COUNT(*) FROM "orders"
=> 642362
>> OrderItem.joins(:order => :customer).
joins(:product => :product_class).
group("product_classes.product_family").
select("product_classes.product_family,"+
"SUM(order_items.amount) AS sales_amount,"+
"SUM(order_items.cost) AS sales_cost,"+
"COUNT(DISTINCT customers.id) AS customers_count").
map{|i| i.attributes.compact}
OrderItem Load (25437.0ms) ...
25 seconds
第21页
You should use NoSQL !
第22页
You should use NoSQL !
第23页
Dimensional Modeling
Deliver data that’s understandable to the business users
Deliver fast query performance
第24页
Dimensional Modeling
What were the
total sales amounts
fact or measure
in California
Customer / Region dimension
in Q1 2014
Time dimension
by product families?
Product dimension
第25页
Data Warehouse
“Star schema” with fact and dimension tables
第26页
“Snowflake schema”
第27页
Data Warehouse Models
class Dwh::CustomerDimension < Dwh::Dimension has_many :sales_facts, class_name: “Dwh::SalesFact", foreign_key: "customer_id"
end
class Dwh::TimeDimension < Dwh::Dimension has_many :sales_facts, class_name: “Dwh::SalesFact", foreign_key: "time_id"
end
class Dwh::SalesFact < Dwh::Fact belongs_to :customer, class_name: "Dwh::CustomerDimension" belongs_to :product, class_name: "Dwh::ProductDimension" belongs_to :time, class_name: "Dwh::TimeDimension"
end
class Dwh::ProductDimension < Dwh::Dimension has_many :sales_facts, class_name: "Dwh::SalesFact", foreign_key: "product_id" belongs_to :product_class, class_name: "Dwh::ProductClassDimension"
end
class Dwh::ProductClassDimension < Dwh::Dimension has_many :products, class_name: "Dwh::ProductDimension", foreign_key: "product_class_id"
end
第28页
Load Dimension
class Dwh::CustomerDimension < Dwh::Dimension # ... def self.truncate! connection.execute "TRUNCATE TABLE #{table_name}" end def self.load! truncate! column_names = %w(id full_name city state_province country birth_date gender created_at updated_at) connection.insert %[ INSERT INTO #{table_name} (#{column_names.join(',')}) SELECT #{column_names.join(',')} FROM #{::Customer.table_name} ] end
end
第29页
class Dwh::TimeDimension < Dwh::Dimension
def self.load!
connection.select_values(%[
SELECT DISTINCT order_date FROM #{Order.table_name}
WHERE order_date NOT IN
(SELECT date_value FROM #{table_name})
]).each do |date|
year, month, day = date.year, date.month, date.day
quarter = ((month-1)/3)+1
quarter_name = "Q#{quarter} #{year}"
month_name = date.strftime("%b %Y")
Generate
day_name = date.strftime("%b %d %Y") sql = send :sanitize_sql_array, [
%[
Time
INSERT INTO #{table_name} (id, date_value, year, quarter, month, day, year_name, quarter_name, month_name, day_name)
VALUES
Dimension (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ],
date_to_id(date), date, year, quarter, month, day,
year.to_s, quarter_name, month_name, day_name
]
connection.insert sql
end
end
end
第30页
Load Facts
class Dwh::SalesFact < Dwh::Fact def self.load! truncate! connection.insert %[ INSERT INTO #{table_name} (customer_id, product_id, time_id, sales_quantity, sales_amount, sales_cost) SELECT o.customer_id, oi.product_id, CAST(to_char(o.order_date, 'YYYYMMDD') AS INTEGER), oi.quantity, oi.amount, oi.cost FROM #{OrderItem.table_name} oi INNER JOIN #{Order.table_name} o ON o.id = oi.order_id ] end
end
第31页
What were the total sales amounts
in California in Q1 2014 by product families?
Dwh::SalesFact. joins(:customer).joins(:product => :product_class).joins(:time). where("d_customers.country" => “USA",
"d_customers.state_province" => "CA"). where("d_time.year" => 2014, "d_time.quarter" => 1). group("d_product_classes.product_family"). sum("sales_amount")
第32页
Two-Dimensional Table
Columns
Rows
Cell
第33页
Multi-Dimensional Data Model
Data cube
Dimension
Dimension
Dimension
Measures
第34页
Multi-Dimensional Data Model
Customer
Sales cube
Product
Time
Measures
Sales quantity
Sales amount
Sales cost
Customers count
第35页
Dimension Hierarchies
All Customers
USA
Canada
WA CA
OR
San Francisco
Los Angeles
All Country
State City
Levels
第36页
Time Dimension
All Times
Q1 JUL
Q2 AUG
Q3 Q4 SEP
AUG 01
AUG 02
All Times
W1 JAN 17
W2 W3 W4
JAN 18
JAN 19
All Year Quarter Month Day
All Year Week Day
Default
hierarchy
Weekly
hierarchy
第37页
OLAP Technologies
On-Line Analytical Processing
Mondrian
http://community.pentaho.com/projects/mondrian/
mondrian-olap gem
https://github.com/rsim/mondrian-olap
第38页
Mondrian::OLAP::Schema.define do cube 'Sales' do table 'f_sales', schema: 'dwh'
mondrian-olap
dimension 'Customer', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Customers', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Country', column: 'country' level 'State Province', column: 'state_province' level 'City', column: 'city' level 'Name', column: 'full_name' end
end
schema definition
dimension 'Product', foreign_key: 'product_id' do hierarchy all_member_name: 'All Products', primary_key: 'id', primary_key_table: 'd_products' do join left_key: 'product_class_id', right_key: 'id' do table 'd_products', schema: 'dwh' table 'd_product_classes', schema: 'dwh' end level 'Product Family', table: 'd_product_classes', column: 'product_family' level 'Product Department', table: 'd_product_classes', column: 'product_department' level 'Product Category', table: 'd_product_classes', column: 'product_category' level 'Product Subcategory', table: 'd_product_classes', column: 'product_subcategory' level 'Brand Name', table: 'd_products', column: 'brand_name' level 'Product Name', table: 'd_products', column: 'product_name' end
end
dimension 'Time', foreign_key: 'time_id', type: 'TimeDimension' do hierarchy all_member_name: 'All Time', primary_key: 'id' do table 'd_time', schema: 'dwh' level 'Year', column: 'year', type: 'Numeric', name_column: 'year_name', level_type: 'TimeYears' level 'Quarter', column: 'quarter', type: 'Numeric', name_column: 'quarter_name', level_type: 'TimeQuarters' level 'Month', column: 'month', type: 'Numeric', name_column: 'month_name', level_type: 'TimeMonths' level 'Day', column: 'day', type: 'Numeric', name_column: 'day_name', level_type: 'TimeDays' end
end
measure 'Sales Quantity', column: 'sales_quantity', aggregator: 'sum' measure 'Sales Amount', column: 'sales_amount', aggregator: 'sum' measure 'Sales Cost', column: 'sales_cost', aggregator: ‘sum' measure ‘Customers Count', column: ‘customer_id', aggregator: ‘distinct-count' end end
第39页
What were the total sales amounts
in California in Q1 2014 by product families?
olap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
第40页
MDX Query Language
olap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
SELECT {[Measures].[Sales Amount]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] WHERE ([Customer].[USA].[CA], [Time].[Quarter].[Q1 2014])
第41页
Results Caching
SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost], [Measures].[Customers Count]} ON COLUMNS,
[Product].[Product Family].Members ON ROWS FROM [Sales] (21713.0ms) SELECT {[Measures].[Sales Amount], [Measures].[Sales Cost],
[Measures].[Customers Count]} ON COLUMNS, [Product].[Product Family].Members ON ROWS FROM [Sales] (10.0ms)
第42页
Additional Attribute Dimension
dimension 'Gender', foreign_key: 'customer_id' do hierarchy all_member_name: 'All Genders', primary_key: 'id' do table 'd_customers', schema: 'dwh' level 'Gender', column: 'gender' do name_expression do sql "CASE d_customers.gender WHEN 'F' THEN ‘Female' WHEN 'M' THEN ‘Male' END" end end end
end
olap.from("Sales"). columns("[Measures].[Sales Amount]"). rows("[Gender].[Gender].Members")
第43页
Dynamic Attribute Dimension
dimension 'Age interval', foreign_key: 'customer_id' do
hierarchy all_member_name: 'All Age', primary_key: 'id' do
table 'd_customers', schema: 'dwh'
level 'Age interval' do
key_expression do
sql %[
CASE
WHEN age(d_customers.birth_date) < interval '20 years'
THEN '< 20 years'
WHEN age(d_customers.birth_date) < interval '30 years'
THEN '20-30 years'
WHEN age(d_customers.birth_date) < interval '40 years'
THEN '30-40 years'
WHEN age(d_customers.birth_date) < interval '50 years'
THEN '40-50 years'
ELSE '50+ years'
END ] end end end end
[Age interval].[<20 years] [Age interval].[20-30 years] [Age interval].[30-40 years] [Age interval].[40-50 years] [Age interval].[50+ years]
第44页
Calculation Formulas
calculated_member 'Profit', dimension: 'Measures', format_string: '#,##0.00', formula: '[Measures].[Sales Amount] - [Measures].[Sales Cost]'
calculated_member 'Margin %', dimension: 'Measures', format_string: '#,##0.00%', formula: '[Measures].[Profit] / [Measures].[Sales Amount]'
olap.from("Sales"). columns("[Measures].[Profit]", "[Measures].[Margin %]"). rows("[Product].[Product Family].Members"). where("[Customer].[USA].[CA]", "[Time].[Quarter].[Q1 2014]")
第45页
Enables Ad-hoc Queries by Users
第46页
ETL process
Extract
Database
REST API
Transform
Load
Data
Warehouse
Dimension1
Dimension2
Measures
Dimension3
Dimension4
第47页
Ruby Tools for ETL
ETL https://github.com/square/ETL Kiba http://www.kiba-etl.org/
第48页
Kiba example
# declare a ruby method here, for quick reusable logic def parse_french_date(date)
Date.strptime(date, '%d/%m/%Y') end # or better, include a ruby file which loads reusable assets # eg: commonly used sources / destinations / transforms, under unit-test require_relative 'common' # declare a source where to take data from (you implement it - see notes below) source MyCsvSource, 'input.csv' # declare a row transform to process a given field transform do |row|
row[:birth_date] = parse_french_date(row[:birth_date]) # return to keep in the pipeline row end # declare another row transform, dismissing rows conditionally by returning nil transform do |row| row[:birth_date].year < 2000 ? row : nil end # declare a row transform as a class, which can be tested properly transform ComplianceCheckTransform, eula: 2015
第49页
Multithreaded ETL
https://github.com/ruby-concurrency/concurrent-ruby
Data source
Extracted
Transformed
Extract
data
Transform
data
Load
ThreadPool
ThreadPool
ThreadPool
Pro-tip: Use
第50页
class Dwh::TimeDimension < Dwh::Dimension
def self.load!
logger.silence do
connection.select_values(%[
SELECT DISTINCT order_date FROM #{Order.table_name}
WHERE order_date NOT IN (SELECT date_value FROM #{table_name})
]).each do |date|
insert_date(date)
end
end
end
def self.insert_date(date) year, month, day = date.year, date.month, date.day
Single
quarter = ((month-1)/3)+1
quarter_name = "Q#{quarter} #{year}" month_name = date.strftime("%b %Y") day_name = date.strftime("%b %d %Y")
threaded
sql = send :sanitize_sql_array, [
%[ INSERT INTO #{table_name} (id, date_value, year, quarter, month, day,
ETL
year_name, quarter_name, month_name, day_name)
VALUES
(?, ?, ?, ?, ?, ?,
?, ?, ?, ?)
],
date_to_id(date), date, year, quarter, month, day,
year.to_s, quarter_name, month_name, day_name
]
connection.insert sql
end
end
第51页
require 'concurrent/executors'
ETL with
class Dwh::TimeDimension < Dwh::Dimension
Thread Pool
def self.parallel_load!(pool_size = 4) logger.silence do insert_date_pool = Concurrent::FixedThreadPool.new(pool_size)
connection.select_values(%[ SELECT DISTINCT order_date FROM #{Order.table_name} WHERE order_date NOT IN (SELECT date_value FROM #{table_name})
]).each do |date| insert_date_pool.post(date) do |date| connection_pool.with_connection do insert_date(date) end end
end
insert_date_pool.shutdown insert_date_pool.wait_for_termination end end
end
第52页
Benchmark!
Dwh::TimeDimension.load! Dwh::TimeDimension.parallel_load!(2) Dwh::TimeDimension.parallel_load!(4) Dwh::TimeDimension.parallel_load!(6) Dwh::TimeDimension.parallel_load!(8)
(5236.0ms) (3450.0ms) (2142.0ms) (2361.0ms) (2826.0ms)
optimal size in this case
Java Mission Control
第53页
Traditional vs Analytical Relational Databases
Optimized for transaction processing
Optimized for analytical queries
第54页
Row-based Storage
第55页
Columnar Storage
http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html
第56页
Analytical Query Performance
SELECT d_product_classes.product_family,
SUM(f_sales.sales_amount) AS sales_amount,
SUM(f_sales.sales_cost) AS sales_cost,
COUNT(DISTINCT f_sales.customer_id) AS customers_count FROM "dwh"."f_sales"
6 million rows
INNER JOIN "dwh"."d_products" ON "dwh"."d_products"."id" =
"dwh"."f_sales"."product_id"
INNER JOIN "dwh"."d_product_classes" ON "dwh"."d_product_classes"."id" =
"dwh"."d_products"."product_class_id"
GROUP BY d_product_classes.product_family
always ~18 seconds
first ~9 seconds next ~1.5 seconds
第57页
When to use what?
Traditional Fact table size transactional
databases
Analytical columnar databases
< 1M rows OK No big win
1-10M rows
Complex queries slower
OK
10-100M rows
Slow
OK
>100M rows
Very slow OK with tuning
第58页
What did we cover?
Problems with analytical queries Dimensional modeling Star schemas
Mondrian OLAP and MDX ETL – Extract, Transform, Load Analytical columnar databases
第59页
Que s t io n s?
raimonds.simanovskis@gmail.com
@rsim
github.com/rsim
https://github.com/rsim/sales_app_demo