第1页
第2页
MongoDB Best Practices
Jay Runkel
Principal Solutions Architect
jay.runkel@mongodb.com
@jayrunkel
第3页
About Me
Solution Architect
Part of Sales Organization
Work with many organizations new to MongoDB
第4页
Everyone Loves MongoDB’s Flexibility
Document Model
Dynamic Schema
Powerful Query Language
Secondary Indexes
第5页
Everyone Loves MongoDB’s Flexibility
Document Model
Dynamic Schema
Powerful Query Language
Secondary Indexes
第6页
Sometimes Organizations Struggle with Performance
第7页
Good News!
Poor Performance Usually Due to Common (and often simple) mistakes
第8页
Agenda
Quick MongoDB Introduction
Best Practices
Hardware/OS
Schema/Queries
Loading Data
第9页
MongoDB Introduction
第10页
Document Data Model
Relational
MongoDB
{
first_name: ‘Paul’,
surname: ‘Miller’,
city: ‘London’,
location: [45.123,47.232],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
]
}
Here we have greatly reduced the relational data model for this application to two tables. In reality no database has two tables. It is much more common to have hundreds or thousands of tables. And as a developer where do you begin when you have a complex data modeljQuery1102014390171969579424_1450832537487 If you’re building an app you’re really thinking about just a hand full of common things, like products, and these can be represented in a document much more easily that a complex relational model where the data is broken up in a way that doesn’t really reflect the way you think about the data or write an application.
第11页
Documents are Rich Data Structures
{
first_name: ‘Paul’,
surname: ‘Miller’,
cell: 447557505611,
city: ‘London’,
location: [45.123,47.232],
Profession: [‘banking’, ‘finance’, ‘trader’],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
]
}
Fields can contain an array of sub-documents
Fields
Typed fields
Fields can
contain arrays
String
Number
Geo-Coordinates
第12页
Do More With Your Data
{
first_name: ‘Paul’,
surname: ‘Miller’,
city: ‘London’,
location: [45.123,47.232],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
}
}
第13页
Automatic Sharding
Three types: hash-based, range-based, location-aware
Increase or decrease capacity as you go
Automatic balancing
MongoDB provides horizontal scale-out for databases using a technique called sharding, which is trans- parent to applications. Sharding distributes data across multiple physical partitions called shards. Sharding allows MongoDB deployments to address the hardware limitations of a single server, such as bottlenecks in RAM or disk I/O, without adding complexity to the application.
MongoDB supports three types of sharding:
• Range-based Sharding. Documents are partitioned across shards according to the shard key value. Documents with shard key values “close” to one another are likely to be co-located on the same shard. This approach is well suited for applications that need to optimize range- based queries.
• Hash-based Sharding. Documents are uniformly distributed according to an MD5 hash of the shard key value. Documents with shard key values “close” to one another are unlikely to be co-located on the same shard. This approach guarantees a uniform distribution of writes across shards, but is less optimal for range-based queries.
• Tag-aware Sharding. Documents are partitioned according to a user-specified configuration that associates shard key ranges with shards. Users can optimize the physical location of documents for application requirements such as locating data in specific data centers.
MongoDB automatically balances the data in the cluster as the data grows or the size of the cluster increases or decreases.
第14页
Query Routing
Multiple query optimization models
Each sharding option appropriate for different apps
Sharding is transparent to applications; whether there is one or one hundred shards, the application code for querying MongoDB is the same. Applications issue queries to a query router that dispatches the query to the appropriate shards.
For key-value queries that are based on the shard key, the query router will dispatch the query to the shard that manages the document with the requested key. When using range-based sharding, queries that specify ranges on the shard key are only dispatched to shards that contain documents with values within the range. For queries that don’t use the shard key, the query router will dispatch the query to all shards and aggregate and sort the results as appropriate. Multiple query routers can be used with a MongoDB system, and the appropriate number is determined based on performance and availability requirements of the application.
第15页
Replica Sets
Replica Set – 2 to 50 copies
Self-healing shard
Data Center Aware
Addresses availability considerations:
High Availability
Disaster Recovery
Maintenance
Workload Isolation: operational & analytics
High Availability – Ensure application availability during many types of failures
Disaster Recovery – Address the RTO and RPO goals for business continuity
Maintenance – Perform upgrades and other maintenance operations with no application downtime
Secondaries can be used for a variety of applications – failover, hot backup, rolling upgrades, data locality and privacy and workload isolation
第16页
Assumptions
第17页
Assumptions
MongoDB 3.0 or 3.2
第18页
Storage Engine Architecture in 3.2
Content Repo
IoT Sensor Backend
Ad Service
Customer Analytics
Archive
MongoDB Query Language (MQL) + Native Drivers
MongoDB Document Data Model
WT
MMAP
Supported in MongoDB 3.2
Management
Security
In-memory (beta)
Encrypted
3rd party
第19页
Best Practices
Hardware/Operating System
第20页
Servers
Specifications Good Fit For MongoDB?
Correct Number of Servers?
Properly Configured?
第21页
What Type of Servers
RAM
64 256 GB+
Fast IO Systems
RAID-10/SSDs
Many cores
Compress/Uncompress
Encrypt/Decrypt
Aggregation queries
第22页
What about a SAN?
Mostly Random Disk Access
IOPS
Need dedicated IOPS or performance will vary
Configure your SAN properly
Suitability of any IO system will depend upon IOPS
第23页
How Many Servers Do I Need?
How Many Shards Do I Need?
第24页
MongoDB cluster sizing at 30,000 ft
Disk Space
RAM
Query Throughput
第25页
Sum of disk space across shards > greater than required storage size
Disk Space: How Many Shards Do I Need?
第26页
Sum of disk space across shards > greater than required storage size
Disk Space: How Many Shards Do I Need?
Example
Data Size = 9 TB
WiredTiger Compression Ratio: .33
Storage size = 3 TB
Server disk capacity = 2 TB
2 Shards Required
第27页
Working set should fit in RAM
Sum of RAM across shards > Working Set
WorkSet = Indexes plus the set of documents accessed frequently
WorkSet in RAM
Shorter latency
Higher Throughput
RAM: How Many Shards Do I Need?
第28页
Measuring Index Size
db.coll.stats() – index size of collection
Estimate frequently accessed documents
Ex: total size of documents accessedper day
RAM: How Many Shards Do I Need?
第29页
Measuring Index Size
db.coll.stats() – index size of collection
Estimate frequently accessed documents
Ex: total size of documents accessedper day
RAM: How Many Shards Do I Need?
Example
Working Set = 428 GB
Server RAM = 128 GB
428/128 = 3.34
4 Shards Required
第30页
Measure max sustained query rate of a single server (with replication)
build a prototype and measure
Assume sharding overhead of 20-30%
Query Rate: How Many Shards Do I Need?
第31页
Measure max sustained query rate of a single server (with replication)
build a prototype and measure
Assume sharding overhead of 20-30%
Query Rate: How Many Shards Do I Need?
Example
Require: 50K ops/sec
Prototype performance: 20 ops/sec (1 replica set)
4 Shards Required: 80 ops/sec * .7 = 56K ops/sec
第32页
第33页
Configure Them Properly
Default OS Settings Often Don’t Provide Optimal Performance
See MongoDB Production Notes
https://docs.mongodb.org/manual/administration/production-notes
Also Review:
Amazon EC2: https://docs.mongodb.org/ecosystem/platforms/amazon-ec2/
Azure: https://docs.mongodb.org/ecosystem/platforms/windows-azure/
第34页
Server/OS Configuration
Server configuration recommendations
XFS
Turn off atime and diratime
NOOP scheduler
File descriptor limits
Disable transparent huge pages and NUMA
Read ahead of 32
Separate data volumes for data files, the journal, and the log.
Change the default TCP keepalive time to 300 seconds.
第35页
These are important
Ignore them and your performance may suffer
The first 100 lines of the MongoDB logs identifies suboptimal OS settings
第36页
Best Practices
Schema Design
第37页
Don’t Use a Relational Schema
第38页
Taylor MongoDB Schema to Application Workload
Design schema to provide good query performance
Schema design will impact required number of shards!
Application
Query Workload
{
Name: “john”
Height: 12
Address: {…}
}
db.cust.find({…})
db.cust.aggregate({…})
第39页
Compare Alternative Schemas
Build a spreadsheet
Calculate # of shards for each schema
Estimate query performance
# of documents
# of inserts
# of deletes
Required indexes
Number of documents inspected
Number of documents sent across network
第40页
Modeling Decisions
Referencing vs. Embedding
Aggregating data by device, customer, product, etc.
第41页
Referencing
Procedure
{
"_id" : 333,
"date" : "2003-02-09T05:00:00"),
"hospital" : “County Hills”,
"patient" : “John Doe”,
"physician" : “Stephen Smith”,
"type" : ”Chest X-ray",
”result" : 134
}
Results
{
“_id” : 134
"type" : "txt",
"size" : NumberInt(12),
"content" : {
value1: 343,
value2: “abc”,
…
}
}
第42页
Embedding
Procedure
{
"_id" : 333,
"date" : "2003-02-09T05:00:00"),
"hospital" : “County Hills”,
"patient" : “John Doe”,
"physician" : “Stephen Smith”,
"type" : ”Chest X-ray",
”result" : {
"type" : "txt",
"size" : NumberInt(12),
"content" : {
value1: 343,
value2: “abc”,
…
}
}
}
第43页
Embedding
Advantages
Retrieve all relevant information in a single query/document
Avoid implementing joins in application code
Update related information as a single atomic operation
MongoDB doesn’t offer multi-document transactions
Limitations
Large documents mean more overhead if most fields are not relevant
Might mean replicating data
16 MB document size limit
第44页
Atomicity
Document operations are atomic
db.patients.update({_id: 12345},
{$inc : {numProcedures : 1},
$push : {procedures : “proc123”},
$set : {addr.state : “TX”}})
No multi-document transactions
db.beginTransaction();
db.patients.update({_id: 12345}, …);
db.procedure.insert({_id: “proc123”, …});
db.records.insert({_id: “rec123”, …});
db.endTransaction();
第45页
Embedding
Advantages
Retrieve all relevant information in a single query/document
Avoid implementing joins in application code
Update related information as a single atomic operation
MongoDB doesn’t offer multi-document transactions
Limitations
Large documents mean more overhead if most fields are not relevant
16 MB document size limit
第46页
Referencing
Advantages
Smaller documents
Less likely to reach 16 MB document limit
Infrequently accessed information not accessed on every query
No duplication of data
Limitations
Two queries required to retrieve information
Cannot update related information atomically
第47页
One-to-Many & Many-to-Many Relationships
第48页
Schema Alternatives – Do the math?
How complex queries?
How much hardware/shards will I need?
第49页
Vital Sign Monitoring Device
Vital Signs Measured:
Blood Pressure
Pulse
Blood Oxygen Levels
Produces data at regular intervals
Once per minute
第50页
We have a hospital(s) of devices
第51页
Data From Vital Signs Monitoring Device
{
deviceId: 123456,
spO2: 88,
pulse: 74,
bp: [128, 80],
ts: ISODate("2013-10-16T22:07:00.000-0500")
}
One document per minute per device
Relational approach
第52页
Document Per Hour (By minute)
{
deviceId: 123456,
spO2: { 0: 88, 1: 90, …, 59: 92},
pulse: { 0: 74, 1: 76, …, 59: 72},
bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]},
ts: ISODate("2013-10-16T22:00:00.000-0500")
}
Store per-minute data at the hourly level
Update-driven workload
1 document per device per hour
第53页
Characterizing Write Differences
Example: data generated every minute
Recording the data for 1 patient for 1 hour:
Document Per Event
60 inserts
Document Per Hour
1 insert, 59 updates
第54页
Characterizing Read Differences
Want to graph 24 hour of vital signs for a patient:
Read performance is greatly improved
Document Per Event
1440 reads
Document Per Hour
24 reads
第55页
Characterizing Memory and Storage Differences
100K Devices
1 years worth of data
第56页
Characterizing Memory and Storage Differences
100K Devices
1 years worth of data
第57页
Characterizing Memory and Storage Differences
100K Devices
1 years worth of data
第58页
Characterizing Memory and Storage Differences
100K Devices
1 years worth of data
第59页
Indexes
Wrong
None
Too Many
Partial
Index counts
第60页
Best Practices
Loading Data
第61页
Rule of Thumb
To saturate a MongoDB cluster
loader hardware ~= mongodb hardware
Many threads
Many mongos
第62页
Loader Architecture
loader
mongos
primary
primary
primary
secondary
secondary
secondary
secondary
secondary
secondary
第63页
Loader Architecture
loader
mongos
primary
primary
primary
secondary
secondary
secondary
secondary
secondary
secondary
Where are the bottlenecks?
第64页
Loader Architecture
loader
mongos
primary
primary
primary
secondary
secondary
secondary
secondary
secondary
secondary
Where are the bottlenecks?
第65页
Loader Architecture
loader (8)
mongos (4)
primary
primary
primary
secondary
secondary
secondary
secondary
secondary
secondary
loader (8)
mongos (4)
loader (8)
mongos (4)
Use many threads
Use multiple loader servers
第66页
When Sharding
If you care about initial performance, you must pre-split
Otherwise, initial performance will be slow
(hash sharding automatically presplits collection)
第67页
Without presplitting
Shard 1
Shard 2
Shard 3
Shard 4
-∞ … ∞
sh.shardCollection(“records.patients”, {zipcode : 1})
第68页
Without presplitting
Shard 1
Shard 2
Shard 3
Shard 4
-∞ … 11305
64K chunks
Splitting will occur quickly
Balancing occurs much more slowly
The entire query workload Shard 1
11306 … 44506
44507 … ∞
第69页
Without presplitting
Shard 1
Shard 2
Shard 3
Shard 4
-∞ … 11305
11306 … 44506
44507 … ∞
Loader
mongos
第70页
Split collection
Shard 1
Shard 2
Shard 3
Shard 4
Split and distribute empty chunks before loading any data
Evenly distribute query load across cluster
-∞ … 08333
08334 … 16667
16668 … 25000
25001… 33334
33335 … 41668
41669 … 50000
50001 … 58334
58335 … 66668
66669 … 75000
75001 … 83334
88335 … 96668
96669 … 99999
第71页
Split collection
Shard 1
Shard 2
Shard 3
Shard 4
-∞ … 08333
08334 … 16667
16668 … 25000
25001… 33334
33335 … 41668
41669 … 50000
50001 … 58334
58335 … 66668
66669 … 75000
75001 … 83334
88335 … 96668
96669 … 99999
Loader
mongos
第72页
Best Practices
Design Paradigms
第73页
Updates
Find
Modify
Save
Vs.
Update
第74页
Summary
第75页
Best Practices
Use servers with specifications that will provide good MongoDB performance
64+ GB RAM, many cores, many IOPS (RAID-10/SSDs)
Calculate How Many Shards?
Calculate required RAM and Disk Space
Build a prototype to determine the ops/sec capacity of a server
Do the math
Configure OS for Optimal MongoDB Performance
See MongoDB Production Notes
Review logs for warnings (Don’t ignore)
第76页
Best Practices (cont.)
Create a Document Schema
Denormalized
Tailor schema to application workload
Use application queries to guide schema design decisions
Consider alternative schemas
Compare cluster size (# of shards) and performance
Build a spreadsheet
第77页
Best Practices
Loading Data
Loader Hardware ~= MongoDB hardware
Many threads
Many mongos
Pre-split
Ensure query workload is evenly distributed across the cluster from the start
第78页
Questions?
jay.runkel@mongodb.com
@jayrunkel