MongoDB Best Practices(最佳实践) by Jay Runkel

AirJD

没有录音文件

00:00/00:00

加收藏

MongoDB Best Practices(最佳实践) by Jay Runkel

发布者 dber

发布于 1450832509170 浏览 9327 关键词 MongoDB, NoSQL, English

分享到

第1页

第2页

MongoDB Best Practices

Jay Runkel
Principal Solutions Architect
jay.runkel@mongodb.com
@jayrunkel

第3页

About Me

Solution Architect

Part of Sales Organization

Work with many organizations new to MongoDB

第4页

Everyone Loves MongoDB’s Flexibility

Document Model

Dynamic Schema

Powerful Query Language

Secondary Indexes

第5页

Everyone Loves MongoDB’s Flexibility

Document Model

Dynamic Schema

Powerful Query Language

Secondary Indexes

第6页

Sometimes Organizations Struggle with Performance

第7页

Good News!

Poor Performance Usually Due to Common (and often simple) mistakes

第8页

Agenda

Quick MongoDB Introduction

Best Practices

Hardware/OS

Schema/Queries

Loading Data

第9页

MongoDB Introduction

第10页

Document Data Model

Relational

MongoDB

{
  first_name: ‘Paul’,
  surname: ‘Miller’,
  city: ‘London’,
  location: [45.123,47.232],
  cars: [
    { model: ‘Bentley’,
      year: 1973,
      value: 100000, … },
    { model: ‘Rolls Royce’,
      year: 1965,
      value: 330000, … }
  ]
}

Here we have greatly reduced the relational data model for this application to two tables. In reality no database has two tables. It is much more common to have hundreds or thousands of tables. And as a developer where do you begin when you have a complex data modeljQuery1102014390171969579424_1450832537487 If you’re building an app you’re really thinking about just a hand full of common things, like products, and these can be represented in a document much more easily that a complex relational model where the data is broken up in a way that doesn’t really reflect the way you think about the data or write an application.

第11页

Documents are Rich Data Structures

{
  first_name: ‘Paul’,
  surname: ‘Miller’,
  cell: 447557505611,
  city: ‘London’,
  location: [45.123,47.232],
  Profession: [‘banking’, ‘finance’, ‘trader’],
  cars: [
    { model: ‘Bentley’,
      year: 1973,
      value: 100000, … },
    { model: ‘Rolls Royce’,
      year: 1965,
      value: 330000, … }
  ]
}

Fields can contain an array of sub-documents

Fields

Typed fields

Fields can
contain arrays

String

Number

Geo-Coordinates

第12页

Do More With Your Data

第13页

Automatic Sharding

Three types: hash-based, range-based, location-aware
Increase or decrease capacity as you go
Automatic balancing

MongoDB provides horizontal scale-out for databases using a technique called sharding, which is trans- parent to applications. Sharding distributes data across multiple physical partitions called shards. Sharding allows MongoDB deployments to address the hardware limitations of a single server, such as bottlenecks in RAM or disk I/O, without adding complexity to the application.
MongoDB supports three types of sharding:
• Range-based Sharding. Documents are partitioned across shards according to the shard key value. Documents with shard key values “close” to one another are likely to be co-located on the same shard. This approach is well suited for applications that need to optimize range- based queries.
• Hash-based Sharding. Documents are uniformly distributed according to an MD5 hash of the shard key value. Documents with shard key values “close” to one another are unlikely to be co-located on the same shard. This approach guarantees a uniform distribution of writes across shards, but is less optimal for range-based queries.
• Tag-aware Sharding. Documents are partitioned according to a user-specified configuration that associates shard key ranges with shards. Users can optimize the physical location of documents for application requirements such as locating data in specific data centers.
MongoDB automatically balances the data in the cluster as the data grows or the size of the cluster increases or decreases.

第14页

Query Routing

Multiple query optimization models
Each sharding option appropriate for different apps

Sharding is transparent to applications; whether there is one or one hundred shards, the application code for querying MongoDB is the same. Applications issue queries to a query router that dispatches the query to the appropriate shards.

For key-value queries that are based on the shard key, the query router will dispatch the query to the shard that manages the document with the requested key. When using range-based sharding, queries that specify ranges on the shard key are only dispatched to shards that contain documents with values within the range. For queries that don’t use the shard key, the query router will dispatch the query to all shards and aggregate and sort the results as appropriate. Multiple query routers can be used with a MongoDB system, and the appropriate number is determined based on performance and availability requirements of the application.

第15页

Replica Sets

Replica Set – 2 to 50 copies
Self-healing shard
Data Center Aware
Addresses availability considerations:
High Availability
Disaster Recovery
Maintenance
Workload Isolation: operational & analytics

High Availability – Ensure application availability during many types of failures
Disaster Recovery – Address the RTO and RPO goals for business continuity
Maintenance – Perform upgrades and other maintenance operations with no application downtime

Secondaries can be used for a variety of applications – failover, hot backup, rolling upgrades, data locality and privacy and workload isolation

第16页

Assumptions

第17页

Assumptions

MongoDB 3.0 or 3.2

第18页

Storage Engine Architecture in 3.2

Content Repo

IoT Sensor Backend

Ad Service

Customer Analytics

第19页

Best Practices

Hardware/Operating System

第20页

Servers

Specifications Good Fit For MongoDB?

Correct Number of Servers?

Properly Configured?

第21页

What Type of Servers

RAM
64  256 GB+

Fast IO Systems
RAID-10/SSDs

Many cores
Compress/Uncompress
Encrypt/Decrypt
Aggregation queries

第22页

What about a SAN?

Mostly Random Disk Access

IOPS

Need dedicated IOPS or performance will vary

Configure your SAN properly

Suitability of any IO system will depend upon IOPS

第23页

How Many Servers Do I Need?

How Many Shards Do I Need?

第24页

MongoDB cluster sizing at 30,000 ft

Disk Space

RAM

Query Throughput

第25页

Sum of disk space across shards > greater than required storage size

Disk Space: How Many Shards Do I Need?

第26页

Sum of disk space across shards > greater than required storage size

Disk Space: How Many Shards Do I Need?

Example

Data Size = 9 TB
WiredTiger Compression Ratio: .33
Storage size = 3 TB
Server disk capacity = 2 TB

2 Shards Required

第27页

Working set should fit in RAM
Sum of RAM across shards > Working Set

WorkSet = Indexes plus the set of documents accessed frequently

WorkSet in RAM 
Shorter latency
Higher Throughput

RAM: How Many Shards Do I Need?

第28页

Measuring Index Size
db.coll.stats() – index size of collection

Estimate frequently accessed documents
Ex: total size of documents accessedper day

RAM: How Many Shards Do I Need?

第29页

Measuring Index Size
db.coll.stats() – index size of collection

Estimate frequently accessed documents
Ex: total size of documents accessedper day

RAM: How Many Shards Do I Need?

Example

Working Set = 428 GB
Server RAM = 128 GB

428/128 = 3.34

4 Shards Required

第30页

Measure max sustained query rate of a single server (with replication)
build a prototype and measure

Assume sharding overhead of 20-30%

Query Rate: How Many Shards Do I Need?

第31页

Measure max sustained query rate of a single server (with replication)
build a prototype and measure

Assume sharding overhead of 20-30%

Query Rate: How Many Shards Do I Need?

Example

Require: 50K ops/sec
Prototype performance: 20 ops/sec (1 replica set)

4 Shards Required: 80 ops/sec * .7 = 56K ops/sec

第32页

第33页

Configure Them Properly

Default OS Settings Often Don’t Provide Optimal Performance

See MongoDB Production Notes
https://docs.mongodb.org/manual/administration/production-notes

Also Review:
Amazon EC2: https://docs.mongodb.org/ecosystem/platforms/amazon-ec2/
Azure: https://docs.mongodb.org/ecosystem/platforms/windows-azure/

第34页

Server/OS Configuration

Server configuration recommendations
XFS
Turn off atime and diratime
NOOP scheduler
File descriptor limits
Disable transparent huge pages and NUMA
Read ahead of 32
Separate data volumes for data files, the journal, and the log.
Change the default TCP keepalive time to 300 seconds.

第35页

These are important

Ignore them and your performance may suffer

The first 100 lines of the MongoDB logs identifies suboptimal OS settings

第36页

Best Practices

Schema Design

第37页

Don’t Use a Relational Schema

第38页

Taylor MongoDB Schema to Application Workload

Design schema to provide good query performance

Schema design will impact required number of shards!

Application
Query Workload

{
  Name: “john”
  Height: 12
  Address: {…}
}

db.cust.find({…})

db.cust.aggregate({…})

第39页

Compare Alternative Schemas

Build a spreadsheet

Calculate # of shards for each schema

Estimate query performance
# of documents
# of inserts
# of deletes
Required indexes
Number of documents inspected
Number of documents sent across network

第40页

Modeling Decisions

Referencing vs. Embedding

Aggregating data by device, customer, product, etc.

第41页

Referencing

Results

{
    “_id” : 134
    "type" : "txt",
    "size" : NumberInt(12),
    "content" : {
        value1: 343,
        value2: “abc”,
        …
    }
}

第42页

Embedding

Procedure
{
    "_id" : 333,
    "date" : "2003-02-09T05:00:00"),
    "hospital" : “County Hills”,
    "patient" : “John Doe”,
    "physician" : “Stephen Smith”,
    "type" : ”Chest X-ray",
    ”result" : {
        "type" : "txt",
        "size" : NumberInt(12),
        "content" : {
            value1: 343,
            value2: “abc”,
            …
        }
    }
}

第43页

Embedding

Advantages
Retrieve all relevant information in a single query/document
Avoid implementing joins in application code
Update related information as a single atomic operation
MongoDB doesn’t offer multi-document transactions

Limitations
Large documents mean more overhead if most fields are not relevant
Might mean replicating data
16 MB document size limit

第44页

Atomicity

Document operations are atomic
db.patients.update({_id: 12345},
      {$inc : {numProcedures : 1},
            $push : {procedures : “proc123”},
                    $set : {addr.state : “TX”}})

No multi-document transactions

db.beginTransaction();
db.patients.update({_id: 12345}, …);
db.procedure.insert({_id: “proc123”, …});
db.records.insert({_id: “rec123”, …});
db.endTransaction();

第45页

Embedding

第46页

Referencing

Advantages
Smaller documents
Less likely to reach 16 MB document limit
Infrequently accessed information not accessed on every query
No duplication of data

Limitations
Two queries required to retrieve information
Cannot update related information atomically

第47页

One-to-Many & Many-to-Many Relationships

第48页

Schema Alternatives – Do the math?

How complex queries?

How much hardware/shards will I need?

第49页

Vital Sign Monitoring Device

Vital Signs Measured:
Blood Pressure
Pulse
Blood Oxygen Levels

Produces data at regular intervals
Once per minute

第50页

We have a hospital(s) of devices

第51页

Data From Vital Signs Monitoring Device

{
  deviceId: 123456,
  spO2: 88,
  pulse: 74,
  bp: [128, 80],
  ts: ISODate("2013-10-16T22:07:00.000-0500")
}

One document per minute per device
Relational approach

第52页

Document Per Hour (By minute)

{
  deviceId: 123456,
  spO2: { 0: 88, 1: 90, …, 59: 92},
  pulse: { 0: 74, 1: 76, …, 59: 72},
  bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]},
  ts: ISODate("2013-10-16T22:00:00.000-0500")
}

Store per-minute data at the hourly level
Update-driven workload
1 document per device per hour

第53页

Characterizing Write Differences

Example: data generated every minute
Recording the data for 1 patient for 1 hour:

Document Per Event
60 inserts

Document Per Hour
1 insert, 59 updates

第54页

Characterizing Read Differences

Want to graph 24 hour of vital signs for a patient:

Read performance is greatly improved

Document Per Event
1440 reads

Document Per Hour
24 reads

第55页

Characterizing Memory and Storage Differences

100K Devices
1 years worth of data

第56页

Characterizing Memory and Storage Differences

100K Devices
1 years worth of data

第57页

Characterizing Memory and Storage Differences

100K Devices
1 years worth of data

第58页

Characterizing Memory and Storage Differences

100K Devices
1 years worth of data

第59页

Indexes

Wrong

None

Too Many

Partial
Index counts

第60页

Best Practices

Loading Data

第61页

Rule of Thumb

To saturate a MongoDB cluster 
loader hardware ~= mongodb hardware

Many threads

Many mongos

第62页

Loader Architecture

loader

mongos

primary

secondary

第63页

Loader Architecture

loader

mongos

primary

secondary

Where are the bottlenecks?

第64页

Loader Architecture

loader

mongos

primary

secondary

Where are the bottlenecks?

第65页

Loader Architecture

loader (8)

mongos (4)

primary

secondary

loader (8)

mongos (4)

Use many threads

Use multiple loader servers

第66页

When Sharding

If you care about initial performance, you must pre-split

Otherwise, initial performance will be slow

(hash sharding automatically presplits collection)

第67页

Without presplitting

Shard 1

Shard 2

Shard 3

Shard 4

-∞ … ∞

sh.shardCollection(“records.patients”, {zipcode : 1})

第68页

Without presplitting

Shard 1

Shard 2

Shard 3

Shard 4

-∞ … 11305

64K chunks
Splitting will occur quickly
Balancing occurs much more slowly

The entire query workload  Shard 1

11306 … 44506

44507 … ∞

第69页

Without presplitting

Shard 1

Shard 2

Shard 3

Shard 4

-∞ … 11305

11306 … 44506

44507 … ∞

Loader
mongos

第70页

Split collection

Shard 1

Shard 2

Shard 3

Shard 4

Split and distribute empty chunks before loading any data

Evenly distribute query load across cluster

-∞ … 08333

08334 … 16667

16668 … 25000

25001… 33334

33335 … 41668

41669 … 50000

50001 … 58334

58335 … 66668

66669 … 75000

75001 … 83334

88335 … 96668

96669 … 99999

第71页

Split collection

Shard 1

Shard 2

Shard 3

Shard 4

-∞ … 08333

08334 … 16667

16668 … 25000

25001… 33334

33335 … 41668

41669 … 50000

50001 … 58334

58335 … 66668

66669 … 75000

75001 … 83334

88335 … 96668

96669 … 99999

Loader
mongos

第72页

Best Practices

Design Paradigms

第73页

Updates

Find
Modify
Save

Vs.

Update

第74页

Summary

第75页

Best Practices

Use servers with specifications that will provide good MongoDB performance
64+ GB RAM, many cores, many IOPS (RAID-10/SSDs)

Calculate How Many Shards?
Calculate required RAM and Disk Space
Build a prototype to determine the ops/sec capacity of a server
Do the math

Configure OS for Optimal MongoDB Performance
See MongoDB Production Notes
Review logs for warnings (Don’t ignore)

第76页

Best Practices (cont.)

Create a Document Schema
Denormalized

Tailor schema to application workload
Use application queries to guide schema design decisions
Consider alternative schemas
Compare cluster size (# of shards) and performance
Build a spreadsheet

第77页

Best Practices

Loading Data
Loader Hardware ~= MongoDB hardware
Many threads
Many mongos

Pre-split
Ensure query workload is evenly distributed across the cluster from the start

第78页

Questions?

jay.runkel@mongodb.com
@jayrunkel