Preparing Your Data to be Horizontally Scaled with HiveDB

This page provides some basic hints for determining how to partition your data model with HiveDB.

Create a data model

Identify all of the resources that you want to store. In general you probably don’t want to put things in the hive unless you are talking about them in large powers of ten and their growth rate is also expressed as a large power of ten records per week or per month.


Example Domain
Example Data Model

Identify a significant partitioning dimension

HiveDB spreads your data across many shards by dividing it into discrete subsets. You select a significant value that uniquely identifies subsets of data within your domain, a partition dimension, in HiveDB lingo. This strategy works well for data that has a natural split, but does not work equally well for all data sets. It was chosen because it fits the real-world problem that HiveDB was created to solve.
Easily Partitioned Data SetA difficult data set to partition

Things to consider when partitioning your data

  • You cannot join across nodes.
  • Consider the worst case scenario. Can records belonging to a single partition key over flow the capacity of a node?
  • How does your usage pattern compare to the partitioning scheme? Entities may have a relationship in the data model, but if you don’t have to use that relationship to execute a JOIN in the application you can break that relationship in the partitioning scheme. Alternately, you can simply perform two database operations and handle the join in the application.

Example

Both product and image depend on member, and have no other relations except to each other. So, member is a natural partition dimension for this data set.

Enumerate your finding cases

HiveDB does not support arbitrary queries across the entire data set. Instead it uses a system pre-built indexes similar to BerkeleyDB to execute queries that span nodes. So, you must enumerate all of your finding cases that span the data set so that they may be indexed.

Example

Some example use cases that define the queries we will need to index for:

  • The member profile page shows a particular member’s information. Find Member by id
  • The edit my products page shows all of a members products Find products by member
  • The product details page shows information for one product. Find product by id
  • The design details page shows all of the products for a particular design. Find products by image

HiveDB has two different types of indexes primary indexes and secondary indexes.

  • Primary indexes map an id of the partition dimension entity to a node. In the example case the primary index would map member.id → node. Every data set must have a single primary index. Primary index keys are sometimes referred to as partition keys.
  • Secondary indexes map a value of an entity to another index key. Secondary index keys can map to primary index keys or to other secondary index keys. For example a secondary index on product.id would map product.id → member.id.

The four queries listed above define three indexes in HiveDB.

  1. The primary index on member.id → node.id
  2. A secondary index of product.id → member.id
  3. A secondary index of image.id → product.id

4 Comments

  1. gio added these pithy words on November 6, 2007 | Permalink

    There’s a typo in the title “preparing you data” :-)
    Interesting piece of software.

    Best regards,

    G.

  2. Divya B added these pithy words on January 19, 2008 | Permalink

    Could you also give us some information about the following?

    1. Is there any data replication happening in this setup?
    2, How is it handling uniform distribution of data when a new box is added?

  3. britt added these pithy words on January 19, 2008 | Permalink

    @Divya B

    1. HiveDB doesn’t handle replication. In general we defer to MySQL native replication. However, HiveDB does support storing records in more than one location if you want to handle replication within your application.

    2. Currently we don’t provide a facility for back-filling records onto new nodes so that records are evenly distributed. The default assignment algorithm will evenly distribute new records across nodes, but if you want to move existing records onto the new node you will have to write your own implementation of the Migrator interface and use it to balance the nodes.

    We do have plans to implement a balancing and migration tool in the future, we just don’t have it yet.

  4. Ajit added these pithy words on November 26, 2008 | Permalink

    Does hivedb support aggregation of data across shards? For me this is a very interesting use case and if hivedb does it, I would be interested in trying hivedb for my project.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*