This page provides some basic hints for determining how to partition your data model with HiveDB.
Create a data model
Identify all of the resources that you want to store. In general you probably don’t want to put things in the hive unless you are talking about them in large powers of ten and their growth rate is also expressed as a large power of ten records per week or per month.
Example Domain

Identify a significant partitioning dimension
HiveDB spreads your data across many shards by dividing it into discrete subsets. You select a significant value that uniquely identifies subsets of data within your domain, a partition dimension, in HiveDB lingo. This strategy works well for data that has a natural split, but does not work equally well for all data sets. It was chosen because it fits the real-world problem that HiveDB was created to solve.


Things to consider when partitioning your data
- You cannot join across nodes.
- Consider the worst case scenario. Can records belonging to a single partition key over flow the capacity of a node?
- How does your usage pattern compare to the partitioning scheme? Entities may have a relationship in the data model, but if you don’t have to use that relationship to execute a JOIN in the application you can break that relationship in the partitioning scheme. Alternately, you can simply perform two database operations and handle the join in the application.
Example
Both product and image depend on member, and have no other relations except to each other. So, member is a natural partition dimension for this data set.
Enumerate your finding cases
HiveDB does not support arbitrary queries across the entire data set. Instead it uses a system pre-built indexes similar to BerkeleyDB to execute queries that span nodes. So, you must enumerate all of your finding cases that span the data set so that they may be indexed.
Example
Some example use cases that define the queries we will need to index for:
- The member profile page shows a particular member’s information. Find Member by id
- The edit my products page shows all of a members products Find products by member
- The product details page shows information for one product. Find product by id
- The design details page shows all of the products for a particular design. Find products by image
HiveDB has two different types of indexes primary indexes and secondary indexes.
- Primary indexes map an id of the partition dimension entity to a node. In the example case the primary index would map member.id → node. Every data set must have a single primary index. Primary index keys are sometimes referred to as partition keys.
- Secondary indexes map a value of an entity to another index key. Secondary index keys can map to primary index keys or to other secondary index keys. For example a secondary index on product.id would map product.id → member.id.
The four queries listed above define three indexes in HiveDB.
- The primary index on member.id → node.id
- A secondary index of product.id → member.id
- A secondary index of image.id → product.id

4 Comments
There’s a typo in the title “preparing you data” :-)
Interesting piece of software.
Best regards,
G.
Could you also give us some information about the following?
1. Is there any data replication happening in this setup?
2, How is it handling uniform distribution of data when a new box is added?
@Divya B
1. HiveDB doesn’t handle replication. In general we defer to MySQL native replication. However, HiveDB does support storing records in more than one location if you want to handle replication within your application.
2. Currently we don’t provide a facility for back-filling records onto new nodes so that records are evenly distributed. The default assignment algorithm will evenly distribute new records across nodes, but if you want to move existing records onto the new node you will have to write your own implementation of the Migrator interface and use it to balance the nodes.
We do have plans to implement a balancing and migration tool in the future, we just don’t have it yet.
Does hivedb support aggregation of data across shards? For me this is a very interesting use case and if hivedb does it, I would be interested in trying hivedb for my project.