Preparing Your Data to be Horizontally Scaled with HiveDB

This page provides some basic hints for determining how to partition your data model with HiveDB.

Create a data model

Identify all of the resources that you want to store. In general you probably don’t want to put things in the hive unless you are talking about them in large powers of ten and their growth rate is also expressed as a large power of ten records per week or per month.

Identify a significant partitioning dimension

HiveDB spreads your data across many shards by dividing it into discrete subsets. You select a significant value that uniquely identifies subsets of data within your domain, a partition dimension, in HiveDB lingo. This strategy works well for data that has a natural split, but does not work equally well for all data sets. It was chosen because it fits the real-world problem that HiveDB was created to solve.

Things to consider when partitioning your data

  • You cannot join across nodes.
  • Consider the worst case scenario. Can records belonging to a single partition key over flow the capacity of a node?
  • How does your usage pattern compare to the partitioning scheme? Entities may have a relationship in the data model, but if you don’t have to use that relationship to execute a JOIN in the application you can break that relationship in the partitioning scheme. Alternately, you can simply perform two database operations and handle the join in the application.

Example
Both product and image depend on member, and have no other relations except to each other. So, member is a natural partition dimension for this data set.

Enumerate your finding cases

HiveDB does not support arbitrary queries across the entire data set. Instead it uses a system pre-built indexes similar to BerkeleyDB to execute queries that span nodes. So, you must enumerate all of your finding cases that span the data set so that they may be indexed.

Example
Some example use cases that define the queries we will need to index for:

  • The member profile page shows a particular member’s information. Find Member by id
  • The edit my products page shows all of a members products Find products by member
  • The product details page shows information for one product. Find product by id
  • The design details page shows all of the products for a particular design. Find products by image

HiveDB has two different types of indexes primary indexes and secondary indexes.

  • Primary indexes map an id of the partition dimension entity to a node. In the example case the primary index would map member.id → node. Every data set must have a single primary index. Primary index keys are sometimes referred to as partition keys.
  • Secondary indexes map a value of an entity to another index key. Secondary index keys can map to primary index keys or to other secondary index keys. For example a secondary index on product.id would map product.id → member.id.

The four queries listed above define three indexes in HiveDB.

      The primary index on member.id → node.id
      A secondary index of product.id → member.id
      A secondary index of image.id → product.id