5 Simple tips for data modeling with Cassandra/Datastax Astra

Chang Xiao
2 min readMar 16, 2022
Photo by Kai Pilger on Unsplash

It is easy to apply relational database data modeling concepts to Cassandra. However, you may quickly find it to be counter-productive for your application. After some painful data modeling refactoring for our application Siggy.ai, here are some quick tips:

1. The partition key does not have to be unique

Although the simplest design is a partition key that is the same as the primary key. This means you will have 1 row per partition. You can often assign a partition key that encompasses more than 1 row of data. For example, some partition keys can be a tag, date, or even a bucket value (e.g. 1, 2, 3)

2. Design partition key to controlling your data size

The partition key defines where the data is stored and how it is replicated. The rule of thumb is less than 100MB or 100,000 rows of data per partition.

3. Add columns to the primary key only for the uniqueness

The primary key can contain the partition key. When you have a partition that contains more than 1 row it may be good to add more fields to query for unique records. For example: in a blog table you may use a date field as the partition key. Adding blog post title field and author field to the primary key will most likely give you the ability to query for a unique blog post by date, blog post title, and the author.

4. Data ordering with clustering columns

The non-partition key fields (mentioned above) in your primary key are also called clustering columns. They are very useful for your inequality query (>, <, etc) as well as how your data is ordered by default.

For example: if we had a blog table with a partition key of a “date”, we can add a clustering column called “hour (0–23)”. You can now query blog posts created on a specific date and within a specific range of hours of that day.

Additionally, you can also define the default ordering of data with the clustering key (ascending or descending order). This means if you query for the first 10 blog posts from a day, it will be ordered by the hour of that day based on the ordering definition.

5. TimeUUID will be your best friend

Lastly, TimeUUID can often be used to form the primary key that provides both uniqueness and sorting capabilities. You can convert timeUUID into date/time values as well as perform range queries with timeUUIDs.

Useful resources

If you are interested to dive deeper into the best practices in data modeling. Datastax has a free data modeling course that can be very helpful for you to avoid any pitfalls when transitioning to Cassandra.

https://academy.datastax.com/resources/ds220-data-modeling

--

--

Chang Xiao

Starter, dev, digital consultant, cyclist, tennis player. Currently focused on data science and specifically recommendation systems.