Reducing your DB footprint with Druid
Apache Druid is described by its developers as a real-time analytics database designed for fast analytics on large datasets which works perfectly for aggregation and reporting queries (such as "group by" queries).
For us in Shareablee, where we need to aggregate millions of records each day to provide the most up-to-date social metrics for our client, this sounds like a perfect tool. When we first incorporated Druid in our stack back in 2019, we were astonished of how well it suited all of our needs, and it is being used more and more in our products ever since.
What really surprised me, when I first started working with Druid, was how efficiently data is stored within the database itself. Nowadays Shareablee is storing tables with 10's of billions of records (and still growing!) build upon fairly standard Druid configuration and we are constantly benefiting from its features that allow us highly reduce database footprint while still providing ridiculously fast queries.
In this article, I would like to discuss an experiment that would show what can be achieved in terms of reducing space using out-of-the-box Druid configuration, based on a small subset of Shareablee's data.
Data representation in Druid
At the first glance, Druid may seem like your good-old friend, a relational database (ie. PostgreSQL). You got a table (in Druid called a datasource) which stores columns with the data in them. The difference is that each of these columns is a separate data structure that is stored individually which helps Druid to scan only these columns, which are actually used in a query.
Druid distinguishes three types of columns:
-
timestamp column,
-
dimension: slice of your query which can be grouped or filtered,
-
metric: representing all sorts of aggregations you will execute on your data; aggregations can be applied while ingesting data, at query time (pre-aggregations) and after query (post-aggregations).
In the table below we can see all of the column types we have mentioned. __time represents our main timestamp column, then content_id, likes, shares and comments are metrics, which we can filter and perform further aggregations, and engagement is a metric, created at ingestion time which shows a sum of likes, shares and comments.
timestamp | dimensions | metrics | |||
---|---|---|---|---|---|
__time | content_id | likes | shares | comments | total_engagement |
2020-01-01 | 215923_321 | 150 | 25 | 75 | 200 |
2020-01-01 | 462727_968 | 200 | 10 | 150 | 360 |
2020-01-02 | 683812_432 | 2000 | 140 | 1800 | 3940 |
Data in Druid is stored using LZ4 algorithm and RoaringBitmap as compression algorithms. By using them it is possible to significantly reduce the dataset size out of the box. Take a look at the example datasource with over 11 millions records ingested into Druid.
Data storage type | Size |
---|---|
.gzipped ndjsons | 8.1 Gb |
data ingested into Druid | 2.11Gb |
As you can see, just by storing data in Druid we have managed to reduce the size of our datasource nearly four times! This chunk of data, is not something that we would call a big set of data. It holds a small collection of our clients data, but it goes back to 2016, so it has a very wide date range. It is also an ongoing collection in which we are appending ~30-40 thousand records each day with many of those being historical data fillups. Should such appends worry us in terms of affecting how much storage we would need to actually store that data? We will discuss it in the next section...
Segmentation
By design Druid stores it's indexes in segments. Each segment represents a "portion" of data within given time-range. That time range is determined based on segmentGranularity parameter, which is being set, when a new datasource is created; it can represent as little as a second
or as much as a year
.
For example, if segmentGranularity is set to day
, all data within the same date should be stored within the same segment.
When new data is ingested into our datasource, a new segment is created; at this point new data can be added to the segment. When there is no more data in given "chunk" of time, a segment is being published: pushed down the processing pipeline, where it can be stored in Druid historical and/or deep storage. If we would like to add some new data for given date, a new segment has to be created.
Let's break it down with an example:
date | content_id | likes | comments |
---|---|---|---|
2020-01-01 | 215923_321 | 100 | 150 |
2020-01-01 | 462727_968 | 200 | 75 |
2020-01-01 | 684822_049 | 150 | 200 |
While ingesting such data in our datasource we would create a segment which stores data for 2020-01-01. When ingestion ends, and there is no more data to process, a segment is being published and can be accessed via query. If we were to take a look into segments list, we would see:
segment_id |
---|
datasource_name_2020-01-01T00:00:00Z_2020-01-01T23:59:59Z_1 |
Then, another set of data appears:
date | content_id | likes | comments |
---|---|---|---|
2020-01-01 | 573729_472 | 20 | 77 |
2020-01-02 | 683812_432 | 2000 | 1300 |
As the 2020-01-01
segment has already been published and became immutable, Druid will create another one to store the data for this day. Another record has a date of 2020-01-02
so this one should be stored within another "time chunk" so a new segment has to be created for it as well.
After this ingestion, our segments list will look similar to:
segment_id |
---|
datasource_name_2020-01-01T00:00:00Z_2020-01-01T23:59:59Z_1 |
datasource_name_2020-01-01T00:00:00Z_2020-01-01T23:59:59Z_2 |
datasource_name_2020-01-02T00:00:00Z_2020-01-02T23:59:59Z_1 |
We ended up with two separate segments, both representing the same date range. For us, in Shareablee, this can be pretty cumbersome. As it was mentioned our data collection pipeline allows us to push historical data which could lead to creating unnecessary segments over and over again, every time we need to populate already existing date range. In our case, this could lead to our datasets being very sparse, with many small segments lying around everywhere and pushing us into loosing all the advantage of the space we just saved by using Druid. For such occasions Druid provides an additional mechanism: compaction tasks.
Druid compaction tasks
Compaction tasks allow Druid to go through your whole dataset and reorganize how particular rows are stored. Because of it's timeseries-lead design all records that are within given grain, depending on your datasource granularity, should be aggregated within the same segment (there are of course limitations to the segment size. The goal is to have as few segments as possible).
Compaction tasks can be quite time-consuming process, as Druid cannot append data to an existing segment and has to push existing data into a new one (depending on how much data Druid has to process in given task). For an uncompacted datasource with many records in it, this can take up to few hours to complete. This is an important step which should be taken into consideration while designing your pipeline to compact your datasources before it gets to big.
If you are ingesting data in a batch fashion, you can compact your datasource after given batch has been processed. For streaming ingestion, you can try Druids' autocompaction feature
Let's take another look on how does compaction impacts the size of our datasource:
Compression | Size |
---|---|
.gziped raw data | 8.1 Gb |
Raw Druid Ingested Datasource | 2.11 Gb |
Druid compacted Datasource | 1.4 Gb |
As you can see, after compaction we have gained another 40% on compression and almost 83% on reducing initial dataset size. Isn't that amazing?!
Wrap-up
This was a very brief and shallow overview on how Apache Druid can reduce the size of your dataset. There are few more tricks and tweaks you can do to better partition and process your data. If you want to read more on the topic Druid docs provide quite detailed explanation on every part of the database.
As you can see, using few simple tricks, combined with clever design of Apache Druid, you can get reduce your storage capacity 5 times.