Digital marketing is directly connected with managing big volumes of data. Ads personalization, conversions tracking, A/B tests, and many other metrics are constantly gathered, stored, and actively used for various analytical tasks.
An example: the PropellerAds multisource network can receive billions of ad impressions daily. And this is just a part of the overall data that requires attention. Storing and managing such volumes seems a tricky task — but is it really so when you have professional teams and know how to automate processes?
We asked experts from AdTech Holding — Ekaterina Kolmakova, Product Owner of the Analytics Group, and Alex Kirilishin, Technical Product Owner of BI — to share the best practices for storing and managing data.
Ekaterina: There are many types of data we require for analytical tasks, but roughly speaking, we can divide it into two big categories: dimensions and measures.
The trickiest part here is that we analyze not only our clients’ data – but also the data of their audiences — people who watch ads and interact with them. We don’t keep the personal information of these people and can not identify them — but we can track their behavior with our ads, and the volume of such data is much bigger, as it involves billions of impressions.
Ekaterina: The number of analytical tasks is enormous — but again, we can loosely divide them into two main groups.
In simple words, it is a detailing and visualization of how the company performs. For example, we send a daily report to the Holding board so that they can track the revenue and other metrics of each project. Besides, operational reports might serve for building marketing funnels — for instance, the BI team provides us with some email campaign data, and we analyze the performance of this campaign.
Such tasks are always done upon a special request from various departments. Usually, they appear when there is a need to analyze ad rotation, make some adjustments, or conduct tests.
For example, PropellerAds has low, medium, and high-activity user cohorts. Sometimes, we must re-arrange these cohorts to ensure more accurate targeting and better performance. To do it, we need to analyze user behavior in each cohort and make a report that will give an idea of the right balance between each activity level.
Another example: A/B tests that check various ideas. For example, there is an assumption that sports websites will generate more revenue from ads if we change the default order of ad formats seen by users. To confirm or dispel this idea, we initiate a split test to compare particular metrics required for the maximum accuracy of the results.
How to predict user behavior?
The Definition of Done for our job is always a precise reply to a question: for example, ‘How much did we earn from this?’ or ‘Will we earn more if we change this?’, etc.
Aleksey: To organize data generation properly and flawlessly, we have several dedicated teams, each responsible for different services — for example, Push Notifications or ads rotation.
When discussing data storage, we can divide all data into two other groups. Some data is stored forever — yes, it is kept from the moment when the holding was launched and is never deleted. Another group is related to data that become useless after some time and can be safely erased.
Aleksey: That’s right — the volumes of data for the last 72 hours can reach approximately 95TB in a single storage. And, of course, all of our workflows, from data loading to deletion, are automated — otherwise, it wouldn’t be possible to proceed with this all as quickly as we do it.
To be more precise, here is what we automate:
Besides, we have partly automated search and disposal of unused data.
Ekaterina: The analytical work is also automatable: we create various scripts and embed them in Tableau’s workbook. These scripts add new data daily, so we don’t need to load it manually each time there is an update.
Aleksey: Generally, the amount of data we work with is not a problem — but it can be a challenging question even with automation processes.
For example, as I have already mentioned, we have data that requires permanent retention — and the volume of this data is growing each year. Evidently, this fact implies strategic planning to ensure enough space for it: the storage architecture doesn’t allow us simply to add a new server to existing ones. Briefly, when this server is introduced, we will need to rebalance the whole structure — in other words, redistribute the data across multiple servers for the optimal performance of each.
Ekaterina: Another challenge comes out when there is an analytical request that requires deep data — a tremendous amount of information. Just one example: I already mentioned that we need to track data from users who watch ads – which can involve billions of daily requests. You can’t get a tangible analysis result based on a single day, but it is also impossible to keep so much data for more than several days.
This is why the analytics department seeks compromise solutions like requesting simplified datamarts that can be stored, for example, for a month.
Another option is to minimize the required data. For example, we get a request to compare a range of metrics within an A/B test. It becomes clear that this test requires three months of data collection — and this will slow down the other workflows. This is why we go for a compromise again — for example, consider restricting the number of metrics, leaving only the first-order ones — like CTR.
Aleksey: Yes, the BI department also faces similar cases related to data storage. Just one of them: we kept a pretty large amount of data related to our clients’ performance, and the volumes soon exceeded our capacities. To solve the issue, we roughly estimated the revenue we got through this data and made a conclusion that we could get rid of a pretty big piece of information. Long story short, we continued storing only the most important part of this data, so yes — it was again about compromises.
Aleksey: There is a potential solution we might consider in the future if we face a massive data storage crunch. To explain it, we must introduce new terms related to how frequently we access particular information: hot, warm, and cold data.
Getting back to optimization, it might be done based on data storage methods for every data type. The data we don’t access too often can be stored in cheaper storage. They work slower, but once the requests are very infrequent, it won’t hurt the overall workflow. On the contrary, hot data can be stored on expensive servers with much higher performance.
Such optimization is not a priority right now — but this solution might become our optimization opportunity when the data volumes become too large to fit in our current capacities.
Obviously, working with big data volumes involves many more tasks and challenges — but modern methods allow it to be a seamless process. Professional teams engaged in data management and analysis at AdTech Holding ensure accurate reports, high-quality test results, and top-notch security of all the information kept and used by the company.