Too Much Data: Is It a Challenge for AdTech Businesses?
Digital marketing is directly connected with managing big volumes of data. Ads personalization, conversions tracking, A/B tests, and many other metrics are constantly gathered, stored, and actively used for various analytical tasks.
An example: the PropellerAds multisource network can receive billions of ad impressions daily. And this is just a part of the overall data that requires attention. Storing and managing such volumes seems a tricky task — but is it really so when you have professional teams and know how to automate processes?
We asked experts from AdTech Holding — Ekaterina Kolmakova, Product Owner of the Analytics Group, and Alex Kirilishin, Technical Product Owner of BI — to share the best practices for storing and managing data.
On Data Collection and Application
— What data is collected at AdTech Holding?
Ekaterina: There are many types of data we require for analytical tasks, but roughly speaking, we can divide it into two big categories: dimensions and measures.
- Dimensions are data related to customers — for example, their logins, last log-in date, or location from where they visited the system.
- Measures involve information about a large number of various events. For example, in the case of an ad network, this includes impressions, revenue, conversions, and other metrics received from clients’ ad campaigns. As new campaigns keep appearing, this data will continue to be collected — for the whole company’s lifetime.
The trickiest part here is that we analyze not only our clients’ data – but also the data of their audiences — people who watch ads and interact with them. We don’t keep the personal information of these people and can not identify them — but we can track their behavior with our ads, and the volume of such data is much bigger, as it involves billions of impressions.
— We will definitely discuss the challenges of data volumes further today — but before it, could you please share which tasks are solved with the data?
Ekaterina: The number of analytical tasks is enormous — but again, we can loosely divide them into two main groups.
- Operational reports
In simple words, it is a detailing and visualization of how the company performs. For example, we send a daily report to the Holding board so that they can track the revenue and other metrics of each project. Besides, operational reports might serve for building marketing funnels — for instance, the BI team provides us with some email campaign data, and we analyze the performance of this campaign.
- Ad Hoc research and A/B tests
Such tasks are always done upon a special request from various departments. Usually, they appear when there is a need to analyze ad rotation, make some adjustments, or conduct tests.
For example, PropellerAds has low, medium, and high-activity user cohorts. Sometimes, we must re-arrange these cohorts to ensure more accurate targeting and better performance. To do it, we need to analyze user behavior in each cohort and make a report that will give an idea of the right balance between each activity level.
Another example: A/B tests that check various ideas. For example, there is an assumption that sports websites will generate more revenue from ads if we change the default order of ad formats seen by users. To confirm or dispel this idea, we initiate a split test to compare particular metrics required for the maximum accuracy of the results.
How to predict user behavior?
The Definition of Done for our job is always a precise reply to a question: for example, ‘How much did we earn from this?’ or ‘Will we earn more if we change this?’, etc.
On Data Storage, Collection, and Disposal
— So, data is constantly managed and transferred to other departments by the BI team. How to collect and store it properly?
Aleksey: To organize data generation properly and flawlessly, we have several dedicated teams, each responsible for different services — for example, Push Notifications or ads rotation.
When discussing data storage, we can divide all data into two other groups. Some data is stored forever — yes, it is kept from the moment when the holding was launched and is never deleted. Another group is related to data that become useless after some time and can be safely erased.
— As we already know, the volume of data is big enough, so obviously, it all can’t be done without automation. How do you automate work when you collect data and process it?
Aleksey: That’s right — the volumes of data for the last 72 hours can reach approximately 95TB in a single storage. And, of course, all of our workflows, from data loading to deletion, are automated — otherwise, it wouldn’t be possible to proceed with this all as quickly as we do it.
To be more precise, here is what we automate:
- Change of boot configuration when data changes (with manual control in some cases).
- Collecting data marts using tools such as Airflow and a hand-crafting Python framework.
- Data cleansing. We don’t delete any data manually — instead, we distribute it by so-called topics, with different erasure settings for each topic. Depending on these settings, data is cleansed after a particular timespan — for example, 2, 3, or 7 days.
Besides, we have partly automated search and disposal of unused data.
Ekaterina: The analytical work is also automatable: we create various scripts and embed them in Tableau’s workbook. These scripts add new data daily, so we don’t need to load it manually each time there is an update.
— Does it mean that automation totally solves the problem of large data volumes?
Aleksey: Generally, the amount of data we work with is not a problem — but it can be a challenging question even with automation processes.
For example, as I have already mentioned, we have data that requires permanent retention — and the volume of this data is growing each year. Evidently, this fact implies strategic planning to ensure enough space for it: the storage architecture doesn’t allow us simply to add a new server to existing ones. Briefly, when this server is introduced, we will need to rebalance the whole structure — in other words, redistribute the data across multiple servers for the optimal performance of each.
Ekaterina: Another challenge comes out when there is an analytical request that requires deep data — a tremendous amount of information. Just one example: I already mentioned that we need to track data from users who watch ads – which can involve billions of daily requests. You can’t get a tangible analysis result based on a single day, but it is also impossible to keep so much data for more than several days.
This is why the analytics department seeks compromise solutions like requesting simplified datamarts that can be stored, for example, for a month.
Another option is to minimize the required data. For example, we get a request to compare a range of metrics within an A/B test. It becomes clear that this test requires three months of data collection — and this will slow down the other workflows. This is why we go for a compromise again — for example, consider restricting the number of metrics, leaving only the first-order ones — like CTR.
Aleksey: Yes, the BI department also faces similar cases related to data storage. Just one of them: we kept a pretty large amount of data related to our clients’ performance, and the volumes soon exceeded our capacities. To solve the issue, we roughly estimated the revenue we got through this data and made a conclusion that we could get rid of a pretty big piece of information. Long story short, we continued storing only the most important part of this data, so yes — it was again about compromises.
On Possible Optimization Solutions
— And what are the other ways of optimization? What can be done to make the workflows more convenient?
Aleksey: There is a potential solution we might consider in the future if we face a massive data storage crunch. To explain it, we must introduce new terms related to how frequently we access particular information: hot, warm, and cold data.
- Hot data is high-priority data collected for the latest couple of weeks. This data is requested more often than any other type — so it requires quick and easy access.
- Warm data is related to data gathered for about two years. This type is used less frequently than hot data but still requires regular access. For example, a sales manager might request statistics of a long-lasting ad campaign once a quarter.
- Cold data is data we need to store but actually use very rarely — like once a year or so.
Getting back to optimization, it might be done based on data storage methods for every data type. The data we don’t access too often can be stored in cheaper storage. They work slower, but once the requests are very infrequent, it won’t hurt the overall workflow. On the contrary, hot data can be stored on expensive servers with much higher performance.
Such optimization is not a priority right now — but this solution might become our optimization opportunity when the data volumes become too large to fit in our current capacities.
To Sum Up
Obviously, working with big data volumes involves many more tasks and challenges — but modern methods allow it to be a seamless process. Professional teams engaged in data management and analysis at AdTech Holding ensure accurate reports, high-quality test results, and top-notch security of all the information kept and used by the company.