The Hadoop ecosystem is comprised of numerous technologies that can work together to provide a powerful and scalable mechanism for analyzing and deriving insight from large quantities of data. In an effort to showcase the flexibility and raw power of queries that can be performed over large datasets stored in Hadoop, this post is written to demonstrate an example use case. The specific goal is to produce data related to user retention, an important metric for all product companies to analyze and understand.
Motivation: Why User Retention?
Broadly speaking, when equipped with the appropriate tools and data, we can enable our team and our customers to better understand the factors that drive user engagement and to ultimately make decisions that deliver better products to market.
User retention measures speak to the core of product quality by answering a crucial question about how the product resonates with users. In the case of apps (mobile or otherwise), that question is: “how many days does it take for users to stop using (or uninstall) the app?”.
Pinch Media (now Flurry) delivered a formative presentation early in the AppStore’s history. Among numerous insights collected from their dataset was the following slide, which detailed patterns in user retention across all apps implementing their tracking SDK:
(This was before the availability of push notifications and numerous other mechanisms that app developers can use to increase stickiness.)
This post explores techniques to recreate the data in this graph using technologies in the Hadoop ecosystem.
Polar Mobile builds an application platform that is tailored specifically to the needs of media companies.
Accurate and timely reporting is a core requirement for our customers, and Hadoop (with Hive) underpins much of our core analytics and reporting infrastructure.
Our implementation requires input from two existing Hive tables (
In an effort to ensure that this post is generally useful to anyone using or evaluating Hadoop and Hive, the schemas for these tables are provided. The provided schemas mirror the tables used in our actual production infrastructure (with some column names omitted for brevity’s sake).
events table stores the raw event stream that is collected from mobile apps built off of our platform.
Each row in this table is a discrete event. There are numerous types of events (recorded via the
eventType field), including events corresponding to page views and app opens.
All incoming event data is flattened and stored in a single Hive table, with a schema that is similar to the following:
[gist id=1636294 file=events.hql]
publicationId field is roughly analogous to a unique application identifier. The omitted metadata and event-specific fields store additional information about the event, including device-specific details such as the device model and the operating system version, and event-specific details such as the amount of time spent performing the particular action.)
The events data in the Hive warehouse is partitioned by date and time. Partition values take the form
YYYYmmddhh, and the corresponding HDFS directory structure is in the format
Due to a high volume of incoming events, data is partitioned hourly. This optimization is important, as the most common queries that we perform against this table involve relatively narrow date ranges.
subs table stores information about our userbase. Each row in the table represents a unique user (“subscriber”).
The Hive schema for this table is similar to the following:
[gist id=1636294 file=subs.hql]
This table is partitioned by date, with a monthly granularity. Partition values take the form
YYYYmm, and the corresponding HDFS directory layout is
events table, partitioning the
subs table on a monthly basis is feasible because the number of users is at least one order of magnitude smaller than the number of events generated by the userbase.
Measuring User Retention
Earlier, we described how measuring user retention involves answering questions concerning the number of days that it takes for users to stop using a given application.
What percentage of users are active on their nth day using the app?
We define a user as
active on a particular day if the user generates at least one application open (
appOpen) event on that day. n is defined relative to the user’s first app installation (“subscription”) date.
Thus, for all users,
Day 0 is the day that the user first installed the app. By definition, all users are
active on day
n = 0.
Implementing the Retention Measure
Our goal is to determine how many users use an application on their nth day after installing the app. Using the
events data, this problem can be decomposed into two steps.
Step 1 - Building the Intermediary Daily Events Aggregate Table
This query produces an intermediary table named
subscriberDailyEvents that stores aggregate values for each user, counting the number of events (by event type) fired on a particular day.
This intermediary table is stored in Hive because the size is significant. Note that the utility of this table is not limited to the particular retention measurement described in this post; in practice, the data in this table is used in a number of additional queries that are unrelated to user retention.
[gist id=1636294 file=daily-events-aggregate.hql]
In the query, we replace the
ending_partition parameters with the appropriate
dateBucket values. For example, the
ending_partition values may be
2011121100 (December 11, 2011) and
2011121200 (December 12, 2011), respectively. Typically, this query is run daily.
Step 2 - Calculating the Retention Aggregate
Next, we create a new aggregate using the intermediary table produced in Step 1. This new aggregate satisfies the retention measurement that we set out to calculate, because it records the number of users that are active on their nth day of using a given app.
[gist id=1636294 file=retention-aggregate.hql]
In this query, values for
num_days must be specified.
num_days sets a limit on the number of days of activity that we consider for a group of users. It is important to ensure that all of the users in the set being considered have been subscribed for at least
num_days. The set of included users is limited to those that activated within a specific date range.
This query is also ran on a daily basis, and is dependent on the results of the query described in Step 1.
Finally, the results of the retention aggregate query are dumped into a MySQL database for subsequent analysis and ad-hoc querying.
Numerous improvements are possible that will allow us to derive additional insight from this existing dataset. These improvements include the following:
- The introduction of a complementary measurement that incorporates the average number of page view events for all users that are active on a given day, grouped by the users’ nth day of activity. This measurement will allow us to more precisely see how average engagement levels change over time.
- The incorporation of dimensional analysis, with measurements decomposed into groups of similar users. Groups can be formed based on numerous factors, including installation date, device platform, country, etc. Groups can also be formed based on activity tiers, for example, to compare how overall retention differs between the most active and the least active users.
Hadoop, Hive, and related technologies are formidable tools for unlocking value from data.
Deriving insight from data is an important activity that can empower your team and your customers to build better products and make more informed business decisions. Retention measurements are particularly significant because they paint a detailed picture about the overall stickiness of a product across the entire userbase.
Stay tuned, as we will continue to discuss our analytics and reporting infrastructure in future posts.