Search

ALL  /  Insights  /  Press Releases  /  Careers  /  Engineering  /  Design  /  Stats  /  Life at Polar


Measuring User Retention with Hadoop and Hive

The Hadoop ecosystem is comprised of numerous technologies that can work together to provide a powerful and scalable mechanism for analyzing and deriving insight from large quantities of data. In an effort to showcase the flexibility and raw power of queries that can be performed over large datasets stored in Hadoop, this post is written to demonstrate an example use case. The specific goal is to produce data related to user retention, an important metric for all product companies to analyze and understand.

Motivation: Why User Retention?

Broadly speaking, when equipped with the appropriate tools and data, we can enable our team and our customers to better understand the factors that drive user engagement and to ultimately make decisions that deliver better products to market.

User retention measures speak to the core of product quality by answering a crucial question about how the product resonates with users. In the case of apps (mobile or otherwise), that question is: “how many days does it take for users to stop using (or uninstall) the app?”.

Pinch Media (now Flurry) delivered a formative presentation early in the AppStore’s history. Among numerous insights collected from their dataset was the following slide, which detailed patterns in user retention across all apps implementing their tracking SDK:

Pinch Media: Free Applications - Usage Over Time

(This was before the availability of push notifications and numerous other mechanisms that app developers can use to increase stickiness.)

This post explores techniques to recreate the data in this graph using technologies in the Hadoop ecosystem.

Background

Polar Mobile builds an application platform that is tailored specifically to the needs of media companies.

Accurate and timely reporting is a core requirement for our customers, and Hadoop (with Hive) underpins much of our core analytics and reporting infrastructure.

Data Model

Our implementation requires input from two existing Hive tables (events and subs).

In an effort to ensure that this post is generally useful to anyone using or evaluating Hadoop and Hive, the schemas for these tables are provided. The provided schemas mirror the tables used in our actual production infrastructure (with some column names omitted for brevity’s sake).

Event Data

The events table stores the raw event stream that is collected from mobile apps built off of our platform.

Each row in this table is a discrete event. There are numerous types of events (recorded via the eventType field), including events corresponding to page views and app opens.

All incoming event data is flattened and stored in a single Hive table, with a schema that is similar to the following:

[gist id=1636294 file=events.hql]

(The publicationId field is roughly analogous to a unique application identifier. The omitted metadata and event-specific fields store additional information about the event, including device-specific details such as the device model and the operating system version, and event-specific details such as the amount of time spent performing the particular action.)

The events data in the Hive warehouse is partitioned by date and time. Partition values take the form YYYYmmddhh, and the corresponding HDFS directory structure is in the format YYYY/mm/dd/hh.

Due to a high volume of incoming events, data is partitioned hourly. This optimization is important, as the most common queries that we perform against this table involve relatively narrow date ranges.

 

Subscription Data

The subs table stores information about our userbase. Each row in the table represents a unique user (“subscriber”).

The Hive schema for this table is similar to the following:

[gist id=1636294 file=subs.hql]

This table is partitioned by date, with a monthly granularity. Partition values take the form YYYYmm, and the corresponding HDFS directory layout is YYYY/mm.

Unlike the events table, partitioning the subs table on a monthly basis is feasible because the number of users is at least one order of magnitude smaller than the number of events generated by the userbase.

Measuring User Retention

Earlier, we described how measuring user retention involves answering questions concerning the number of days that it takes for users to stop using a given application.

This retention question can be answered more generally in terms of user activity:

What percentage of users are active on their nth day using the app?

We define a user as active on a particular day if the user generates at least one application open (appOpen) event on that day. n is defined relative to the user’s first app installation (“subscription”) date.

Thus, for all users, Day 0 is the day that the user first installed the app. By definition, all users are active on day n = 0.

Implementing the Retention Measure

Our goal is to determine how many users use an application on their nth day after installing the app. Using the subs and events data, this problem can be decomposed into two steps.

Step 1 - Building the Intermediary Daily Events Aggregate Table

This query produces an intermediary table named subscriberDailyEvents that stores aggregate values for each user, counting the number of events (by event type) fired on a particular day.

This intermediary table is stored in Hive because the size is significant. Note that the utility of this table is not limited to the particular retention measurement described in this post; in practice, the data in this table is used in a number of additional queries that are unrelated to user retention.

[gist id=1636294 file=daily-events-aggregate.hql]

In the query, we replace the starting_partition and ending_partition parameters with the appropriate dateBucket values. For example, the starting_partition and ending_partition values may be 2011121100 (December 11, 2011) and 2011121200 (December 12, 2011), respectively. Typically, this query is run daily.

 

Step 2 - Calculating the Retention Aggregate

Next, we create a new aggregate using the intermediary table produced in Step 1. This new aggregate satisfies the retention measurement that we set out to calculate, because it records the number of users that are active on their nth day of using a given app.

[gist id=1636294 file=retention-aggregate.hql]

In this query, values for first_activation, last_activation, and num_days must be specified. num_days sets a limit on the number of days of activity that we consider for a group of users. It is important to ensure that all of the users in the set being considered have been subscribed for at least num_days. The set of included users is limited to those that activated within a specific date range.

This query is also ran on a daily basis, and is dependent on the results of the query described in Step 1.

Finally, the results of the retention aggregate query are dumped into a MySQL database for subsequent analysis and ad-hoc querying.

Future Improvements

Numerous improvements are possible that will allow us to derive additional insight from this existing dataset. These improvements include the following:

  • The introduction of a complementary measurement that incorporates the average number of page view events for all users that are active on a given day, grouped by the users’ nth day of activity. This measurement will allow us to more precisely see how average engagement levels change over time.
  • The incorporation of dimensional analysis, with measurements decomposed into groups of similar users. Groups can be formed based on numerous factors, including installation date, device platform, country, etc. Groups can also be formed based on activity tiers, for example, to compare how overall retention differs between the most active and the least active users.

Closing Remarks

Hadoop, Hive, and related technologies are formidable tools for unlocking value from data.

Deriving insight from data is an important activity that can empower your team and your customers to build better products and make more informed business decisions. Retention measurements are particularly significant because they paint a detailed picture about the overall stickiness of a product across the entire userbase.

Stay tuned, as we will continue to discuss our analytics and reporting infrastructure in future posts.