Background Image

DARTS Technical Overview

What is DARTS?

DARTS stands for Dynamic and Responsive Targeting System. It is a system that couples concepts from multi-armed bandit algorithms modified for delayed feedback scenarios with a mechanism for fairly allocating targets from each arm.

DARTS operates in a feedback loop where targets are selected and an action is applied (e.g., making a call, showing an ad, etc.), and the results (whether the targets are good and bad) are observed. The results are then used in the next iteration of the loop to change the number of targets that are drawn from each target pool.

There are three main components:

  1. Target Pools:
    • Group targets together based on shared attributes.
    • They are flexible, can be as simple as demographic groups or as complex as machine learning model predictions (see case study).
    • They can overlap or assign quality scores (like prediction probabilities) for individual targets.
    • You don’t know which pool will have the good targets, so they are distributed evenly at first.
  2. Adaptive Module:
    • Dictates how many targets to select in the next round based on prior results.
    • Under the hood, it’s a multi-armed bandit
  3. Allocation Mechanism:
    • Dictates a strategy to pick targets using adaptive module as input.
    • Can pick the allocated targets from each pool all at once.
    • Or you can pick one target at a time, rotating through each pool in a round-robin fashion.
    • The exact strategy you use is dictated by target pool setup and problem domain.

Multi-Armed Bandits

A multi-armed bandit takes its name from the idiomatic phrase for a slot machine, "one-armed bandit." The payoff, or reward, for pulling the arm on the slot machine is unknown to the user, but could be estimated after pulling the arm many times. A multi-armed bandit is an extension of this idea where now the user wants to optimize the payoff from pulling the arms from an array of slot machines.

The optimization process in this scenario relies on balancing how often the user explores pulling the different arms with how often the user exploits the arm with the best known rewards. There are many algorithms, known as explore/exploit policies, that exist to do this: epsilon-greedy, UCB, and Bayes UCB, to name a few. These algorithms are designed to work in an online scenario where the user gets a reward immediately after pulling one of the arms.

An example of this scenario is a casino with many slot machines. The user pulls the arm on a slot machine, observes a reward, computes the next arm to pull based on the chosen explore/exploit policy, and then pulls the next arm. This process is repeated for each arm pull.

Delayed Feedback Scenarios

In the real world, reward feedback may be delayed until after a user pulls multiple arms. These scenarios require that explore/exploit policies are modified to optimize which arms should be pulled in the next round on a batch-by-batch basis. Instead of the explore/exploit algorithm dictating a single arm to pull, it instead dictates the mix of which arms should be pulled the next round.

To extend the casino example, let's say that the user was banned from the casino because of their calculation and arm pulling antics. Instead of the user going inside the casino themselves, they send a friend in with instructions indicating how many times their friend should pull each arm. The friend records the rewards from each pull of each arm. When the friend leaves the casino, they report the rewards to the user, and the user computes how many times the friend should pull each arm the next time they are sent in to the casino.

Allocation Mechanism

In simple cases, as outlined in the example above, explore/exploit algorithms modified for delayed feedback can just dictate how many times each arm should be pulled. Depending on the use-case, however, pulling an arm in a multi-armed bandit problem may be more complicated than simply making the choice and observing a reward. DARTS was designed for a case study where each arm was a machine learning model. Pulling an arm meant picking a prediction from the model and performing an action on that prediction. Only then could a reward be observed. As outlined, this is still fairly straight-forward. We could just allocate the top *n* predictions from each model, where *n* is the number of arm pulls dictated by the explore/exploit algorithm.

However, in our case, the predictions from each model were not completely disjoint, meaning that multiple arms contained some of the same predictions. But the actions required by our problem necessitated disjoint contributions from each model each round. To operate in this context, we developed an allocation mechanism to fairly distribute predictions from each model. The process was not as simple as just choosing how many times each arm should be pulled.

Allocations require re-thinking when predictions (or pools) overlap between arms. If we allocate all predictions from a model at once, we risk disadvantaging the other models by pulling predictions from further down in their pool. To accommodate for this, we allocate from pools in a round-robin fashion. The pool that gets the least allocation goes first and picks one prediction, then the next pool picks one, and so on until all pools have picked. After each arm has picked one, the order is reversed and each arms picks one again. This process repeats until all allocations have been made. If an arm runs out of allocations part way it is excluded from the round-robin.


Rewards are a core part of multi-armed bandit problems. Simply put, a reward is some number quantifying a gain received from pulling one of the arms. The explore-exploit algorithms included in DARTS attempt to maximize rewards received at each time step.

How that reward is calculated depends on the framing on the problem. One common use-case for multi-armed bandits is comparing advertisements. The arms in this scenario are each advertisement. The explore/exploit algorithm dictates which advertisement is shown. The reward is a visitor clicking on an advertisement.

For our case study, the reward was a model making a correct prediction for a conflicted voter. Because our arms had overlapping distributions, each arm received a reward for making a correct prediction regardless of which model contributed it to the target pool. It is up to the user of DARTS to apply rewards according to the needs of their use-case.

Explore/Exploit Algorithms

DARTS ships with three Explore/Exploit Algorithms, epsilon-greedy, UCB1, and Bayes UCB. Each of these algorithms have been modified from the original one-at-a-time online design to accommodate delayed feedback scenarios.

Epsilon Greedy

The epsilon-greedy algorithm is the most common and simplest of the explore/exploit algorithms. A parameter epsilon is chosen to determine the proportion of time the user should explore instead of exploiting the best performing arm. For example, if epsilon is 0.1, then 90% of the time the user will pick the best performing arm (exploit) and 10% of the time the user will choose a different arm (explore). In an online learning scenario, the exploration is simply a random choice from all the arms except the best performing one.

In DARTS, we modified this algorithm to account for delayed feedback scenarios following the methodology from Liu, Downe, and Reed (2019). In this scenario, the best-performing arm is pulled \(N\cdot(1-\epsilon)\) times where \(N\) is the total number of times to pull an arm for the next round. Any remaining arm \(k\) is pulled \(N\cdot\frac{\epsilon}{K-1}\) times where \(K\) is the total number of arms.

The epsilon-greedy policy tends to plateau the quickest in terms of which arm is the most optimal to pull, but that plateau tends to be lower than other algorithms.

Upper Confidence Bounds

There are two related policies that can be used with DARTS surrounding the idea of an upper confidence bound. In the context of a multi-armed bandit problem, the upper confidence bound we care about is the upper confidence bound on the rewards generated by a given arm. Upper Confidence Bound explore/exploit algorithms rely on the confidence the user has in the mean rewards generated by pulling a given arm. As the number of times and arm is pulled increases, the more confident the user is in the mean rewards gained from pulling that arm. In an online setting, the arm with the highest upper confidence bound is picked to be pulled. Arms with wide confidence intervals are more likely to be picked (explored) through the process of updating the upper confidence bound after each pull. Eventually, the choice in arm will stabilize.

To modify Upper Confidence Bound policies for delayed feedback, we simply normalize the upper confidence bound for all the arms and use the result as the percentage of time an arm should be pulled. Because this technique stabilizes over time instead of always picking the best arm, we implemented a greed factor that a user can use to bias the arm pull allocations towards the best performing arm as needed.

The two flavors included in DARTS are UCB1 and Bayes UCB.


In the UCB1 explore/exploit policy, the size of the confidence interval is dictated by a timestep \(t \in [1,\infty)\). In an online scenario, \(t\) is updated with each arm pull, but in a delayed feedback scenario \(t\) online increases with each round of multiple arm pulls. The equation follows:

\[N_{a,t+1}=N_{t+1}\bigg(\frac{\mu_a+\sqrt\frac{2\log_{10} t}{n_{a}}}{\sum_{a\in A} \mu_a+\sqrt\frac{2\log_{10} t}{n_{a}}}\bigg)\]

This equation states that the number of times to pull arm \(a\) during the next round (\(N_{a,t+1}\)) is equal to the total number of pulls next round (\(N_{t+1}\)) multiplied by the mean rewards for arm \(a\) plus the square root of the log of timestep \(t\) divided by the number of times arm \(a\) has been pulled so far.

The UCB1 policy tends to perform much better than epsilon-greedy in the long term, but requires more rounds of play to reach its optimum mix of arms to pull.

Bayes UCB

The Bayes UCB explore/exploit policy relies on the idea of Baye's Rule, that the probability of the mean rewards for an arm is conditioned on the history of rewards received during previous pulls. The Bayes UCB policy assumes that the distribution of rewards for each arm are independent and are drawn from Gaussian distributions. Bayes UCB does not rely on a timestep to implement decay like UCB1 does. Instead, Bayes UCB parameterizes the upper confidence bound calculation with a constant \(c\) that reflects the number of standard deviations from the mean to consider for the upper confidence bound. The equation is similar to that of UCB1:

\[N_{a,t+1}=N_{t+1}\bigg(\frac{\mu_a+\frac{c\sigma_a t}{\sqrt n_{a}}}{\sum_{a\in A}\mu_a+\frac{c\sigma_a t}{\sqrt n_{a}}}\bigg)\]

In this equation \(\sigma_a\) is the standard deviation of the rewards.