Dynamic Cost-Per-Action Mechanisms
and Applications to Online Advertising

Hamid Nazerzadeh*
Amin Saberi*
Rakesh Vohra

March 6, 2008

*Management Science and Engineering Department, Stanford University, Stanford, CA 94305, email: {hamidnz,saberi@stanford.edu}
Department of Managerial Economics and Decision Sciences, Kellogg School of Management, Northwestern University, Evanston, IL 60208, email: {r-vohra@kellogg.northwestern.edu}

Abstract

We study the Cost-Per-Action or Cost-Per-Acquisition (CPA) charging scheme in online advertising. In this scheme, instead of paying per click, the advertisers pay only when a user takes a specific action (e.g. fills out a form) or completes a transaction on their websites.

We focus on designing efficient and incentive compatible mechanisms that use this charging scheme. We describe a mechanism based on a sampling-based learning algorithm that under suitable assumptions is asymptotically individually rational, asymptotically Bayesian incentive compatible and asymptotically ex-ante efficient.

In particular, we demonstrate our mechanism for the case where the utility functions of the advertisers are independent and identically-distributed random variables as well as the case where they evolve like independent reflected Brownian motions.

1 Introduction

Currently, the main two charging models in the online advertising industry are cost-per-impression (CPM) and cost-per-click (CPC). In the CPM model, the advertisers pay the publisher for the impression of their ads. CPM is commonly used in traditional media (e.g. magazines and television) or banner advertising and is more suitable when the goal of the advertiser is to increase brand awareness.

A more attractive and more popular charging model in online advertising is the CPC model in which the advertisers pay the publisher only when a user clicks on their ads. In the last few years, there has been a tremendous shift towards the CPC charging model. CPC is adopted by search engines such as Google or Yahoo! for the placement of ads next to search results (also known as sponsored search) and on the website of third-party publishers.

In this paper we will focus on another natural and widely advocated charging scheme known as the Cost-Per-Action or Cost-Per-Acquisition (CPA) model. In this model, instead of paying per click, the advertiser pays only when a user takes a specific action (e.g. fills out a form) or completes a transaction. Recently, several companies like Google, eBay, Amazon, Advertising.com, and Snap.com have started to sell advertising in this way.

CPA models can be the ideal charging scheme, especially for small and risk averse advertisers. We will briefly describe a few advantages of this charging scheme over CPC and refer the reader to [19] for a more detailed discussion.

One of the drawbacks of the CPC scheme is that it requires the advertisers to submit their bids before observing the profits generated by the users clicking on their ads. Learning the expected value of each click, and therefore the right bid for the ad, is a prohibitively difficult task especially in the context of sponsored search in which the advertisers typically bid for thousands of keywords. CPA eliminates this problem because it allows the advertisers to report their payoff after observing the user’s action.

Another drawback of the CPC scheme is its vulnerability to click fraud. Click fraud refers to clicks generated by someone or something with no genuine interest in the advertisement. Such clicks can be generated by the publisher of the content who has an interest in receiving a share of the revenue of the ad or by a rival who wishes to increase the cost of advertising for the advertiser. Click fraud is considered by many experts to be the biggest challenge facing the online advertising industry [14102421]. CPA schemes are less vulnerable because generating a fraudulent action is typically more costly than generating a fraudulent click. For example, an advertiser can define the action as a sale and pay the publisher only when the ad yields profit1 .

On the other hand, there is a fundamental difference between CPA and CPC charging models. A click on the ad can be observed by both advertiser and publisher. However, the action of the user is hidden from the publisher and is observable only by the advertiser. Although the publisher can require the advertisers to install a software that will monitor actions that take place on their web site, even moderately sophisticated advertisers can find a way to manipulate the software if they find it sufficiently profitable.

Are the publishers exposed to the manipulation or misreporting of the advertisers in the CPA scheme? Does CPA create an incentive for the advertisers to misreport the number of actions or their payoffs for the actions? The main result of this paper is to give a negative answer to these questions. We design a mechanism that, asymptotically and under reasonable assumptions, removes the incentives of the advertisers to misreport their payoffs. At the same time, our mechanism has the same asymptotic efficiency and hence revenue as the currently used CPC mechanisms. We will use techniques in learning and mechanism design to obtain this result.

In the next section, we will formally describe our model in mechanism design terminology (see [22].) We will refer to advertisers as agents and to the impression of an ad as an item. For simplicity of exposition only, we assume only one advertisement slot per page. In section 8 we outline how to extend our results to the case where more than one advertisement can be displayed in each page. Although our work is essentially motivated by online advertising, we believe that the application of our mechanism is not limited this domain.

1.1 Model

We study the following problem: there are a number of self-interested agents competing for identical items sold repeatedly at times t = 1,2,⋅⋅⋅. At each time t, a mechanism allocates the item to one of the agents. Agents discover their utility for the good only if it is allocated to them. If agent i receives the good at time t, she realizes utility uit (denominated in money) for it and reports (not necessarily truthfully) the realized utility to the mechanism. Then, the mechanism determines how much the agent has to pay for receiving the item. We allow the utility of an agent to change over time. For this environment we are interested in auction mechanisms which have the following four properties.

  1. The mechanism is individually rational in each period.
  2. Agents have an incentive to truthfully report their realized utilities.
  3. The efficiency (and revenue) is, in an appropriate sense, not too small compared to a second price auction.
  4. The correctness of the mechanism does not depend on an a-priori knowledge of the distribution of uit’s. This feature is motivated by the Wilson doctrine [25] 2 .

The precise manner in which these properties are formalized is described in section 2.

We will build our mechanisms on a sampling-based learning algorithm. The learning algorithm is used to estimate the expected utility of the agents, and consists of two alternating phases: exploration and exploitation. During an exploration phase, the item is allocated for free to a randomly chosen agent. During an exploitation phase, the mechanism allocates the item to the agent with the highest estimated expected utility. After each allocation, the agent who has received the item reports its realized utility. Subsequently, the mechanism updates the estimate of utilities and determines the payment.

We characterize a class of learning algorithms that ensure that the corresponding mechanism has the four desired properties. The main difficulty in obtaining this result is the following: since there is uncertainty about the utilities, it is possible that in some periods the item is allocated to an agent who does not have the highest utility in that period. Hence, the natural second-highest price payment rule would violate individual rationality. On the other hand, if the mechanism does not charge an agent because her reported utility after the allocation is low, it gives her an incentive to shade her reported utility down. Our mechanism solves these problems by using an adaptive, cumulative pricing scheme.

We illustrate our results by identifying simple mechanisms that have the desired properties. We demonstrate these mechanisms for the case in which the uit’s are independent and identically-distributed random variables as well as the case where their expected values evolve like independent reflected Brownian motions. In these cases the mechanism is actually ex-post individually rational.

In our proposed mechanism, the agents do not have to bid for the items. This is advantageous when the bidders themselves are unaware of their utility values. However, in some cases, an agent might have a better estimate of her utility for the item than our mechanism. For this reason, we describe how we can slightly modify our mechanism to allow those agents to bid directly.

1.2 Related Work

There is a large number of interesting results on using machine learning techniques in mechanism design. We only briefly survey the main techniques and ideas and compare them with the approach of this paper.

Most of these works, like [581118], consider one-shot games or repeated auctions in which the agents leave the environment after they received an item. In our setting we may allocates items to an agent several times and hence, we need to consider the strategic behavior of the agents over time. There is also a big literature on regret minimization or expert algorithms. In our context, these algorithms are applicable even if the utilities of the agents are changing arbitrarily. However, the efficiency (and therefore the revenue) of these algorithms is comparable to the mechanisms that allocates the item to the single best agent (expert) (e.g. see [17]). Our goal is more ambitious: our efficiency is close the most efficient allocation which might allocate the item to different agents at different times. On the other hand, we focus on utility values that change smoothly (e.g. like a Brownian motion).

In a finitely repeated version of the environment considered here, Athey and Segal [2] construct an efficient, budget balanced, mechanism where truthful revelation in each period is Bayesian incentive compatible. Bapna and Weber [4] consider the infinite horizon version of [2] and propose a class of incentive compatible mechanisms based on the Gittins index (see [12]). Taking a different approach, Bergemann and Välimäki [6] and Cavallo et al. [9] propose an incentive compatible generalization of the Vickrey-Clark-Groves mechanism based on the marginal contribution of each agent for this environment. All these mechanisms need the exact solution of the underlying optimization problems, and therefore require complete information about the prior of the utilities of the agents; also, they do not apply when the evolution of the utilities of the agents is not stationary over time. This violates the last of our desiderata. For a comprehensive survey in dynamic mechanism design literature see [23].

In the context of sponsored search, attention has focused on ways of estimating click through rates. Gonen and Pavlov [13] give a mechanism which learns the click-through rates via sampling and show that truthful bidding is, with high probability, a (weakly) dominant strategy in this mechanism. Along this line, Wortman et al. [26] introduced an exploration scheme for learning advertisers’ click-through rates in sponsored search which maintains the equilibrium of the system. In these works, unlike ours, the distribution of the utilities of agents are assumed to be fixed over time.

Immorlica et al. [15], and later Mahdian and Tomak [19], examine the vulnerability of various procedures for estimating click through, and identify a class of click through learning algorithms in which fraudulent clicks cannot increase the expected payment per impression by more than o(1). This is under the assumption that the slot of an agent is fixed and the bids of other agents remain constant overtime. In contrast, we study conditions which guarantee incentive compatibility and efficiency, while the utility of (all) agents may evolve over time.

2 Definitions and Notation

Suppose n agents competing in each period for a single item. The item is sold repeatedly at time t = 1,2,⋅⋅⋅. Denote by uit the nonnegative utility of agent i for the item at time t. Utilities are denominated in a common monetary scale.

The utilities of agents may evolve over time according to a stochastic process. We assume that for ij, the evolution of uit and ujt are independent stochastic processes. We also define μit = E[uit|ui1,⋅⋅⋅,ui,t-1]. Throughout this paper, expectations are taken conditioned on the complete history. For simplicity of notation, we now omit those terms that denote such a conditioning. With notational convention, it follows, for example, that E[uit] = E[μit]. Here the second expectation is taken over all possible histories.

Let M be a mechanism used to sell the items. At each time, M allocates the item to one of the agents. Let i be the agent who has received the item at time t. Define xit to be the variable indicating the allocation of the item to i at time t. After the allocation, agent i observes her utility, uit, and then reports rit, as her utility for the item, to the mechanism. Note that we do not require an agent to know her utility for possessing the item in advance of acquiring it. The mechanism then determines the payment, denoted by pit.

Definition 1 An agent i is truthful if rit = uit, for all time xit = 1,t > 0.

Our goal is to design a mechanism which has the following properties. We assume n, the number of agents, is constant.

Individual Rationality:
A mechanism is ex-post individually rational if for any time T > 0 and any agent 1 i n, the total payment of agent i does not exceed the sum of her reports:
∑T
    xitrit - pit > 0.
t=1

M is asymptotically ex-ante individually rational if:

        ∑T
lim infE [  xitrit - pit] ≥ 0.
T →∞    t=1
Incentive Compatibility:
This property implies that truthfulness defines an asymptotic Bayesian Nash equilibrium. Consider agent i and suppose all agents except i are truthful. Let Ui(T) be the expected total profit of agent i, if agent i is truthful between time 1 and T. Also, let ^Ui(T) be the maximum of expected profit of agent i under any other strategy. Asymptotic incentive compatibility requires that
^Ui(T)- Ui (T ) = o(Ui(T )).
Efficiency and Revenue:
Call a mechanism that allocates at each time t (and for each history) the item to an agent in argmaxi{μit} ex-ante efficient. The total social welfare obtained by an ex-ante efficient mechanism up to time T is E[ t=1T maxi{μit}]. If each agent i knew μit for all t, such an allocation could be achieved by a second price auction – in an incentive compatible way. Have each i report a value of μit, give the item to any agent in argmaxi{μit} and charge them the second highest μit. Let γt be the second highest μit at time t > 0. Then, the expected revenue of this second price mechanism up to time T is equal to E[ t=1T γt].

We will measure the efficiency and the revenue of M by comparing it to the second price mechanism just described. Let W(T) and R(T) be the expected welfare and the expected revenue of mechanism M between time 1 and T, when all agents are truthful, i.e.

          ∑T ∑n
W (T) = E[       xitμit]
          t=1 i=1
              T  n
             ∑  ∑
   R(T ) = E [     pit]
             t=1 i=1

Then, M is asymptotically ex-ante efficient if:

  ∑T
E[   maxi {μit}]- W (T ) = o(W (T )).
  t=1
Also, the revenue of M is asymptotically equivalent to the revenue of the second price auction if:
  ∑T
E[   γt]- R (T) = o(R(T )).
  t=1

3 Proposed Mechanism


  For t = 1,2,
    With probability η(t), explore:
        Uniformly at random, allocate the item to an agent i, 1 i n.
        pit 0
    With probability 1 - η(t), exploit:
        Randomly allocate the item to an agent i ∈ argmaxi{^μ it(t)}.
        pit k=1t-1yik min{^γk(k),^γk(t)}- k=1t-1pik
    rit the report of agent i.
    pjt 0, ji

Figure 1: Mechanism M

We build our mechanism on top of a learning algorithm that estimates the expected utility of the agents. We refrain from an explicit description of the learning algorithm. Rather, we describe sufficient conditions for a learning algorithm that can be extended to a mechanism with all the properties we seek (see section 4.1). In section 6 and 7 we give two examples of environments where learning algorithms satisfying these sufficient conditions exist.

The mechanism consists of two phases: explore and exploit. During the explore phase, with probability η(t), η : [0,1], the item is allocated for free to a randomly chosen agent. During the exploit phase, the mechanism allocates the item to the agent with the highest estimated expected utility. Afterwards, the agent reports her utility to the mechanism and the mechanism determines the payment. We first formalize our assumptions about the learning algorithm and then we discuss the payment scheme. The mechanism is given in Figure 1.

The learning algorithm, samples uit’s at rate η(t), and based on the history of the reports of agent i, returns an estimate of μit. Let ^μ it(T) be the estimate of the algorithm for μit conditional on the history of the reports up to time T. The history of the reports of agent i up to time T is the sequence of the reported values and times of observation of uit up to but not including time T. Note that we allow T > t. Thus, information at time T > t can be used to revise an estimate of μit made at some earlier time. We assume that the accuracy of the learning algorithm is monotone, i.e., increasing the number of samples, in expectation, only increases the accuracy of the estimations 3.

We now describe the payment scheme. Let ^γt(T) = maxji{^μ jt(T)}, where i is the agent who receives the item at time t. We define yit to be the indicator variable of the allocation of the item to agent i during an exploit phase. The payment of agent i at time t, denoted pit, is equal to:

     t∑-1                     ∑t-1
pit =    yikmin {^γk(k ),^γk(t)}-     pik
     k=1                     k=1

Therefore, we have:

∑t       t-∑ 1
   pik =    yikmin{^γk(k),^γk(t)}.
k=1      k=1

An agent only pays for items that are allocated to her during the exploit phase, up to but not including time t. At time t, the payment of agent i for the item she received at time k < t is equal to min{^γk(k),^γk(t)}. The payments scheme emulates the second pricing scheme by replacing γt with its estimations. The mechanism reduces the price of an item if it later realizes that item was overpriced. However, it would not increase the payment when the items was underpriced. This happens when an item, due to the estimation error, is allocated to an agent who does not have the highest expected valuation for the item. Since the estimations of learning algorithm for the utilities of agents become more precise over time, our adaptive cumulative payment scheme allows correction of the mistakes made in the past.

4 Analysis

We start this section by defining Δt. Assume all agents are truthful up to time t. Let Δt be the maximum over all agents i, the difference between μit and its estimation using only reports during the explore phase. Because the accuracy of the learning algorithm is monotone, for a truthful agent i at time T t we have:

E [|^μit(T )- μit|] ≤ E [|^μit(t) - μit|] ≤ E [Δt ]                      (1)

In the inequality above, and in the rest of the paper, the expectations of μ^ it are taken over the evolution of uit’s and the random choices of the mechanism. For simplicity of notation, we omit those terms that denote such a conditioning.

[ In this section, we will relate the performance of the mechanism to the estimation error of the learning algorithm. We start with the individual rationality aspects of the mechanism. Then we show that if Δt is small, then agents cannot gain much by deviating from the truthful strategy. We also bound the efficiency loss in terms of Δt. ]

Theorem 1 For a truthful agent i up to time T, in expectation, the amount that i may be over charged for the items she received during exploitation is bounded by the total estimation error of the algorithm, i.e.,

   T∑            ∑T           ∑T
E [   yituit]- E[   pit] ≥ - E[  Δt ]
   t=1           t=1          t=1

Proof :  In fact, we prove a stronger results by showing:

  ∑T         ∑T            T∑- 1                    ∑T
E [   pit]- E [   yituit] ≤ E [  yit(^γt(t)- μit)+] ≤ E [  Δt]                (2)
   t=1         t=1           t=1                     t=1
where (z)+ = max{0,z}.

If yit = 0 then pit = 0. Also, recall that for every time t, because expectations are computed conditioned on the complete history, E[uit] = E[μit]. By the payment rule we have:

   T          T              T-1
  ∑          ∑               ∑
E[   pit]-  E[   yituit] ≤   E[    yit(min{^γt(t),^γt(T)} - μit)]
  t=1        t=1             t=1
                             T∑-1
                       ≤   E[    yit(^γt(t)- μit)]
                             t=1
                             T∑-1
                       ≤   E[   (^γt(t) - μit)+]
                             t=1
Since yit = 1 indicates that the mechanism allocated the item to agent i at time t, we have ^γt(t) ^μ it(t). Plugging into inequality above, we get:
   ∑T         ∑T              T∑-1            +
E [   pit]- E [   yituit] ≤   E[   (^μit(t)- μit) ]
   t=1         t=1              t=1
                              T∑-1
                        ≤   E[    Δt]
                              t=1
The last inequality is followed by (1).

Now we study the incentive compatibility aspects of the mechanism.

Theorem 2 Suppose all agents are truthful except agent i who is intending to deviate from the truthful strategy. Agent i cannot increase her expected utility up to time T by more than:

    ∑T
8E [   Δt]+  E[ m1a≤xt≤T {μit}]                               (3)
    t=1

Proof :  [ We bound the expected profit i could obtain by deviating from the truthful strategy, from the items she receives during exploitation and up to time T. The term E[max1tT {μit}] in the expression above bounds the outstanding payment of agent i. Recall that agent do not pay for the last item they receive during exploitation. Also, note that the exploration rate is independent of the strategy of the agents. Therefore, without loss of generality, we assume at time T, agent i has received an item during exploitation. ]

Let S be the strategy that i deviates to. Fixing the evolution of all ujts, 1 j n, and all random choices of the mechanism, let DT be the set of times that i receives the item under strategy S during the exploit phase and before time T. Formally, DT = {t < T|yit = 1, if the strategy of i is S}. Similarly, let CT = {t < T|yit = 1, if i is truthful}. Also, let ^μ it, and ^γt correspond to the estimates of the mechanism when the strategy of i is S. The expected profit i could obtain under strategy S from the items she received during exploitation, up to time T - 1, is equal to:

   ∑              ′    ′              ∑               ′    ′
E [    μit - min{^γt(t),^γt(T)}] =  E [       μit - min {^γt(t),^γt(T)}]+
  t∈DT                              t∈DT∑ ∩CT
                                 E [       μit - min {^γ′(t),^γ′(T )}]          (4)
                                                      t    t
                                    t∈DT \CT

For time t 1, we examine two cases:

  1. The first case is when t ∈ DT CT . We will show that in those cases the “current price” given to agent i under the two scenarios are close. To this aim, we first observe:
    min{^γ′t(t),^γ′t(T)}  =  ^γt(t)+ min {^γ′t(t)- ^γt(t),^γ ′t(T )- ^γt(t)}

    Let ji be the agent with the highest ^μ jt(t), i.e., ^μ jt(t) = argmaxji{^μ jt(t)} = ^γt(t). Because i is the winner both in DT and CT , by definition of ^γt we have ^μ jt(t) ^γt(t) and ^μ jt(T) ^γt(T). Plugging into the inequality above we get:

    min {^γ′(t),^γ′(T )} ≥   ^γt(t) + min{^μ′ (t) - ^μjt(t),^μ′(T )- ^μjt(t)}
      t    t                     jt  ′         jt    ′
                 ≥   ^γt(t) - |^μjt(t) - ^μjt(t)|- |^μjt(t)- ^μjt(T)|
                 =   ^γt(t) - |^μjt(t) - μjt + μjt - ^μ ′jt(t)|- |^μjt(t) - μjt + μjt - ^μ′jt(T)|
                 ≥   ^γ (t) - 2|^μ  (t)- μ  |- |μ  -  ^μ′(t)|- |μ  -  ^μ′(T )|
                      t        jt      jt     jt    jt       jt    jt

    By taking expectation from both sides we have:

    E [min {^γ′t(t),^γ′t(T)}I(t ∈ DT ∩CT )]  ≥  E [^γt(t)I(t ∈ DT ∩ CT )] -
                                      E [2|^μjt(t)- μjt|I(t ∈ DT ∩ CT )]-
                                        (       ′             ′    )
                                      E [|μjt - ^μjt(t)|- |μjt - ^μjt(T)| I(t ∈ DT ∩ CT )]
                                   ≥  E [^γt(t)I(t ∈ DT ∩ CT )] -
                                                              ′             ′
                                      E [2|^μjt(t)- μjt|+ |μjt - ^μjt(t)|+ |μjt - ^μjt(T)|]

    Because agent j is truthful, by (1), we get:

    E [min {^γ′(t),^γ′(T )}I(t ∈ D ∩ C  )] ≥   E[^γ (t)I(t ∈ D  ∩ C )]- 4E [Δ ]        (5)
        t    t           T    T          t         T    T         t
  2. The second case is when t ∈ DT \CT . We show in those cases agent i cannot increase her “profit” by much.

    Let j be the agent who would receive the item at time t when agent i is truthful. Therefore μ^jt(t) ^μ it(t). Also, ^μ jt(t) maxji{^μ jt} = γt(t). Similarly, μ^ jt(T) γt(T).

    μ  - min{^γ ′(t),^γ′(T )}  =   ^μ (t)+ |μ  - ^μ (t)|- min{^γ ′(t),^γ ′(T )}
 it        t    t          it      it   it          t     t
                      ≤   ^μjt(t)+ |μit - ^μit(t)|- min {^μjt(t),^μjt(T )}]
                      =   max {0,^μjt(t)- ^μjt(T)} + |μit - μ^it(t)|

                      ≤   |μ^jt(t)- ^μjt(T)|+ |μit - ^μit(t)|
                      ≤   |μ^jt(t)- μjt|+ |μjt - ^μjt(T )| + |μit - ^μit(t)|

    We sum up the inequality above over all t ∈ DT \ CT . Because all agents are truthful, taking expectation from both sides, , by (1), we get:

        ∑                                  ∑T
E[        μit - min{^γ′(t),^γ′(T)}]  ≤  3E [   Δt]                   (6)
  t∈D \C            t    t              t=1
     T  T

Substituting inequalities (5) and (6) into (4):

  T∑-1                    ∑                     ∑T
E[   yituit - pit] ≤   E [       μit - ^γt(t)]+ 7E [  Δt]
  t=1                  t∈DT∩CT                 t=1
                        ∑                   ∑                    ∑T
                 =   E [   μit - ^γt(t)]- E[        μit - ^γt(t)]+ 7E [   Δt]
                       t∈CT               t∈C \D                 t=1
                                             T  T

By inequality (2), -E[ t∈CT\DTμit - γt] E[ t=1T Δt]. Therefore:

  T-1                                     T
  ∑                  ∑                   ∑
E[   yituit - pit]- E [    uit - pit] ≤   8E [   Δt]
  t=1               t∈CT                 t=1
which completes the proof.

We now compare the welfare of our mechanism to the efficient mechanism that at every time allocates the item the agent with the highest expected utility. The expected loss of efficiency during exploration is equal to E[ t=1T η(t)maxi{μit}]. In the next theorem, we show that in the equilibrium, the efficiency loss during the exploit phase is bounded by a factor of the total estimation error of the learning algorithm.

Theorem 3 Let W(T) denote the expected welfare of mechanism M between time 1 and T. If all the agents are truthful, we have:

              T                 T
             ∑                 ∑
W  (T )  ≥  E [   maix {μit}]- E [   η(t)maix {μit} + 2Δt]
              t=1               t=1

Proof :  There are two reasons for loss of efficiency of the mechanism. The first reason is the loss in welfare during the exploration when the item is allocated randomly to one of agents. The expected loss in this case is equal to E[ t=1T η(t)maxi{μit}].

Another reason is the mistakes during exploitation. The error in the estimations may lead to allocation to an agent who does not value the item the most. Suppose at time t, during exploitation, the mechanism allocated the item to agent j instead of i, i.e., μit > μjt. By the rule of the mechanism we have ^μ it(t) ^μ jt(t). By subtracting this inequality from μjt - μit we get:

μjt - μit ≥  μjt - μit - (^μjt(t) - ^μit)
          =  (μjt - ^μjt(t)) + (μ^it(t) - μit)

We sum up this inequality over all such time t, and by inequality (1), we observe that the expected efficiency loss during exploration is bounded by 2E[ t=1T Δt].

Therefore, for the expected welfare of M between time 1 and T we have:

  ∑T                       ∑T
E[   max {μit}]- W (T) ≤ E [   η(t)max {μit}+ 2Δt ]
  t=1  i                   t=1      i

4.1 Sufficient Conditions (TODO)

In this section, using the theorem from the previous section, we give sufficient conditions on the learning algorithm which guarantee asymptotic ex-ante individual rationality and incentive compatibility and efficiency.

Theorem 4 If for the learning algorithm, for all 1 i n, and T > 0:

                         ∑T             ∑T
(C1 )  E[max1 ≤t≤T {μit} +   t=1Δt ] = o(E [ t=1 η(t)μit])
then mechanism M is asymptotically ex-ante individually rational and asymptotically incentive compatible. Also, if in addition to (C1), the following condition holds
        ∑                         ∑
(C2)  E [ Tt=1η (t)maxi {μit}] = o(E[  Tt=1maxi {μit}])
then, M is asymptotically ex-ante efficient.

Before stating the proof, we observe a natural trade-off between exploitation and exploration rates in our context: higher exploration rates lead to more accurate estimates of the utilities of the agents. Condition (C1) provides us with a lower bound on the exploration rate. On the flip side, condition (C2) gives an upper bound. In the following sections, we will show with two examples how conditions (C1) and (C2) can be used to adjust the exploration rate of a learning algorithm in order to obtain efficiency and incentive compatibility.

Proof :  The expected utility of a truthful agent i up to time T is equal to:

  ∑T            1   ∑T             ∑T
E[   xituit]  =  --E [  η (t)μit]+  E[   yitμit]
  t=1           n   t=1            t=1

Subtracting E[ t=1T pit] from both sides, by Theorem 1 we get:

  ∑T            ∑T             ∑T              ∑T           ∑T
E [  xituit]- E [   pit]  =   1E [   η(t)μit]+ (E[   yitμit]- E [   pit])
  t=1           t=1          n  t=1             t=1          t=1
                                T              T
                        ≥   1E [∑   η(t)μ  ]- E[∑  Δ  ]
                            n          it          t
                               t=1            t=1

Plugging condition (C1) into the equation above yields:

  ∑T            ∑T                     ∑T
E [   xituit]-  E[   pit]  ≥  (-1-  o(1))E [   η(t)μit]                 (7)
  t=1           t=1         n          t=1

Therefore, the mechanism is asymptotically ex-ante individual rational. Moreover, inequality (7) implies that the utility of the agent i is Ω(E[ t=1T η(t)μit]). Thus, by Theorem 2, if (C1) holds, then the mechanism is asymptotically incentive compatible.

To prove the claim about the efficiency of the mechanism, we invoke Theorem 3. By this theorem and condition (C1) we have:

              T                T
             ∑                ∑
W (T)  ≥   E[   maxi {μit}]- E [  η (t)maxi {μit}+ 2Δt ]
             t=1              t=1
             ∑T                        ∑T
       ≥   E[   maxi {μit}]- (1+ o(1))E[   η(t)maxi {μit}]
             t=1                       t=1

Plugging condition (C2) into the equation above we get:

   T                             T
E[∑  max {μ  }]- W (T)  =   O(E [∑   η(t) max{ μ }])
       i   it                            i    it
  t=1                            t=1
                                ∑T
                        =   o(E[   maxi {μit}])
                                t=1
which implies asymptotic ex-ante efficiency.

Theorem above shows that under some assumptions, the welfare obtained by the mechanism is asymptotically equivalent to efficient mechanism that every time allocates the item to the agent with the highest expected utility. We give similar conditions on the revenue guarantee of the mechanism.

Theorem 5 If in addition to (C1), the following condition holds

         ∑                        ∑
(C3 ) E [  Tt=1 η(t)maxi{μit}] = o(E [ Tt=1 γit])
then the revenue of the mechanism is asymptotically equivalent to the revenue of the efficient mechanism that at every time allocates the item to the agent with the highest expected utility and charges the winning agent the second highest expected utility.

Proof :  There are three reasons for loss of revenue. The first one is the loss during the exploration which is equal to E[ t=1T η(t)γt] E[ t=1T η(t)maxi{μit}].

Another reason is the estimation error of γt. Let i be the agent who has received the item at time t. We consider two cases:

  1. If i is the agent with the second highest expected utility, then let j be the agent with the highest expected utility. The estimation error of γt is equal to γt - min{^γt(t),^γt(T)}.
    γt - min {^γt(t),^γt(T)} ≤  μit - min {^μjt(t)- μ^jt(T)}
                      ≤  μjt - min{^μjt(t) - ^μjt(T )}

                      ≤  max {μjt - μ^jt(t),μjt - ^μjt(T )}
                      ≤  |μjt - ^μjt(t)|+ |μjt - ^μjt(T)|
    Therefore in this case, by inequality (1), the expected estimation error of γt is bounded by 2Δt.
  2. Otherwise, let j be the agent with the second highest expected utility.
    γt - min{^γt(t),^γt(T)} ≤ μjt - min {^μjt(t)- ^μjt(T)}

    Similar to the previous case, we have:

    γ  - min{^γ (t),^γ (T)} ≤ |μ  - ^μ (t)|+ |μ  - ^μ (T )|
 t        t    t         jt   jt       jt   jt
    which bounds the expected estimation error of γt by 2Δt.

There third reason contributing to loss of revenue is the outstanding payments. Agents do not pay for the last item they received during exploitation. These outstanding payments attribute to a loss that is bounded by n E[max1tT γt].

For the expected revenue of the mechanism we have:

              T  n
             ∑  ∑
R (T)  =   E[       pit]
             t=1i=1
             ∑T
       ≥   E[   (1- η(t))min{^γt(T),^γt(t)}]- n ⋅E [matx≤T γt]
             t=1
             ∑T
       ≥   E[   (1- η(t))(γt - 4Δt)]- n ⋅E[max γt]
             t=1                          t≤T
             ∑T        ∑T            ∑T
       ≥   E[   γt]- E [   η(t)γt]- E [   4Δt]- n ⋅E [max γt]
             t=1       t=1           t=1             t≤T

Plugging condition (C1) we get:

             T                  T
            ∑                  ∑
R(T)  ≥   E[   γt]- (1 + o(1))E [   η(t) maix{μit}]
            t=1                 t=1

Therefore, by condition (C3)

  ∑T                  ∑T
E [   γt]- R (T) = o(E[   max {μit}])
  t=1                 t=1  i

5 Allowing agents to bid

In mechanism M no agent explicitly bids for an item. Whether an agent receives an item or not depends on the history of their reported utilities and the estimates that M forms from them. This may be advantageous when the bidders themselves are unaware of what their utilities will be. However, when agents may posses a better estimate of their utilities we would like to make use of that. For this reason we describe how to modify M so as to allow agents to bid for an item.

If time t occurs during an exploit phase let Bt be the set of the agents who bid at this time. The mechanism bids on the behalf of all agent i∕∈Bt. Denote by bit the bid of agent i ∈Bt for the item at time t. The modification of M sets bit = ^μ it(t), for i∕∈B. Then, the item is allocated at random to one of the agents in arg maxibit.

If i is the agent who received the item at time t, let A = {bjt|j ∈Bt}∪{μjt|,j∕∈Bt}. Define γt as the second highest value in A. Let ^γt(T) to be equal to maxjibjk. The payment of agent i will be

      t-1                   t-1
      ∑                     ∑
pit ←    yikmin {^γk(t),bik}-     pik.
      k=1                   k=1

To incorporate the fact that bidders can bid for an item, we must modify the definition of truthfulness.

Definition 2 Agent i is truthful if:

  1. rit = uit, for all time xit = 1,t 1.
  2. If i bids at time t, then E[|bit - μit|] E[|^μ it - μit|].

Note that item 2 does not require that agent i bid their actual utility only that their bid be closer to the mark than the estimate. With this modification in definition, Theorems 4 and ?? continue to hold.

6 Independent and Identically-Distributed Utilities

In this section, we assume that for each i, uit’s are independent and identically-distributed random variables. For simplicity, we define μi = E[uit],t > 0. Without loss of generality, we also assume 0 < μi 1.

In this environment, the learning algorithm we use is an ε-greedy algorithm for the multi-armed bandit problem4 . Let nit = k=1t-1xit. For ϵ ∈ (0,1), we define:

           t-1
           ∑
   nit =      xit
           k=1
 ηϵ(t) =   min{1,nt- ϵln1+ϵ t}
           {(∑T    x  r )∕n  ,    n   > 0
^μit(T ) =       k=1  ik ik   iT      iT
            0,                    niT = 0
Call the mechanism based on this learning algorithm Mϵ(iid).

Lemma 6 If all agents are truthful, then, under Mϵ(iid)

E[Δ ] = O( √-1--).
    t       t1-ϵ

The proof of this lemma is given in appendix A.

We show that Mϵ(iid), for ε 1 3, satisfies all the desired properties we discussed in the previous section. Moreover, it satisfies a stronger notion of individual rationality. Mϵ(iid) satisfies ex-post individual rationality if for any agent i, and for all T 1:

∑T       ∑T
    pit ≤    xitrit
 t=1      t=1

Theorem 7 Mϵ(iid) is ex-post individually rational. Also, for 0 ϵ 1 3, Mϵ(iid) is asymptotically incentive compatible and ex-ante efficient.

Proof :  We first prove ex-post individual rationality. It is sufficient to prove it only for the periods that agent i has received the item within an exploit phase. For T, such that yiT = 1, we have:

∑T         T∑- 1
    pit  =      yit min{^γt(T),^μit(t)}
t=1         t=1
           T- 1          T-1
           ∑             ∑
        ≤      yit^γt(T) ≤    yit^μiT (T)
            t=1           t=1
                       T∑-1
        ≤  nit^μiT(T) =    xitrit
                       t=1

The third inequality follows because the item is allocated to i at time T which implies μ^ iT (T) ^γt(T). We complete the proof by showing that conditions (C1) and (C2) hold. Note that μi 1. By lemma 6, for ϵ 1 3:

      T∑-1          1+ϵ       1-ϵ  1+ϵ       ∑T
E [1 +    Δt ] = O (T 2 ) = o(T    ln   T) = O (   ηϵ(t)μi).
      t=1                                    t=1
Therefore, (C1) holds.

The welfare of any mechanism between time 1 and T is bounded by T. For any ϵ > 0, E[1 + t=1T-1Δt + ηt] = o(T) which implies (C2).

7 Brownian Motion

In this section, we assume for each i, 1 i n, the evolution of μit is a reflected Brownian motion with mean zero and variance σi2; the reflection barrier is 0. In addition, we assume μi0 = 0, and σi2 σ2, for some constant σ. The mechanism observes the values of μit at discrete times t = 1,2,⋅⋅⋅.

In this environment our learning algorithm estimates the reflected Brownian motion using a mean zero martingale. We define lit is defined as the last time up to time t that the item is allocated to agent i. This includes both explore and exploit phases. If i has not been allocated any item yet, lit is zero.

                   - ϵ  2+2ϵ
 ηϵ(t) =   m(in{1,nt   ln    t}                            (8)
           || r-     t < T
           {  ilit
^μit(T ) =   | rili,t-1  t = T                                (9)
           |( rili,T   t > T
Call this mechanism Mϵ(B). For simplicity, we assume that the agent reports the exact value of μit. It is not difficult to verify that the results in this section hold as long as the expected value of the error of these estimates at time t is o(t1 6 ).

We begin analyzing the mechanism by stating some well-known properties of reflected Brownian motions (see [7]).

Proposition 8 Let [Wt,t 0] be a reflected Brownian motion with mean zero and variance σ2; the reflection barrier is 0. Assume the value of Wt at time t is equal to y:

        √ ---
E[y] = θ( tσ2)                                    (10 )
For T > 0, let z = Wt+T . For the probability density function of z - y we have:
                           ∘ ------
                             --2--  -2Txσ22-
       Pr[(z - y) ∈ dx] ≤    πT σ2 e                          (11 )
                           ∘ 8T-σ2-1 --x2-
        Pr [|z - y| ≥ x] ≤    ----- -e2Tσ2                     (12 )
                           ∘ --π---x
                             8T-σ2  -x22-
E [|z - y|I (|z - y| ≥ x)] ≤      π  e 2Tσ                        (13 )

Corollary 9 The expected value of the maximum of μiT , 1 i n, is θ(√--
 T).

Note that in the corollary above n and σ are constant. Now, similar to Lemma 6, we bound ET ]. The proof is given in appendix B.

Lemma 10 Suppose under Mϵ(B) all agents are truthful until time T, then, ET ] = O(Tϵ 2 ).

Now we are ready to prove the main theorem of this section:

Theorem 11 Mϵ(B) is ex-post individually rational. Also, for 0 ϵ 1 3, Mϵ(B) is asymptotically incentive compatible and ex-ante efficient.

Proof :  We first prove ex-post individual rationality. It is sufficient to prove it only for the periods that agent i has received the item within an exploit phase. For T, such that yiT = 1, we have:

∑T         T∑-1
   pit =      yitmin{^γt(T),^μit(t)}
t=1         t=1
           T∑-1          T∑- 1
       ≤      y  ^μ (t) =    y r -
           t=1 it it     t=1  it ili,t-1
            T
           ∑
       ≤      xitrit.
           t=1

We complete the proof by showing the conditions (C1) and (C2) hold. By (10), the expected utility of each agent at time t from random exploration is

 √ --- -ϵ  1+ϵ       1-ϵ  1+ϵ
θ( tσ2t  ln   t) = θ(t2  ln   t).
Therefore, the expected utility up to time T from exploration is θ(T3
2-ϵ ln1+ϵ T). By Lemma (10) and Corollary 9:
             T∑-1          1+ϵ
E[mta≤xT {μiT}+     Δt] = O(T   2).
              t=1
For ϵ 1 3, 3
2 - ϵ 1 + ϵ 2 this yields condition (C1).

By Corollary 9, the expected value of maxi{μiT } and γT are θ(√--
 T). Therefore, the expected welfare of an efficient mechanism between time 1 and T is θ(T3 2 ). For any 0 < ϵ < 1, we have:

    32       32-ϵ  1+ϵ     1+ϵ2
θ(T  ) = ω (T   ln   t+  T   )
By condition (C2), Mϵ(B) is asymptotically ex-ante efficient.

To apply this model to sponsored search we treat each item as a bundle of search queries. Each time step is defined by the arrival of m queries. The mechanism allocates all m queries to an agent and after that, the advertiser reports the average utility for these queries. The payment pit is now the price per item, i.e. the advertiser pays mpit for the bundle of queries. The value of m is chosen such that μit can be estimated with high accuracy.

8 Discussion and Open Problems

In this section we discuss some extensions of the mechanisms.

Multiple Slots To modify M so that it can accommodate multiple slots we borrow from Gonen and Pavlov [13], who assume there exist a set of conditional distributions which determine the conditional probability that the ad in slot j1 is clicked conditional on the ad in slot j2 being clicked. During the exploit phase, M allocates the slots to the advertisers with the highest expected utility, and the prices are determined according to Holmstrom’s lemma ([20], see also [1]) The estimates of the utilities are updated based on the reports, using the conditional distribution.

Delayed Reports In some applications, the value of receiving the item is realized at some later date. For example, a user clicks on an ad and visits the website of the advertiser. A couple of days later, she returns to the website and completes a transaction. It is not difficult to adjust the mechanism to accommodate this setting by allowing the advertiser to report with a delay or change her report later.

Creating Multiple Identities When a new advertiser joins the system, in order to learn her utility value our mechanism gives it a few items for free in the explore phase. Therefore our mechanism is vulnerable to advertisers who can create several identities and join the system. It is not clear whether creating a new identity is cheap in our context because the traffic generated by advertising should eventually be routed to a legitimate business. Still, one way to avoid this problem is to charge users without a reliable history using CPC.

Acknowledgment. We would like to thank Arash Asadpour, Peter Glynn, Ashish Goel, Ramesh Johari, and Thomas Weber for fruitful discussions. The second author acknowledges the support from NSF and a gift from Google.

References

[1]   G. Aggarwal, A. Goel, and R. Motwani. Truthful auctions for pricing search keywords. Proceedings of ACM conference on Electronic Commerce, 2006.

[2]   S. Athey, and I. Segal. An Efficient Dynamic Mechanism. manuscript, 2007.

[3]   P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning archive, Volume 47 , Issue 2-3, 235-256, 2002.

[4]   A. Bapna, and T. Weber. Efficient Dynamic Allocation with Uncertain Valuations. Working Paper, 2006.

[5]   M. Balcan, A. Blum, J. Hartline, and Y. Mansour. Mechanism Design via Machine Learning. Proceedings of 46th Annual IEEE Symposium on Foundations of Computer Science, 2005.

[6]   D. Bergemann, and J. Välimäki. Efficient Dynamic Auctions. Proceedings of Third Workshop on Sponsored Search Auctions, 2007.

[7]   A. Borodin, and P. Salminen. Handbook of Brownian Motion: Facts and Formulae. Springer, 2002.

[8]   A. Blum, V. Kumar, A. Rudra, and F. Wu. Online Learning in Online Auctions. Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete Algorithms, 2003.

[9]   R. Cavallo, D. Parkes, and S. Singh, Efficient Online Mechanism for Persistent, Periodically Inaccessible Self-Interested Agents. Working Paper, 2007.

[10]   K. Crawford. Google CFO: Fraud A Big Threat. CNN/Money, December 2, 2004.

[11]   E. Elkind. Designing And Learning Optimal Finite Support Auctions. Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 2007.

[12]   J. Gittins. Multi-Armed Bandit Allocation Indices. Wiley, New York, NY, 1989.

[13]   R. Gonen, and E. Pavlov. An Incentive-Compatible Multi-Armed Bandit Mechanism. Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, 2007.

[14]   B. Grow, B. Elgin, and M. Herbst. Click Fraud: The dark side of online advertising. BusinessWeek. Cover Story, October 2, 2006.

[15]   N. Immorlica, K. Jain, M. Mahdian, and K. Talwar. Click Fraud Resistant Methods for Learning Click-Through Rates. Proceedings of the 1st Workshop on Internet and Network Economics, 2005.

[16]   B. Kitts, P. Laxminarayan, B. LeBlanc, and R. Meech. A Formal Analysis of Search Auctions Including Predictions on Click Fraud and Bidding Tactics. Workshop on Sponsored Search Auctions, 2005.

[17]   R. Kleinberg. Online Decision Problems With Large Strategy Sets. Ph.D. Thesis, MIT, 2005.

[18]   S. Lahaie, and D. Parkes. Applying Learning Algorithms to Preference Elicitation. Proceedings of the 5th ACM conference on Electronic Commerce, 2004.

[19]   M. Mahdian, and K. Tomak. Pay-per-action model for online advertising. Proceedings of the 3rd International Workshop on Internet and Network Economics, 549-557, 2007.

[20]   P. Milgrom, Putting Auction Theory to Work. Cambridge University Press, 2004.

[21]   D. Mitchell. Click Fraud and Halli-bloggers. New York Times, July 16, 2005.

[22]   N. Nisan, T. Roughgarden, E. Tardos, and V. Vazirani, editors. Algorithmic Game Theory, Cambridge University Press, 2007.

[23]   D. Parkes. Online Mechanisms Algorithmic Game Theory (Nisan et al. eds.), 2007.

[24]   B. Stone. When Mice Attack: Internet Scammers Steal Money with “Click Fraud”. Newsweek, January 24, 2005.

[25]   R. Wilson. Game-Theoretic Approaches to Trading Processes. Economic Theory: Fifth World Congress, ed. by T. Bewley, chap. 2, pp. 33-77, Cambridge University Press, Cambridge, 1987.

[26]   J. Wortman, Y. Vorobeychik, L. Li, and J. Langford. Maintaining Equilibria During Exploration in Sponsored Search Auctions. Proceedings of the 3rd International Workshop on Internet and Network Economics, 2007.

A Proof of Lemma 6Independent and Identically-Distributed Utilities

Proof :  We prove the lemma by showing that for any agent i,

                --1---       1-
Pr[|μi - ^μit(t)| ≥ √t1--ϵμi] = o(tc),∀c > 0.

First, we estimate E[nit]. There exists a constant d such that:

         t- 1        t- 1
         ∑  ηϵ(k)   ∑       1- - ϵ 1+ ϵ     1-1-ϵ  1+ϵ
E [nit] ≥      n  =     min {n,k   ln    k} > dt   ln   t
         k=1         k=1

By the Chernoff-Hoeffding bound:

Pr[n  ≤  E[nit]] ≤ e-t1-ϵl8dn1+ϵt.
    it     2

Inequality (1) and the Chernoff-Hoeffding bound imply:

                ---1--
Pr[|μi - ^μit(t)| ≥ √t1--ϵμi]
                             1            E [n ]
         = Pr[|μi - ^μit(t)| ≥ √-----μi ∧ nit ≥---it-]
                             t1- ϵ           2
                           --1---         E-[nit]
         + Pr[|μi - ^μit(t)| ≥ √ t1-ϵμi ∧ nit <  2   ]
             --11-ϵt1-ϵln1+ϵt μi    1-ϵ 1+ϵ
         ≤ 2e-t----2d-------+ e -t-8lnd---t
             1
         = o(-c),∀c > 0
             t
Therefore, with probability 1 - o(1 t ), for all agents, Δt 1 ___  ----
√t1-ϵ. Since the maximum value of uit is 1, Et] = O( 1___ √-1-ϵ
 t).

B Proof of Lemma 10Brownian Motion

Proof :  Define Xit = |μi,T - μi,T-t|. We first prove Pr[Xit > Tϵ 2 ] = o( 1_Tc ),c > 0. There exists a constant Td such that for any time T Td, the probability that i has not been randomly allocated the item in the last t < Td step is at most:

       -                - ϵ  2+2ϵ   t    -tln2+2ϵT-
Pr[T - li,T- 1 > t] < (1 - T  ln     T) ≤ e   Tϵ   .                   (14)
Let t = 1 ____ ln1+ϵ T Tϵ. By equation (12) and (14),
          ϵ                ϵ       -
Pr[Xit > T 2] =   Pr[Xit > T 2 ∧T - li,T- 1 ≤ t]
                 + Pr[Xit > T ϵ2 ∧T - li,T- 1 > t]
                    1
             =   o(-c),∀c > 0.
                   T

Hence, with high probability, for all the n agents, Xit Tϵ 2 . If for some of the agents Xit Tϵ 2 , then, by Corollary 9, the expected value of the maximum of μit over these agents is θ(√ --
  T). Therefore, E[maxi{Xit}] = O(Tϵ 2 ). The lemma follows because ET ] E[maxi{Xit}].