Dynamic Cost-Per-Action Mechanisms and Applications to Online Advertising

Currently, the main two charging models in the online advertising industry are cost-per-impression (CPM) and cost-per-click (CPC). In the CPM model, the advertisers pay the publisher for the impression of their ads. CPM is commonly used in traditional media (e.g. magazines and television) or banner advertising and is more suitable when the goal of the advertiser is to increase brand awareness.

A more attractive and more popular charging model in online advertising is the CPC model in which the advertisers pay the publisher only when a user clicks on their ads. In the last few years, there has been a tremendous shift towards the CPC charging model. CPC is adopted by search engines such as Google or Yahoo! for the placement of ads next to search results (also known as sponsored search) and on the website of third-party publishers.

In this paper we will focus on another natural and widely advocated charging scheme known as the Cost-Per-Action or Cost-Per-Acquisition (CPA) model. In this model, instead of paying per click, the advertiser pays only when a user takes a specific action (e.g. fills out a form) or completes a transaction. Recently, several companies like Google, eBay, Amazon, Advertising.com, and Snap.com have started to sell advertising in this way.

CPA models can be the ideal charging scheme, especially for small and risk averse advertisers. We will briefly describe a few advantages of this charging scheme over CPC and refer the reader to [19] for a more detailed discussion.

One of the drawbacks of the CPC scheme is that it requires the advertisers to submit their bids before observing the profits generated by the users clicking on their ads. Learning the expected value of each click, and therefore the right bid for the ad, is a prohibitively difficult task especially in the context of sponsored search in which the advertisers typically bid for thousands of keywords. CPA eliminates this problem because it allows the advertisers to report their payoff after observing the user’s action.

Another drawback of the CPC scheme is its vulnerability to click fraud. Click fraud refers to clicks generated by someone or something with no genuine interest in the advertisement. Such clicks can be generated by the publisher of the content who has an interest in receiving a share of the revenue of the ad or by a rival who wishes to increase the cost of advertising for the advertiser. Click fraud is considered by many experts to be the biggest challenge facing the online advertising industry [14, 10, 24, 21]. CPA schemes are less vulnerable because generating a fraudulent action is typically more costly than generating a fraudulent click. For example, an advertiser can define the action as a sale and pay the publisher only when the ad yields profit¹ .

On the other hand, there is a fundamental difference between CPA and CPC charging models. A click on the ad can be observed by both advertiser and publisher. However, the action of the user is hidden from the publisher and is observable only by the advertiser. Although the publisher can require the advertisers to install a software that will monitor actions that take place on their web site, even moderately sophisticated advertisers can find a way to manipulate the software if they find it sufficiently profitable.

Are the publishers exposed to the manipulation or misreporting of the advertisers in the CPA scheme? Does CPA create an incentive for the advertisers to misreport the number of actions or their payoffs for the actions? The main result of this paper is to give a negative answer to these questions. We design a mechanism that, asymptotically and under reasonable assumptions, removes the incentives of the advertisers to misreport their payoffs. At the same time, our mechanism has the same asymptotic efficiency and hence revenue as the currently used CPC mechanisms. We will use techniques in learning and mechanism design to obtain this result.

In the next section, we will formally describe our model in mechanism design terminology (see [22].) We will refer to advertisers as agents and to the impression of an ad as an item. For simplicity of exposition only, we assume only one advertisement slot per page. In section 8 we outline how to extend our results to the case where more than one advertisement can be displayed in each page. Although our work is essentially motivated by online advertising, we believe that the application of our mechanism is not limited this domain.

1.1 Model

We study the following problem: there are a number of self-interested agents competing for identical items sold repeatedly at times t = 1,2, ⋅⋅⋅

. At each time t, a mechanism allocates the item to one of the agents. Agents discover their utility for the good only if it is allocated to them. If agent i receives the good at time t, she realizes utility u_it (denominated in money) for it and reports (not necessarily truthfully) the realized utility to the mechanism. Then, the mechanism determines how much the agent has to pay for receiving the item. We allow the utility of an agent to change over time. For this environment we are interested in auction mechanisms which have the following four properties.

The precise manner in which these properties are formalized is described in section 2.

We will build our mechanisms on a sampling-based learning algorithm. The learning algorithm is used to estimate the expected utility of the agents, and consists of two alternating phases: exploration and exploitation. During an exploration phase, the item is allocated for free to a randomly chosen agent. During an exploitation phase, the mechanism allocates the item to the agent with the highest estimated expected utility. After each allocation, the agent who has received the item reports its realized utility. Subsequently, the mechanism updates the estimate of utilities and determines the payment.

We characterize a class of learning algorithms that ensure that the corresponding mechanism has the four desired properties. The main difficulty in obtaining this result is the following: since there is uncertainty about the utilities, it is possible that in some periods the item is allocated to an agent who does not have the highest utility in that period. Hence, the natural second-highest price payment rule would violate individual rationality. On the other hand, if the mechanism does not charge an agent because her reported utility after the allocation is low, it gives her an incentive to shade her reported utility down. Our mechanism solves these problems by using an adaptive, cumulative pricing scheme.

We illustrate our results by identifying simple mechanisms that have the desired properties. We demonstrate these mechanisms for the case in which the u_it’s are independent and identically-distributed random variables as well as the case where their expected values evolve like independent reflected Brownian motions. In these cases the mechanism is actually ex-post individually rational.

In our proposed mechanism, the agents do not have to bid for the items. This is advantageous when the bidders themselves are unaware of their utility values. However, in some cases, an agent might have a better estimate of her utility for the item than our mechanism. For this reason, we describe how we can slightly modify our mechanism to allow those agents to bid directly.

1.2 Related Work

There is a large number of interesting results on using machine learning techniques in mechanism design. We only briefly survey the main techniques and ideas and compare them with the approach of this paper.

Most of these works, like [5, 8, 11, 18], consider one-shot games or repeated auctions in which the agents leave the environment after they received an item. In our setting we may allocates items to an agent several times and hence, we need to consider the strategic behavior of the agents over time. There is also a big literature on regret minimization or expert algorithms. In our context, these algorithms are applicable even if the utilities of the agents are changing arbitrarily. However, the efficiency (and therefore the revenue) of these algorithms is comparable to the mechanisms that allocates the item to the single best agent (expert) (e.g. see [17]). Our goal is more ambitious: our efficiency is close the most efficient allocation which might allocate the item to different agents at different times. On the other hand, we focus on utility values that change smoothly (e.g. like a Brownian motion).

In a finitely repeated version of the environment considered here, Athey and Segal [2] construct an efficient, budget balanced, mechanism where truthful revelation in each period is Bayesian incentive compatible. Bapna and Weber [4] consider the infinite horizon version of [2] and propose a class of incentive compatible mechanisms based on the Gittins index (see [12]). Taking a different approach, Bergemann and Välimäki [6] and Cavallo et al. [9] propose an incentive compatible generalization of the Vickrey-Clark-Groves mechanism based on the marginal contribution of each agent for this environment. All these mechanisms need the exact solution of the underlying optimization problems, and therefore require complete information about the prior of the utilities of the agents; also, they do not apply when the evolution of the utilities of the agents is not stationary over time. This violates the last of our desiderata. For a comprehensive survey in dynamic mechanism design literature see [23].

In the context of sponsored search, attention has focused on ways of estimating click through rates. Gonen and Pavlov [13] give a mechanism which learns the click-through rates via sampling and show that truthful bidding is, with high probability, a (weakly) dominant strategy in this mechanism. Along this line, Wortman et al. [26] introduced an exploration scheme for learning advertisers’ click-through rates in sponsored search which maintains the equilibrium of the system. In these works, unlike ours, the distribution of the utilities of agents are assumed to be fixed over time.

Immorlica et al. [15], and later Mahdian and Tomak [19], examine the vulnerability of various procedures for estimating click through, and identify a class of click through learning algorithms in which fraudulent clicks cannot increase the expected payment per impression by more than o(1). This is under the assumption that the slot of an agent is fixed and the bids of other agents remain constant overtime. In contrast, we study conditions which guarantee incentive compatibility and efficiency, while the utility of (all) agents may evolve over time.

2 Definitions and Notation

Suppose n agents competing in each period for a single item. The item is sold repeatedly at time t = 1,2, ⋅⋅⋅

. Denote by u_it the nonnegative utility of agent i for the item at time t. Utilities are denominated in a common monetary scale.

The utilities of agents may evolve over time according to a stochastic process. We assume that for i≠j, the evolution of u_it and u_jt are independent stochastic processes. We also define μ_it = E[u_it|u_i1, ⋅⋅⋅

,u_i,t-1]. Throughout this paper, expectations are taken conditioned on the complete history. For simplicity of notation, we now omit those terms that denote such a conditioning. With notational convention, it follows, for example, that E[u_it] = E[μ_it]. Here the second expectation is taken over all possible histories.

Let

be a mechanism used to sell the items. At each time,

allocates the item to one of the agents. Let i be the agent who has received the item at time t. Define x_it to be the variable indicating the allocation of the item to i at time t. After the allocation, agent i observes her utility, u_it, and then reports r_it, as her utility for the item, to the mechanism. Note that we do not require an agent to know her utility for possessing the item in advance of acquiring it. The mechanism then determines the payment, denoted by p_it.

Our goal is to design a mechanism which has the following properties. We assume n, the number of agents, is constant.

3 Proposed Mechanism

We build our mechanism on top of a learning algorithm that estimates the expected utility of the agents. We refrain from an explicit description of the learning algorithm. Rather, we describe sufficient conditions for a learning algorithm that can be extended to a mechanism with all the properties we seek (see section 4.1). In section 6 and 7 we give two examples of environments where learning algorithms satisfying these sufficient conditions exist.

The mechanism consists of two phases: explore and exploit. During the explore phase, with probability η(t), η : ℕ → [0,1], the item is allocated for free to a randomly chosen agent. During the exploit phase, the mechanism allocates the item to the agent with the highest estimated expected utility. Afterwards, the agent reports her utility to the mechanism and the mechanism determines the payment. We first formalize our assumptions about the learning algorithm and then we discuss the payment scheme. The mechanism is given in Figure 1.

The learning algorithm, samples u_it’s at rate η(t), and based on the history of the reports of agent i, returns an estimate of μ_it. Let

_it(T) be the estimate of the algorithm for μ_it conditional on the history of the reports up to time T. The history of the reports of agent i up to time T is the sequence of the reported values and times of observation of u_it up to but not including time T. Note that we allow T > t. Thus, information at time T > t can be used to revise an estimate of μ_it made at some earlier time. We assume that the accuracy of the learning algorithm is monotone, i.e., increasing the number of samples, in expectation, only increases the accuracy of the estimations ³.

We now describe the payment scheme. Let

_t(T) = max_j≠i{

_jt(T)}, where i is the agent who receives the item at time t. We define y_it to be the indicator variable of the allocation of the item to agent i during an exploit phase. The payment of agent i at time t, denoted p_it, is equal to:

An agent only pays for items that are allocated to her during the exploit phase, up to but not including time t. At time t, the payment of agent i for the item she received at time k < t is equal to min{

_k(k),

_k(t)}. The payments scheme emulates the second pricing scheme by replacing γ_t with its estimations. The mechanism reduces the price of an item if it later realizes that item was overpriced. However, it would not increase the payment when the items was underpriced. This happens when an item, due to the estimation error, is allocated to an agent who does not have the highest expected valuation for the item. Since the estimations of learning algorithm for the utilities of agents become more precise over time, our adaptive cumulative payment scheme allows correction of the mistakes made in the past.

4 Analysis

We start this section by defining Δ_t. Assume all agents are truthful up to time t. Let Δ_t be the maximum over all agents i, the difference between μ_it and its estimation using only reports during the explore phase. Because the accuracy of the learning algorithm is monotone, for a truthful agent i at time T ≥ t we have:

In the inequality above, and in the rest of the paper, the expectations of

_it are taken over the evolution of u_it’s and the random choices of the mechanism. For simplicity of notation, we omit those terms that denote such a conditioning.

[ In this section, we will relate the performance of the mechanism to the estimation error of the learning algorithm. We start with the individual rationality aspects of the mechanism. Then we show that if ∑ Δ_t is small, then agents cannot gain much by deviating from the truthful strategy. We also bound the efficiency loss in terms of ∑ Δ_t. ]

If y_it = 0 then p_it = 0. Also, recall that for every time t, because expectations are computed conditioned on the complete history, E[u_it] = E[μ_it]. By the payment rule we have:

Proof : [ We bound the expected profit i could obtain by deviating from the truthful strategy, from the items she receives during exploitation and up to time T. The term E[max_1≤t≤T{μ_it}] in the expression above bounds the outstanding payment of agent i. Recall that agent do not pay for the last item they receive during exploitation. Also, note that the exploration rate is independent of the strategy of the agents. Therefore, without loss of generality, we assume at time T, agent i has received an item during exploitation. ]

Let

be the strategy that i deviates to. Fixing the evolution of all u_jts, 1 ≤ j ≤ n, and all random choices of the mechanism, let D_T be the set of times that i receives the item under strategy

during the exploit phase and before time T. Formally, D_T = {t < T|y_it = 1, if the strategy of i is

}. Similarly, let C_T = {t < T|y_it = 1, if i is truthful}. Also, let

′_it, and

′_t correspond to the estimates of the mechanism when the strategy of i is

. The expected profit i could obtain under strategy S from the items she received during exploitation, up to time T - 1, is equal to:

By inequality (2), -E[∑ _{tC_T\D_T}μ_it - γ_t] ≤ E[∑ _t=1^TΔ_t]. Therefore:

We now compare the welfare of our mechanism to the efficient mechanism that at every time allocates the item the agent with the highest expected utility. The expected loss of efficiency during exploration is equal to E[∑ _t=1^Tη(t)max_i{μ_it}]. In the next theorem, we show that in the equilibrium, the efficiency loss during the exploit phase is bounded by a factor of the total estimation error of the learning algorithm.

Proof : There are two reasons for loss of efficiency of the mechanism. The first reason is the loss in welfare during the exploration when the item is allocated randomly to one of agents. The expected loss in this case is equal to E[∑ _t=1^Tη(t)max_i{μ_it}].

Another reason is the mistakes during exploitation. The error in the estimations may lead to allocation to an agent who does not value the item the most. Suppose at time t, during exploitation, the mechanism allocated the item to agent j instead of i, i.e., μ_it > μ_jt. By the rule of the mechanism we have

_it(t) ≤

_jt(t). By subtracting this inequality from μ_jt - μ_it we get:

We sum up this inequality over all such time t, and by inequality (1), we observe that the expected efficiency loss during exploration is bounded by 2E[∑ _t=1^TΔ_t].

4.1 Sufficient Conditions (TODO)

In this section, using the theorem from the previous section, we give sufficient conditions on the learning algorithm which guarantee asymptotic ex-ante individual rationality and incentive compatibility and efficiency.

Before stating the proof, we observe a natural trade-off between exploitation and exploration rates in our context: higher exploration rates lead to more accurate estimates of the utilities of the agents. Condition (C1) provides us with a lower bound on the exploration rate. On the flip side, condition (C2) gives an upper bound. In the following sections, we will show with two examples how conditions (C1) and (C2) can be used to adjust the exploration rate of a learning algorithm in order to obtain efficiency and incentive compatibility.

Therefore, the mechanism is asymptotically ex-ante individual rational. Moreover, inequality (7) implies that the utility of the agent i is Ω(E[∑ _t=1^Tη(t)μ_it]). Thus, by Theorem 2, if (C1) holds, then the mechanism is asymptotically incentive compatible.

To prove the claim about the efficiency of the mechanism, we invoke Theorem 3. By this theorem and condition (C1) we have:

Theorem above shows that under some assumptions, the welfare obtained by the mechanism is asymptotically equivalent to efficient mechanism that every time allocates the item to the agent with the highest expected utility. We give similar conditions on the revenue guarantee of the mechanism.

Proof : There are three reasons for loss of revenue. The first one is the loss during the exploration which is equal to E[∑ _t=1^Tη(t)γ_t] ≤ E[∑ _t=1^Tη(t)max_i{μ_it}].

Another reason is the estimation error of γ_t. Let i be the agent who has received the item at time t. We consider two cases:

There third reason contributing to loss of revenue is the outstanding payments. Agents do not pay for the last item they received during exploitation. These outstanding payments attribute to a loss that is bounded by n ⋅ E[max_1≤t≤Tγ_t].

5 Allowing agents to bid

In mechanism

no agent explicitly bids for an item. Whether an agent receives an item or not depends on the history of their reported utilities and the estimates that

forms from them. This may be advantageous when the bidders themselves are unaware of what their utilities will be. However, when agents may posses a better estimate of their utilities we would like to make use of that. For this reason we describe how to modify

so as to allow agents to bid for an item.

If time t occurs during an exploit phase let

_t be the set of the agents who bid at this time. The mechanism bids on the behalf of all agent i

_t. Denote by b_it the bid of agent i

_t for the item at time t. The modification of

sets b_it =

_it(t), for i

B. Then, the item is allocated at random to one of the agents in arg max_ib_it.

If i is the agent who received the item at time t, let A = {b_jt|j

_t}∪{μ_jt|,j

_t}. Define γ_t as the second highest value in A. Let

_t(T) to be equal to max_j≠ib_jk. The payment of agent i will be

To incorporate the fact that bidders can bid for an item, we must modify the definition of truthfulness.

Note that item 2 does not require that agent i bid their actual utility only that their bid be closer to the mark than the estimate. With this modification in definition, Theorems 4 and ?? continue to hold.

6 Independent and Identically-Distributed Utilities

In this section, we assume that for each i, u_it’s are independent and identically-distributed random variables. For simplicity, we define μ_i = E[u_it],t > 0. Without loss of generality, we also assume 0 < μ_i ≤ 1.

In this environment, the learning algorithm we use is an ε-greedy algorithm for the multi-armed bandit problem⁴ . Let n_it = ∑ _k=1^t-1x_it. For ϵ

(0,1), we define:

We show that

_ϵ(iid), for ε ≤ 1 3, satisfies all the desired properties we discussed in the previous section. Moreover, it satisfies a stronger notion of individual rationality.

_ϵ(iid) satisfies ex-post individual rationality if for any agent i, and for all T ≥ 1:

Proof : We first prove ex-post individual rationality. It is sufficient to prove it only for the periods that agent i has received the item within an exploit phase. For T, such that y_iT = 1, we have:

The third inequality follows because the item is allocated to i at time T which implies

_iT(T) ≥

_t(T). We complete the proof by showing that conditions (C1) and (C2) hold. Note that μ_i ≤ 1. By lemma 6, for ϵ ≤ 1 3:

The welfare of any mechanism between time 1 and T is bounded by T. For any ϵ > 0, E[1 + ∑ _t=1^T-1Δ_t + η_t] = o(T) which implies (C2). □

7 Brownian Motion

In this section, we assume for each i, 1 ≤ i ≤ n, the evolution of μ_it is a reflected Brownian motion with mean zero and variance σ_i²; the reflection barrier is 0. In addition, we assume μ_i0 = 0, and σ_i² ≤ σ2, for some constant σ. The mechanism observes the values of μ_it at discrete times t = 1,2, ⋅⋅⋅

In this environment our learning algorithm estimates the reflected Brownian motion using a mean zero martingale. We define l_it is defined as the last time up to time t that the item is allocated to agent i. This includes both explore and exploit phases. If i has not been allocated any item yet, l_it is zero.

We begin analyzing the mechanism by stating some well-known properties of reflected Brownian motions (see [7]).

Note that in the corollary above n and σ are constant. Now, similar to Lemma 6, we bound E[Δ_T]. The proof is given in appendix B.

We complete the proof by showing the conditions (C1) and (C2) hold. By (10), the expected utility of each agent at time t from random exploration is

By Corollary 9, the expected value of max_i{μ_iT} and γ_T are θ( √--
T

). Therefore, the expected welfare of an efficient mechanism between time 1 and T is θ(T^3
2). For any 0 < ϵ < 1, we have:

To apply this model to sponsored search we treat each item as a bundle of search queries. Each time step is defined by the arrival of m queries. The mechanism allocates all m queries to an agent and after that, the advertiser reports the average utility for these queries. The payment p_it is now the price per item, i.e. the advertiser pays mp_it for the bundle of queries. The value of m is chosen such that μ_it can be estimated with high accuracy.

8 Discussion and Open Problems

Multiple Slots To modify

so that it can accommodate multiple slots we borrow from Gonen and Pavlov [13], who assume there exist a set of conditional distributions which determine the conditional probability that the ad in slot j₁ is clicked conditional on the ad in slot j₂ being clicked. During the exploit phase,

allocates the slots to the advertisers with the highest expected utility, and the prices are determined according to Holmstrom’s lemma ([20], see also [1]) The estimates of the utilities are updated based on the reports, using the conditional distribution.

Delayed Reports In some applications, the value of receiving the item is realized at some later date. For example, a user clicks on an ad and visits the website of the advertiser. A couple of days later, she returns to the website and completes a transaction. It is not difficult to adjust the mechanism to accommodate this setting by allowing the advertiser to report with a delay or change her report later.

Creating Multiple Identities When a new advertiser joins the system, in order to learn her utility value our mechanism gives it a few items for free in the explore phase. Therefore our mechanism is vulnerable to advertisers who can create several identities and join the system. It is not clear whether creating a new identity is cheap in our context because the traffic generated by advertising should eventually be routed to a legitimate business. Still, one way to avoid this problem is to charge users without a reliable history using CPC.

Acknowledgment. We would like to thank Arash Asadpour, Peter Glynn, Ashish Goel, Ramesh Johari, and Thomas Weber for fruitful discussions. The second author acknowledges the support from NSF and a gift from Google.

References

[1] G. Aggarwal, A. Goel, and R. Motwani. Truthful auctions for pricing search keywords. Proceedings of ACM conference on Electronic Commerce, 2006.

[2] S. Athey, and I. Segal. An Efficient Dynamic Mechanism. manuscript, 2007.

[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning archive, Volume 47 , Issue 2-3, 235-256, 2002.

[4] A. Bapna, and T. Weber. Efficient Dynamic Allocation with Uncertain Valuations. Working Paper, 2006.

[5] M. Balcan, A. Blum, J. Hartline, and Y. Mansour. Mechanism Design via Machine Learning. Proceedings of 46th Annual IEEE Symposium on Foundations of Computer Science, 2005.

[6] D. Bergemann, and J. Välimäki. Efficient Dynamic Auctions. Proceedings of Third Workshop on Sponsored Search Auctions, 2007.

[7] A. Borodin, and P. Salminen. Handbook of Brownian Motion: Facts and Formulae. Springer, 2002.

[8] A. Blum, V. Kumar, A. Rudra, and F. Wu. Online Learning in Online Auctions. Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete Algorithms, 2003.

[9] R. Cavallo, D. Parkes, and S. Singh, Efficient Online Mechanism for Persistent, Periodically Inaccessible Self-Interested Agents. Working Paper, 2007.

[10] K. Crawford. Google CFO: Fraud A Big Threat. CNN/Money, December 2, 2004.

[11] E. Elkind. Designing And Learning Optimal Finite Support Auctions. Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 2007.

[12] J. Gittins. Multi-Armed Bandit Allocation Indices. Wiley, New York, NY, 1989.

[13] R. Gonen, and E. Pavlov. An Incentive-Compatible Multi-Armed Bandit Mechanism. Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, 2007.

[14] B. Grow, B. Elgin, and M. Herbst. Click Fraud: The dark side of online advertising. BusinessWeek. Cover Story, October 2, 2006.

[15] N. Immorlica, K. Jain, M. Mahdian, and K. Talwar. Click Fraud Resistant Methods for Learning Click-Through Rates. Proceedings of the 1st Workshop on Internet and Network Economics, 2005.

[16] B. Kitts, P. Laxminarayan, B. LeBlanc, and R. Meech. A Formal Analysis of Search Auctions Including Predictions on Click Fraud and Bidding Tactics. Workshop on Sponsored Search Auctions, 2005.

[17] R. Kleinberg. Online Decision Problems With Large Strategy Sets. Ph.D. Thesis, MIT, 2005.

[18] S. Lahaie, and D. Parkes. Applying Learning Algorithms to Preference Elicitation. Proceedings of the 5th ACM conference on Electronic Commerce, 2004.

[19] M. Mahdian, and K. Tomak. Pay-per-action model for online advertising. Proceedings of the 3rd International Workshop on Internet and Network Economics, 549-557, 2007.

[20] P. Milgrom, Putting Auction Theory to Work. Cambridge University Press, 2004.

[21] D. Mitchell. Click Fraud and Halli-bloggers. New York Times, July 16, 2005.

[22] N. Nisan, T. Roughgarden, E. Tardos, and V. Vazirani, editors. Algorithmic Game Theory, Cambridge University Press, 2007.

[23] D. Parkes. Online Mechanisms Algorithmic Game Theory (Nisan et al. eds.), 2007.

[24] B. Stone. When Mice Attack: Internet Scammers Steal Money with “Click Fraud”. Newsweek, January 24, 2005.

[25] R. Wilson. Game-Theoretic Approaches to Trading Processes. Economic Theory: Fifth World Congress, ed. by T. Bewley, chap. 2, pp. 33-77, Cambridge University Press, Cambridge, 1987.

[26] J. Wortman, Y. Vorobeychik, L. Li, and J. Langford. Maintaining Equilibria During Exploration in Sponsored Search Auctions. Proceedings of the 3rd International Workshop on Internet and Network Economics, 2007.

A Proof of Lemma 6Independent and Identically-Distributed Utilities

B Proof of Lemma 10Brownian Motion

Proof : Define X_it = |μ_i,T - μ_i,T-t|. We first prove Pr[X_it > T^ϵ
2] = o( 1_T^c),∀c > 0. There exists a constant T_d such that for any time T ≥ T_d, the probability that i has not been randomly allocated the item in the last t < T_d step is at most:

Hence, with high probability, for all the n agents, X_it ≤ T^ϵ
2. If for some of the agents X_it ≥ T^ϵ
2, then, by Corollary 9, the expected value of the maximum of μ_it over these agents is θ( √ --
T

). Therefore, E[max_i{X_it}] = O(T^ϵ
2). The lemma follows because E[Δ_T] ≤ E[max_i{X_it}]. □