Optimistic knowledge gradient

From HandWiki

In statistics The optimistic knowledge gradient[1] is a approximation policy proposed by Xi Chen, Qihang Lin and Dengyong Zhou in 2013. This policy is created to solve the challenge of computationally intractable of large size of optimal computing budget allocation problem in binary/multi-class crowd labeling where each label from the crowd has a certain cost.[2]

Motivation

The optimal computing budget allocation problem is formulated as a Bayesian Markov decision process[3](MDP) and is solved by using the dynamic programming (DP) algorithm where the Optimistic knowledge gradient policy is used to solve the computationally intractable of the dynamic programming[4] (DP) algorithm.

Consider a budget allocation issue in crowdsourcing. The particular crowdsourcing problem we considering is crowd labeling. Crowd labeling is a large amount of labeling tasks which are hard to solve by machine, turn out to easy to solve by human beings, then we just outsourced to an unidentified group of random people in a distributed environment.

Methodology

We want to finish this labeling tasks rely on the power of the crowd hopefully. For example, suppose we want to identify a picture according to the people in a picture is adult or not, this is a Bernoulli labeling problem, and all of us can do in one or two seconds, this is an easy task for human being. However, if we have tens of thousands picture like this, then this is no longer the easy task any more. That's why we need to rely on crowdsourcing framework to make this fast. Crowdsourcing framework of this consists of two steps. Step one, we just dynamically acquire from the crowd for items. On the other sides, this is dynamic procedure. We don't just send out this picture to everyone and we focus every response, instead, we do this in quantity. We are going to decide which picture we send it in the next, and which worker we are going to hire in the crowd in the next. According to his or her historical labeling results. And each picture can be sent to multiple workers and every worker can also work on different pictures. Then after we collect enough number of labels for different picture, we go to the second steps where we want to infer true label of each picture based on the collected labels. So there are multiple ways we can do inference. For instance, the simplest we can do this is just majority vote. The problem is that no free lunch, we have to pays for worker for each label he or she provides and we only have a limited project budget. So the question is how to spend the limited budget in a smart way.

Challenges

Before showing the mathematic model, the paper mentions what kinds of challenges we are facing.

Challenge 1

First of all, the items have a different level of difficulty to compute the label, in a previous example, some picture are easy to classify. In this case, you will usually see very consistent labels from the crowd. However, if some pictures are ambiguous, people may disagree with each other resulting in highly inconsistent labelling. So we may allocate more resources on this ambiguous task.

Challenge 2

And another difficulty we often have is that the worker are not perfect, sometimes this worker are not responsible, they just provide the random label, therefore, of course, we would not spend our budget on this no reliable workers. Now the problem is both the difficulty of the pictures and the reliability of the worker we completely unknown at the beginning. We can only estimate them during the procedure. Therefore, we are naturally facing to exploration and exploitation, and our goal is to give a reasonable good policy to spend money to the right way–maximize the overall accuracy of final inferred labels.

Mathematical model

For the mathematical model, we have the K items, [math]\displaystyle{ i = \{1,2,\ldots,k\} }[/math], and total budget is T and we assume each label cost 1 so we are going to have T labels eventually. We assume each items has true label [math]\displaystyle{ Z_i }[/math]which positive or negative, this binomial cases and we can extended to multiple class, labeling cases, this a singular idea. And the positive set [math]\displaystyle{ H^* }[/math] is defined as the set of items whose true label is positive. And [math]\displaystyle{ \theta_i }[/math] also defined a soft-label, [math]\displaystyle{ \theta_i }[/math] for each item which number between 0 and 1, and we define [math]\displaystyle{ \theta_i }[/math] as underlying probability of being labeled as positive by a member randomly picked from a group of perfect workers.

In this first case, we assume for every worker is perfect, it means they all reliable, but being perfect doesn’t means this worker gives the same answer or right answer. It just means they will try their best to figure out the best answer in their mind, and suppose everyone is perfect worker, just randomly picked one of them, and with [math]\displaystyle{ \theta_i }[/math] probability, we going to get a guy who believe this one is positive. That is how we explain [math]\displaystyle{ \theta_i }[/math]. So we are assume a label [math]\displaystyle{ Y_i }[/math] is drawn from Bernoulli([math]\displaystyle{ \theta_i }[/math]), and [math]\displaystyle{ \theta_i }[/math] must be consistent with the true label, which means [math]\displaystyle{ \theta_i }[/math] is greater or equal to 0.5 if and only if this item is positive with a true positive label. So our goal is to learn H*, the set of positive items. In other word, we want to make an inferred positive set H based on collected labels to maximize:

[math]\displaystyle{ \sum_{i=1}^k (\textbf{1}_{(i\in H)}\textbf{1}_{(i\in H^\star)}+\textbf{1}_{(i\notin H)} \textbf{1}_{(i\notin H^\star)}) }[/math]

It can also be written as:

[math]\displaystyle{ |H\cap H^\star| + |H^c\cap H^{\star c}| }[/math]

step1: Bayesian decision process

Before show the Bayesian framework, the paper use an example to mention why we choose Bayesian instead of frequency approach, such that we can propose some posterior of prior distribution on the soft-label [math]\displaystyle{ \theta_i }[/math]. We assume each [math]\displaystyle{ \theta_i }[/math] is drawn from a known Beta prior:

[math]\displaystyle{ \theta_i \sim \mathrm{Beta}(a_i^o,b_i^o) }[/math]

And the matrix:

[math]\displaystyle{ s^o = \left \langle (a_i^o,b_i^o)\right \rangle_{i=1}^k \in \textbf{R}^{k\times2} }[/math]

So we know that the Bernoulli conjugate of beta, so once we get a new label for item i, we going to update posterior distribution, the beta distribution by:

[math]\displaystyle{ \theta_i \sim \mathrm{Beta}(a_i^t,b_i^t) }[/math]
[math]\displaystyle{ y_i\mid \theta_i\sim \mathrm{Bernoulli}(\theta_i) }[/math]
[math]\displaystyle{ \theta_i\mid y_i = 1\sim \mathrm{Beta}(a_i^t+1,b_i^t) }[/math]
[math]\displaystyle{ \theta_i\mid y_i = -1\sim \mathrm{Beta}(a_i^t+1,b_i^t) }[/math]

Depending on the label is positive or negative.

Here is the whole procedure in the high level, we have T stage, [math]\displaystyle{ 0\le t \le T-1 }[/math]. And in current stage we look at matrix S, which summarized the posterior distribution information for all the [math]\displaystyle{ \theta_i }[/math]

[math]\displaystyle{ s^t = \left \langle (a_i^t,b_i^t)\right \rangle_{i=1}^k \in \textbf{R}^{k\times2} }[/math]

We are going to make a decision, choose the next item to label [math]\displaystyle{ i_t }[/math], [math]\displaystyle{ i_t \in \{1,2,\ldots,k\} }[/math].

And depending what the label is positive or negative, we add a matrix to getting a label:

[math]\displaystyle{ \theta_i \sim \mathrm{Beta}(a_i^t,b_i^t) }[/math]
[math]\displaystyle{ y_i\mid \theta_i\sim \mathrm{Bernoulli}(\theta_i) }[/math]
[math]\displaystyle{ \theta_i\mid y_i = 1\sim \mathrm{Beta}(a_i^t+1,b_i^t) }[/math]
[math]\displaystyle{ \theta_i\mid y_i = -1\sim \mathrm{Beta}(a_i^t+1,b_i^t) }[/math]

Above all, this is the whole framework.

step2: Inference on positive set

When the t labels are collected, we can make an inference about the positive set Ht based on posterior distribution given by St

[math]\displaystyle{ \begin{align} H_t & = \operatorname{argmax}\limits_{H \subset\{1,2,\ldots,k\}} E \left( \sum_{i=1}^k (\textbf{1}(i\in H)\textbf{1}(i\in H^\star)+\textbf{1}(i\notin H) \textbf{1}{(i\notin H^{\star})})\mid S^\star\right) \\ & =\operatorname{argmax}\limits_{H \subset\{1,2,\ldots,k\}} \sum_{i=1}^k (\textbf{1}(i\in H) \Pr(i\in H^\star\mid S^t) +\textbf{1}(i\notin H) \Pr(i\notin H^\star \mid S^t)) \\ & =\{i:\Pr(i\in H^\star\mid S^t)\geq0.5\} \end{align} }[/math]

So here become the Bernoulli selection problem, we just take to look at the probability of being positive or being negative conditional [math]\displaystyle{ S_t }[/math] to see is greater than 0.5 or not, if it is greater than 0.5, then we prove this item into the current infer positive set [math]\displaystyle{ H_t }[/math] so this is a cost form for current optimal solution [math]\displaystyle{ H_t }[/math] based on the information in [math]\displaystyle{ S_t }[/math].

After know what is optimal solution, then the paper show what is the optimal value. Plug [math]\displaystyle{ t }[/math] in the optimal function,

[math]\displaystyle{ h(x) = \max(x,1-x) }[/math]

This function is just a single function which choose the larger one between the conditional probability of being positive and being negative. Once we get one more label for item i, we take a difference between this value, before and after we get a new label, we can see this conditional probability can actually simplify as follows:

[math]\displaystyle{ \begin{align} R(s^t,i_t,y_{i_t}) & = \sum_{i=1}^k h(\Pr{(i\in H^\star\mid s^{t+1})})-\sum_{i=1}^k h(\Pr(i\in H^\star\mid s^t)) \\ & = \sum_{i=1}^k h(\Pr{(a_i^{t+1,b_i^{t+1}})})-\sum_{i=1}^k h(\Pr(a_i^t,b_i^t)). \end{align} }[/math]

The positive item being positive only depends on the beta posterior, therefore, if only the function of parameter of beta distribution function are a and b, as

[math]\displaystyle{ h(\Pr(a_{i_t}^{t+1},b_{i_t}^{t+1}))-h(\Pr(a_{i_t}^t,b_{i_t}^t)) }[/math]

One more label for this particular item, we double change the posterior function, so all of this items can be cancel except 1, so this is the change for whole accuracy and we defined as stage-wise reward: improvement the inference accuracy by one more sample. Of course this label have two positive value, we’ve get positive label or negative label, take average for this two, get expect reward. We just choose item to be label such that the expect reward is maximized using Knowledge Gradient:

[math]\displaystyle{ \begin{align} i_t & = \operatorname{argmax}\limits_{i \in\{1,2,\ldots,k\}} E(R(s^t,i,y_i)\mid s^t) \\ & = \operatorname{argmax}\limits_{i \in\{1,2,\ldots,k\}} \left(\frac{a_i^t}{a_i^t+b_i^t} R(s^t,i,1)+\frac{b_i^t}{a_i^t+b_i^t}R(s^t,i,-1)\right) \end{align} }[/math]

They are multiple items, let us know how do we break the ties. If we break the tie deterministically, which means we choose the smallest index. We are going to have a problem because this is not consistent which means the positive stage [math]\displaystyle{ H_t }[/math] does not converge to the true positive stage [math]\displaystyle{ H^* }[/math].

So we can also try to break the ties randomly, it works, however, we will see the performance is almost like uniform sampling, is the best reward. The writer’s policy is kinds of more greedy, instead of choosing the average in stage once reward, we can actually calculate the larger one, the max of the two stage possible reward, so Optimistic Knowledge Gradient:

[math]\displaystyle{ i_t = \operatorname{argmax}\limits_{i\in\{1,\ldots,k\}}(R^+(S^t,i)) = \max(R(S^t,i,1),R(S^t,i,-1)) }[/math]

And we know under optimistic knowledge gradient, the final inference accuracy converge to 100%. Above is based on every worker is perfect, however, in practice, workers are not always responsible. So if in imperfect workers, we assume K items, [math]\displaystyle{ 1\leq i \leq k }[/math].

[math]\displaystyle{ \theta_i\in(0,1)\sim \mathrm{Bet}a(a_i^o,b_i^o) }[/math]

The probability of item [math]\displaystyle{ i }[/math] being labeled as positive by a perfect worker. M workers, [math]\displaystyle{ 1\leq j \leq M }[/math] , [math]\displaystyle{ \rho_j\in (0,1)\sim \mathrm{Beta}(c_j^o,d_j^o) }[/math] The probability of worker [math]\displaystyle{ j }[/math] giving the same label as a perfect worker. Distribution of the label [math]\displaystyle{ Z_{ij} }[/math] from worker [math]\displaystyle{ j }[/math] to item [math]\displaystyle{ i }[/math]:

[math]\displaystyle{ \Pr(Z_{ij}=1\mid \theta_i,\rho_j) = \Pr(Z_{ij}=1\mid Y_i =1) \Pr(Y_i=1)+\Pr(Z_{ij}=1\mid Y_i = -1)\Pr(Y_i = -1)=\rho_j\theta_i t (1-\rho_j)(1-\theta_i) }[/math]

And the action space is that

[math]\displaystyle{ \Pr(Z_{ij}=1\mid \theta_i,\rho_j) = Pr(Z_{ij}=1\mid Y_i =1) \Pr(Y_i=1)+\Pr(Z_{ij}=1\mid Y_i = -1)\Pr(Y_i = - 1)=\rho_j\theta_i t (1-\rho_j)(1-\theta_i)=\rho_j\theta_i t (1-\rho_j)(1-\theta_i), }[/math]

where [math]\displaystyle{ \qquad\qquad (i,j)\in \{1,2,\ldots,k\}\times\{1,2,\ldots,M\} }[/math], label matrix: [math]\displaystyle{ Z_{ij}\in\{-1,1\} }[/math]

It is difficult to calculate, so we can use Variational Bayesian methods[5] of [math]\displaystyle{ \Pr(i\in H^\star\mid S^t) }[/math]

References

  1. [1] Statistical Decision Making for Optimal Budget Allocation in Crowd Labeling Xi Chen, Qihang Lin, Dengyong Zhou; 16(Jan):1−46, 2015.
  2. [2] Proceedings of the 30-th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Xi Chen, Qihang Lin, Dengyong Zhou
  3. *Learning to Solve Markovian Decision Processes by Satinder P. Singh
  4. An Introduction to Dynamic Programming
  5. * Variational-Bayes Repository A repository of papers, software, and links related to the use of variational methods for approximate Bayesian learning