Error tolerance (PAC learning)
Machine learning and data mining |
---|
In PAC learning, error tolerance refers to the ability of an algorithm to learn when the examples received have been corrupted in some way. In fact, this is a very common and important issue since in many applications it is not possible to access noise-free data. Noise can interfere with the learning process at different levels: the algorithm may receive data that have been occasionally mislabeled, or the inputs may have some false information, or the classification of the examples may have been maliciously adulterated.
Notation and the Valiant learning model
In the following, let [math]\displaystyle{ X }[/math] be our [math]\displaystyle{ n }[/math]-dimensional input space. Let [math]\displaystyle{ \mathcal{H} }[/math] be a class of functions that we wish to use in order to learn a [math]\displaystyle{ \{0,1\} }[/math]-valued target function [math]\displaystyle{ f }[/math] defined over [math]\displaystyle{ X }[/math]. Let [math]\displaystyle{ \mathcal{D} }[/math] be the distribution of the inputs over [math]\displaystyle{ X }[/math]. The goal of a learning algorithm [math]\displaystyle{ \mathcal{A} }[/math] is to choose the best function [math]\displaystyle{ h \in \mathcal{H} }[/math] such that it minimizes [math]\displaystyle{ error(h) = P_{x \sim \mathcal{D} }( h(x) \neq f(x)) }[/math]. Let us suppose we have a function [math]\displaystyle{ size(f) }[/math] that can measure the complexity of [math]\displaystyle{ f }[/math]. Let [math]\displaystyle{ \text{Oracle}(x) }[/math] be an oracle that, whenever called, returns an example [math]\displaystyle{ x }[/math] and its correct label [math]\displaystyle{ f(x) }[/math].
When no noise corrupts the data, we can define learning in the Valiant setting:[1][2]
Definition: We say that [math]\displaystyle{ f }[/math] is efficiently learnable using [math]\displaystyle{ \mathcal{H} }[/math] in the Valiant setting if there exists a learning algorithm [math]\displaystyle{ \mathcal{A} }[/math] that has access to [math]\displaystyle{ \text{Oracle}(x) }[/math] and a polynomial [math]\displaystyle{ p(\cdot,\cdot,\cdot,\cdot) }[/math] such that for any [math]\displaystyle{ 0 \lt \varepsilon \leq 1 }[/math] and [math]\displaystyle{ 0 \lt \delta \leq 1 }[/math] it outputs, in a number of calls to the oracle bounded by [math]\displaystyle{ p\left(\frac{1}{\varepsilon},\frac{1}{\delta},n,\text{size}(f)\right) }[/math] , a function [math]\displaystyle{ h \in \mathcal{H} }[/math] that satisfies with probability at least [math]\displaystyle{ 1-\delta }[/math] the condition [math]\displaystyle{ \text{error}(h) \leq \varepsilon }[/math].
In the following we will define learnability of [math]\displaystyle{ f }[/math] when data have suffered some modification.[3][4][5]
Classification noise
In the classification noise model[6] a noise rate [math]\displaystyle{ 0 \leq \eta \lt \frac{1}{2} }[/math] is introduced. Then, instead of [math]\displaystyle{ \text{Oracle}(x) }[/math] that returns always the correct label of example [math]\displaystyle{ x }[/math], algorithm [math]\displaystyle{ \mathcal{A} }[/math] can only call a faulty oracle [math]\displaystyle{ \text{Oracle}(x,\eta) }[/math] that will flip the label of [math]\displaystyle{ x }[/math] with probability [math]\displaystyle{ \eta }[/math]. As in the Valiant case, the goal of a learning algorithm [math]\displaystyle{ \mathcal{A} }[/math] is to choose the best function [math]\displaystyle{ h \in \mathcal{H} }[/math] such that it minimizes [math]\displaystyle{ error(h) = P_{x \sim \mathcal{D} }( h(x) \neq f(x)) }[/math]. In applications it is difficult to have access to the real value of [math]\displaystyle{ \eta }[/math], but we assume we have access to its upperbound [math]\displaystyle{ \eta_B }[/math].[7] Note that if we allow the noise rate to be [math]\displaystyle{ 1/2 }[/math], then learning becomes impossible in any amount of computation time, because every label conveys no information about the target function.
Definition: We say that [math]\displaystyle{ f }[/math] is efficiently learnable using [math]\displaystyle{ \mathcal{H} }[/math] in the classification noise model if there exists a learning algorithm [math]\displaystyle{ \mathcal{A} }[/math] that has access to [math]\displaystyle{ \text{Oracle}(x,\eta) }[/math] and a polynomial [math]\displaystyle{ p(\cdot,\cdot,\cdot,\cdot) }[/math] such that for any [math]\displaystyle{ 0 \leq \eta \leq \frac{1}{2} }[/math], [math]\displaystyle{ 0\leq \varepsilon \leq 1 }[/math] and [math]\displaystyle{ 0\leq \delta \leq 1 }[/math] it outputs, in a number of calls to the oracle bounded by [math]\displaystyle{ p\left(\frac{1}{1-2\eta_B}, \frac{1}{\varepsilon},\frac{1}{\delta},n,size(f)\right) }[/math] , a function [math]\displaystyle{ h \in \mathcal{H} }[/math] that satisfies with probability at least [math]\displaystyle{ 1-\delta }[/math] the condition [math]\displaystyle{ error(h) \leq \varepsilon }[/math].
Statistical query learning
Statistical Query Learning[8] is a kind of active learning problem in which the learning algorithm [math]\displaystyle{ \mathcal{A} }[/math] can decide if to request information about the likelihood [math]\displaystyle{ P_{f(x)} }[/math] that a function [math]\displaystyle{ f }[/math] correctly labels example [math]\displaystyle{ x }[/math], and receives an answer accurate within a tolerance [math]\displaystyle{ \alpha }[/math]. Formally, whenever the learning algorithm [math]\displaystyle{ \mathcal{A} }[/math] calls the oracle [math]\displaystyle{ \text{Oracle}(x,\alpha) }[/math], it receives as feedback probability [math]\displaystyle{ Q_{f(x)} }[/math], such that [math]\displaystyle{ Q_{f(x)} - \alpha \leq P_{f(x)} \leq Q_{f(x)} + \alpha }[/math].
Definition: We say that [math]\displaystyle{ f }[/math] is efficiently learnable using [math]\displaystyle{ \mathcal{H} }[/math] in the statistical query learning model if there exists a learning algorithm [math]\displaystyle{ \mathcal{A} }[/math] that has access to [math]\displaystyle{ \text{Oracle}(x,\alpha) }[/math] and polynomials [math]\displaystyle{ p(\cdot,\cdot,\cdot) }[/math], [math]\displaystyle{ q(\cdot,\cdot,\cdot) }[/math], and [math]\displaystyle{ r(\cdot,\cdot,\cdot) }[/math] such that for any [math]\displaystyle{ 0 \lt \varepsilon \leq 1 }[/math] the following hold:
- [math]\displaystyle{ \text{Oracle}(x,\alpha) }[/math] can evaluate [math]\displaystyle{ P_{f(x)} }[/math] in time [math]\displaystyle{ q\left(\frac{1}{\varepsilon},n,size(f)\right) }[/math];
- [math]\displaystyle{ \frac{1}{\alpha} }[/math] is bounded by [math]\displaystyle{ r\left(\frac{1}{\varepsilon},n,size(f)\right) }[/math]
- [math]\displaystyle{ \mathcal{A} }[/math] outputs a model [math]\displaystyle{ h }[/math] such that [math]\displaystyle{ err(h)\lt \varepsilon }[/math], in a number of calls to the oracle bounded by [math]\displaystyle{ p\left(\frac{1}{\varepsilon},n,size(f)\right) }[/math].
Note that the confidence parameter [math]\displaystyle{ \delta }[/math] does not appear in the definition of learning. This is because the main purpose of [math]\displaystyle{ \delta }[/math] is to allow the learning algorithm a small probability of failure due to an unrepresentative sample. Since now [math]\displaystyle{ \text{Oracle}(x,\alpha) }[/math] always guarantees to meet the approximation criterion [math]\displaystyle{ Q_{f(x)} - \alpha \leq P_{f(x)} \leq Q_{f(x)} + \alpha }[/math], the failure probability is no longer needed.
The statistical query model is strictly weaker than the PAC model: any efficiently SQ-learnable class is efficiently PAC learnable in the presence of classification noise, but there exist efficient PAC-learnable problems such as parity that are not efficiently SQ-learnable.[8]
Malicious classification
In the malicious classification model[9] an adversary generates errors to foil the learning algorithm. This setting describes situations of error burst, which may occur when for a limited time transmission equipment malfunctions repeatedly. Formally, algorithm [math]\displaystyle{ \mathcal{A} }[/math] calls an oracle [math]\displaystyle{ \text{Oracle}(x,\beta) }[/math] that returns a correctly labeled example [math]\displaystyle{ x }[/math] drawn, as usual, from distribution [math]\displaystyle{ \mathcal{D} }[/math] over the input space with probability [math]\displaystyle{ 1- \beta }[/math], but it returns with probability [math]\displaystyle{ \beta }[/math] an example drawn from a distribution that is not related to [math]\displaystyle{ \mathcal{D} }[/math]. Moreover, this maliciously chosen example may strategically selected by an adversary who has knowledge of [math]\displaystyle{ f }[/math], [math]\displaystyle{ \beta }[/math], [math]\displaystyle{ \mathcal{D} }[/math], or the current progress of the learning algorithm.
Definition: Given a bound [math]\displaystyle{ \beta_B\lt \frac{1}{2} }[/math] for [math]\displaystyle{ 0 \leq \beta \lt \frac{1}{2} }[/math], we say that [math]\displaystyle{ f }[/math] is efficiently learnable using [math]\displaystyle{ \mathcal{H} }[/math] in the malicious classification model, if there exist a learning algorithm [math]\displaystyle{ \mathcal{A} }[/math] that has access to [math]\displaystyle{ \text{Oracle}(x,\beta) }[/math] and a polynomial [math]\displaystyle{ p(\cdot,\cdot,\cdot,\cdot,\cdot) }[/math] such that for any [math]\displaystyle{ 0 \lt \varepsilon \leq 1 }[/math], [math]\displaystyle{ 0 \lt \delta \leq 1 }[/math] it outputs, in a number of calls to the oracle bounded by [math]\displaystyle{ p\left(\frac{1}{1/2 - \beta_B},\frac{1}{\varepsilon},\frac{1}{\delta},n,size(f)\right) }[/math] , a function [math]\displaystyle{ h \in \mathcal{H} }[/math] that satisfies with probability at least [math]\displaystyle{ 1-\delta }[/math] the condition [math]\displaystyle{ error(h) \leq \varepsilon }[/math].
Errors in the inputs: nonuniform random attribute noise
In the nonuniform random attribute noise[10][11] model the algorithm is learning a Boolean function, a malicious oracle [math]\displaystyle{ \text{Oracle}(x,\nu) }[/math] may flip each [math]\displaystyle{ i }[/math]-th bit of example [math]\displaystyle{ x=(x_1,x_2,\ldots,x_n) }[/math] independently with probability [math]\displaystyle{ \nu_i \leq \nu }[/math].
This type of error can irreparably foil the algorithm, in fact the following theorem holds:
In the nonuniform random attribute noise setting, an algorithm [math]\displaystyle{ \mathcal{A} }[/math] can output a function [math]\displaystyle{ h \in \mathcal{H} }[/math] such that [math]\displaystyle{ error(h)\lt \varepsilon }[/math] only if [math]\displaystyle{ \nu \lt 2\varepsilon }[/math].
See also
References
- ↑ Valiant, L. G. (August 1985). Learning Disjunction of Conjunctions. In IJCAI (pp. 560–566).
- ↑ Valiant, Leslie G. "A theory of the learnable." Communications of the ACM 27.11 (1984): 1134–1142.
- ↑ Laird, P. D. (1988). Learning from good and bad data. Kluwer Academic Publishers.
- ↑ Kearns, Michael. "Efficient noise-tolerant learning from statistical queries." Journal of the ACM 45.6 (1998): 983–1006.
- ↑ Brunk, Clifford A., and Michael J. Pazzani. "An investigation of noise-tolerant relational concept learning algorithms." Proceedings of the 8th International Workshop on Machine Learning. 1991.
- ↑ Kearns, M. J., & Vazirani, U. V. (1994). An introduction to computational learning theory, chapter 5. MIT press.
- ↑ Angluin, D., & Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4), 343–370.
- ↑ 8.0 8.1 Kearns, M. (1998). [www.cis.upenn.edu/~mkearns/papers/sq-journal.pdf Efficient noise-tolerant learning from statistical queries]. Journal of the ACM, 45(6), 983–1006.
- ↑ Kearns, M., & Li, M. (1993). [www.cis.upenn.edu/~mkearns/papers/malicious.pdf Learning in the presence of malicious errors]. SIAM Journal on Computing, 22(4), 807–837.
- ↑ Goldman, S. A., & Sloan, Robert, H. (1991). The difficulty of random attribute noise. Technical Report WUCS 91 29, Washington University, Department of Computer Science.
- ↑ Sloan, R. H. (1989). Computational learning theory: New models and algorithms (Doctoral dissertation, Massachusetts Institute of Technology).
Original source: https://en.wikipedia.org/wiki/Error tolerance (PAC learning).
Read more |