Barron space

In functional analysis, the Barron space is a function space. It is a Banach space. It originated from the study of universal approximation properties of two-layer neural networks. It has applications in approximation theory and statistical learning theory.

It is named after Andrew R. Barron, who did work on the functional analysis of two-layer neural networks, though he did not define Barron space in his works.^[1]

Setup

We quote the following universal approximation theorem:

Universal approximation theorem — Let $C (X, ℝ^{m})$ denote the set of continuous functions from a subset $X$ of a Euclidean $ℝ^{n}$ space to a Euclidean space $ℝ^{m}$ . Let $σ \in C (ℝ, ℝ)$ . Note that $(σ \circ x)_{i} = σ (x_{i})$ , so $σ \circ x$ denotes $σ$ applied to each component of $x$ .

Then $σ$ is not polynomial if and only if for every $n \in ℕ$ , $m \in ℕ$ , compact $K \subseteq ℝ^{n}$ , $f \in C (K, ℝ^{m}), ε > 0$ there exist $k \in ℕ$ , $A \in ℝ^{k \times n}$ , $b \in ℝ^{k}$ , $C \in ℝ^{m \times k}$ such that $\sup_{x \in K} ‖ f (x) - g (x) ‖ < ε$ where $g (x) = C \cdot (σ (A \cdot x + b))$

In words, given a subset

X \subset ℝ^{n}

, and a fixed activation function

σ

that is not a polynomial function, then any continuous function of type

X \to ℝ^{m}

can be approximated as a 2-layered neural network with a linear layer

(A, b)

, followed by the nonlinear activation

σ

, followed by another linear layer

C

. Furthermore, given any compact subset

K \subset X

, the approximation can be arbitrarily good in the uniform norm.

Usually we only consider the case where the neural network has a single output, that is, the case where $m = 1$ , since for multiple outputs, the outputs can be separately approximated. We assume that $m = 1$ for the rest of the article.

Number of hidden neurons

In the statement of the theorem, the middle layer is the hidden layer. The number $k$ is the number of neurons in the hidden layer. These neurons are called hidden neurons.

Given a compact set $X \subset ℝ^{n}$ , to approximate a generic continuous function to an accuracy of $ϵ$ in the uniform norm over $X$ , $O (ϵ^{- d})$ hidden neurons are needed. This is a manifestation of the curse of dimensionality.

In a 1993 paper, Barron showed that a large class of continuous functions are much more approximable than a generic continuous function. Specifically, he showed that there is a set of continuous functions such that, given any Borel probability measure $μ$ , only $O (ϵ^{- 2})$ hidden neurons are needed to approximate $f$ to an accuracy of $ϵ$ in the $L^{2} (μ)$ norm. In this sense, these functions are nice, in that they can be efficiently approximated by a neural network without hitting the curse of dimensionality.^[2]^[3]

Definition

It is natural to consider the infinite width limit, where the summation turns into an integral: $f (x) : = \int c σ (a^{T} x + b) ρ (d a, d b, d c)$ where $a, b, c$ takes values in $ℝ^{n}, ℝ, ℝ$ , and $ρ$ is a probability distribution over $ℝ^{n} \times ℝ \times ℝ$ .

Different $ρ$ may lead to the same $f$ . That is, the representation of a function as an infinite-width neural network is not unique. However, among these, one may be selected as having lowest regularization loss, as usual in statistical learning theory.

ReLU activation

If $σ$ is the ReLU function, define as follows.

For any $p \in [1, \infty]$ , define the regularization loss of representation $ρ$ as $‖ ρ ‖_{p} : = 𝔼_{(a, b, c) \sim ρ} [| c (‖ a ‖_{1} + | b |) |^{p}]^{1 / p}$ if $p \in [1, \infty)$ , and $‖ ρ ‖_{p} : = \sup_{(a, b, c) \in supp ρ} | c (‖ a ‖_{1} + | b |) |$ if $p = \infty$ . This is defined in analogy to Lp spaces, and is motivated by the previous result on the number of hidden neurons.

The special case of $p = 1$ is also called the path norm, since it is interpreted as the path weight $c (‖ a ‖_{1} + | b |)$ , averaged across all paths from the inputs to the output of the neural network.

The p-Barron norm of $f$ is defined as $‖ f ‖_{B_{p}} : = \inf_{ρ represents f} ‖ ρ ‖_{p}$ The p-Barron space over $X \subset ℝ^{n}$ is the set of continuous functions of type $X \to ℝ$ of finite p-Barron norm.

It can be proven that if $‖ f ‖_{B_{p}} < \infty$ for some $p \in [1, \infty]$ , then $‖ f ‖_{B_{p}}$ is the same for all $p \in [1, \infty]$ . Therefore, all these are the same, and we drop the p and call all of $‖ \cdot ‖_{B_{p}}$ the same Barron norm, and the space of them the Barron space. It is written as $ℬ$ ^[1]

The Barron space is a Banach space.^[3]^{: Thm. 2.3}

Non-ReLU activation

If $σ$ is not the ReLU function, then define the p-extended Barron norm: $‖ ρ ‖_{p} : = 𝔼_{(a, b, c) \sim ρ} [| c (‖ a ‖_{1} + | b | + 1) |^{p}]^{1 / p}$ $‖ f ‖_{{\tilde{B}}_{p}} : = \inf_{ρ represents f} ‖ ρ ‖_{p}$ Similarly, define the p-extended Barron spaces.

In general, they are not the same for different values of p.

Multilayer version

There is a generalization for multilayer neural networks with ReLU activations.^[4]

Properties

Basic properties

Theorem.^[1]^{: Thm. 1} Given $f \in ℬ$ , then for any positive integer $k$ , there exists a two-layer ReLU network with $M$ hidden neurons $f_{M} (x) = \frac{1}{M} \sum_{i = 1}^{M} c_{i} ReLU (a_{i} \cdot x + b_{i})$ such that ${‖ f_{M} - f ‖}_{L^{2} (Ω)}^{2} \leq \frac{3 ‖ f ‖_{B}^{2}}{M}$ , and $\frac{1}{M} \sum_{i = 1}^{M} | c_{i} | (‖ a_{i} ‖_{1} + | b_{i} |) \leq 2 ‖ f ‖_{B}$ .

Theorem.^[1]^{: Thm. 2} (converse to the previous theorem) For any $f$ continuous on $X \subset ℝ^{n}$ , if there exists a sequence of two-layer ReLU networks $f_{M}$ with $M = 1, 2, \dots$ hidden neurons, converging $f_{M} \to f$ pointwise, and the Barron norms of these $f_{M}$ are uniformly bounded by a single $C$ : $\frac{1}{M} \sum_{i = 1}^{M} | c_{i} | (‖ a_{i} ‖_{1} + | b_{i} |) \leq C, \forall M = 1, 2, 3, \dots$ then $f \in ℬ$ and $‖ f ‖_{B} \leq C$ .

Harmonic analysis

Theorem. For any $f$ continuous on $X \subset ℝ^{n}$ , define $γ (f) : = \inf_{\hat{f}} \int_{ℝ^{n}} ‖ ω ‖_{1}^{2} | \hat{f} (ω) | d ω$ where $\hat{f}$ ranges over Fourier transformations of all possible extensions of $f$ to all of $ℝ^{n}$ , then, if $γ (f) < \infty$ , then $f \in ℬ$ .

Furthermore, we have the explicit upper bound:^[1]^{: Prop. 2} $‖ f ‖_{ℬ} \leq 2 γ (f) + 2 ‖ \nabla f (0) ‖_{1} + 2 | f (0) |$

Statistical learning theory

Let $S = {z_{1}, z_{2}, \dots, z_{m}} \subseteq Z$ be a sample of points and consider a function class $ℱ$ of real-valued functions over $Z$ . Then, the empirical Rademacher complexity of $ℱ$ given $S$ is defined as:

{Rad}_{S} (ℱ) = \frac{1}{m} 𝔼_{σ} [\sup_{f \in ℱ} | \sum_{i = 1}^{m} σ_{i} f (z_{i}) |]

Theorem.^[1]^{: Thm. 3} For any $C > 0$ , let $ℱ_{C} : = {f \in ℬ : ‖ f ‖_{B} \leq C}$ , then ${Rad}_{S} (ℱ_{C}) \leq 2 C \sqrt{\frac{2 \ln (2 n)}{| S |}}$ , where as a reminder $n$ is the number of dimensions of the domain of $f$ .

This result shows that the space of functions bounded in Barron norm has low Rademacher complexity, which according to statistical learning theory, means they are highly learnable. This rhymes with the fact that they are well approximable by a network with few hidden neurons.

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 E, Weinan; Ma, Chao; Wu, Lei (2022-02-01). "The Barron Space and the Flow-Induced Function Spaces for Neural Network Models" (in en). Constructive Approximation 55 (1): 369–406. doi:10.1007/s00365-021-09549-y. ISSN 1432-0940. https://doi.org/10.1007/s00365-021-09549-y.
↑ Barron, A.R. (May 1993). "Universal approximation bounds for superpositions of a sigmoidal function". IEEE Transactions on Information Theory 39 (3): 930–945. doi:10.1109/18.256500. ISSN 0018-9448. Bibcode: 1993ITIT...39..930B.
↑ ^3.0 ^3.1 E., Weinan; Wojtowytsch, Stephan (April 2022). "Representation formulas and pointwise properties for Barron functions" (in en). Calculus of Variations and Partial Differential Equations 61 (2). doi:10.1007/s00526-021-02156-6. ISSN 0944-2669. https://link.springer.com/10.1007/s00526-021-02156-6.
↑ E, Weinan; Wojtowytsch, Stephan (2020-07-30), On the Banach spaces associated with multi-layer ReLU networks: Function representation, approximation theory and gradient descent dynamics, http://arxiv.org/abs/2007.15623

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Barron space. Read more

[:0-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 E, Weinan; Ma, Chao; Wu, Lei (2022-02-01). "The Barron Space and the Flow-Induced Function Spaces for Neural Network Models" (in en). Constructive Approximation 55 (1): 369–406. doi:10.1007/s00365-021-09549-y. ISSN 1432-0940. https://doi.org/10.1007/s00365-021-09549-y.

[2] Barron, A.R. (May 1993). "Universal approximation bounds for superpositions of a sigmoidal function". IEEE Transactions on Information Theory 39 (3): 930–945. doi:10.1109/18.256500. ISSN 0018-9448. Bibcode: 1993ITIT...39..930B.

[:1-3] 3.0 ^3.1 E., Weinan; Wojtowytsch, Stephan (April 2022). "Representation formulas and pointwise properties for Barron functions" (in en). Calculus of Variations and Partial Differential Equations 61 (2). doi:10.1007/s00526-021-02156-6. ISSN 0944-2669. https://link.springer.com/10.1007/s00526-021-02156-6.

[4] E, Weinan; Wojtowytsch, Stephan (2020-07-30), On the Banach spaces associated with multi-layer ReLU networks: Function representation, approximation theory and gradient descent dynamics, http://arxiv.org/abs/2007.15623

[1]

[2]

[3]

[4]

v t e Functional analysis (topics)
Topological vector spaces	Asplund Banach (list) Banach lattice Barrelled Bornological Brauner F-space Fréchet (tame) Hilbert (Inner product space Polarization identity) LF-space Locally convex (Seminorms/Minkowski functionals) Mackey Montel Nuclear Normed (norm) Quasinormed Reflexive Riesz Smith Stereotype Strictly convex Webbed Topological tensor product (of Hilbert spaces)
Topologies of function spaces	Dual Dual space (Dual norm) Operator Ultraweak Weak (polar operator) Mackey Strong (polar operator) Ultrastrong Uniform convergence
Linear operators	Adjoint Bilinear (form operator sesquilinear) (Un)Bounded Closed Compact (on Hilbert spaces) (Dis)Continuous Densely defined Fredholm Hilbert–Schmidt Functionals (positive) Normal Nuclear Self-adjoint Strictly singular Trace class Transpose Unitary
Operator theory	Banach algebras C-algebras Spectrum (C-algebra radius) Spectral theory (of ODEs Spectral theorem) Polar decomposition Singular value decomposition
Theorems	Banach–Alaoglu Banach–Mazur Banach–Saks Banach–Schauder (open mapping) Banach–Steinhaus (Uniform boundedness) Bessel's inequality Cauchy–Schwarz inequality Closed graph Closed range Eberlein–Šmulian Freudenthal spectral Gelfand–Mazur Gelfand–Naimark Goldstine Hahn–Banach (hyperplane separation) Kakutani fixed-point Krein–Milman Lomonosov's invariant subspace Mackey–Arens Mazur's lemma M. Riesz extension Riesz representation Parseval's identity Schauder fixed-point
Analysis	Abstract Wiener space Bochner space Differentiation in Fréchet spaces Derivatives (Fréchet Gateaux functional holomorphic) Integrals (Bochner Dunford Gelfand–Pettis regulated Paley–Wiener weak) Functional calculus (Borel continuous holomorphic) Inverse function theorem (Nash–Moser theorem) Measures (Lebesgue Projection-valued Vector) Weakly measurable function
Types of sets	Absolutely convex Absorbing Balanced Bounded Convex Convex cone (subset) Linear cone (subset) Radial Star-shaped Symmetric Zonotope
Subsets / set operations	Algebraic interior (core) Bounding points Convex hull Extreme point Interior Minkowski addition Polar

Anonymous

Search

Barron space

Namespaces

More

Page actions

Contents

Setup

Number of hidden neurons

Definition

ReLU activation

Non-ReLU activation

Multilayer version

Properties

Basic properties

Harmonic analysis

Statistical learning theory

See also

References

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Barron space

Setup

Number of hidden neurons

Definition

ReLU activation

Non-ReLU activation

Multilayer version

Properties

Basic properties

Harmonic analysis

Statistical learning theory

See also

References

Navigation

Wiki tools

Page tools

Other projects

Categories