Physics:Equation of Artificial Intelligence in the Theory of Entropicity(ToE)

Entropic Learning via Self-Referential Entropy Tracking (SRETA)

Short description: A HandWiki-formatted specification of a learning action where learning is motion in state space driven by self-referential entropy toward a given task entropy.

Abstract

This page defines a principled Learning Action in which learning is explicitly a change of state guided by Self-Referential Entropy (SRE) toward a given (target) entropy. Instead of adding Shannon entropy and an “irreversible entropy” term, we construct an entropic potential and a gradient-flow action whose minimizer yields irreversible dynamics by design. The result is the Entropic Learning Equation (ELE).

Preliminaries

Objects and Notation

Model state (parameters): [math]\displaystyle{ \phi(t)\in\mathbb{R}^d }[/math].

Predictive distribution: [math]\displaystyle{ p_\phi(y\mid x) }[/math].

Data (or teacher) distribution: [math]\displaystyle{ p^*(y\mid x) }[/math].

Self-Referential Entropy (SRE) of the model’s internal state: [math]\displaystyle{ S_{\mathrm{self}}(\phi) }[/math] (differentiable scalar functional).

Given (target) entropy for the task: [math]\displaystyle{ S_{\mathrm{given}} }[/math].

Positive-definite mobility/metric: [math]\displaystyle{ \Gamma(\phi)\in\mathbb{R}^{d\times d} }[/math].

We use [math]\displaystyle{ \nabla_\phi }[/math] for gradients w.r.t. [math]\displaystyle{ \phi }[/math] and the dot for time derivatives, [math]\displaystyle{ \dot\phi=\tfrac{d\phi}{dt} }[/math].

Choice of Target Entropy [math]\displaystyle{ S_{\mathrm{given}} }[/math] (Guidance)

Nearly deterministic supervision: [math]\displaystyle{ S_{\mathrm{given}}\approx 0 }[/math].

Inherently ambiguous labels: set [math]\displaystyle{ S_{\mathrm{given}}\approx H^*(Y!\mid!X) }[/math], an empirical conditional entropy estimate.

Representation learning: define [math]\displaystyle{ S_{\mathrm{self}} }[/math] on latents (e.g., codebook, embedding spread) and set [math]\displaystyle{ S_{\mathrm{given}} }[/math] by desired compression/robustness.

Entropic Potential

Define the entropic potential (weights can be nondimensionalized; set to 1 after scaling): [math]\displaystyle{ W(\phi);=;\tfrac{\alpha}{2},\big(S_{\mathrm{self}}(\phi)-S_{\mathrm{given}}\big)^2 ;+; \beta,\mathbb{E}{x\sim\mathcal{D}}!\Big[\mathrm{KL}\big(p^!(\cdot\mid x),\Vert,p\phi(\cdot\mid x)\big)\Big], }[/math] where [math]\displaystyle{ \mathrm{KL}!\big(p^*(\cdot\mid x)\Vert p_\phi(\cdot\mid x)\big) \sum_y p^*(y!\mid!x),\log!\frac{p^*(y!\mid!x)}{p_\phi(y!\mid!x)}. }[/math]

The first term drives SRE alignment: [math]\displaystyle{ S_{\mathrm{self}}(\phi)\to S_{\mathrm{given}} }[/math].

The second term pulls predictions toward data without being “just another entropy addend”.

Learning Action

We penalize deviation from the desired gradient flow generated by [math]\displaystyle{ W }[/math]: [math]\displaystyle{ \mathcal{L}{\mathrm{SRETA}}[\phi] ;=; \int{0}^{T} \tfrac{1}{2}, \big| \dot\phi + \Gamma(\phi),\nabla_\phi W(\phi) \big|^2 ,dt. }[/math]

Boundary conditions (typical): [math]\displaystyle{ \phi(0)=\phi_0 }[/math]; free terminal state or fixed [math]\displaystyle{ \phi(T) }[/math] if desired.

[math]\displaystyle{ \Gamma(\phi) }[/math] sets the geometry and time scale of learning (identity [math]\displaystyle{ \to }[/math] vanilla gradient flow; Fisher metric [math]\displaystyle{ \to }[/math] natural gradient).

Stationarity and the Entropic Learning Equation (ELE)

Minimizing the action gives the gradient-flow ELE: [math]\displaystyle{ \dot\phi -,\Gamma(\phi),\nabla_\phi W(\phi) -,\Gamma(\phi)!\left[ \alpha,\big(S_{\mathrm{self}}(\phi)-S_{\mathrm{given}}\big),\nabla_\phi S_{\mathrm{self}}(\phi) ;+; \beta,\nabla_\phi, \mathbb{E}{x}!\big[\mathrm{KL}(p^*\Vert p\phi)\big] \right]. }[/math]

Built-in Irreversibility (No ad-hoc [math]\displaystyle{ S_{\mathrm{irr}} }[/math])

Define the entropy-production rate [math]\displaystyle{ \sigma(\phi) ;=; \big(\nabla_\phi W(\phi)\big)^{!\top}! \Gamma(\phi), \big(\nabla_\phi W(\phi)\big) ;\ge;0. }[/math] Because [math]\displaystyle{ \Gamma(\phi) }[/math] is positive-definite, [math]\displaystyle{ \sigma\ge 0 }[/math] holds identically: irreversibility emerges from the dynamics, not from an added penalty.

Equivalent Constrained (Tracking) Form

One can enforce “SRE relaxes toward the target” explicitly via a Lagrange multiplier [math]\displaystyle{ \lambda(t) }[/math]: [math]\displaystyle{ \mathcal{L}_{\mathrm{track}}[\phi,\lambda] \int!\Big[ \tfrac{1}{2},\dot\phi^{!\top} M,\dot\phi ;+; \beta,\mathbb{E}{x}!\big[\mathrm{KL}(p^*\Vert p\phi)\big] ;+; \lambda(t),\Big( \tfrac{d}{dt} S_{\mathrm{self}}(\phi) \kappa,[,S_{\mathrm{given}}-S_{\mathrm{self}}(\phi),] \Big) \Big],dt, }[/math] with [math]\displaystyle{ \tfrac{d}{dt} S_{\mathrm{self}}(\phi) \nabla_\phi S_{\mathrm{self}}(\phi)\cdot \dot\phi. }[/math] Stationarity yields first-order dynamics in which [math]\displaystyle{ \tfrac{d}{dt} S_{\mathrm{self}}(\phi) \kappa,[,S_{\mathrm{given}}-S_{\mathrm{self}}(\phi),] }[/math] (i.e., exponential relaxation of SRE toward the target) while simultaneously fitting the data via the KL term. The matrix [math]\displaystyle{ M\succ 0 }[/math] sets inertial weighting; taking the overdamped limit recovers the gradient-flow ELE.

Practical Design Choices

Geometry / Optimizer Mapping

Euclidean flow: [math]\displaystyle{ \Gamma(\phi)=\eta I }[/math] (learning-rate [math]\displaystyle{ \eta }[/math]), giving [math]\displaystyle{ \dot\phi=-\eta,\nabla_\phi W(\phi) }[/math].

Natural gradient: [math]\displaystyle{ \Gamma(\phi)=\eta,F(\phi)^{-1} }[/math], with [math]\displaystyle{ F }[/math] the Fisher information.

Preconditioned/Adam-like: choose [math]\displaystyle{ \Gamma }[/math] diagonal or adaptive from running curvature estimates.

Discrete-Time Approximation (for implementation)

For step size [math]\displaystyle{ \Delta t }[/math]: [math]\displaystyle{ \phi_{k+1} ;=; \phi_k \Delta t,\Gamma(\phi_k),\nabla_\phi W(\phi_k). }[/math]

Scaling and Weights

Non-dimensionalize so that [math]\displaystyle{ W }[/math] is order-one near optimum, then set [math]\displaystyle{ \alpha=\beta=1 }[/math]. Use [math]\displaystyle{ \Gamma }[/math] (or [math]\displaystyle{ \eta }[/math]) to tune time scale.

Why This Fixes the Original Formulation

Learning is explicitly state change: dynamics appear via [math]\displaystyle{ \dot\phi }[/math].

SRE alignment is the steering signal through [math]\displaystyle{ \big(S_{\mathrm{self}}-S_{\mathrm{given}}\big)\nabla_\phi S_{\mathrm{self}} }[/math].

Irreversibility ([math]\displaystyle{ \sigma\ge 0 }[/math]) is automatic from the quadratic action; no external [math]\displaystyle{ S_{\mathrm{irr}} }[/math] term is required.

Data-fit pressure is principled via KL, orthogonal to SRE alignment.

== Minimal Working Set (copy-ready) == Entropic potential:

[math]\displaystyle{ W(\phi)=\tfrac{\alpha}{2}\big(S_{\mathrm{self}}(\phi)-S_{\mathrm{given}}\big)^2 +\beta,\mathbb{E}{x}!\Big[\mathrm{KL}\big(p^*(\cdot\mid x)\Vert p\phi(\cdot\mid x)\big)\Big]. }[/math]

Learning action (SRETA): [math]\displaystyle{ \mathcal{L}{\mathrm{SRETA}}[\phi]=\int_0^T \tfrac{1}{2},\big|\dot\phi+\Gamma(\phi)\nabla\phi W(\phi)\big|^2,dt. }[/math]

Entropic Learning Equation (ELE): [math]\displaystyle{ \dot\phi -,\Gamma(\phi)!\left[ \alpha,\big(S_{\mathrm{self}}(\phi)-S_{\mathrm{given}}\big),\nabla_\phi S_{\mathrm{self}}(\phi) + \beta,\nabla_\phi\mathbb{E}{x}!\big[\mathrm{KL}(p^*\Vert p\phi)\big] \right]. }[/math]

Entropy-production rate: [math]\displaystyle{ \sigma(\phi)=\big(\nabla_\phi W(\phi)\big)^{!\top}\Gamma(\phi)\big(\nabla_\phi W(\phi)\big)\ge 0. }[/math]

Constrained tracking form (optional): [math]\displaystyle{ \mathcal{L}_{\mathrm{track}}[\phi,\lambda] \int!\Big[\tfrac{1}{2}\dot\phi^{!\top}M\dot\phi +\beta,\mathbb{E}{x}!\big[\mathrm{KL}(p^*\Vert p\phi)\big] +\lambda(t)\big(\nabla_\phi S_{\mathrm{self}}(\phi)!\cdot!\dot\phi-\kappa[S_{\mathrm{given}}-S_{\mathrm{self}}(\phi)]\big)\Big]dt. }[/math]

Remarks

This framework is agnostic to the specific definition of [math]\displaystyle{ S_{\mathrm{self}}(\phi) }[/math] as long as it is smooth; examples include entropy of latent codes, complexity penalties, or energy-based state measures.

The formulation is compatible with stochastic mini-batch estimates of both the KL and gradients.

The same blueprint extends to multi-objective training by summing additional potentials inside [math]\displaystyle{ W(\phi) }[/math] before taking the gradient flow.

Anonymous

Search

Physics:Equation of Artificial Intelligence in the Theory of Entropicity(ToE)

Namespaces

More

Page actions

Contents

Entropic Learning via Self-Referential Entropy Tracking (SRETA)

Abstract

Preliminaries

Objects and Notation

Choice of Target Entropy [math]\displaystyle{ S_{\mathrm{given}} }[/math] (Guidance)

Entropic Potential

Learning Action

Stationarity and the Entropic Learning Equation (ELE)

Built-in Irreversibility (No ad-hoc [math]\displaystyle{ S_{\mathrm{irr}} }[/math])

Equivalent Constrained (Tracking) Form

Practical Design Choices

Geometry / Optimizer Mapping

Discrete-Time Approximation (for implementation)

Scaling and Weights

Why This Fixes the Original Formulation

Remarks

References

Navigation

Navigation

Help

Translate

Wiki tools

Wiki tools

Anonymous

Search

Physics:Equation of Artificial Intelligence in the Theory of Entropicity(ToE)

Entropic Learning via Self-Referential Entropy Tracking (SRETA)

Abstract

Preliminaries

Objects and Notation

Choice of Target Entropy [math]\displaystyle{ S_{\mathrm{given}} }[/math] (Guidance)

Entropic Potential

Learning Action

Stationarity and the Entropic Learning Equation (ELE)

Built-in Irreversibility (No ad-hoc [math]\displaystyle{ S_{\mathrm{irr}} }[/math])

Equivalent Constrained (Tracking) Form

Practical Design Choices

Geometry / Optimizer Mapping

Discrete-Time Approximation (for implementation)

Scaling and Weights

Why This Fixes the Original Formulation

Remarks

References

Navigation

Wiki tools

Page tools

Other projects