Physics:Equation of Artificial Intelligence in the Theory of Entropicity(ToE)
Entropic Learning via Self-Referential Entropy Tracking (SRETA)
Abstract
This page defines a principled Learning Action in which learning is explicitly a change of state guided by Self-Referential Entropy (SRE) toward a given (target) entropy. Instead of adding Shannon entropy and an “irreversible entropy” term, we construct an entropic potential and a gradient-flow action whose minimizer yields irreversible dynamics by design. The result is the Entropic Learning Equation (ELE).
Preliminaries
Objects and Notation
Model state (parameters): [math]\displaystyle{ \phi(t)\in\mathbb{R}^d }[/math].
Predictive distribution: [math]\displaystyle{ p_\phi(y\mid x) }[/math].
Data (or teacher) distribution: [math]\displaystyle{ p^*(y\mid x) }[/math].
Self-Referential Entropy (SRE) of the model’s internal state: [math]\displaystyle{ S_{\mathrm{self}}(\phi) }[/math] (differentiable scalar functional).
Given (target) entropy for the task: [math]\displaystyle{ S_{\mathrm{given}} }[/math].
Positive-definite mobility/metric: [math]\displaystyle{ \Gamma(\phi)\in\mathbb{R}^{d\times d} }[/math].
We use [math]\displaystyle{ \nabla_\phi }[/math] for gradients w.r.t. [math]\displaystyle{ \phi }[/math] and the dot for time derivatives, [math]\displaystyle{ \dot\phi=\tfrac{d\phi}{dt} }[/math].
Choice of Target Entropy [math]\displaystyle{ S_{\mathrm{given}} }[/math] (Guidance)
Nearly deterministic supervision: [math]\displaystyle{ S_{\mathrm{given}}\approx 0 }[/math].
Inherently ambiguous labels: set [math]\displaystyle{ S_{\mathrm{given}}\approx H^*(Y!\mid!X) }[/math], an empirical conditional entropy estimate.
Representation learning: define [math]\displaystyle{ S_{\mathrm{self}} }[/math] on latents (e.g., codebook, embedding spread) and set [math]\displaystyle{ S_{\mathrm{given}} }[/math] by desired compression/robustness.
Entropic Potential
Define the entropic potential (weights can be nondimensionalized; set to 1 after scaling): [math]\displaystyle{ W(\phi);=;\tfrac{\alpha}{2},\big(S_{\mathrm{self}}(\phi)-S_{\mathrm{given}}\big)^2 ;+; \beta,\mathbb{E}{x\sim\mathcal{D}}!\Big[\mathrm{KL}\big(p^!(\cdot\mid x),\Vert,p\phi(\cdot\mid x)\big)\Big], }[/math] where [math]\displaystyle{ \mathrm{KL}!\big(p^*(\cdot\mid x)\Vert p_\phi(\cdot\mid x)\big) \sum_y p^*(y!\mid!x),\log!\frac{p^*(y!\mid!x)}{p_\phi(y!\mid!x)}. }[/math]
The first term drives SRE alignment: [math]\displaystyle{ S_{\mathrm{self}}(\phi)\to S_{\mathrm{given}} }[/math].
The second term pulls predictions toward data without being “just another entropy addend”.
Learning Action
We penalize deviation from the desired gradient flow generated by [math]\displaystyle{ W }[/math]: [math]\displaystyle{ \mathcal{L}{\mathrm{SRETA}}[\phi] ;=; \int{0}^{T} \tfrac{1}{2}, \big| \dot\phi + \Gamma(\phi),\nabla_\phi W(\phi) \big|^2 ,dt. }[/math]
Boundary conditions (typical): [math]\displaystyle{ \phi(0)=\phi_0 }[/math]; free terminal state or fixed [math]\displaystyle{ \phi(T) }[/math] if desired.
[math]\displaystyle{ \Gamma(\phi) }[/math] sets the geometry and time scale of learning (identity [math]\displaystyle{ \to }[/math] vanilla gradient flow; Fisher metric [math]\displaystyle{ \to }[/math] natural gradient).
Stationarity and the Entropic Learning Equation (ELE)
Minimizing the action gives the gradient-flow ELE: [math]\displaystyle{ \dot\phi -,\Gamma(\phi),\nabla_\phi W(\phi) -,\Gamma(\phi)!\left[ \alpha,\big(S_{\mathrm{self}}(\phi)-S_{\mathrm{given}}\big),\nabla_\phi S_{\mathrm{self}}(\phi) ;+; \beta,\nabla_\phi, \mathbb{E}{x}!\big[\mathrm{KL}(p^*\Vert p\phi)\big] \right]. }[/math]
Built-in Irreversibility (No ad-hoc [math]\displaystyle{ S_{\mathrm{irr}} }[/math])
Define the entropy-production rate [math]\displaystyle{ \sigma(\phi) ;=; \big(\nabla_\phi W(\phi)\big)^{!\top}! \Gamma(\phi), \big(\nabla_\phi W(\phi)\big) ;\ge;0. }[/math] Because [math]\displaystyle{ \Gamma(\phi) }[/math] is positive-definite, [math]\displaystyle{ \sigma\ge 0 }[/math] holds identically: irreversibility emerges from the dynamics, not from an added penalty.
Equivalent Constrained (Tracking) Form
One can enforce “SRE relaxes toward the target” explicitly via a Lagrange multiplier [math]\displaystyle{ \lambda(t) }[/math]: [math]\displaystyle{ \mathcal{L}_{\mathrm{track}}[\phi,\lambda] \int!\Big[ \tfrac{1}{2},\dot\phi^{!\top} M,\dot\phi ;+; \beta,\mathbb{E}{x}!\big[\mathrm{KL}(p^*\Vert p\phi)\big] ;+; \lambda(t),\Big( \tfrac{d}{dt} S_{\mathrm{self}}(\phi) \kappa,[,S_{\mathrm{given}}-S_{\mathrm{self}}(\phi),] \Big) \Big],dt, }[/math] with [math]\displaystyle{ \tfrac{d}{dt} S_{\mathrm{self}}(\phi) \nabla_\phi S_{\mathrm{self}}(\phi)\cdot \dot\phi. }[/math] Stationarity yields first-order dynamics in which [math]\displaystyle{ \tfrac{d}{dt} S_{\mathrm{self}}(\phi) \kappa,[,S_{\mathrm{given}}-S_{\mathrm{self}}(\phi),] }[/math] (i.e., exponential relaxation of SRE toward the target) while simultaneously fitting the data via the KL term. The matrix [math]\displaystyle{ M\succ 0 }[/math] sets inertial weighting; taking the overdamped limit recovers the gradient-flow ELE.
Practical Design Choices
Geometry / Optimizer Mapping
Euclidean flow: [math]\displaystyle{ \Gamma(\phi)=\eta I }[/math] (learning-rate [math]\displaystyle{ \eta }[/math]), giving [math]\displaystyle{ \dot\phi=-\eta,\nabla_\phi W(\phi) }[/math].
Natural gradient: [math]\displaystyle{ \Gamma(\phi)=\eta,F(\phi)^{-1} }[/math], with [math]\displaystyle{ F }[/math] the Fisher information.
Preconditioned/Adam-like: choose [math]\displaystyle{ \Gamma }[/math] diagonal or adaptive from running curvature estimates.
Discrete-Time Approximation (for implementation)
For step size [math]\displaystyle{ \Delta t }[/math]: [math]\displaystyle{ \phi_{k+1} ;=; \phi_k \Delta t,\Gamma(\phi_k),\nabla_\phi W(\phi_k). }[/math]
Scaling and Weights
Non-dimensionalize so that [math]\displaystyle{ W }[/math] is order-one near optimum, then set [math]\displaystyle{ \alpha=\beta=1 }[/math]. Use [math]\displaystyle{ \Gamma }[/math] (or [math]\displaystyle{ \eta }[/math]) to tune time scale.
Why This Fixes the Original Formulation
Learning is explicitly state change: dynamics appear via [math]\displaystyle{ \dot\phi }[/math].
SRE alignment is the steering signal through [math]\displaystyle{ \big(S_{\mathrm{self}}-S_{\mathrm{given}}\big)\nabla_\phi S_{\mathrm{self}} }[/math].
Irreversibility ([math]\displaystyle{ \sigma\ge 0 }[/math]) is automatic from the quadratic action; no external [math]\displaystyle{ S_{\mathrm{irr}} }[/math] term is required.
Data-fit pressure is principled via KL, orthogonal to SRE alignment.
== Minimal Working Set (copy-ready) == Entropic potential:
[math]\displaystyle{ W(\phi)=\tfrac{\alpha}{2}\big(S_{\mathrm{self}}(\phi)-S_{\mathrm{given}}\big)^2 +\beta,\mathbb{E}{x}!\Big[\mathrm{KL}\big(p^*(\cdot\mid x)\Vert p\phi(\cdot\mid x)\big)\Big]. }[/math]
Learning action (SRETA): [math]\displaystyle{ \mathcal{L}{\mathrm{SRETA}}[\phi]=\int_0^T \tfrac{1}{2},\big|\dot\phi+\Gamma(\phi)\nabla\phi W(\phi)\big|^2,dt. }[/math]
Entropic Learning Equation (ELE): [math]\displaystyle{ \dot\phi -,\Gamma(\phi)!\left[ \alpha,\big(S_{\mathrm{self}}(\phi)-S_{\mathrm{given}}\big),\nabla_\phi S_{\mathrm{self}}(\phi) + \beta,\nabla_\phi\mathbb{E}{x}!\big[\mathrm{KL}(p^*\Vert p\phi)\big] \right]. }[/math]
Entropy-production rate: [math]\displaystyle{ \sigma(\phi)=\big(\nabla_\phi W(\phi)\big)^{!\top}\Gamma(\phi)\big(\nabla_\phi W(\phi)\big)\ge 0. }[/math]
Constrained tracking form (optional): [math]\displaystyle{ \mathcal{L}_{\mathrm{track}}[\phi,\lambda] \int!\Big[\tfrac{1}{2}\dot\phi^{!\top}M\dot\phi +\beta,\mathbb{E}{x}!\big[\mathrm{KL}(p^*\Vert p\phi)\big] +\lambda(t)\big(\nabla_\phi S_{\mathrm{self}}(\phi)!\cdot!\dot\phi-\kappa[S_{\mathrm{given}}-S_{\mathrm{self}}(\phi)]\big)\Big]dt. }[/math]
Remarks
This framework is agnostic to the specific definition of [math]\displaystyle{ S_{\mathrm{self}}(\phi) }[/math] as long as it is smooth; examples include entropy of latent codes, complexity penalties, or energy-based state measures.
The formulation is compatible with stochastic mini-batch estimates of both the KL and gradients.
The same blueprint extends to multi-objective training by summing additional potentials inside [math]\displaystyle{ W(\phi) }[/math] before taking the gradient flow.