Physics:Equation of Artificial Intelligence in the Theory of Entropicity(ToE)
Entropic Learning via Self-Referential Entropy Tracking (SRETA)
That is:
Learning is a change in internal entropy states of a system [a change in Self Referential Entropy[SRE]] towards a given [internal or external] reference entropy.
Abstract
This paper defines a principled Learning Action in which learning is explicitly a change of state guided by Self-Referential Entropy (SRE) toward a given (target) entropy. Instead of adding Shannon entropy and an “irreversible entropy” term, we construct an entropic potential and a gradient-flow action whose minimizer yields irreversible dynamics by design. The result is the Entropic Learning Equation (ELE).
This paper is an update on an earlier paper on Artificial Intelligence and Deep Learning in the Theory of Entropicity (ToE).[1]
Preliminaries
Objects and Notation
Model state (parameters): [math]\displaystyle{ \phi(t)\in\mathbb{R}^d }[/math].
Predictive distribution: [math]\displaystyle{ p_\phi(y\mid x) }[/math].
Data (or teacher) distribution: [math]\displaystyle{ p^*(y\mid x) }[/math].
Self-Referential Entropy (SRE) of the model’s internal state: [math]\displaystyle{ S_{\mathrm{self}}(\phi) }[/math] (differentiable scalar functional).
Given (target) entropy for the task: [math]\displaystyle{ S_{\mathrm{given}} }[/math].
Positive-definite mobility/metric: [math]\displaystyle{ \Gamma(\phi)\in\mathbb{R}^{d\times d} }[/math].
We use [math]\displaystyle{ \nabla_\phi }[/math] for gradients w.r.t. [math]\displaystyle{ \phi }[/math] and the dot for time derivatives, [math]\displaystyle{ \dot\phi=\tfrac{d\phi}{dt} }[/math].
Choice of Target Entropy [math]\displaystyle{ S_{\mathrm{given}} }[/math] (Guidance)
Nearly deterministic supervision: [math]\displaystyle{ S_{\mathrm{given}}\approx 0 }[/math].
Inherently ambiguous labels: set [math]\displaystyle{ S_{\mathrm{given}}\approx H^*(Y!\mid!X) }[/math], an empirical conditional entropy estimate.
Representation learning: define [math]\displaystyle{ S_{\mathrm{self}} }[/math] on latents (e.g., codebook, embedding spread) and set [math]\displaystyle{ S_{\mathrm{given}} }[/math] by desired compression/robustness.
Entropic Potential
Define the entropic potential (weights can be nondimensionalized; set to 1 after scaling): [math]\displaystyle{ W(\phi);=;\tfrac{\alpha}{2},\big(S_{\mathrm{self}}(\phi)-S_{\mathrm{given}}\big)^2 ;+; \beta,\mathbb{E}{x\sim\mathcal{D}}!\Big[\mathrm{KL}\big(p^!(\cdot\mid x),\Vert,p\phi(\cdot\mid x)\big)\Big], }[/math] where [math]\displaystyle{ \mathrm{KL}!\big(p^*(\cdot\mid x)\Vert p_\phi(\cdot\mid x)\big) \sum_y p^*(y!\mid!x),\log!\frac{p^*(y!\mid!x)}{p_\phi(y!\mid!x)}. }[/math]
The first term drives SRE alignment: [math]\displaystyle{ S_{\mathrm{self}}(\phi)\to S_{\mathrm{given}} }[/math].
The second term pulls predictions toward data without being “just another entropy addend”.
Learning Action
We penalize deviation from the desired gradient flow generated by [math]\displaystyle{ W }[/math]: [math]\displaystyle{ \mathcal{L}{\mathrm{SRETA}}[\phi] ;=; \int{0}^{T} \tfrac{1}{2}, \big| \dot\phi + \Gamma(\phi),\nabla_\phi W(\phi) \big|^2 ,dt. }[/math]
Boundary conditions (typical): [math]\displaystyle{ \phi(0)=\phi_0 }[/math]; free terminal state or fixed [math]\displaystyle{ \phi(T) }[/math] if desired.
[math]\displaystyle{ \Gamma(\phi) }[/math] sets the geometry and time scale of learning (identity [math]\displaystyle{ \to }[/math] vanilla gradient flow; Fisher metric [math]\displaystyle{ \to }[/math] natural gradient).
Stationarity and the Entropic Learning Equation (ELE)
Minimizing the action gives the gradient-flow ELE: [math]\displaystyle{ \dot\phi -,\Gamma(\phi),\nabla_\phi W(\phi) -,\Gamma(\phi)!\left[ \alpha,\big(S_{\mathrm{self}}(\phi)-S_{\mathrm{given}}\big),\nabla_\phi S_{\mathrm{self}}(\phi) ;+; \beta,\nabla_\phi, \mathbb{E}{x}!\big[\mathrm{KL}(p^*\Vert p\phi)\big] \right]. }[/math]
Built-in Irreversibility (No ad-hoc [math]\displaystyle{ S_{\mathrm{irr}} }[/math])
Define the entropy-production rate [math]\displaystyle{ \sigma(\phi) ;=; \big(\nabla_\phi W(\phi)\big)^{!\top}! \Gamma(\phi), \big(\nabla_\phi W(\phi)\big) ;\ge;0. }[/math] Because [math]\displaystyle{ \Gamma(\phi) }[/math] is positive-definite, [math]\displaystyle{ \sigma\ge 0 }[/math] holds identically: irreversibility emerges from the dynamics, not from an added penalty.
Equivalent Constrained (Tracking) Form
One can enforce “SRE relaxes toward the target” explicitly via a Lagrange multiplier [math]\displaystyle{ \lambda(t) }[/math]: [math]\displaystyle{ \mathcal{L}_{\mathrm{track}}[\phi,\lambda] \int!\Big[ \tfrac{1}{2},\dot\phi^{!\top} M,\dot\phi ;+; \beta,\mathbb{E}{x}!\big[\mathrm{KL}(p^*\Vert p\phi)\big] ;+; \lambda(t),\Big( \tfrac{d}{dt} S_{\mathrm{self}}(\phi) \kappa,[,S_{\mathrm{given}}-S_{\mathrm{self}}(\phi),] \Big) \Big],dt, }[/math] with [math]\displaystyle{ \tfrac{d}{dt} S_{\mathrm{self}}(\phi) \nabla_\phi S_{\mathrm{self}}(\phi)\cdot \dot\phi. }[/math] Stationarity yields first-order dynamics in which [math]\displaystyle{ \tfrac{d}{dt} S_{\mathrm{self}}(\phi) \kappa,[,S_{\mathrm{given}}-S_{\mathrm{self}}(\phi),] }[/math] (i.e., exponential relaxation of SRE toward the target) while simultaneously fitting the data via the KL term. The matrix [math]\displaystyle{ M\succ 0 }[/math] sets inertial weighting; taking the overdamped limit recovers the gradient-flow ELE.
Practical Design Choices
Geometry / Optimizer Mapping
Euclidean flow: [math]\displaystyle{ \Gamma(\phi)=\eta I }[/math] (learning-rate [math]\displaystyle{ \eta }[/math]), giving [math]\displaystyle{ \dot\phi=-\eta,\nabla_\phi W(\phi) }[/math].
Natural gradient: [math]\displaystyle{ \Gamma(\phi)=\eta,F(\phi)^{-1} }[/math], with [math]\displaystyle{ F }[/math] the Fisher information.
Preconditioned/Adam-like: choose [math]\displaystyle{ \Gamma }[/math] diagonal or adaptive from running curvature estimates.
Discrete-Time Approximation (for implementation)
For step size [math]\displaystyle{ \Delta t }[/math]: [math]\displaystyle{ \phi_{k+1} ;=; \phi_k \Delta t,\Gamma(\phi_k),\nabla_\phi W(\phi_k). }[/math]
Scaling and Weights
Non-dimensionalize so that [math]\displaystyle{ W }[/math] is order-one near optimum, then set [math]\displaystyle{ \alpha=\beta=1 }[/math]. Use [math]\displaystyle{ \Gamma }[/math] (or [math]\displaystyle{ \eta }[/math]) to tune time scale.
Why This Fixes the Original Formulation
Learning is explicitly state change: dynamics appear via [math]\displaystyle{ \dot\phi }[/math].
SRE alignment is the steering signal through [math]\displaystyle{ \big(S_{\mathrm{self}}-S_{\mathrm{given}}\big)\nabla_\phi S_{\mathrm{self}} }[/math].
Irreversibility ([math]\displaystyle{ \sigma\ge 0 }[/math]) is automatic from the quadratic action; no external [math]\displaystyle{ S_{\mathrm{irr}} }[/math] term is required.
Data-fit pressure is principled via KL, orthogonal to SRE alignment.
== Minimal Working Set (copy-ready) == Entropic potential:
[math]\displaystyle{ W(\phi)=\tfrac{\alpha}{2}\big(S_{\mathrm{self}}(\phi)-S_{\mathrm{given}}\big)^2 +\beta,\mathbb{E}{x}!\Big[\mathrm{KL}\big(p^*(\cdot\mid x)\Vert p\phi(\cdot\mid x)\big)\Big]. }[/math]
Learning action (SRETA): [math]\displaystyle{ \mathcal{L}{\mathrm{SRETA}}[\phi]=\int_0^T \tfrac{1}{2},\big|\dot\phi+\Gamma(\phi)\nabla\phi W(\phi)\big|^2,dt. }[/math]
Entropic Learning Equation (ELE): [math]\displaystyle{ \dot\phi -,\Gamma(\phi)!\left[ \alpha,\big(S_{\mathrm{self}}(\phi)-S_{\mathrm{given}}\big),\nabla_\phi S_{\mathrm{self}}(\phi) + \beta,\nabla_\phi\mathbb{E}{x}!\big[\mathrm{KL}(p^*\Vert p\phi)\big] \right]. }[/math]
Entropy-production rate: [math]\displaystyle{ \sigma(\phi)=\big(\nabla_\phi W(\phi)\big)^{!\top}\Gamma(\phi)\big(\nabla_\phi W(\phi)\big)\ge 0. }[/math]
Constrained tracking form (optional): [math]\displaystyle{ \mathcal{L}_{\mathrm{track}}[\phi,\lambda] \int!\Big[\tfrac{1}{2}\dot\phi^{!\top}M\dot\phi +\beta,\mathbb{E}{x}!\big[\mathrm{KL}(p^*\Vert p\phi)\big] +\lambda(t)\big(\nabla_\phi S_{\mathrm{self}}(\phi)!\cdot!\dot\phi-\kappa[S_{\mathrm{given}}-S_{\mathrm{self}}(\phi)]\big)\Big]dt. }[/math]
Remarks
This framework is agnostic to the specific definition of [math]\displaystyle{ S_{\mathrm{self}}(\phi) }[/math] as long as it is smooth; examples include entropy of latent codes, complexity penalties, or energy-based state measures.
The formulation is compatible with stochastic mini-batch estimates of both the KL and gradients.
The same blueprint extends to multi-objective training by summing additional potentials inside [math]\displaystyle{ W(\phi) }[/math] before taking the gradient flow.
References
- ↑ Physics:Artificial Intelligence Formulated by the Theory of Entropicity(ToE). (2025, August 27). HandWiki, . Retrieved 03:59, August 27, 2025 from https://handwiki.org/wiki/index.php?title=Physics:Artificial_Intelligence_Formulated_by_the_Theory_of_Entropicity(ToE)&oldid=3742591