LaTex2Web logo

Documents Live, a web authoring and publishing system

If you see this, something is wrong

Table of contents

First published on Tuesday, Jun 17, 2025 and last modified on Tuesday, Jun 17, 2025 by François Chaplais.

Hessian Geometry of Latent Space in Generative Models
arXiv
Published version: 10.48550/arXiv.2506.10632

Alexander Lobashev Glam AI, San Francisco, USA Email

Dmitry Guskov Artificial Neural Computing Corp., Weston, FL, USA Email

Maria Larchenko Magicly AI, Dubai, UAE

Mikhail Tamm School of Digital Technologies, Tallinn University, Tallinn, Estonia

Keywords: Machine Learning, ICML

Abstract

This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the log-partition function, which defines the Fisher metric for exponential families. Theoretical convergence guarantees are provided, and the method is validated on the Ising and TASEP models, outperforming existing baselines in reconstructing thermodynamic quantities. Applied to diffusion models, the method reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher metric. We demonstrate that while geodesic interpolations are approximately linear within individual phases, this linearity breaks down at phase boundaries, where the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space. These findings provide new insights into the complex structure of diffusion model latent spaces and their connection to phenomena like phase transitions. Our source code is available at https://github.com/alobashev/hessian-geometry-of-diffusion-models.

1 Introduction

State-of-the-art image generation models often exhibit abrupt changes in image appearance during interpolation, indicating a non-smooth latent space [1, 2]. These abrupt transitions have been studied in community from two different perspectives.

Visualization of Theorem theorem:main. Having samples from p(x|t')) we can approximate posterior distribution p(t|x_1, x_N)) . The partition function Z(t)) defines a Hessian metric on the latent space. Our Theorem theorem:main guarantees that Z(t)) in limit depends only on the posterior distributions p(t|x_{1}, , x_{N})) and Theorem theorem:convergence offers a way to learn Z(t)) from p(t|x_{1}, , x_{N})) .
Figure 1. Visualization of Theorem 2. Having samples from \( p(x|t')\) we can approximate posterior distribution \( p(t|x_1, \dots x_N)\) . The partition function \( \log Z(t)\) defines a Hessian metric on the latent space. Our Theorem 2 guarantees that \( \log Z(t)\) in limit depends only on the posterior distributions \( p(t|x_{1}, \dots, x_{N})\) and Theorem 3 offers a way to learn \( \log Z(t)\) from \( p(t|x_{1}, \dots, x_{N})\) .

Riemannian geometry of latent space  Recent work by [31] constructs a latent basis by considering singular vectors of Jacobian in the feature space, deriving a metric on a latent space via pullback from the euclidean metric in the feature space. A similar pullback-based approach from euclidean metric in image space was used by [11]. It was demonstrated that linear interpolation in the latent space closely approximates geodesic interpolation. [32] studied the Riemannian geometry of GAN latent spaces, deriving metrics under the assumption that the stochastic generator’s mean and variance functions are twice differentiable. Other notable study [3] proposes an approach to automatically discover interpretable directions in a latent space. In [4] authors verify the union of manifold hypothesis for various image datasets. Additionally, [33] treats latent variables as coordinates on a Riemannian manifold, using the Jacobian-derived metric to construct geodesics in latent spaces obtained through nonlinear dimensionality reduction.

Learning phase transitions in statistical physics  Machine learning methods have been widely applied to the study of classical and quantum statistical physics models [5, 6, 7, 8, 9]. These works primarily focus on determining phase transition boundaries and extracting learned order parameters, which serve to distinguish one phase from another.

In [10], it was observed that the principal components of the mean and standard deviation of latent variables in a VAE trained on 2D Ising model configurations were highly correlated with known physical quantities, suggesting that the VAE implicitly extracts sufficient statistics from the data.

We could unify these two approaches by considering a generative model as a statistical physics system and examine it using information geometry methods.

Our contributions  This work provides novel method to analyse latent space properties in generative models. It’s main contributions are

  • For two-parametric systems or two-dimensional sections of the latent space, we propose a method to reconstruct the Fisher metric. We provide a theoretical proof of the method’s convergence. The efficiency of the method was tested on exactly solvable statistical physics models: Ising model and TASEP, demonstrating that the reconstructed free energy coincides with the exact solution.
  • Using the proposed approach, we analyze the latent spaces of generative models. We introduce the notion of distinct phases within the diffusion latent space and identify boundaries where the recovered Fisher metric exhibits abrupt changes.
  • We validate the findings of approximately linear geodesics for interpolation, as discussed in [11]. However, our work reveals that this result holds only within a single phase. We extend the analysis to phase transitions, showing that the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space at phase boundaries.

2 Background

Let \( \mathcal{X}\) be a high-dimensional data space and \( \mathcal{S}\) a lower-dimensional latent space. We assume the existence of a stochastic generative mapping from \( \mathcal{S}\) to \( \mathcal{X}\) , defined by the conditional probability distribution \( p(x|t)\) on \( \mathcal{X}\) for each latent vector \( t \in \mathcal{S}\) .

A generative mapping could be a statistical physics model, such as the Ising model, where \( t = (T, H)\) represents the temperature \( T\) and external magnetic field \( H\) , and \( x\) represents a spin configuration on a two-dimensional lattice. Alternatively, a generative mapping could be a trained diffusion model using a stochastic sampler. Here, \( x\) is a generated image and \( t\) is the corresponding latent noise tensor.

2.1 Ising Model

We consider the 2D Ising model [12]. A microstate \( x\) of this model is a set of spin variables \( s_i = \pm 1\) defined on a square lattice of size \( L\times L\) . At equilibrium the probability distribution over the space of microstates is

\[ \begin{equation} p(x|H, T) = \frac{1}{Z(H, T)} e^{- \frac{1}{T} \sum_{\langle i,j \rangle}s_{i}s_{j} - \frac{1}{T} H \sum_{i} s_{i}} \end{equation} \]

(1)

where \( H\) and \( T\) are external parameters called magnetic field and temperature. This model is exactly solvable for \( H=0\) [13, 14, 15], i.e. \( Z(H,T)\) can be analytically found. The model demonstrates a phase transition at \( T_{cr}\approx 2.27\) between the high-temperature disordered state, where the distribution is concentrated on microstates where spin variables are on average equal zero and the low-temperature ordered state, where the distribution is concentrated on microstates with non-zero average spin.

2.2 Totally Asymmetric Simple Exclusion Process

Totally asymmetric simple exclusion process (TASEP) is a simple model of 1-dimensional transport phenomena [16, 17, 18]. A microscopic configuration is a set of particles on a 1d lattice. Each particle can move to the site to the right of it with probability \( p dt\) per time \( dt\) provided that it is empty (we put p = 1 without loss of generality). A particular case is open boundary conditions, when a particle is added with probability \( \alpha dt\) per time \( dt\) to the leftmost site provided that it is empty and removed with probability \( \beta dt\) per time \( dt\) from the rightmost site provided that it is occupied. For this boundary condition the probability distribution is known exactly:

\[ \begin{equation} p(x|\alpha, \beta) = \frac{f(x|\alpha, \beta)}{Z(\alpha, \beta)} , \end{equation} \]

(2)

where microstate \( x\) is a concrete sequence of filled and empty cells. Importantly, the function \( f\) , which is known exactly does not take the form of the exponential family, Eq.(6). TASEP with free boundaries exhibits a rich phase behavior: For large system sizes three distinct phases - the low-density phase, the high-density phase and the maximal current phase are possible depending on the values of \( \alpha, \beta\) , and the asymptotic “free energy" equals:

\[ \begin{equation} F_{\text{TASEP}}(\alpha, \beta) = \left\{ \begin{array}{l} \frac{1}{4},\alpha>\frac{1}{2},\beta>\frac{1}{2};\\ \alpha(1-\alpha), \alpha<\beta,\alpha<\frac{1}{2};\\ \beta(1-\beta), \beta<\alpha,\beta<\frac{1}{2}.\\ \end{array}\right. \end{equation} \]

(3)

2.3 Information Geometry

Fisher metric  Fisher metric for a distribution \( p(x|t)\) is defined as

\[ \begin{equation} g_{F}(t) = \int_{\mathcal{X}} p(x|t) \nabla_{t} \log p(x| t) (\nabla_{t} \log p(x|t))^{T} dx \end{equation} \]

(4)

The Fisher metric is a Riemannian metric. It equips the space \( \mathcal{S}\) of parameters \( t\) with the structure of Riemannian manifold \( (\mathcal{S}, g_{F})\) .

Hessian metric  A Riemannian metric \( g\) is called a Hessian metric if it can be expressed as the Hessian of a convex potential \( \phi\) in local coordinates:

\[ \begin{equation} g = \sum\limits_{i,j=1}^{N}\frac{\partial^{2} \phi(x)}{\partial x^{i} \partial x^{j}} dx^{i} dx^{j}. \end{equation} \]

(5)

Exponential Family  The exponential family consists of distributions of the form

\[ \begin{equation} p(x|t) = e^{\langle f(x), t \rangle - \log Z(t)}, \end{equation} \]

(6)

where the partition function \( Z(t)\) is given by

\[ \begin{equation} Z(t) = \int_{\mathcal{X}} e^{\langle f(x), t \rangle} dx. \end{equation} \]

(7)

The function \( f(x)\) called unnormalized density in machine learning and Hamiltonian in statistical physics. It is generally unknown, making the direct integration of Eq.7 impossible.

A key property of exponential families is that their Fisher metric is always Hessian, equaling the Hessian of the log-partition function:

\[ \begin{equation} g_F(t) = \sum\limits_{i,j=1}^{N}\frac{\partial^{2} \log Z(t)}{\partial t^{i} \partial t^{j}} dt^{i} dt^{j} = \nabla^2 \log Z(t) \end{equation} \]

(8)

Hessianizability  A natural question arises: When does a Riemannian metric admit a Hessian structure? The Bryant–Amari–Armstrong theorem [19, 20, 21] states that this is always locally true for 2D analytic manifolds, later extended to smooth cases [21]:

Theorem 1

(Bryant–Amari–Armstrong) Any analytic Riemannian metric on a 2-dimensional manifold locally admits a Hessian representation.

While Theorems 2 and 3 presented in the next section apply to arbitrary dimensions of data \( \mathcal{X}\) and latent space \( \mathcal{S}\) , they require a special (exponential) form of the data distribution. Theorem 1 allows us to analyze 2D subspace of latent spaces in GANs, diffusion models, and other non-exponential generative models. Notably, the approach proposed below is theoretically justified for any generative model.

3 Method

Visualizing Bregman divergence. &lt;span style=&quot;font-weight: 700&quot;&gt;(Top) A convex function Z(t)) (black curve) and its tangent line (blue) at a point t&amp;#39;) . The Bregman divergence D_B(t,t&amp;#39;)) (vertical green segment) measures the difference between Z(t)) and the linear approximation at t&amp;#39;) . &lt;span style=&quot;font-weight: 700&quot;&gt;(Bottom) The exponential distribution e^{-D_B(t, t&amp;#39;)}) (blue surface) approximates the posterior distribution on parameters p(t| x_{1}, , x_{N})^{1 N}) , where x_{1}, , x_{N} p(x|t&amp;#39;)) , illustrating Theorem theorem:main.
Figure 2. Visualizing Bregman divergence. (Top) A convex function \( \log Z(t)\) (black curve) and its tangent line (blue) at a point \( t'\) . The Bregman divergence \( D_B(t,t')\) (vertical green segment) measures the difference between \( \log Z(t)\) and the linear approximation at \( t'\) . (Bottom) The exponential distribution \( e^{-D_B(t, t')}\) (blue surface) approximates the posterior distribution on parameters \( p(t| x_{1}, \dots, x_{N})^{\frac{1}{N}}\) , where \( x_{1}, \dots, x_{N} \sim p(x|t')\) , illustrating Theorem 2.

This section describes a method for approximating the Fisher metric on the latent space of a generative model, \( p(x|t)\) , enabling the computation of geodesics for smoother interpolation and the identification of phase transitions.

Given a generative model \( p(x|t)\) we proceed in two steps (1) approximating the posterior distribution \( p(t|x)\) , and (2) estimating the log partition function \( \log Z(t)\) by training a network to simulate \( p(t|x)\) . Theorems 2 and 3 justify this approach for exponential families.

Theorem 2

Let \( \mathcal{X} \) be a space of data samples \( x \in \mathcal{X} \) , and \( S \subset \mathbb{R}^n \) be a compact domain with the continuous prior distribution \( p(t)\) supported on \( S\) . Suppose the conditional distribution of data samples given parameter \( t\) is an exponential family

\[ \begin{equation} p(x|t) = e^{\langle t, f(x) \rangle - \log Z(t)}, \end{equation} \]

(9)

where

\[ \begin{equation} Z(t) = \int_{\mathcal{X}} e^{\langle t, f(x) \rangle} dx \end{equation} \]

(10)

converges for all \( t \in S \) . Let \( x_1, …, x_N \sim p(x|t') \) . Then, as \( N\to\infty\) the posterior distribution satisfies:

\[ \begin{equation} \lim_{N \to \infty} \left(p(t|x_1, \ldots, x_N)\right)^{1/N} \overset{\text{a.s.}}{=} e^{-D_\text{log Z(t)}(t, t')} \end{equation} \]

(11)

where \( D_\text{log Z(t)}(t, t')\) is the Bregman divergence between exponential family distributions

\[ \begin{equation} \begin{split} &D_\text{log Z(t)}(t, t') = \\ &= \log Z(t) - \log Z(t') - \langle \nabla_{t'} \log Z(t'), t - t' \rangle \end{split} \end{equation} \]

(12)

For exponential families, \( D_{\log Z(t)}(t, t')\) coincides with the Kullback-Leibler divergence \( D_\text{KL}(p(x|t') \| p(x|t))\) . Theorem 2 thus implies that as more data samples are observed the posterior concentrates on parameters minimizing KL divergence with the true \( t'\) . For intuition, see Fig.2.

Theorem 3

Suppose that the following integral converges to zero

\[ \begin{equation} \int_{S} \int_{S} \left| e^{-D_{\log Z_{1}(t)}(t,t')} - e^{-D_{\log Z_{2}(t)}(t,t')} \right|^{2} dt \, dt' \to 0, \end{equation} \]

(13)

where \( D_{\log Z_{1}(t)}(t,t')\) is the Bregman divergence. Then the Hessian of \( \log Z_{1}(t)\) converges to the Hessian \( \log Z_{2}(t)\) uniformly in \( t\)

\[ \begin{equation} || \nabla^{2}\log Z_{1}(t) - \nabla^{2}\log Z_{2}(t) || \to 0, \end{equation} \]

(14)

where \( ||\cdot||\) denotes the \( L^2\) norm.

In other words, Eq. 13 is MSE loss function for training \( \log Z(t)\) . Strictly speaking, it obtains the partition function only up to an affine transformation \( \log Z(t) \sim \log Z(t) + \langle c, t\rangle + b\) . However, our goal is to examine a metric \( g\) which is a Hessian \( g (t) = \nabla^2 \log Z(t)\) and therefore it does not depend of affine term. Unfortunately, training with MSE loss suffers from vanishing gradients during initial optimization stages. Loss selection for the experiment is covered in Section 3.2.

For simplicity, we assume a uniform parameter distribution \( p(t)\) over \( S\) :

\[ \begin{equation} p(t) = \left\{ \begin{array}{ll} \frac{1}{\text{Vol}(S)}, & t \in S, \\ 0, & \text{otherwise}. \end{array}\right. \end{equation} \]

(15)

This choice avoids bias toward specific regions of \( S\) , ensuring equitable exploration of the latent space.

3.1 Approximation of the Posterior

The first step is illustrated in Fig. 1. During this step we obtain an approximation of the posterior distribution from a set of samples \( p(t | x_1, \dots, x_N)\) .

Training a Mapping  One could deal with this task by directly training a mapping model, that takes as input \( x_1, \dots, x_N\) sampled from \( p(x|t')\) and returns a normalized probability distribution on a compact domain \( S\) . This approach is most suitable when the samples have stochastic nature and no feature extractor can be utilized.

This is the case for Ising and TASEP statistical systems, where a sample \( x_i\) is a microstate, i.e. it is one of the many indistinguishable realizations of the system with the given parameters \( t\) . Importantly, such samples can be pixel-wise uncorrelated. Therefore in this case we use \( U^2\) -Net [22] trained via maximizing likelihood of the true parameters \( t'\) . For the training details please refer to the Appendix B.3.

Figure 3

Using a Feature Extractor  In the image domain, one could use a pre-trained feature extractor. Note that from Theorem 2, it follows that the posterior distribution behaves as \( \sim \exp(-N D_{\text{KL}}(p_1 \parallel p_2))\) . Having a pre-trained feature extractor \( \mathcal{E}\) , we can construct an approximation of \( D_{\text{KL}}\) in terms of a distance between feature vectors:

\[ \begin{equation} \begin{split} D_{\text{KL}}(p(x|t_{1}) \parallel p(x|t_{2})) &\approx d(\mathcal{E}(x_{1}), \mathcal{E}(x_{2})), \\ x_{1} &\sim p(x|t_{1}), \\ x_{2} &\sim p(x|t_{2}). \end{split} \end{equation} \]

(16)

Indeed, if the feature extractor \( \mathcal{E}\) acts as an approximate sufficient statistic, then

\[ \begin{equation} D_{\text{KL}}(p(x|t_{1}) \parallel p(x|t_{2})) = D_{\text{KL}}(p(\mathcal{E}(x)|t_{1}) \parallel p(\mathcal{E}(x)|t_{2})). \end{equation} \]

(17)

If the distribution of features \( p(\mathcal{E}(x)|t_{i})\) under \( t_i\) is approximately normal \( \mathcal{N}(\mu_i, I)\) , then the KL divergence simplifies (up to an additive constant) to the squared difference between the feature means:

\[ \begin{equation} D_{\text{KL}}(p(\mathcal{E}(x)|t_{1}) \parallel p(\mathcal{E}(x)|t_{2})) = \frac{1}{2} \|\mu_{1} - \mu_{2}\|^{2}. \end{equation} \]

(18)

Then the posterior could be approximated as

\[ \begin{equation} \begin{split} p(t|x_{1}, \dots, x_{N}) &\approx e^{- \frac{N}{2}||\mathcal{E}(x)- \mathcal{E}(x')||^{2}}, \\ x \sim p(x|t)& x' \sim p(x|t'). \end{split} \end{equation} \]

(19)

Below in the Experiments section we take CLIP as a feature extractor \( \mathcal{E}\) and produce the approximation of posterior distribution based on distances between CLIP images embeddings.

3.2 Approximation of the Fisher Metric

The loss function in Equation 13, while theoretically ensuring Hessian convergence as stated in Theorem 3, suffers from vanishing gradients during the initial optimization stages (see lemma 5 from the Appendix section). To address this problem, we normalize \( e^{-N D_{\log Z(t)}(t,t')} \) to obtain a probability distribution. Then we compare it with the posterior distribution \( p(t \mid x_1, …, x_N) \) using Jensen-Shannon divergence (JSD), which is a proper metric between two probability distributions.

Lemma 1

Let \( Z: S \to (0,\infty) \) with \( S \subset \mathbb{R}^n \) compact domain and \( \log Z \) convex. Then for any fixed \( t \in S \) :

\[ \begin{equation} \begin{split} &\frac{\exp\left(-D_{\log Z}(t, t')\right)}{\int_{S} \exp\left(-D_{\log Z}( s, t')\right) \, ds} = \\ &= \frac{\exp\left(-\langle t, \nabla_{t'} \log Z(t') \rangle + \log Z(t)\right)}{\int_{S} \exp\left(-\langle s, \nabla_{t'} \log Z(t') \rangle + \log Z(s)\right) \, ds}, \end{split} \end{equation} \]

(20)

where \( D_{\log Z}(t, t')\) is the Bregman divergence.

Define a normalized distribution which depends only on the log-partition function following Lemma 1

\[ \begin{equation} p_{\log Z}(t|t') = \frac{\exp\left(-\langle t, \nabla_{t'} \log Z(t') \rangle + \log Z(t)\right)}{\int_{S} \exp\left(-\langle s, \nabla_{t'} \log Z(t') \rangle + \log Z(s)\right) \, ds} \end{equation} \]

(21)

Now given the posterior \( p(t|x_1, …, x_N)\) , which approximation was discussed in the previous section, we could train the log-partition function \( \log Z_{\theta}(t)\) by minimizing the loss

\[ \begin{equation} \mathcal{L}_1(\theta) = \int_{\mathcal{S} }D_{\text{JS}}\left(p(t|x_1, \ldots, x_N), p_{\log Z_{\theta}}(t|t')\right) dt', \end{equation} \]

(22)

where \( x_1, …, x_N \sim p(x|t')\) . The Jensen-Shannon divergence for distributions \( P\) and \( Q\) is defined as

\[ \begin{equation} D_{\text{JS}}(P, Q) = \frac{1}{2} \left[ D_{KL}(P|| \frac{P+Q}{2}) + D_{KL}(Q|| \frac{P+Q}{2} )\right] \end{equation} \]

(23)

The resulting approximation of the Fisher metric is

\[ \begin{equation} g_{F}(t) = \nabla^{2} \log Z_{\theta^{*}}(t),\theta^{*} = \operatorname*{argmin}_{\theta} \mathcal{L}(\theta) \end{equation} \]

(24)

We model \( \log Z_{\theta}(t)\) by MLP with 5 hidden layers, hidden size of 512 with ReLU activation. We do not require the MLP to be convex as it converges to the convex function during the training.

3.3 Geodesic Approximation

After obtaining the Fisher metric, \( g_F\) , we unlock the ability to explore the intrinsic geometry of our statistical model space. Geodesics, in this context, represent the shortest paths between two probability distributions within this space. They are analogous to straight lines on a flat surface, but in a potentially curved space dictated by the Fisher metric. To find these geodesics, we aim to minimize the curve length, \( L[\gamma(t)]\) , for a smooth curve \( \gamma(t)\) parameterized from \( t=0\) to \( t=1\) and lying within our statistical model space. This curve length is calculated using the Fisher metric as:

\[ \begin{equation} L[\gamma(t)] = \int_{0}^{1} \sqrt{\dot{\gamma}(t)^{T}g_{f}(\gamma(t))\dot{\gamma}(t)}dt \end{equation} \]

(25)

Here, \( \gamma(t)\) represents a path in the parameter space, and \( \dot{\gamma}(t) = \frac{d\gamma(t)}{dt}\) is its tangent vector.

We split \( \gamma(t)\) into discrete points \( \{\gamma_0, \gamma_1, …, \gamma_N\}\) , where \( \gamma_0\) and \( \gamma_N\) are the starting and ending points of our desired geodesic. The continuous integral is then represented by a discrete sum of distances between consecutive points, using the Fisher metric to measure these distances, optimizing intermediate points via Adam to minimize \( L[\gamma]\) (see [11] for details).

4 Experiments and Results

To support our theoretical reasoning, in this section validate our method on the exactly solvable statistical models, namely Ising and TASEP, and compare our estimation of \( \log Z(t)\) with a ground truth. Then, we evaluate our method on two-dimensional slices of diffusion models, comparing the path length and curvature of the learned trajectories with those produced by other methods.

2D Ising model. Partial derivative of the reconstructed free energy with respect to temperature F_{rec}(T, H) T) , Partial derivative of the reconstructed free energy with respect to magnetic field F_{rec}(T, H) H) and reconstructed free energy. See Appendix&amp;#160;sec:appendix_experiments for dataset and training details.
Figure 4. 2D Ising model. Partial derivative of the reconstructed free energy with respect to temperature \( \frac{\partial F_{\text{rec}}(T, H)}{\partial T}\) , Partial derivative of the reconstructed free energy with respect to magnetic field \( \frac{\partial F_{\text{rec}}(T, H)}{\partial H}\) and reconstructed free energy. See Appendix B for dataset and training details.
TASEP. Left: reconstructed free energy (red) compared to the exact solution (blue). Our method achieves near-exact agreement except near domain boundaries (Appendix&amp;#160;subsec:numerics)
Figure 5. TASEP. Left: reconstructed free energy (red) compared to the exact solution (blue). Our method achieves near-exact agreement except near domain boundaries (Appendix B.1)
Free energy of diffusion model reconstructed with CLIP distance.
Figure 6. Free energy of diffusion model reconstructed with CLIP distance.
The fractal structure of phase boundary in the interpolation landscape of diffusion model. The last plot represent the parameter variations 10^{-5}) between neighboring images. The small changes in latent space cause switching between a mountain and a lion.
Figure 7. The fractal structure of phase boundary in the interpolation landscape of diffusion model. The last plot represent the parameter variations \( 10^{-5}\) between neighboring images. The small changes in latent space cause switching between a mountain and a lion.

4.1 Exactly Solvable Statistical Models

We compare against two baselines: (1) posterior-mean-as-statistics (Mean-as-Stat) approach [23, 24], and (2) PCA-VAE, a dimensionality reduction approach following [10] (see Appendix B.1 for details of baselines evaluation).

Table 1 compares the performance of our method (“Convex") against two baselines on 2D Ising and TASEP models. We report the root mean squared error (RMSE) for the reconstructed free energy (\( F\) ) and its partial derivatives with respect to the model parameters (\( \frac{dF}{dT}\) , \( \frac{dF}{dH}\) for Ising; \( \frac{dF}{d\alpha}\) , \( \frac{dF}{d\beta}\) for TASEP). Accurate reconstruction of partial derivatives is critical for identifying phase transitions, where these derivatives diverge. Our method achieves lower RMSE across all metrics in reconstructing thermodynamic quantities.

Fig.4 and Fig.5 demonstrate the results of free energy \( F \sim \log Z(t)\) reconstruction for both systems.

Table 1. Performance results (RMSE) for ISING and TASEP models. F stands for Free Energy.
ISINGF RMSE\( \frac{dF}{dT}\) RMSE\( \frac{dF}{dH}\) RMSE
Convex (Ours)\( \bf{0.0883} \pm \bf{0.0006}\)\( \bf{0.1106} \pm \bf{0.0002}\)\( \bf{0.1237} \pm \bf{0.0016}\)
Mean-as-Stat\( 0.0981 \pm 0.0010\)\( 0.4766 \pm 0.0023\)\( 1.0936 \pm 0.0033\)
PCA-VAE\( 0.1669 \pm 0.0018\)\( 0.7428 \pm 0.0025\)\( 0.7988 \pm 0.0022\)
TASEPF RMSE\( \frac{dF}{d\alpha}\) RMSE\( \frac{dF}{d\beta}\) RMSE
Convex (Ours)\( \bf{0.0112} \pm \bf{0.00008}\)\( \bf{0.1165} \pm \bf{0.0025}\)\( \bf{0.1135} \pm \bf{0.0017}\)
Mean-as-Stat\( 0.0529 \pm 0.0005\)\( 0.3832 \pm 0.0038\)\( 0.3833 \pm 0.0031\)
PCA-VAE\( 0.0524 \pm 0.0006\)\( 0.3837 \pm 0.0038\)\( 0.3872 \pm 0.0022\)

4.2 Two-dimensional Slices of a Diffusion Model

The experiments with diffusion are based on StableDiffusion 1.5 (Dreamshaper8) [25] with DDIM scheduler [26]. For our generation we use 50 inference steps, classifier free guidance scale set to \( 5\) . Prompt is chosen as “High quality picture, 4k, detailed" and negative prompt “blurry, ugly, stock photo".

To build a 2 dimensional latent space section of the diffusion model we generate 3 random initial latents \( z_0, z_1, z_2\) . We use interpolation between latent representations:

\[ \text{z} = z_0 + \alpha (z_1 - z_0) + \beta (z_2 - z_0), \]

where \( \alpha\) and \( \beta\) are uniformly sampled from \( U[0,1]\) . In this setup we take three different triplets of initial latents and generate 60000 images for each with a fixed prompt and initial latent uniformly sampled from the interpolation grid. To prevent our interpolation to fall off the Gaussian hypersphere in all our experiments we employed a normalization of latent vector z.

At first we limit ourselves to the case of deterministic generation process, setting DDIM parameter \( \eta=0\) .

Using a CLIP Feature Extractor  For the given dataset we then compute CLIP distances to approximate prior distribution as discussed in Sec. 3.1. Following our method we train model to reconstruct \( \log Z(t) = \log Z(\alpha,\beta)\) presented in Fig 6. The retrieved \( \log Z(t)\) is not smooth and has abrupt changes in derivatives shown on Fig.8(C), reflecting phase transitions in image space. To evaluate our emergent metric, we analyze geodesic paths. Our results demonstrate that geodesics are approximately linear within a single phase but exhibit nonlinear deviations at phase boundaries, Fig. 8(B). The Fisher metric enables the construction of smoother interpolation paths (geodesics) compared to linear interpolation, Fig. 8(A)

We then investigate images near the phases boundaries. Unlike classical statistical physics models with continuous phase boundaries, the latent diffusion model exhibits phase diagrams with fractal boundaries, illustrated by Fig.7.

Crucially, we observe that near these boundaries, samples from neighboring phases blend together, leading to a phenomenon where phases permeate one another. Repeated magnification of the boundary shown in Fig.7 reveals self-similar patterns across scales, starting from variation scale \( 10^{-5}\) until reaching float16 accuracy of \( 10^{-8}\) . In other words, near the phase boundaries the diffusion output is highly sensitive to small changes in latent space.

This behaviour is commonly characterized by the Lipschitz constant. The work by Yang et al. [27] discusses the Lipshitz constant for diffusion models with respect to time variable. However, to our best knowledge, the observation on divergence of Lipshitz constant with respect to latent space is new.

The Fisher metric is no longer analytic or smooth there, meaning the Braynt-Amari-Armstrong theorem does not apply, and the metric cannot be expressed as the Hessian of a convex function.

Using a Unet Mapping  We investigate our method on diffusion model with \( U^2\) -Net and observe a distinct outcome. Unet doesn’t take into account semantic information, solely relying on pixel space representation. Therefore it is able to predict the exact parameters \( (\alpha,\beta)\) of image generation. From this perspective the generation is completely deterministic and every image is distinguishable. Thus, posterior distribution exactly recovers the target one resulting in smooth \( \log Z(\alpha,\beta)\) .

Following this result, we evaluate our method on generation with DDIM \( \eta=0.1\) . We observe that adding the noise simply results in blurring the posterior distribution prediction. In the case of \( U^2\) -Net it doesn’t change the learned geometry, following the argument above. However, in the case of CLIP this results in smoothed free energy landscape and thus doesn’t show sharp boundaries as in noiseless case.

(A) Geodesic (ours) and linear interpolation between images. Note that our interpolation variant is more perceptually smooth. (B) Illustration of 60 000 generated by StableDiffusion images with geodesic plots (C) First and second derivatives of the log-partition function of diffusion model. Note, that second derivatives are diagonal components of the Fisher metric.
Figure 8. (A) Geodesic (ours) and linear interpolation between images. Note that our interpolation variant is more perceptually smooth. (B) Illustration of 60 000 generated by StableDiffusion images with geodesic plots (C) First and second derivatives of the log-partition function of diffusion model. Note, that second derivatives are diagonal components of the Fisher metric.

Baselines and Metrics  To evaluate our method, we compare it against the approaches of [8] and [11], which represent related work in deterministic generative models. Below, we clarify the key differences and present quantitative results from these comparisons.

[11] study VAEs and compute geodesic curves based on the pullback metric induced by the Euclidean metric in pixel space. [8] consider GAN models and define a metric on the latent space via the pullback of the \( \ell_2\) distance in the feature space of VGG-19 (LPIPS distance). In both cases, the generative models are deterministic: each latent \( Z\) produces only a single image \( X\) .

The key distinction of our work lies in its consideration of a broader class of models, specifically, models with stochastic generation, where a single latent \( Z\) corresponds to a distribution in the image space \( p(X|Z)\) . Diffusion models with stochastic sampling (and all statistical physics models) cannot be addressed within the formalism of [11] or [8].

We compare our algorithm with these baselines in the deterministic sampling regime. We obtain the pullback metric from the Euclidean metric in the space of CLIP embeddings. The Jacobian is estimated via finite differences. We use three evaluation scores: CLIP, pixel, and Perceptual Path Length (PPL) [28], which measure the average path length in CLIP embedding space, pixel space, and the feature space of VGG-19, respectively. These scores are computed as the cumulative \( \ell_2\) distance between consecutive feature vectors (or images for pixel space), then averaged across multiple trajectories (Table 2).

Table 2. Comparison of trajectory lengths and curvature in deterministic sampling regime.
MetricGeodesic (Ours)LinearGeodesic (Wang/Shao)
CLIP Length\( \bf{72.3} \pm \bf{4.00}\)\( 73.6 \pm 3.54\)\( 73.6 \pm 4.37\)
Pixel Length\( 2.77 \times 10^{6} \pm 2.38 \times 10^{4}\)\( 2.76 \times 10^{6} \pm 2.77 \times 10^{4}\)\( \bf{2.74} \times 10^{6} \pm \bf{3.53} \times 10^{4}\)
Perceptual Path Length\( \bf{3.12} \pm \bf{0.16}\)\( 3.17 \pm 0.23\)\( 3.19 \pm 0.21\)
Mean Curvature\( \bf{0.367} \pm \bf{0.691} \)\( 0.00 \pm 0.00\)\( 1.33 \pm 0.53\)

We observe that our interpolation is on par with baselines in the case of deterministic sampling. We additionally compute the curvature of each trajectory as the mean angular change per unit length between consecutive path segments. Our analysis reveals that trajectories constructed using the Wang metric show significantly higher curvature and frequent turning compared to our method. We attribute this behavior to the finite differences, which introduce high-frequency noise in the metric components. In the case of diffusion, the Jacobian is hard to obtain via backpropagation, as suggested in [11], due to high computational cost. During our evaluation we observed that the phase boundaries remained stable across all tested feature extractors. Therefore, we conclude that the algorithm consistently captures the same phase structure.

Underlying mechanism for phase transition  We believe the observed phase transition behavior is connected to the geometry of the image space. This idea aligns with findings from the [4], which shows that natural images lie on a union of disjoint low-dimensional manifolds with varying dimensions.

In diffusion models, generation begins from a high-dimensional Gaussian latent distribution. The reverse ODE process maps this distribution onto disjoint, lower-dimensional manifolds corresponding to distinct image modes. Such a transformation—from a unimodal latent space to a multimodal data space with disjoint supports—may result in a diverging Lyapunov exponent or, equivalently, a diverging Lipschitz constant in the generative mapping, indicating phase transitions.

This phenomenon can be illustrated by the following proposition, which simulates a lower-dimensional data manifold with disjoint supports.

Proposition 1

Suppose that the (target) data distribution is a bimodal mixture of two Gaussians, each with variance \( \sigma^2\) :

\[ \begin{equation} p_0(x) = \frac{1}{2}\mathcal{N}(x\mid -1,\sigma^2) + \frac{1}{2}\mathcal{N}(x\mid 1,\sigma^2). \end{equation} \]

(26)

The latent (source) distribution is the standard normal \( \mathcal{N}(x\mid 0,1)\) . Consider the variance-preserving SDE

\[ \begin{equation} dX_t = -\frac{1}{2}\beta X_t\, dt + \sqrt{\beta}\, dW_t. \end{equation} \]

(27)

Then the Lyapunov exponent of the corresponding reverse-time ODE at \( x=0\) has the following form:

\[ \begin{equation} \lambda = \frac{\beta}{2} \left(1 + \frac{1 - \sigma^2}{\sigma^4} \right), \end{equation} \]

(28)

and it diverges to infinity as \( \sigma \to 0\) . In this case, the point \( x=0\) can be interpreted as a phase transition boundary.

Illustration of Proposition 4.1
Figure 9. Illustration of Proposition 4.1

5 Discussion

In this work we develop a method to reconstruct the Fisher metric for any exponential family distribution. This is typical for statistical physics systems. Our method is based on reconstructing the log-partition function from posterior distributions on the parameter space. We provide theoretical grounding and validate it through experiments by comparing our reconstruction with exact solutions, outperforming mean-as-sufficient-statistics and PCA-VAE baselines. Importantly, we do not assume prior knowledge of the unnormalized density or Hamiltonian, enabling our method to operate in more general settings than MCMC-based techniques like importance sampling.

Bryant-Amari-Armstrong theorem allows us to go beyond exponential family distributions and apply our method to study two-dimensional sections of latent space of arbitrary generative models.

We extend notion of phase transitions known in statistical physics, where small changes in parameters lead to significant shifts in system output, to describe abrupt changes in generative models. While recent studies [29, 30] have explored phase transitions in diffusion models with respect to time-step, our work addresses a distinct problem. We treat the latent space as a parameter space and generated images as microstates. Our reconstruction of the Fisher metric for a two-dimensional section of a diffusion model’s latent space indicates the presence of distinct phases. We observe that within a phase, geodesic interpolations between images are approximately linear, consistent with the work of Shao et al. [11]. However, our results reveal that this assumption breaks down at phase boundaries, where the geometry becomes highly nonlinear. Specifically, we observe that diffusion models exhibit fractal boundaries between phases. We argue that near these boundaries diffusion has divergent Lipschitz constant. It complements results of Yang et al. [27] showing that diffusion has divergent Lipschitz constant in time variable.

Impact Statement

Our work contributes into the fundamental understanding of generative model latent spaces, offering mathematically grounded tools for detecting phase transitions and constructing smoother geodesic interpolations. While generative models in general raise concerns about misuse, our theoretical focus does not directly enable such risks. Nonetheless, we acknowledge that better exploration of latent space could, in principle, facilitate adversarial prompt crafting or model probing. We expect our contributions to support better understanding of generative models and stimulate cross-disciplinary dialogue between machine learning and information geometry communities.

Appendix

A Proof of theoretical results

Theorem 4

Let \( \mathcal{X} \) be a space of data samples \( x \in \mathcal{X} \) , and \( S \subset \mathbb{R}^n \) be a compact domain with the continuous prior distribution \( p(t)\) supported on \( S\) . Suppose the conditional distribution of data samples given parameter \( t\) is an exponential family

\[ \begin{equation} p(x|t) = e^{\langle t, f(x) \rangle - \log Z(t)}, \end{equation} \]

(29)

where

\[ \begin{equation} Z(t) = \int_{\mathcal{X}} e^{\langle t, f(x) \rangle} dx \end{equation} \]

(30)

converges for all \( t \in S \) . Let \( x_1, …, x_N \sim p(x|t') \) . Then, as \( N\to\infty\) the posterior distribution satisfies:

\[ \begin{equation} \lim_{N \to \infty} \left(p(t|x_1, \ldots, x_N)\right)^{1/N} \overset{\text{a.s.}}{=} e^{-D_\text{log Z(t)}(t, t')} = e^{-D_\text{KL}(p(x|t') \| p(x|t))}. \end{equation} \]

(31)

where \( D_\text{log Z(t)}(t, t')\) is the Bregman divergence between exponential family distributions

\[ \begin{equation} \begin{split} &D_\text{log Z(t)}(t, t') = \log Z(t) - \log Z(t') - \langle \nabla_{t'} \log Z(t'), t - t' \rangle \end{split} \end{equation} \]

(32)

Proof

We begin by applying Bayes’ theorem to derive the posterior distribution. Then we use properties of exponential families and the law of large numbers to obtain the desired limit.

By Bayes’ theorem:

\[ \begin{equation} p(t|x_1, \ldots, x_N) = \frac{p(x_1, \ldots, x_N|t)p(t)}{\int_S p(x_1, \ldots, x_N|s)p(s)ds}. \end{equation} \]

(33)

Since the data samples \( x_1, …, x_N\) are i.i.d. given \( t\) , we have:

\[ \begin{equation} \begin{split} p(x_1, \ldots, x_N|t) &= \prod_{i=1}^N p(x_i|t) = \prod_{i=1}^N e^{\langle t, f(x_i) \rangle - \log Z(t)} \\ &= e^{\sum_{i=1}^N \langle t, f(x_i) \rangle - N \log Z(t)}. \end{split} \end{equation} \]

(34)

Substituting into the posterior, we find:

\[ \begin{equation} p(t|x_1, \ldots, x_N)^{1/N} = \frac{e^{\langle t, \frac{1}{N} \sum_{i=1}^N f(x_i) \rangle - \log Z(t) + \frac{1}{N} \log p(t)}}{\left(\int_S e^{N\left[\langle s, \frac{1}{N} \sum_{i=1}^N f(x_i) \rangle - \log Z(s) + \frac{1}{N}\log p(s) \right]} ds \right)^{1/N}}. \end{equation} \]

(35)

Now, consider the limit:

\[ \begin{equation} \lim_{N \to \infty} \frac{1}{N} \sum_{i=1}^N f(x_i). \end{equation} \]

(36)

By Kolmogorov’s strong law of large numbers, this converges almost surely to the expectation:

\[ \begin{equation} \frac{1}{N} \sum_{i=1}^N f(x_i) \xrightarrow{\text{a.s.}} \int f(x) p(x|t') dx = \nabla_{t'} \log Z(t'), \end{equation} \]

(37)

where the last equality holds because \( p(x|t)\) is an exponential family distribution.

In other words:

\[ \begin{equation} p\left(\{\omega \in \Omega : \lim_{N \to \infty} \frac{1}{N} \sum_{i=1}^N f(x_i(\omega)) = \nabla_{t'} \log Z(t')\} \middle| t' \right) = 1. \end{equation} \]

(38)

By the continuous mapping theorem: if \( X_1, X_2, …\) is a sequence of random variables and \( h: \mathbb{R} \to \mathbb{R}\) is a continuous function, then:

\[ \begin{equation} X_n \xrightarrow{\text{a.s.}} X \implies h(X_n) \xrightarrow{\text{a.s.}} h(X). \end{equation} \]

(39)

Let \( \psi(s, x) = e^{\langle s, x \rangle - \log Z(s) + \log p(s)}\) . Since the function \( \psi\) is continuous in \( x\) , and we have shown that \( \frac{1}{N} \sum_{i=1}^N f(x_i) \xrightarrow{\text{a.s.}} \nabla_{t'} \log Z(t')\) as \( N \to \infty\) , the continuous mapping theorem implies:

\[ \begin{equation} e^{\langle t, \frac{1}{N} \sum_{i=1}^N f(x_i) \rangle - \log Z(t) + \frac{1}{N}\log p(t)} \xrightarrow{\text{a.s.}} e^{\langle t, \nabla_{t'} \log Z(t') \rangle - \log Z(t)}, \end{equation} \]

(40)

where \( p(t)\) is a continuous function on a compact domain, and therefore bounded, which implies \( \frac{1}{N} \log p(t) \to 0\) as \( N \to \infty\) .

Now, let’s consider the term in the denominator of the posterior :

\[ \begin{equation} \left( \int_S e^{N\left[\langle s, \frac{1}{N} \sum_{i=1}^N f(x_i) \rangle - \log Z(s) + \frac{1}{N} \log p(s)\right]} ds \right)^{1/N}. \end{equation} \]

(41)

We can rewrite the integral part using Lemma 2. This lemma applies because, according to the strong law of large numbers, the average \( \frac{1}{N} \sum_{i=1}^N f(x_i)\) converges almost surely to a finite value. To utilize the lemma, we define the function:

\[ \begin{equation} \phi(s, N, \omega) = \langle s, \frac{1}{N} \sum_{i=1}^N f(x_i(\omega)) \rangle - \log Z(s) = \psi\left(s, \frac{1}{N} \sum_{i=1}^N f(x_i(\omega))\right). \end{equation} \]

(42)

We will ignore the term \( \frac{1}{N} \log p(s)\) as it converges to 0 as N goes to infinity and wont affect the value of the integral. Then by Lemma 2, we can state that the following limit holds almost surely:

\[ \begin{equation} \lim_{N \to \infty} \frac{1}{N} \log \int_S e^{N \phi(s, N, \omega)} ds \overset{\text{a.s.}}{=} \max_{s \in S} \psi(s, \nabla_{t'} \log Z(t')). \end{equation} \]

(43)

Using this result, we have

\[ \begin{equation} p(t|x_1, \ldots, x_N)^{1/N} = \frac{e^{\langle t, \frac{1}{N} \sum_{i=1}^N f(x_i) \rangle - \log Z(t) + \frac{1}{N} \log p(t)}}{\left(\int_S e^{\langle s, \frac{1}{N} \sum_{i=1}^N f(x_i) \rangle - \log Z(s) + \frac{1}{N} \log p(s)} ds \right)^{1/N}} \xrightarrow{\text{a.s.}} \frac{e^{\langle t, \nabla_{t'} \log Z(t') \rangle - \log Z(t)}}{e^{\max_{s \in S} \psi(s, \nabla_{t'} \log Z(t'))}}. \end{equation} \]

(44)

Now, observe that we have:

\[ \begin{equation} \max_{s \in S} \psi(s, \nabla_{t'} \log Z(t')) = \max_{s \in S} \left[ \langle s, \nabla_{t'} \log Z(t') \rangle - \log Z(s) \right]. \end{equation} \]

(45)

Since the function is concave, we can state that the maximum is achieved when \( s=t'\) , which allows us to obtain

\[ \begin{equation} \max_{s \in S} \psi(s, \nabla_{t'} \log Z(t')) = \langle t', \nabla_{t'} \log Z(t') \rangle - \log Z(t') \end{equation} \]

(46)

Substituting back into the limit, we arrive at the expression for the Bregman divergence:

\[ \begin{equation} \lim_{N \to \infty} p(t|x_1, \ldots, x_N)^{1/N} \overset{\text{a.s.}}{=} \frac{e^{\langle t, \nabla_{t'} \log Z(t') \rangle - \log Z(t)}}{e^{\langle t', \nabla_{t'} \log Z(t') \rangle - \log Z(t')}} = e^{-D_{B}(t',t)}. \end{equation} \]

(47)

To complete the proof, we need to show that the KL divergence \( D_\text{KL}(p(x|t') \| p(x|t))\) in the theorem statement is equivalent to the Bregman divergence \( D_B(t, t')\) in our result. We compute the KL divergence as follows:

\[ \begin{align*} D_\text{KL}(p(x|t') \| p(x|t)) &= \int p(x|t') \log\frac{p(x|t')}{p(x|t)} dx \\ &= \int p(x|t') [\log p(x|t') - \log p(x|t)] dx \\ &= \int p(x|t') [\langle t', f(x) \rangle - \log Z(t') - (\langle t, f(x) \rangle - \log Z(t))] dx \\ &= \int p(x|t') \langle t', f(x) \rangle dx - \log Z(t') - \int p(x|t') \langle t, f(x) \rangle dx + \log Z(t) \\ &= \langle t', \nabla_{t'} \log Z(t') \rangle - \log Z(t') - \langle t, \nabla_{t'} \log Z(t') \rangle + \log Z(t) \\ &= \langle t' - t, \nabla_{t'} \log Z(t') \rangle + \log Z(t) - \log Z(t') \\ &= D_B(t, t') \end{align*} \]

where we use the property that the expectation of \( f(x)\) under the distribution \( p(x|t')\) is equal to the gradient of the log-partition function \( \mathbb{E}_{x\sim p(x|t')}[f(x)] = \nabla_{t'} \log Z(t')\) .

Thus, we have shown that the limit of the posterior is related to the Bregman divergence of \( t'\) and \( t\) , and is consistent with the theorem statement, which concludes the proof.

Lemma 2

Let \( X_i: \Omega \to \mathbb{R}^n\) be measurable functions, \( s \in S\) , \( \omega \in \Omega\) , \( N \in \mathbb{N}\) , and assume \( \mathbb{E}[X_i] = \mu_x < \infty\) (first moment exists). Define:

\[ \begin{equation} \phi(s, N, \omega) = \psi\left(s, \frac{1}{N} \sum_{i=1}^N X_i(\omega)\right), \end{equation} \]

(48)

where \( \psi(x,s)\) is a continuous function in both variables and uniformly continuous in \( x\) for all \( s\) . Then:

\[ \begin{equation} I \overset{\text{def}}{=} \lim_{N \to \infty} \frac{1}{N} \log \left( \int_S e^{N \psi(s, \frac{1}{N} \sum_{i=1}^N X_i)} ds \right) \overset{\text{a.s.}}{=} \max_{s \in S} \psi(s, \mu_x) \overset{\text{def}}{=} M. \end{equation} \]

(49)

Proof (Proof of Lemma)

Let \( \bar{X}_N = \frac{1}{N} \sum_{i=1}^N X_i(\omega)\) . By Kolmogorov’s strong law of large numbers, we have:

\[ \begin{equation} \bar{X}_N(\omega) \xrightarrow{\text{a.s.}} \mu_x, ~ p\left(\{\omega \in \Omega : \lim_{N \to \infty} \bar{X}_N(\omega) = \mu_x\}\right) = 1. \end{equation} \]

(50)

Since \( \psi(s, x)\) is continuous in \( x\) , by the continuous mapping theorem:

\[ \begin{equation} \lim_{N \to \infty} \bar{X}_N(\omega) \xrightarrow{\text{a.s.}} \mu_x \implies \lim_{N \to \infty} \psi(s, \bar{X}_N(\omega)) \xrightarrow{\text{a.s.}} \psi(s, \mu_x). \end{equation} \]

(51)

Thus, for all \( \omega \in \tilde{\Omega}\) :

\[ \begin{equation} \lim_{N \to \infty} \psi(s, \bar{X}_N(\omega)) = \psi(s, \mu_x). \end{equation} \]

(52)

In other words, for any \( \epsilon > 0\) and fixed \( s \in S\) , there exists \( N_0(\epsilon, s)\) such that for all \( N > N_0(\epsilon, s)\) :

\[ \begin{equation} |\psi(s, \bar{X}_N(\omega)) - \psi(s, \mu_x)| < \epsilon, \end{equation} \]

(53)

and for all \( \omega \in \tilde{\Omega}\) , \( s \in S\) , and \( \epsilon > 0\) , there exists \( N_0(\omega, s, \epsilon)\) such that for all \( N > N_0(\omega, s, \epsilon)\) :

\[ \begin{equation} |\psi(s, \bar{X}_N(\omega)) - \psi(s, \mu_x)| < \epsilon. \end{equation} \]

(54)

Define:

\[ \begin{equation} M_N(\omega) = \max_{s \in S} \psi(s, \bar{X}_N(\omega)). \end{equation} \]

(55)

Since \( \psi(s, x)\) is continuous in \( s\) and \( S\) is a compact domain, \( M_N(\omega) < \infty\) . Moreover, since \( \bar{X}_N(\omega) \xrightarrow{\text{a.s.}} \mu_x\) and both \( \psi\) and \( \max\) are continuous functions:

\[ \begin{equation} M_N(\omega) = \max_{s \in S} \psi(s, \bar{X}_N(\omega)) \xrightarrow{\text{a.s.}} \max_{s \in S} \psi(s, \mu_x) = M. \end{equation} \]

(56)

Now consider:

\[ \begin{equation} \int_S e^{N \psi(s, \bar{X}_N(\omega))} ds \leq \int_S e^{N \max_{s \in S} \psi(s, \bar{X}_N(\omega))} ds = e^{N \max_{s \in S} \psi(s, \bar{X}_N(\omega))} \text{Vol}(S). \end{equation} \]

(57)

Taking the logarithm and dividing by \( N\) :

\[ \begin{equation} \frac{1}{N} \log \left(e^{N M_N(\omega)} \text{Vol}(S)\right) \leq M_N(\omega) + \frac{1}{N} \log \text{Vol}(S). \end{equation} \]

(58)

Finally, taking the limit as \( N \to \infty\) :

\[ \begin{equation} \lim_{N \to \infty} \frac{1}{N} \log \int_S e^{N \psi(s, \bar{X}_N(\omega))} ds \leq \lim_{N \to \infty} \left(M_N(\omega) + \frac{1}{N} \log \text{Vol}(S)\right) = M_N(\omega) \xrightarrow{\text{a.s.}} M. \end{equation} \]

(59)

Thus, the result holds almost surely:

\[ \begin{equation} \lim_{N \to \infty} \frac{1}{N} \log \int_S e^{N \psi(s, \bar{X}_N(\omega))} ds \leq M = \max_{s \in S} \psi(s, \mu_x). \end{equation} \]

(60)

Now, almost everywhere, we have \( I \leq M \) . Define:

\[ \begin{equation} K_\epsilon = \{s \in S : |\psi(s, \mu_x) - M| < \epsilon/2\}. \end{equation} \]

(61)

Then:

\[ \begin{equation} \int_S e^{N \psi(s, \bar{X}_N(\omega))} ds \geq \int_{K_\epsilon} e^{N \psi(s, \bar{X}_N(\omega))} ds ~ \forall N, \forall \omega. \end{equation} \]

(62)

Now, consider:

\[ \begin{align} |\psi(s, \bar{X}_N(\omega)) - M| &= |\psi(s, \bar{X}_N(\omega)) - \psi(s, \mu_x) + \psi(s, \mu_x) - M| \\ &\leq |\psi(s, \bar{X}_N(\omega)) - \psi(s, \mu_x)| + |\psi(s, \mu_x) - M|. \\\end{align} \]

(63)

First Term: By the strong law of large numbers and uniform continuity of \( \psi\) , for any \( \epsilon > 0\) , there exists \( N_0\) such that for all \( N > N_0\) and \( s \in K_\epsilon\) :

\[ \begin{equation} |\psi(s, \bar{X}_N(\omega)) - \psi(s, \mu_x)| < \epsilon/2. \end{equation} \]

(64)

This follows because \( \bar{X}_N(\omega) \xrightarrow{\text{a.s.}} \mu_x\) , and \( \psi(s, x)\) is continuous in \( x\) . Hence, for sufficiently large \( N\) , \( \psi(s, \bar{X}_N(\omega))\) converges uniformly to \( \psi(s, \mu_x)\) for \( s \in K_\epsilon\) .

Second Term: By the definition of \( K_\epsilon\) , for all \( s \in K_\epsilon\) :

\[ \begin{equation} |\psi(s, \mu_x) - M| < \epsilon/2. \end{equation} \]

(65)

Combining these results, for all \( \epsilon > 0\) , \( s \in K_\epsilon\) , and \( N > N_0\) :

\[ \begin{equation} |\psi(s, \bar{X}_N(\omega)) - M| < \epsilon. \end{equation} \]

(66)

Thus, for all \( \epsilon > 0\) , \( s \in K_\epsilon\) , and almost all \( \omega \in \Omega\) :

\[ \begin{equation} \psi(s, \bar{X}_N(\omega)) > M - \epsilon. \end{equation} \]

(67)

Therefore:

\[ \begin{equation} \int_{K_\epsilon} e^{N \psi(s, \bar{X}_N(\omega))} ds \geq \int_{K_\epsilon} e^{N (M - \epsilon)} ds = e^{N (M - \epsilon)} \text{Vol}(K_\epsilon). \end{equation} \]

(68)

Taking the logarithm and dividing by \( N\) :

\[ \begin{align} \frac{1}{N} \log \int_{K_\epsilon} e^{N \psi(s, \bar{X}_N(\omega))} ds &\geq \frac{1}{N} \log \left(e^{N (M - \epsilon)} \text{Vol}(K_\epsilon)\right) \\ &= M - \epsilon + \frac{1}{N} \log \text{Vol}(K_\epsilon). \\\end{align} \]

(69)

Taking the limit as \( N \to \infty\) :

\[ \begin{equation} \lim_{N \to \infty} \frac{1}{N} \log \int_{K_\epsilon} e^{N \psi(s, \bar{X}_N(\omega))} ds \geq M - \epsilon. \end{equation} \]

(70)

Combining this with the upper bound \( I \leq M\) , we have:

\[ \begin{equation} M - \epsilon \leq I \leq M. \end{equation} \]

(71)

Since this holds for all \( \epsilon > 0\) , it follows that:

\[ \begin{equation} I = M. \end{equation} \]

(72)

Lemma 3

We will show that the following function is bounded

\[ \begin{equation} \phi(s, \omega, N) = \left\langle s, \bar{X}_N(\omega)) \right\rangle - \log Z(s), \end{equation} \]

(73)

where \( N \in \mathbf{N} \) , \( s \in S \subset \mathbb{R}^n \)

Proof

First, we use the Cauchy–Schwarz inequality

\[ \begin{equation} \left| \left\langle s, \bar{X}_N(\omega)) \right\rangle \right| \leq \|s\|_2 \cdot \left\| \bar{X}_N(\omega)) \right\|_2. \end{equation} \]

(74)

Furthermore, since the \( L_{2}\) norm is a continuous function, and there is almost sure convergence \( \bar{X}_N \rightarrow \mu_x\) , then

\[ \begin{equation} \left\| \bar{X}_N(\omega)) \right\|_2 \xrightarrow{\text{a.s.}} \|\mu_x\|_2. \end{equation} \]

(75)

Thus, there exists \( N_0 \) such that for all \( N > N_0 \) ,

\[ \begin{equation} \left| \left\| \bar{X}_N(\omega)) \right\|_2 - \|\mu_x\|_2 \right| < \epsilon \implies \left\| \bar{X}_N(\omega)) \right\|_2 < \|\mu_x\|_2 + \epsilon. \end{equation} \]

(76)

Since \( \log Z \) is continuous on the compact set \( S \) , there exists \( C_Z \) such that \( \log Z(s) < C_Z \) for all \( s \in S \) .

On the compact set \( S \) , there exists \( C_s \) such that \( \|s\|_2 < C_s \) for all \( s \in S \) .

Therefore, for all \( s \in S \) , there exists \( N_0 \) such that for all \( N > N_0 \) ,

\[ \begin{equation} \phi(s, \omega, N) < C = \|\mu_x\|_2 \cdot C_s + C_Z + \epsilon \end{equation} \]

(77)

Lemma 4

We will show that the following function is uniformly continuous

\[ \begin{equation} \phi(s, \omega, N) = \left\langle s, \bar{X}_N(\omega)) \right\rangle - \log Z(s), \end{equation} \]

(78)

where \( N \in \mathbf{N} \) , \( s \in S \subset \mathbb{R}^n \)

Proof

Let \( x_1\) and \( x_2\) be vectors in \( \mathbb{R}^n\) .

\[ \begin{align*} |\psi(s, x_1) - \psi(s, x_2)| &= |\langle s, x_1 \rangle - \log Z(s) - (\langle s, x_2 \rangle - \log Z(s))| \\ &= |\langle s, x_1 \rangle - \langle s, x_2 \rangle| \\ &= |\langle s, x_1 - x_2 \rangle| \\ &= |s| \, |x_1 - x_2| \, |\cos(\theta)| \end{align*} \]

where \( \theta\) is the angle between \( s\) and \( x_1 - x_2\) . Since \( S\) is compact, there exists \( M\) such that \( |s| < M\) for all \( s \in S\) . Thus,

\[ |\psi(s, x_1) - \psi(s, x_2)| \leq M |x_1 - x_2| \]

Given \( \varepsilon > 0\) , we choose \( \delta = \varepsilon / M\) . If \( |x_1 - x_2| < \delta\) , then

\[ |\psi(s, x_1) - \psi(s, x_2)| < \varepsilon \]

Therefore, \( \psi(s,x)\) is uniformly continuous in \( x\) .

Theorem 5

Suppose that the following integral converges to zero

\[ \begin{equation} \int_{\Theta} \int_{\Theta} \left| e^{-D_{\log Z_{1}(t)}(t',t)} - e^{-D_{\log Z_{2}(t)}(t',t)} \right|^{2} dt \, dt' \to 0, \end{equation} \]

(79)

where

\[ D_{\varphi(t)}(t',t) = \varphi(t') - \varphi(t) - \langle \nabla_{t'} \varphi(t'), t' - t \rangle \]

is the Bregman divergence. Then there \( \log Z_{1}(t)\) converges to \( \log Z_{2}(t)\) uniformly in \( t\)

\[ \begin{equation} || \nabla^{2}\log Z_{1}(t) - \nabla^{2}\log Z_{2}(t) || \to 0, \end{equation} \]

(80)

where \( ||\cdot||\) denotes the \( L^2\) norm.

Proof

Since the integrand is non-negative and its integral goes to zero, the integrand must converge to zero almost everywhere. As the functions involved are continuous, we have:

\[ \begin{equation} e^{-D_{\log Z_{1}(t)}(t',t)} = e^{-D_{\log Z_{2}(t)}(t',t)} \end{equation} \]

(81)

for all \( t, t' \in \Theta\) . Taking the logarithm of both sides, we get:

\[ \begin{equation} -D_{\log Z_{1}(t)}(t',t) = -D_{\log Z_{2}(t)}(t',t) \end{equation} \]

(82)

which implies

\[ \begin{equation} D_{\log Z_{1}(t)}(t',t) = D_{\log Z_{2}(t)}(t',t). \end{equation} \]

(83)

Substituting the definition of the Bregman divergence, we have:

\[ \begin{align*} &\log Z_{1}(t') - \log Z_{1}(t) - \langle \nabla \log Z_{1}(t), t' - t \rangle \\ &= \log Z_{2}(t') - \log Z_{2}(t) - \langle \nabla \log Z_{2}(t), t' - t \rangle. \end{align*} \]

Rearranging terms gives:

\[ \begin{equation} (\log Z_{1}(t') - \log Z_{2}(t')) - (\log Z_{1}(t) - \log Z_{2}(t)) - \langle \nabla \log Z_{1}(t) - \nabla \log Z_{2}(t), t' - t \rangle = 0. \end{equation} \]

(84)

Let \( \Delta(t) = \log Z_{1}(t) - \log Z_{2}(t)\) . Then the equation becomes:

\[ \begin{equation} \Delta(t') - \Delta(t) - \langle \nabla \Delta(t), t' - t \rangle = 0. \end{equation} \]

(85)

This implies that the first-order Taylor expansion of \( \Delta(t')\) around \( t\) is exact, which holds if and only if \( \Delta(t)\) is an affine function:

\[ \begin{equation} \Delta(t) = \langle c, t \rangle + b, \end{equation} \]

(86)

for some constant vector \( c\) and scalar \( b\) . It means that two log-partition function are related via an affine transformation

\[ \begin{equation} \log Z_{1}(t) = \log Z_{2} - \langle c, t \rangle + b \end{equation} \]

(87)

Therefore the \( L^{2}\) norm equals to zero

\[ \begin{equation} || \nabla^{2}\log Z_{1}(t) - \nabla^{2}\log Z_{2}(t) || = 0, \end{equation} \]

(88)

Lemma 5 (Vanishing Gradient in Bregman MSE Loss)

Consider the loss functional \( L = \int_{S} \int_{S} \left| e^{-D_{\log Z_{1}(t)}(t,t')} - e^{-D_{\log Z_{2}(t)}(t,t')} \right|^{2} dt \, dt'\) . Assume there exists a region \( R \subset S \times S\) where either \( D_{\log Z_{1}(t)}(t,t') \geq C\) or \( D_{\log Z_{2}(t)}(t,t') \geq C\) uniformly for some \( C \gg 1\) . Then the gradient of \( L\) with respect to parameters governing \( \log Z_{1}(t)\) satisfies:

\[ \begin{equation} \|\nabla L\| \leq K e^{-C}, \end{equation} \]

(89)

where \( K > 0\) is a constant which depends on the measure of \( R\) and Lipschitz constant of \( \log Z_{1}(t)\) which we assume to be bounded. Thus, gradients vanish exponentially as \( C\) increases.

Proof

The gradient of the loss functional \( L\) with respect to \( \log Z_{1}(t)\) can be formally written (using functional derivatives) as proportional to:

\[ \begin{equation} \nabla L \propto \int_{S} \int_{S} \frac{\delta}{\delta \log Z_{1}(t)} \left| e^{-D_{\log Z_{1}(t)}(t,t')} - e^{-D_{\log Z_{2}(t)}(t,t')} \right|^{2} dt \, dt'. \end{equation} \]

(90)

Expanding the square and taking the derivative with respect to \( \log Z_{1}(t)\) , we focus on the terms involving \( D_{\log Z_{1}(t)}(t,t')\) . The gradient is then proportional to:

\[ \begin{equation} \nabla L \propto \int_{S} \int_{S} \left( e^{-D_{\log Z_{1}(t)}(t,t')} - e^{-D_{\log Z_{2}(t)}(t,t')} \right) \frac{\delta}{\delta \log Z_{1}(t)} e^{-D_{\log Z_{1}(t)}(t,t')} dt \, dt'. \end{equation} \]

(91)

The functional derivative of \( e^{-D_{\log Z_{1}(t)}(t,t')}\) with respect to \( \log Z_{1}(t)\) involves the derivative of the Bregman divergence \( D_{\log Z_{1}(t)}(t,t')\) with respect to \( \log Z_{1}(t)\) . Let us denote \( \mathcal{D}_{1}(t,t') = D_{\log Z_{1}(t)}(t,t')\) and \( \mathcal{D}_{2}(t,t') = D_{\log Z_{2}(t)}(t,t')\) . Then the gradient can be expressed as:

\[ \begin{equation} \nabla L \propto \int_{S} \int_{S} \left( e^{-\mathcal{D}_{1}(t,t')} - e^{-\mathcal{D}_{2}(t,t')} \right) e^{-\mathcal{D}_{1}(t,t')} (-\nabla \mathcal{D}_{1}(t,t')) dt \, dt', \end{equation} \]

(92)

where \( \nabla \mathcal{D}_{1}(t,t')\) represents the gradient of \( D_{\log Z_{1}(t)}(t,t')\) with respect to parameters governing \( \log Z_{1}(t)\) . We consider the magnitude of the integrand within the region \( R \subset S \times S\) . In region \( R\) , either \( D_{\log Z_{1}(t)}(t,t') \geq C\) or \( D_{\log Z_{2}(t)}(t,t') \geq C\) .

Case 1: \( D_{\log Z_{1}(t)}(t,t') \geq C\) . In this case, \( e^{-D_{\log Z_{1}(t)}(t,t')} \leq e^{-C}\) . The magnitude of the integrand is bounded by:

\[ \begin{align*} & \left| \left( e^{-D_{\log Z_{1}(t)}(t,t')} - e^{-D_{\log Z_{2}(t)}(t,t')} \right) e^{-D_{\log Z_{1}(t)}(t,t')} (-\nabla \mathcal{D}_{1}(t,t')) \right| \\ & \leq \left( |e^{-D_{\log Z_{1}(t)}(t,t')}| + |e^{-D_{\log Z_{2}(t)}(t,t')}| \right) |e^{-D_{\log Z_{1}(t)}(t,t')}| \|\nabla \mathcal{D}_{1}(t,t')\| \\ & \leq \left( e^{-C} + e^{-D_{\log Z_{2}(t)}(t,t')} \right) e^{-C} \|\nabla \mathcal{D}_{1}(t,t')\| \\ & \leq (1 + e^{-D_{\log Z_{2}(t)}(t,t')}) e^{-C} \|\nabla \mathcal{D}_{1}(t,t')\|. \end{align*} \]

Assuming that \( \|\nabla \mathcal{D}_{1}(t,t')\|\) is bounded by some constant \( M'\) due to finite Lipschitz constant of \( \log Z_{1}(t)\) and since \( e^{-D_{\log Z_{2}(t)}(t,t')} \leq 1\) (for non-negative Bregman divergences), the integrand magnitude is bounded by \( 2 M' e^{-C}\) .

Case 2: \( D_{\log Z_{2}(t)}(t,t') \geq C\) . In this case, \( e^{-D_{\log Z_{2}(t)}(t,t')} \leq e^{-C}\) . The magnitude of the integrand is bounded by:

\[ \begin{align*} & \left| \left( e^{-D_{\log Z_{1}(t)}(t,t')} - e^{-D_{\log Z_{2}(t)}(t,t')} \right) e^{-D_{\log Z_{1}(t)}(t,t')} (-\nabla \mathcal{D}_{1}(t,t')) \right| \\ & \leq \left( |e^{-D_{\log Z_{1}(t)}(t,t')}| + |e^{-D_{\log Z_{2}(t)}(t,t')}| \right) |e^{-D_{\log Z_{1}(t)}(t,t')}| \|\nabla \mathcal{D}_{1}(t,t')\| \\ & \leq \left( e^{-D_{\log Z_{1}(t)}(t,t')} + e^{-C} \right) e^{-D_{\log Z_{1}(t)}(t,t')} \|\nabla \mathcal{D}_{1}(t,t')\| \\ & = \left( e^{-2D_{\log Z_{1}(t)}(t,t')} + e^{-C} e^{-D_{\log Z_{1}(t)}(t,t')} \right) \|\nabla \mathcal{D}_{1}(t,t')\| \\ & \leq (1 + 1) e^{-C} M' = 2 M' e^{-C}, \end{align*} \]

assuming \( D_{\log Z_{1}(t)}(t,t') \geq 0\) and \( e^{-D_{\log Z_{1}(t)}(t,t')} \leq 1\) .

In both cases, within the region \( R\) , the integrand’s magnitude is bounded by \( 2 M' e^{-C}\) . Let \( m(R)\) be the measure of region \( R\) . Then the norm of the gradient can be bounded by integrating over \( R\) :

\[ \begin{equation} \|\nabla L\| \leq \int_{R} 2 M' e^{-C} dt \, dt' = 2 M' e^{-C} m(R) = K e^{-C}, \end{equation} \]

(93)

where \( K = 2 M' m(R)\) is a constant dependent on the measure of \( R\) and the bound \( M'\) which is related to the Lipschitz properties of \( \log Z_{1}(t)\) . This shows that the gradient vanishes exponentially as \( C\) increases.

Lemma 6

Let \( Z: S \to (0,\infty) \) with \( S \subset \mathbb{R}^n \) compact domain and \( \log Z \) convex. For any fixed \( t \in S \) :

\[ \begin{equation} \frac{\exp\left(-D_{\log Z}(t', t)\right)}{\int_{S} \exp\left(-D_{\log Z}(s, t)\right) \, ds} = \frac{\exp\left(-\langle t', \nabla \log Z(t) \rangle + \log Z(t')\right)}{\int_{S} \exp\left(-\langle s, \nabla \log Z(t) \rangle + \log Z(s)\right) \, ds} \end{equation} \]

(94)

Proof

Expand the Bergman divergence \( D_{\log Z}(t',t) = \log Z(t') - \log Z(t) - \langle \nabla \log Z(t), t' - t \rangle \). The numerator can be rewritten as:

\[ \begin{align} \exp(-D_{\log Z}(t',t)) &= \exp\left(-\left(\log Z(t') - \log Z(t) - \langle \nabla \log Z(t), t' - t \rangle\right)\right) \nonumber \\ &= \exp\left(-\log Z(t') + \log Z(t) + \langle \nabla \log Z(t), t' \rangle - \langle \nabla \log Z(t), t \rangle\right) \nonumber \\ &= \exp(\log Z(t)) \exp(-\langle \nabla \log Z(t), t \rangle) \exp(-\log Z(t') + \langle \nabla \log Z(t), t' \rangle) \end{align} \]

(95)

The denominator, integrating over \( s \in S \), becomes:

\[ \begin{align} \int_{S} \exp(-D_{\log Z}(s,t)) \, ds &= \int_{S} \exp\left(-\left(\log Z(s) - \log Z(t) - \langle \nabla \log Z(t), s - t \rangle\right)\right) \, ds \nonumber \\ &= \int_{S} \exp\left(-\log Z(s) + \log Z(t) + \langle \nabla \log Z(t), s \rangle - \langle \nabla \log Z(t), t \rangle\right) \, ds \nonumber \\ &= \exp(\log Z(t)) \exp(-\langle \nabla \log Z(t), t \rangle) \int_{S} \exp(-\log Z(s) + \langle \nabla \log Z(t), s \rangle) \, ds \nonumber \\ &= Z(t) \exp(-\langle \nabla \log Z(t), t \rangle) \int_{S} \frac{\exp(\langle \nabla \log Z(t), s \rangle)}{Z(s)} \, ds \end{align} \]

(96)

Forming the ratio of the numerator (95) and the denominator (96):

\[ \begin{equation} \frac{\exp(-D_{\log Z}(t',t))}{\int_{S} \exp(-D_{\log Z}(s,t)) \, ds} = \frac{Z(t) \exp(-\langle \nabla \log Z(t), t \rangle) \frac{\exp(\langle \nabla \log Z(t), t' \rangle)}{Z(t')}}{Z(t) \exp(-\langle \nabla \log Z(t), t \rangle) \int_{S} \frac{\exp(\langle \nabla \log Z(t), s \rangle)}{Z(s)} \, ds} \end{equation} \]

(97)

Simplifying the ratio:

\[ \begin{equation} \frac{\exp(-D_{\log Z}(t',t))}{\int_{S} \exp(-D_{\log Z}(s,t)) \, ds} = \frac{\frac{\exp(\langle \nabla \log Z(t), t' \rangle)}{Z(t')}}{\int_{S} \frac{\exp(\langle \nabla \log Z(t), s \rangle)}{Z(s)} \, ds} \end{equation} \]

(98)

Rewrite the terms in the exponent:

\[ \begin{equation} \frac{\exp(\langle \nabla \log Z(t), x \rangle)}{Z(x)} = \exp(\langle \nabla \log Z(t), x \rangle - \log Z(x)) = \exp(-\log Z(x) + \langle \nabla \log Z(t), x \rangle) \end{equation} \]

(99)

Multiplying the exponent by -1 and rearranging:

\[ \begin{equation} \frac{\exp(\langle \nabla \log Z(t), x \rangle)}{Z(x)} = \exp(-(\log Z(x) - \langle \nabla \log Z(t), x \rangle)) = \exp(-(-\langle x, \nabla \log Z(t) \rangle + \log Z(x))) \end{equation} \]

(100)

Further manipulation:

\[ \begin{equation} \frac{\exp(\langle \nabla \log Z(t), x \rangle)}{Z(x)} = \exp(-\langle x, \nabla \log Z(t) \rangle + \log Z(x)) \end{equation} \]

(101)

Substituting this back into equation (98):

\[ \begin{equation} \frac{\exp(-D_{\log Z}(t',t))}{\int_{S} \exp(-D_{\log Z}(s,t)) \, ds} = \frac{\exp(-\langle t', \nabla \log Z(t) \rangle + \log Z(t'))}{\int_{S} \exp(-\langle s, \nabla \log Z(t) \rangle + \log Z(s)) \, ds} \end{equation} \]

(102)

This yields exactly the RHS expression.

Proposition 2

Suppose that the (target) data distribution is a bimodal mixture of two Gaussians, each with variance \( \sigma^2\) :

\[ \begin{equation} p_0(x) = \frac{1}{2}\mathcal{N}(x\mid -1,\sigma^2) + \frac{1}{2}\mathcal{N}(x\mid 1,\sigma^2). \end{equation} \]

(103)

The latent (source) distribution is the standard normal \( \mathcal{N}(x\mid 0,1)\) . Consider the variance-preserving SDE

\[ \begin{equation} dX_t = -\frac{1}{2}\beta X_t\, dt + \sqrt{\beta}\, dW_t. \end{equation} \]

(104)

Then the Lyapunov exponent of the corresponding reverse-time ODE at \( x=0\) has the following form:

\[ \begin{equation} \lambda = \frac{\beta}{2} \left(1 + \frac{1 - \sigma^2}{\sigma^4} \right), \end{equation} \]

(105)

and it diverges to infinity as \( \sigma \to 0\) . In this case, the point \( x=0\) can be interpreted as a phase transition boundary.

Proof

We begin with the reverse probability flow ODE:

\[ \begin{equation} \frac{dX_s}{ds} = -f(X_s,t) + \frac{g(t)^2}{2} \nabla_x \log p_t(X_s), \end{equation} \]

(106)

where the drift term is:

\[ \begin{equation} v(x) = -f(x,t) + \frac{g(t)^2}{2} \nabla_x \log p_t(x). \end{equation} \]

(107)

Linearizing around \( x = 0\) , the Lyapunov exponent is defined as:

\[ \begin{equation} \lambda = v'(0) = -f'(0,t) + \frac{g(t)^2}{2} \frac{d^2}{dx^2} \log p_t(x). \end{equation} \]

(108)

The density \( p_t(x)\) is computed via convolution with the Gaussian noising kernel:

\[ \begin{equation} p_t(x) = \int_{-\infty}^{\infty} p_0(y)\, \mathcal{N}\left(x \mid e^{-\frac{1}{2}\beta t}y1 - e^{-\beta t} \right)dy. \end{equation} \]

(109)

Since convolution of Gaussians is still Gaussian, we obtain:

\[ \begin{equation} p_t(x) = \frac{1}{2\sqrt{2\pi}\sigma_1(t)} A(x), \end{equation} \]

(110)

where

\[ \begin{equation} A(x) = \exp\left(-\frac{(x - \mu(t))^2}{2\sigma_1^2(t)}\right) + \exp\left(-\frac{(x + \mu(t))^2}{2\sigma_1^2(t)}\right), \end{equation} \]

(111)

and

\[ \begin{equation} \sigma_1^2(t) = e^{-\beta t}\sigma^2 + (1 - e^{-\beta t}), ~ \mu(t) = e^{-\frac{1}{2} \beta t}. \end{equation} \]

(112)

Computing the derivative in the definition of the Lyapunov exponent, we get:

\[ \begin{equation} \lambda = \frac{\beta}{2} + \frac{\beta}{2} \cdot \frac{e^{-\beta t} - \sigma_1^2(t)}{\sigma_1^4(t)} = \frac{\beta}{2} \left(1 + \frac{e^{-\beta t} - \sigma_1^2(t)}{\sigma_1^4(t)}\right). \end{equation} \]

(113)

As time goes to \( 0\) , we have \( \sigma_1 \to \sigma\) , and thus:

\[ \begin{equation} \lambda = \frac{\beta}{2} \left(1 + \frac{1 - \sigma^2}{\sigma^4} \right), \end{equation} \]

(114)

suggesting divergence of nearby reverse ODE trajectories for small \( \sigma\) , and identifying \( x = 0\) as a potential phase transition boundary.

B Experimental Details

B.1 Numerics

Mean-as-Stat The main idea of posterior-mean-as- statistics is to predict the parameters \( t\) based on the microstate \( s\) by minimizing the regression error:

\[ \begin{equation} L(\phi) = \mathbb{E}_{s\sim \mathbb{P}(s|t), t \sim \mathbb{P}(t) }\left\| f_{\phi}(s)-t \right\|^{2}_{2}. \end{equation} \]

(115)

A function \( f_{\phi}\) learned in this way will predict the mean of the posterior distribution \( \mathbb{P}(t|s)\) .

To evaluate the quality of the free energy reconstruction, we compute the RMSE between the ground truth free energy \( F_{gt}(t)\) and the reconstructed free energy \( F_{rec}(t)\) :

\[ \begin{equation} \text{RMSE}(F_{rec}, F_{gt}) = \min_{A} \left\| F_{gt}(t) - A(F_{rec}(t)) \right\|_{2}^{2}, \end{equation} \]

(116)

where we minimize over all possible affine transformations \( A \), accounting for the fact that free energy is only defined up to such a transformation

Ising Model. Since the Ising model lacks an exact free energy solution for \( H \neq 0\) , we construct the ground truth \( F_{\text{ising}}(T, H)\) by numerically integrating the known magnetization \( M(T, H)\) and energy \( E(T, H)\) . The ground truth free energy \( F_{ising}(T, H)\) must satisfy the following partial derivative relations with respect to temperature and magnetic field::

\[ \begin{equation} \frac{\partial F_{\text{ising}}(T, H)}{\partial T} = E(T, H), ~ \frac{\partial F_{\text{ising}}(T, H)}{\partial H} = M(T, H). \end{equation} \]

(117)

To approximate \( F_{\text{ising}}(T, H)\) , we train a feedforward neural network \( F_{\theta}(T, H)\) on the domain \( H > 0\) , where the free energy is \( C^2\) -smooth. The approximation is extended to \( H < 0\) using the symmetry \( F_{\text{ising}}(T, -H) = F_{\text{ising}}(T, H)\) . The loss function enforces consistency with the partial derivatives:

\[ \begin{equation} \mathcal{L}(\theta) = \int_{\Omega} \left( \left\| \frac{\partial F_{\theta}}{\partial T} - E(T, H) \right\|_{2}^{2} + \left\| \frac{\partial F_{\theta}}{\partial H} - M(T, H) \right\|_{2}^{2} \right) dT dH. \end{equation} \]

(118)

For the posterior-mean-as-sufficient-statistics method, we adopt a similar approach to integrate the predicted sufficient statistics \( s_{T}(T, H)\) and \( s_{H}(T,H)\) by minimizing the loss:

\[ \begin{equation} \mathcal{L}(\theta) = \int_{\Omega}\left(\left\| \frac{\partial F_{\theta}}{\partial T}-s_{T}(T, H)\right\|^{2}_{2} + \left\| \frac{\partial F_{\theta}}{\partial H}-s_{H}(T, H)\right\|^{2}_{2} \right) dT dH. \end{equation} \]

(119)

After obtaining the ground truth free energy, along with the free energy estimated by our Bayesian thermodynamic integration and the posterior-mean-as-statistics method, we minimize the RMSE with respect to the ground truth over affine transformations to account for the fact that free energy is only defined up to an affine transformation. Finally, we evaluate and compare the resulting errors with the baseline.

TASEP Model. For TASEP, which has an analytical free energy solution, we use a similar neural network approach. The ground truth free energy is given by:

\[ \begin{equation} F_{\text{TASEP}}(\alpha, \beta) = \left\{ \begin{array}{ll} \frac{1}{4}, & \alpha > \frac{1}{2}, \, \beta > \frac{1}{2}; \\ \alpha(1-\alpha), & \alpha < \beta, \, \alpha < \frac{1}{2}; \\ \beta(1-\beta), & \beta < \alpha, \, \beta < \frac{1}{2}. \end{array}\right. \end{equation} \]

(120)

The network \( F_{\theta}(\alpha, \beta)\) is trained to minimize:

\[ \begin{equation} \mathcal{L}(\theta) = \int_{\Omega} \left( \left\| \frac{\partial F_{\theta}}{\partial \alpha} - J_{\alpha}(\alpha, \beta) \right\|_{2}^{2} + \left\| \frac{\partial F_{\theta}}{\partial \beta} - J_{\beta}(\alpha, \beta) \right\|_{2}^{2} \right) d\alpha d\beta, \end{equation} \]

(121)

where \( J_{\alpha}\) and \( J_{\beta}\) are the particle currents at the boundaries, analogous to energy/magnetization in the Ising model. Finally, the resulting free energy surface is evaluated using the RMSE between the true TASEP free energy and the reconstructed free energy, after accounting for possible affine transformations:

\[ \begin{equation} \text{RMSE}(F_{rec}, F_{gt}) = \min_{A} \left| F_{gt}(\alpha,\beta) - A(F_{rec}(\alpha,\beta)) \right|_{2}^{2}. \end{equation} \]

(122)

B.2 Datasets

Ising Model. Our dataset consists of \( N=5.4\times 10^5\) samples of spin configurations on the square lattice of size \( L\times L =128 \times 128\) with periodic boundary conditions. We consider the parameter ranges \( \beta^{-1} = T \in [T_{\min}, T_{\max}]=[1, 5], H \in [H_{\min}, H_{\max}]=[-2, 2]\) similar to the ranges used in (Walker, 2019). Point \( (T, H)\) is sampled uniformly from this rectangle, and then a sample spin configuration is created for these values of temperature and external field by starting with a random initial condition and equilibrating is with Glauber (one-spin Metropolis) dynamics (see, e.g. [18]) for \( 10^{4}\times 128 \times 128 \approx 1.64 \times 10^{8}\) iterations. We represent spin configuration as a single-channel image with color of each pixel taking values \( +1\) and \( -1\) . When constructing target probability distributions we choose \( \sigma=\frac{1}{50}\) and set the discretization \( \mathcal{D}\) of the square \( [T_{\min}, T_{\max}]\times[H_{\min}, H_{\max}] = [1, 5]\times[-2, 2]\) to be a uniform grid with \( L\times L =128 \times 128\) grid cells.

Image-to-image network with U\( ^{2}\) -Net architecture [22] is used to approximate posterior \( p_{\theta}(t|s)\) . The network takes as input a bundle of \( K_{\text{bundle}}\) images concatenated across channel dimension and outputs a categorical distribution representing density values in discrete grid points. For simplicity we choose the discretization \( \mathcal{D}\) to be of the same spatial dimensions as the input image. For all our numerical experiments the training was performed on a single Nvidia-HGX compute node with 8 A100 GPUs. We trained U\( ^{2}\) -Net using Adam optimizer with learning rate \( 0.00001\) and batch size of \( 2048\) for \( N_{\text{U2Net steps}}=20000\) gradient update steps. In all our experiments the training set consists of 80% of samples and the other 20% are used for testing.

TASEP Model. We generate a dataset of \( N=150000\) stationary TASEP configurations on a 1d lattice with \( M=16384\) sites. The rates \( \alpha (\beta)\) of adding (removing) particles at the left(right) boundary are sampled from the uniform prior distribution over a square \( [0,1] \times [0,1]\) . To reach the stationary state we start from a random initial condition and perform \( N_{\text{steps}} = 2\times 10^{9} \approx 8 M^2\) move attempts, which is known to be enough to achieve the stationary state except for the narrow vicinity of the transition line \( \alpha = \beta<1/2\) between high-density and low-density phases (in this case the stationary state has a slowly diffusing front of a shock wave in it, one needs of order \( M^2\) move attempts to form the shock but of order \( M^3\) move attempts for it to diffusively explore all possible positions).

We reshape 1d lattice with \( M=16384\) sites into an image of size \( L\times L =128 \times 128\) using raster scan ordering. To construct target probability distributions we set \( \sigma=\frac{1}{150}\) and define the discretization \( \mathcal{D}\) as a uniform grid on \( [\alpha_{\min}, \alpha_{\max}]\times[\beta_{\min}, \beta_{\max}] = [0,1]\times[0,1]\) with \( L\times L =128 \times 128\) grid cells.

B.3 U\( ^{2}\) Net Training

Our task now is to estimate the posterior distribution \( p(t|s)\) using the set of samples \( t_k, s_{k}\) . To do that, we are trying to approximate \( p(t|s)\) by representatives from some parametric family of distributions \( p_{\theta}(\bf{x}|y)\) , choosing \( \theta\) to maximize some cost function. It is conventional to choose the cost function to maximize the likelihood of the true external parameters \( t_{i}\) given \( s_{i}\) over all samples in the set:

\[ \begin{equation} \begin{split} \text{NegativeLogLikelihood}(\theta) = -\sum_{i=1}^{N}\log(p_{\theta}(t_{i}|s_{i})) \approx \\ \approx N \mathbb{E}_{\{t',s\} \sim p(t')p(s|t')} \log(p_{\theta}(t|s)), \end{split} \end{equation} \]

(123)

where on the right hand side we replaced the summands by their expected values. Minimization of the log-likelihood can be reinterpreted as the minimization of the KL divergence between the target distribution

\[ \begin{equation} p_{\text{target}}(t|t') = \delta (t-t') \end{equation} \]

(124)

(i.e., the reconstructed labels \( t\) are identical to the input labels \( t'\) ) and the predicted distribution \( \int p_{\theta}(t|s)p(s|\bf{t'})ds\) :

\[ \begin{equation} \mathcal{L}(\theta) = \mathbb{E}_{t' \sim P(t)} \text{KL}\left(p_{\text{target}}(t|t')|| \int p_{\theta}(t|s)p(s|\textbf{t'})ds\right). \end{equation} \]

(125)

In practice, to avoid divergences it is convenient to replace the “hard label” target distribution (124) with a smoothened distribution

\[ \begin{equation} p_{\text{target}} (t|t') = C \cdot \exp\left(-\frac{1}{2 \sigma^{2}}||t'-t||^{2}\right), \end{equation} \]

(126)

where \( \sigma\) is a smoothening parameter and \( C\) is the normalizing constant.

B.4 Supplementary Figures

Figure 11
Figure 12
Figure 10. Free energy surface of a diffusion model (\( \eta = 0.1\) ) reconstructed using CLIP distance. Noise injection smooths the free energy landscape, suppressing sharp phase boundaries (compare with Fig. 6).
Figure 14
Figure 15
Figure 16
Figure 13. Training examples for free energy reconstruction. Ground truth images with known generation parameters. CLIP-induced latent distribution. Predicted distribution from our convex model.
First- and second-order derivatives of the diffusion model’s free energy. Discontinuities mark first-order phase transitions and second-order transitions. Regions of constant derivative correlate with visually distinct phases.
Figure 17. First- and second-order derivatives of the diffusion model’s free energy. Discontinuities mark first-order phase transitions and second-order transitions. Regions of constant derivative correlate with visually distinct phases.
Free energy derivatives of diffusion model&amp;rsquo;s free energy reconstructed with clip distance. Overlaid grid highlights regions of constant derivative, correlating with visually distinct phases.
Figure 18. Free energy derivatives of diffusion model’s free energy reconstructed with clip distance. Overlaid grid highlights regions of constant derivative, correlating with visually distinct phases.

References

[1] Yahui Liu and Enver Sangineto and Yajing Chen and Linchao Bao and Haoxian Zhang and Nicu Sebe and Bruno Lepri and Wei Wang and Marco De Nadai Smoothing the disentangled latent style space for unsupervised image-to-image translation Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2021 10785–10794

[2] Jiayi Guo and Xingqian Xu and Yifan Pu and Zanlin Ni and Chaofei Wang and Manushree Vasu and Shiji Song and Gao Huang and Humphrey Shi Smooth diffusion: Crafting smooth latent spaces in diffusion models Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 7548–7558

[3] Hang Li and Chengzhi Shen and Philip Torr and Volker Tresp and Jindong Gu Self-discovering interpretable diffusion latent directions for responsible text-to-image generation Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 12006–12016

[4] Bradley CA Brown and Anthony L Caterini and Brendan Leigh Ross and Jesse C Cresswell and Gabriel Loaiza-Ganem Verifying the union of manifolds hypothesis for image data arXiv preprint arXiv:2207.02862 2022

[5] Evert PL Van Nieuwenburg and Ye-Hua Liu and Sebastian D Huber Learning phase transitions by confusion Nature Physics 2017 13 5 435–439

[6] Juan Carrasquilla and Roger G Melko Machine learning phases of matter Nature Physics 2017 13 5 431–434

[7] Benno S Rem and Niklas Käming and Matthias Tarnowski and Luca Asteria and Nick Fläschner and Christoph Becker and Klaus Sengstock and Christof Weitenberg Identifying quantum phase transitions using artificial neural networks on experimental data Nature Physics 2019 15 9 917–920

[8] Jielin Wang and Wanzhou Zhang and Tian Hua and Tzu-Chieh Wei Unsupervised learning of topological phase transitions using the Calinski-Harabaz index Physical Review Research 2021 3 1 013074

[9] Askery Canabarro and Felipe Fernandes Fanchini and André Luiz Malvezzi and Rodrigo Pereira and Rafael Chaves Unveiling phase transitions with machine learning Physical Review B 2019 100 4 045129

[10] Nicholas Walker and Ka-Ming Tam and Mark Jarrell Deep learning on the 2-dimensional Ising model to extract the crossover region with a variational autoencoder Scientific reports 2020 10 1 1–12

[11] Hang Shao and Abhishek Kumar and P Thomas Fletcher The riemannian geometry of deep generative models Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2018 315–323

[12] Ernst Ising Beitrag zur theorie des ferromagnetismus Zeitschrift für Physik 1925 31 1 253–258

[13] Lars Onsager Crystal statistics. I. A two-dimensional model with an order-disorder transition Physical Review 1944 65 3-4 117

[14] Mark Kac and John C Ward A combinatorial solution of the two-dimensional Ising model Physical Review 1952 88 6 1332

[15] Rodney J Baxter and Ian G Enting 399th solution of the Ising model Journal of Physics A: Mathematical and General 1978 11 12 2463

[16] B. Derrida and M. R. Evans and V. Hakim and V. Pasquier Exact solution of a 1D asymmetric exclusion model using a matrix formulation Journal of Physics A: Mathematical and General 1993 26 7 1493

[17] R. A. Blythe and M. R. Evans Nonequilibrium steady states of matrix-product form: a solver's guide Journal of Physics A: Mathematical and General 2007 40 46 R333

[18] P. L. Krapivsky and S. Redner and E. Ben-Naim A kinetic view of statistical physics Cambridge University Press 2010

[20] Shun-ichi Amari and John Armstrong Curvature of Hessian manifolds Differential Geometry and its Applications 2014 33 1–12

[21] Robert L Bryant Hessianizability of surface metrics arXiv preprint arXiv:2405.06998 2024

[22] Xuebin Qin and Zichen Zhang and Chenyang Huang and Masood Dehghan and Osmar R Zaiane and Martin Jagersand U2-Net: Going deeper with nested U-structure for salient object detection Pattern Recognition 2020 106 107404

[23] Paul Fearnhead and Dennis Prangle Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation Journal of the Royal Statistical Society Series B: Statistical Methodology 2012 74 3 419–474

[24] Bai Jiang and Tung-yu Wu and Charles Zheng and Wing H Wong Learning summary statistic for approximate Bayesian computation via deep neural network Statistica Sinica 2017 1595–1618

[25] Lykon Dreamshaper-8 2023

[26] Jiaming Song and Chenlin Meng and Stefano Ermon Denoising diffusion implicit models arXiv preprint arXiv:2010.02502 2020

[27] Zhantao Yang and Ruili Feng and Han Zhang and Yujun Shen and Kai Zhu and Lianghua Huang and Yifei Zhang and Yu Liu and Deli Zhao and Jingren Zhou and others Lipschitz singularities in diffusion models The Twelfth International Conference on Learning Representations 2023

[28] Tero Karras and Samuli Laine and Timo Aila A style-based generator architecture for generative adversarial networks Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2019 4401–4410

[29] Antonio Sclocchi and Alessandro Favero and Matthieu Wyart A phase transition in diffusion models reveals the hierarchical nature of data Proceedings of the National Academy of Sciences 2025 122 1 e2408799121

[30] Giulio Biroli and Tony Bonnaire and Valentin De Bortoli and Marc Mézard Dynamical regimes of diffusion models Nature Communications 2024 15 1 9957

[31] Yong-Hyun Park and Mingi Kwon and Jaewoong Choi and Junghyo Jo and Youngjung Uh Understanding the latent space of diffusion models through the lens of riemannian geometry Advances in Neural Information Processing Systems 2023 36 24129–24142

[32] Georgios Arvanitidis and Lars Kai Hansen and Søren Hauberg Latent space oddity: on the curvature of deep generative models arXiv preprint arXiv:1710.11379 2017

[33] Alessandra Tosi and Søren Hauberg and Alfredo Vellido and Neil D Lawrence Metrics for probabilistic geometries arXiv preprint arXiv:1411.7432 2014