If you see this, something is wrong
First published on Wednesday, Jun 3, 2026 and last modified on Wednesday, Jun 3, 2026 by François Chaplais.
Machine Learning, ICML
While Physics-Informed Neural Networks (PINNs) are powerful for solving Partial Differential Equations (PDEs), their training is often paralyzed by gradient pathology. The gradients from the PDE residuals and boundary constraints oppose each other, trapping the model in local minima. Current solutions, such as adaptive weighting or hard constraints, either fail to fundamentally resolve this ill-conditioning or are limited to simple geometries. In this study, we systematically analyze the possible causes of this gradient pathology from the perspectives of loss landscapes and optimization dynamics. Based on the obtained conclusion, we propose Constraint-Aligned loss with Manifold Lifting (CAML). By reformulating all zeroth-order terms into aligned constraints, our method effectively mitigates gradient conflicts. In addition, we introduce a delay factor to help the optimizer skip the high-curvature area. Experiments demonstrate that our CAML significantly enhances numerical stability and efficiency in highly complex PINN problems. Our code is open-sourced on CAML .
Physics-Informed Neural Networks (PINNs) [1] have established themselves as a transformative framework in scientific computing, offering a robust deep learning strategy for solving Partial Differential Equations (PDEs). By embedding PDE residuals and boundary conditions directly into the loss function, this approach significantly mitigates the dependence on labeled training data. Notably, PINNs that rely exclusively on physical prior functions can be considered an innovative meshless numerical method. This paradigm enables the completely self-supervised reconstruction of target physical fields, yielding neural models that generalize effectively without the need for costly data acquisition. In domains such as linear elasticity [2, 3], fluid mechanics [4, 5], heat conduction [6, 7], and electromagnetism [8, 9], such a well-trained model is poised to supersede traditional numerical methods, providing nearly real-time solutions for previously computationally-intensive geometric problems that require an extremely long period of time.
However, compared to traditional data-driven paradigms, employing PINNs to approximate PDE solutions remains non-trivial. A salient limitation is the ill-conditioning of the loss function, which severely hinders convergence, even in relatively elementary scenarios [10]. Literature indicates that the gradients derived from the PDE residuals and the boundary constraints often exhibit conflicting directions, causing the optimization process to be trapped in local minima [11, 12, 13]. Consequently, in problems involving complex geometries or PDE configurations, this pathology prevents the model from accurately recovering the ground truth.
Recent work has addressed this challenge from several distinct perspectives. For example, [16, 12] and [14] focus on designing adaptive loss functions to equilibrate the pathology between boundary conditions and PDE residuals. In contrast, [40, 24] and [25] attempt to enhance the optimizer by dynamically adjusting the optimization trajectory to navigate the non-convex landscape. Another line of work, [26] and [41], enhances performance through a unique configuration of the model. Nevertheless, these methods mainly alleviate convergence issues stemming from gradient ill-conditioning, without fundamentally mitigating the ill-conditioning itself. Alternatively, model-based improvements, such as [27] and [42], eliminate the boundary loss term by enforcing hard constraints. However, such methods typically rely on strict Dirichlet boundary conditions and distance functions, which severely restrict their generalizability to complex geometries and combined boundary conditions.
In this work, we provide the explicit geometric characterization of the PDE-residual loss landscape in PINNs. We show that operator non-uniqueness induces a connected manifold of global minimizers for the PDE residual, forming an ill-conditioned valley in parameter space and intrinsically causing gradient conflicts with boundary losses. Based on this insight, we propose Constraint-Aligned loss with Manifold Lifting (CAML), which directly modifies the geometry of the optimization landscape to mitigate gradient pathology. Our contributions are summarized as follows:
Experiments across various physical fields demonstrate that these innovative methods significantly enhance the stability and efficiency of the model when solving geometrically complex problems or nonlinear boundary conditions.
Conflict of Interest. The authors declare that there are no financial or other conflicts of interest related to this work.
A series of studies has systematically investigated the pathology of PINN loss functions from multiple perspectives, including gradient conflicts between PDE residuals and boundary conditions [11], the ruggedness of the loss landscape [10, 13], and derivative pathologies [14, 15]. These challenges have spurred the development of various mitigation strategies, including model architectures, loss formulations, and optimization methods.
Loss function modifications provide the most direct remedy. Representative approaches include approaches that incorporate gradient information of PDE residuals to regularize steep solution landscapes [16], adversarial \( L^\infty\) -based PINNs for high-dimensional Hamilton-Jacobi-Bellman (HJB) equations [17], meta-learning formulations [18], neural spectral methods [19], and auxiliary-variable decompositions for high-order derivatives [14]. In addition, there are a series of methods [20, 21, 22, 23] that provide implicit regularization through adaptive weights.
Optimizer-based methods attempt to navigate in the non-convex landscape more effectively. DCGD [24] and ConFIG [25] dynamically adjust the optimization trajectories to maintain consistency between update directions and individual loss gradients. [40] leverage Hessian-induced Riemannian metrics to exploit curvature information. [13] and [43] propose optimizers that target pathology induced by differential operators.
Architecture modifications provide implicit regularization. PirateNets [26] introduce adaptive residual connections to stabilize PDE residual minimization under suboptimal initialization. [41] analyzes critical points through a wide neural network theory. [44, 45, 46] and [35] attempted to enhance the prediction accuracy by using advanced architectures.
Despite these advances, existing methods primarily balance gradient magnitudes or adjust optimization trajectories without fundamentally eliminating the directional conflicts between boundary and interior loss terms. The underlying ill-conditioned loss landscape persists.
Another class of methods [27, 28] enforces the calculation of boundary conditions through an ansatz-based hard constraint. By removing boundary terms from the objective, these methods fundamentally avoid gradient conflicts. [47, 48, 42, 49] and [50] further expanded on this and applied this hard constraint method to various fields. Nevertheless, these methods demand careful construction of distance functions and geometry-aware representations for each specific domain, posing significant challenges for generalization to arbitrary complex geometries.
Beyond gradient-based techniques, several optimization-oriented strategies have been proposed to improve convergence efficiency. Curriculum learning [10] reduces training difficulty by progressively organizing samples, while adaptive sampling [29] dynamically reallocates collocation points to enhance efficiency. Complementarily, recent work has analyzed PINN training failures from a gradient-dynamics perspective: [11] employed the annealing algorithm to balance the interactions among different terms in the loss function, and [17] further formalized this phenomenon via neural tangent kernel analysis, proposing adaptive weighting to balance convergence across loss components. While orthogonal to gradient-aware approaches and easily integrated as auxiliary components, these enhancements do not directly resolve the fundamental conflict between boundary conditions and PDE residual gradients.
We consider a general steady-state PDE of the form
(1)
where \( u\) denotes the unknown solution, \( \mathcal{N}\) is the differential operator defining the PDE, \( f\) is the source term, and \( \Omega\) is the computational domain. To fully specify a problem instance, boundary conditions are imposed on \( \Gamma=\partial\Omega\) , which may consist of Dirichlet, Neumann, and Robin components:
(2)
where \( \mathbf{n}\) denotes the normal outward unit vector, and \( g_\mathrm{d}\) , \( g_\mathrm{n}\) , and \( g_\mathrm{r}\) are prescribed boundary value functions. In the Robin condition, \( \alpha\) and \( \beta\) are coefficients of the zeroth- and first-order terms, respectively.
For problems with composite boundary conditions, the standard PINN loss is defined as
(3)
where \( w_\mathrm{res}\) and \( w_\mathrm{bc}\) balance the PDE residual and boundary condition losses. The PDE residual loss is given by
(4)
where \( {x_i}_{i=1}^{N_\mathrm{res}} \subset \Omega\) are interior collocation points and \( u_\theta\) denotes the model approximation parameterized by \( \theta\) .
The boundary condition loss \( \mathcal{L}_\mathrm{bc}\) is defined as the sum of individual boundary losses, \( \mathcal{L}_\mathrm{bc}=\mathcal{L}_\mathrm{d}+\mathcal{L}_\mathrm{n}+\mathcal{L}_\mathrm{r}\) , with
(5)
where \( {x_i}\) are collocation points sampled on \( \Gamma_\mathrm{d}\) , \( \Gamma_\mathrm{n}\) , and \( \Gamma_\mathrm{r}\) , respectively.
Recent studies [11] have shown that the gradient directions associated with the PDE residual, \( \nabla_\theta \mathcal{L}_{\mathrm{res}}\) , and those induced by boundary conditions, \( \nabla_\theta \mathcal{L}_{\mathrm{bc}}\) , often exhibit a negative inner product. This gradient conflict leads to competing optimization trajectories and can significantly impede convergence.
Previous studies [30, 13] have primarily investigated the geometry of the loss landscape in the parameter space. In this section, we instead adopt a complementary perspective by simultaneously considering the loss geometry in both the function space and the parameter space. Let \( \mathcal{F}(\Omega)\) denote an appropriate Sobolev space defined in the domain \( \Omega\) . We define the loss functional in the function space as
(6)
Unlike the complete loss function defined in the parameter space that incorporates boundary constraints in Equation (3), this equation is a theoretical functional for the PDE residual.
Theorem 1 (Existence of Loss Valleys in PDE Residual Term)
If the PDE constraint \( \mathcal{N}[u]=f\) does not uniquely determine a solution in function space, then the PDE residual loss admits a flat and connected set \( \mathcal{S}_\mathrm{res} = {u \in \mathcal{F}(\Omega) \mid \mathcal{N}[u] = f} \) of global minimizers. When represented by an over-parameterized neural network, this set induces a connected manifold of zero residual solutions in parameter space with at least one flat direction.
We provide the strict proof in Appendix A.1. This characteristic reflects the intrinsic non-uniqueness of the differential operator \( \mathcal{N}\) , which is parameterized by a small number of free modes in classical PDEs. Although the zero-residual set may correspond to a simple hyperplane in the function space, its preimage in the parameter space is generally mapped to a highly distorted and non-convex manifold due to the strong nonlinearity of this mapping, as visualized in Figure 1.
This characteristic is fundamentally distinct from the Mean Squared Error (MSE) constraint commonly encountered in classic regression problems, where its objective function admits only a single global optimum. In contrast, the continuous zero-residual set \( \mathcal{S}_{\mathrm{res}}\) of PINN loss yields a ‘valley-like’ loss landscape in the function space. The bottom of this loss valley consists of multiple distinct functions that exactly satisfy the governing equation, yet differ substantially in their absolute values. Consequently, unlike the ‘flat minima’ widely discussed in supervised regression contexts [31, 32, 33], this loss valley originates from the intrinsic non-uniqueness of the PDE solution under insufficient constraints, rather than the redundancy of network parameters. Importantly, even when the PDE loss is nearly zero, a significant optimization trajectory may still be required to reach the physically meaningful solution selected by the boundary conditions.
[13] provided a quantitative analysis of the PINN loss landscape by estimating the Hessian spectral density, revealing severe pathology with large outlier eigenvalues and a high density of near-zero eigenvalues, consistent with our loss valley hypothesis. Following their work, we further analyze the Hessian in both function and parameter space at converged solutions, as shown in Table 1. We observe that the zero-loss solutions in the function space exhibit an infinite Hessian condition number, confirming the existence of the strict zero-residual manifold predicted by Theorem 1. Furthermore, the subspace similarity in the function space reaches \( 100\%\) , verifying the linear solution set shown in Figure 1. After being mapped onto the parameter space, this manifold forms a sharp valley, with a condition number over \( 10^{22}\) . In addition, the low-curvature eigenspaces in parameter space only partially align across different additive constants (\( \approx50\%\) ), indicating that the loss valley forms a broad and geometrically complex manifold, rather than a fixed-dimensional mode. More details about this experiment are in Appendix C.1.
| Condition Number | Subspace Similarity | ||||
| Func. Space | Param Space | Func. Space | Param Space | ||
| 0 | 3.97e-7 | \( \infty\) | \( 1.50 \times 10^{23}\) | \( -\) | \( -\) |
| 1 | 1.07e-5 | \( \infty\) | \( 5.97 \times 10^{22}\) | 100.0% | 50.8% |
| -1 | 5.24e-8 | \( \infty\) | \( 1.05 \times 10^{24}\) | 100.0% | 44.8% |
Boundary conditions can be viewed as constraints in the function space \( \mathcal{S}_\mathrm{bc} = {u \in \mathcal{F}(\Omega) \mid \alpha u + \beta\nabla u \cdot \mathbf{n}|_{\Gamma} = g}\) that intersect the PDE zero-residual solution manifold \( \mathcal S_{\mathrm{res}}\) . The well-posedness of the resulting boundary value problem depends on whether these constraints are sufficient to eliminate the intrinsic invariances of the differential operator. We have visualized this in Figure 4.
Unique Solution. When the zeroth-order term is present (i.e., \( \alpha > 0\) , encompassing Dirichlet and Robin conditions) and standard regularity and coercivity assumptions hold, the constraint removes the additive invariance associated with the PDE operator. Consequently, the intersection \( \mathcal{S}_\mathrm{bc} \cap \mathcal{S}_\mathrm{res}\) reduces to a singleton, yielding a unique solution.
Non-Unique Solution. When only the derivative term is present (i.e., \( \alpha = 0\) , corresponding to the Neumann condition), the constraint does not eliminate the additive constant mode for differential operators depending only on derivatives. As a result, if \( u \in \mathcal{S}_\mathrm{bc} \cap \mathcal{S}_\mathrm{res}\) , then \( u + c\) also belongs to this intersection for any constant \( c \in \mathbb{R}\) . Therefore, the solution is not unique unless normalization is imposed.
In this section, we identify a typical optimization dynamics of standard PINNs from the perspective of gradient competition, as in Figure 1.4.1.2. Note that we do not claim such a decomposition occurs in all possible PINN optimization scenarios. However, for a PDE with large coefficients, it constitutes a representative behavioral pattern, which we demonstrate in Appendix C.2 by an experiment.
As observed by [11], the gradient magnitude of the PDE residual term typically dominates that of the boundary condition term in a PDE with large coefficients, namely \( \|\nabla_\theta \mathcal{L}_{\mathrm{res}}\|\gg \|\nabla_\theta \mathcal{L}_{\mathrm{bc}}\|\) , especially during the early stages of training. Consequently, the optimization dynamics is initially governed by the PDE residual. Under this regime, gradient descent follows
(7)
which drives the model to rapidly fall into the low-loss set \( \mathcal{M}_{\mathrm{res}}={\theta \mid \mathcal{L}_{\mathrm{res}}(\theta)\leq \varepsilon}\) corresponding to the loss valley induced by the PDE residual, where the threshold \( \varepsilon>0\) marks the transition from residual-dominated to boundary-influenced gradients. Since this valley is generally high-dimensional, while the current landing point is determined by both model initialization and loss landscape, the current output value may deviate substantially from the global optimum that satisfies the boundary conditions.
In the subsequent training phase, the model is confined within the low-loss valley \( \mathcal{M}_{\mathrm{res}}\) by the PDE residual gradient \( \nabla_\theta\mathcal{L}_{\mathrm{res}}\) , while the boundary condition gradient \( \nabla_\theta\mathcal{L}_{\mathrm{bc}}\) guides the parameters directly toward the global optimum. Therefore, we establish the following lemma:
Lemma 1 (Gradient Conflict from Perspective of Loss Valley)
Near the PDE loss valley \( \mathcal{M}_{\mathrm{res}}\) , the residual gradient \( \nabla_\theta\mathcal{L}_{\mathrm{res}}\) is normal to this manifold, whereas the boundary condition gradient \( \nabla_\theta\mathcal{L}_{\mathrm{bc}}\) generically contains a normal component pointing in the opposite direction. Then the occurrence of a negative inner product
(8)
occurs generically during the training process. This leads to a structural gradient conflict during PINN optimization.
The proof of this lemma is provided in Appendix A.2. As a result of this conflict, the model parameter is forced to follow a tortuous trajectory along the valley floor rather than a direct path toward the global optimum, leading to a substantial degradation in convergence speed.
A more severe difficulty arises when the PDE residual-induced loss valley exhibits high curvature along the optimization trajectory. In such cases, the gradient signal provided by the boundary conditions is insufficient to overcome the local geometric constraints from the PDE residual, leading to oscillatory and stagnant update directions. These pathological oscillations hinder the smooth traversal of the valley floor and often trap the parameters in sub-optimal local minima, ultimately slowing or preventing convergence to the true solution.
Based on the above analysis, we redesign the PINN loss function to minimize the extent of optimization exploration during Phase II. By expanding the range of feasible solutions and improving the position of entering the valley, we significantly reduce the likelihood of entering regions with high curvature that hinder convergence.
We consider a general steady state PDE given by Equation (1). At the interior collocation points \( {x_i}_{i=1}^{N_{\mathrm{res}}} \subset \Omega\) , its discrete form can be written as
(9)
Similarly, boundary conditions imposed at \( {x_b}_{b=1}^{N_{\mathrm{bc}}} \subset \Gamma\) are expressed in a unified form as
(10)
This formulation encompasses Dirichlet (\( \beta_b=0\) ), Neumann (\( \alpha_b=0\) ), and Robin (\( \alpha_b,\beta_b>0\) ) as special cases.
For a well-posed problem, a PDE and boundary condition with a nonzero zeroth-order term can remove the additive constant ambiguity. However, it also constrains the solution to a fixed point. As discussed in Section 1.4.2, even when the model is already initialized close to the global optimum, the PDE residual term may still steer the parameter into a valley far from the true solution, thus significantly slowing convergence. To address this issue, a possible approach is to relax the constraint range, thereby making it easier for the optimizer to identify a global optimum in the loss valley that simultaneously satisfies PDE and boundary conditions.
Therefore, we introduce a solvable additive constant offset \( c \in \mathbb{R}\) for all zeroth-order terms into the loss formulation, thus transforming the task into a controllable ill-posed problem. By explicitly solving (linear) or performing a small number of iterations (nonlinear) within each backpropagation step, the optimal value of \( c\) is determined, which permits a controlled translation along the additive-invariant direction of the PDE solution manifold during the entire training process. Specifically, we define interior and boundary residuals as
(11)
then their aligned constraint form can be defined as
(12)
Accordingly, the aligned PDE residual loss and boundary condition loss are given by
(13)
The optimal offset \( c\) is then obtained by solving a one-dimensional auxiliary sub-optimization problem
(14)
Linear Case. For linear zeroth-order terms, this problem admits a closed-form solution:
(15)
where \( \gamma_i\) denotes the coefficient of the zeroth-order terms in PDE, i.e., \( \mathcal{Z}[u](x_i)=\gamma_i\,u(x_i)\) . We derived this in Appendix B.1.1. In practice, \( u_\theta\) is explicitly detached from \( c\) to prevent gradient propagation through this offset.
Nonlinear Case. For nonlinear zeroth-order terms, deriving an optimal value for \( c\) generally requires iterative solvers. Here, in the early iteration steps, we recommend using Newton’s method several times to update \( c\) . For more complex PDEs, first-order methods such as gradient descent can also be used as alternatives. We provide details in Appendix B.1.2.
By introducing this additive constant \( c\) , the network output is no longer constrained to a fixed solution but is instead allowed to translate along this constant direction. Therefore, when training converges,
(16)
The true solution \( u^*\) can be obtained from \( u_\theta\) by explicitly eliminating the learned offset \( c\) .
From an optimization standpoint, this aligned constraint formulation substantially enlarges the overlap between the set of boundary-admissible solutions and the PDE-admissible solution manifold. As a result, even if the parameter falls into a loss valley that is far from the original global optimum at Phase I, the optimizer can still efficiently locate an additive shift that is compatible with both constraints within a higher-dimensional expansion. We rigorously prove and discuss this enlargement in Appendix A.3, and provide the full pipeline in Algorithm 1 in Appendix B.2.
| MLP | PirateNets | PINNsFormer | |||||||||||
| Heat | Pois. | NS | Helm. | Heat | Pois. | NS | Helm. | Heat | Pois. | NS | Helm. | ||
| PINN [1] | Stp | 10127\( _{\pm 2839}\) | 8005\( _{\pm 1191}\) | 15375\( _{\pm 3583}\) | 1141\( _{\pm 11}\) | 8678\( _{\pm 5523}\) | 3875\( _{\pm 2445}\) | 1599\( _{\pm 99}\) | 3197\( _{\pm 296}\) | 389\( _{\pm 17}\) | |||
| \( L_2\) | 7.52e-3\( _{\pm 4\text{e-}3}\) | 1.11e-2\( _{\pm 5\text{e-}3}\) | 2.21e-2\( _{\pm 1\text{e-}2}\) | 8.56e-3\( _{\pm 5\text{e-}3}\) | 8.48e-2\( _{\pm 6\text{e-}2}\) | 7.29e-3\( _{\pm 1\text{e-}2}\) | 3.40e-2\( _{\pm 3\text{e-}2}\) | 2.86e-1\( _{\pm 7\text{e-}2}\) | 4.49e-3\( _{\pm 1\text{e-}3}\) | ||||
| \( L^\infty\) -PINN [17] | Stp | 12648\( _{\pm 6212}\) | 4768\( _{\pm 552}\) | 2801\( _{\pm 205}\) | 7721\( _{\pm 2881}\) | 5545\( _{\pm 5413}\) | 1501\( _{\pm 595}\) | 1912\( _{\pm 290}\) | 3142\( _{\pm 231}\) | 415\( _{\pm 97}\) | 3435\( _{\pm 1058}\) | ||
| \( L_2\) | 5.07e-1\( _{\pm 2\text{e-}1}\) | 8.02e-3\( _{\pm 3\text{e-}3}\) | 1.34e-3\( _{\pm 4\text{e-}4}\) | 1.01e-1\( _{\pm 9\text{e-}2}\) | 9.12e-2\( _{\pm 7\text{e-}2}\) | 4.92e-3\( _{\pm 4\text{e-}3}\) | 1.41e-1\( _{\pm 1\text{e-}1}\) | 2.00e-1\( _{\pm 2\text{e-}2}\) | 1.38e-2\( _{\pm 1\text{e-}2}\) | 1.97e-3\( _{\pm 5\text{e-}4}\) | |||
| SA-PINN [34] | Stp | 8958\( _{\pm 2501}\) | 8355\( _{\pm 3867}\) | 6888\( _{\pm 2902}\) | 8959\( _{\pm 2507}\) | 4713\( _{\pm 1043}\) | 9501\( _{\pm 3755}\) | 8930\( _{\pm 5914}\) | 4598\( _{\pm 3340}\) | 1649\( _{\pm 227}\) | 1888\( _{\pm 118}\) | 330\( _{\pm 9}\) | |
| \( L_2\) | 4.71e-3\( _{\pm 2\text{e-}3}\) | 1.95e-1\( _{\pm 2\text{e-}1}\) | 9.79e-3\( _{\pm 4\text{e-}3}\) | 8.07e-3\( _{\pm 5\text{e-}3}\) | 8.64e-2\( _{\pm 1\text{e-}1}\) | 6.13e-1\( _{\pm 5\text{e-}1}\) | 8.71e-2\( _{\pm 6\text{e-}2}\) | 6.44e-3\( _{\pm 4\text{e-}3}\) | 1.23e-2\( _{\pm 1\text{e-}3}\) | 4.41e-2\( _{\pm 1\text{e-}2}\) | 6.76e-3\( _{\pm 5\text{e-}3}\) | ||
| BRDR [21] | Stp | 15213\( _{\pm 2175}\) | 6137\( _{\pm 795}\) | 5462\( _{\pm 605}\) | 13131\( _{\pm 2505}\) | 5245\( _{\pm 1058}\) | 8276\( _{\pm 1326}\) | 9427\( _{\pm 4083}\) | 5831\( _{\pm 6922}\) | 1941\( _{\pm 203}\) | 2333\( _{\pm 214}\) | 367\( _{\pm 13}\) | |
| \( L_2\) | 5.58e-3\( _{\pm 2\text{e-}3}\) | 8.51e-2\( _{\pm 7\text{e-}2}\) | 8.66e-3\( _{\pm 3\text{e-}3}\) | 2.05e-2\( _{\pm 1\text{e-}2}\) | 5.51e-1\( _{\pm 3\text{e-}1}\) | 5.20e-1\( _{\pm 3\text{e-}1}\) | 9.26e-2\( _{\pm 6\text{e-}2}\) | 6.81e-3\( _{\pm 1\text{e-}2}\) | 1.67e-2\( _{\pm 1\text{e-}2}\) | 8.65e-2\( _{\pm 4\text{e-}2}\) | 6.25e-3\( _{\pm 1\text{e-}3}\) | ||
| DB-PINN [23] | Stp | 6988\( _{\pm 329}\) | 9451\( _{\pm 6738}\) | 3981\( _{\pm 294}\) | 11005\( _{\pm 1662}\) | 12689\( _{\pm 6402}\) | 6054\( _{\pm 4412}\) | 2304\( _{\pm 1113}\) | 1159\( _{\pm 132}\) | 331\( _{\pm 36}\) | 4882\( _{\pm 245}\) | ||
| \( L_2\) | 6.82e-2\( _{\pm 1\text{e-}2}\) | 4.56e-1\( _{\pm 2\text{e-}1}\) | 7.89e-3\( _{\pm 2\text{e-}3}\) | 1.22e-2\( _{\pm 1\text{e-}2}\) | 3.12e-1\( _{\pm 2\text{e-}1}\) | 3.47e-2\( _{\pm 2\text{e-}2}\) | 3.69e-3\( _{\pm 2\text{e-}3}\) | 1.67e-3\( _{\pm 8\text{e-}3}\) | 5.69e-3\( _{\pm 2\text{e-}3}\) | 5.37e-3\( _{\pm 2\text{e-}3}\) | |||
| Stp | 1577\( _{\pm 130}\) | 3868\( _{\pm 1932}\) | 1269\( _{\pm 966}\) | 1004\( _{\pm 450}\) | 713\( _{\pm 496}\) | 4880\( _{\pm 726}\) | 2597\( _{\pm 748}\) | 410\( _{\pm 401}\) | 746\( _{\pm 115}\) | 1936\( _{\pm 78}\) | 201\( _{\pm 54}\) | 1659\( _{\pm 1604}\) | |
| CAML (Ours) | \( L_2\) | 1.16e-3\( _{\pm 8\text{e-}5}\) | 5.00e-3\( _{\pm 9\text{e-}4}\) | 4.73e-3\( _{\pm 3\text{e-}4}\) | 1.56e-4\( _{\pm 5\text{e-}5}\) | 4.98e-3\( _{\pm 4\text{e-}3}\) | 3.24e-2\( _{\pm 1\text{e-}2}\) | 2.79e-2\( _{\pm 2\text{e-}2}\) | 1.19e-3\( _{\pm 5\text{e-}4}\) | 4.04e-3\( _{\pm 9\text{e-}4}\) | 8.29e-2\( _{\pm 1\text{e-}2}\) | 4.07e-3\( _{\pm 2\text{e-}3}\) | 9.10e-4\( _{\pm 3\text{e-}4}\) |
From the perspective of optimization dynamics, another way to shorten the optimization path is to directly bypass the high-curvature regions of the PDE loss landscape. Consequently, we define a time-dependent gating function \( \lambda(t)\in[0,1]\) to guide the parameters to rapidly approach the affine subspace that satisfies the boundary conditions at the early stage of training. The total loss can be written as
(17)
A simple and reproducible choice for \( \lambda(t)\) is a piecewise linear schedule:
(18)
where \( t_d\) is the delay step number and \( t_r\) is the ramp length.
Although Equation (17) resembles a dynamic reweighting scheme, the proposed delay factor fundamentally alters the optimization dynamics. Fixed or adaptive weighting strategies still allow the residual term to dominate early training, thereby pulling the parameters into an unfavorable location within the residual valley. In contrast, the delay factor explicitly prevents this, thus reducing the probability of entering a valley that is far from the global optimum.
Our experiments aim to answer the following five questions: RQ1. How does the proposed approach compare with state-of-the-art PINN loss variants in terms of efficiency, stability, and sensitivity to initialization across different physical problems? RQ2. Can the proposed aligned constraint formulation effectively mitigate gradient conflicts between PDE residuals and boundary conditions during PINN training and shorten the optimization path? RQ3. What is the contribution of aligned constraints and delay-residual optimization? RQ4. How does CAML integrate with the existing conflict-aware optimizers? And RQ5. To which scenarios is CAML applicable?
We selected four representative and challenging benchmarks across thermodynamics, fluid mechanics and electromagnetism, including: (i) Heat, a heat conduction problem with composite boundary conditions, (ii) Poisson, a Poisson problem with complex nonlinear boundary conditions, (iii) NS, a steady-state Navier-Stokes problem with a nonlinear operator, and (iv) Helmholtz, a Helmholtz reaction-diffusion problem with complex geometry and high frequency. As baselines, we compare with the following five loss function-based PINN variants: (i) classic PINN [1], (ii) \( L^\infty\) -PINN [17], (iii) SA-PINN [34], (iv) BRDR [21], and (v) DB-PINN [23].
To systematically evaluate the performance of our method across different parameter spaces, we consider three representative model architectures in all benchmarks: (i) an MLP-based PINN, (ii) PirateNets [26], a physics-informed ResNet, and (iii) PINNsFormer [35], a physics-informed Transformer. For all experiments, we keep the network structures and hyperparameter configurations consistent with the original implementations, and only replace the loss function. The optimizer is Adam [36]. We implemented the algorithms using PyTorch [37] and the experiments are conducted on an NVIDIA RTX 5090 GPU.
In each experiment, we selected five random seeds for initialization. In addition to (i) relative \( L_2\) error at the same iteration step \( T_{\min}\) and (ii) the number of iterations required for the relative \( L_2\) error to reach a predefined accuracy threshold, we further introduce the following statistic indicator in part of the experiments to evaluate the stability of optimization: (iii) full-progress gradient cosine similarity
(19)
following previous work [38, 11], which characterizes the degree of conflict between the residual and boundary-condition gradients. The higher this value, the more similar the directions of the two gradients will be. We report the fraction of training iterations (over a fixed total number of iterations) for which \( \cos(\phi)>0\) . All details of the experimental setup are in Appendix C.3.
| MLP | PirateNets | PINNsFormer | |||||
| Heat | Pois. | Heat | Pois. | Heat | Pois. | ||
| PINN | [1] | 5.89% | 3.85% | 1.44% | 4.09% | 5.13% | 7.07% |
| \( L^\infty\) -PINN | [17] | 11.98% | 62.81% | 40.22% | 84.08% | 54.68% | 87.78% |
| SA-PINN | [34] | 4.95% | 9.73% | 5.72% | 11.18% | 5.27% | 6.98% |
| BRDR | [21] | 4.57% | 21.28% | 6.56% | 4.63% | 4.02% | 8.00% |
| DB-PINN | [23] | 0.02% | 3.19% | 7.82% | 10.01% | 5.17% | 16.32% |
| CAML | (Ours) | 39.16% | 52.75% | 39.41% | 31.58% | 41.45% | 55.50% |
RQ1: Performance Analysis. Table 2 presents quantitative comparisons across four PDE benchmarks and three backbone architectures. CAML consistently achieves the best or second-best \( L_2\) accuracy in nearly all settings, while requiring substantially fewer iterations to reach the target error. In contrast, several baselines exhibit large variance or fail entirely when some seeds do not converge within the iteration budget. CAML is successful in most trials and shows significantly reduced variance in both convergence steps and final error. This robustness aligns with the analysis in Section 1.4.2: by enlarging the admissible solution set via manifold lifting, CAML mitigates sensitivity to initialization-dependent minima in the residual valley. Moreover, its consistent performance across architectures indicates that the gains arise from improved optimization geometry induced by aligned constraints, rather than architecture-specific effects.
RQ2: Gradient Conflict Mitigation. Table 3 reports the fraction of training iterations with positive cosine similarity between residual and boundary gradients. For standard PINNs, this fraction is typically less than 10%, indicating persistent gradient conflict throughout training. In contrast, CAML achieves an order-of-magnitude increase across all backbone architectures, demonstrating its effectiveness in alleviating gradient conflicts. Notably, \( L^\infty\) -PINN forces gradient alignment at the most critical points. As a result, the probability of gradient consistency can be statistically higher. However, this alignment is local and passive, and does not fundamentally fix the pathological structure of the loss landscape. Consequently, it is not equivalent to genuine progress toward the correct solution.
Figure 11 presents the relative error and parameter update distance curves of all loss functions for the MLP model on the Heat benchmark. The results indicate that CAML exhibits a substantially faster optimization speed from initialization compared to other baselines. Moreover, excluding the two suboptimal methods (DB-PINN and \( L^\infty\) -PINN), CAML traverses a shorter path in the parameter space.
Figure 1.6.2 shows the changes in \( c\) during the training iterations, which gradually converge to a fixed value as training progresses. In the linear cases, \( c\) admits an analytical optimum, and thus the loss is strictly reduced regardless of its value. Therefore, larger variations in \( c\) at initial stage are, in fact, expected to yield greater stability, as they indicate that CAML actively aligns more pathological conflicts. In the nonlinear case, however, the small errors introduced by Newton’s method can affect the accuracy of the final solution. This is why we introduce \( t_c\) and fix \( c\) once its variation becomes sufficiently small, as stated in Appendix B.1.2, thereby avoiding training noise caused by minor fluctuations.
We also analyze computational overhead, the difference between aligned constraints and learnable bias, as well as parameter sensitivity, as detailed in Appendices C.4, C.5, and C.6.
To evaluate the contribution of each component, we conducted an ablation study under four configurations: (i) original PINN, (ii) Aligned Constraint (AC) only, (iii) Delay-Residual (DR) only, and (iv) full CAML. We conducted experiments on MLP on both Heat and Poisson benchmarks, and all configurations are consistent with main experiment.
RQ3: Ablation Study. We report all results in Table 4. On the Heat benchmark, AC makes the most significant contribution, markedly improving gradient consistency and accelerating convergence. When DR is applied independently, it also yields partial improvements in convergence speed. However, in the full CAML setting, no additional benefit is observed beyond that provided by AC alone. In contrast, for benchmarks such as Poisson with strongly nonlinear boundary conditions, although AC substantially enhances gradient consistency, it still fails to reach the target accuracy within the fixed epoch budget. DR, on the other hand, significantly improves convergence efficiency, and its combination with AC achieves the best overall performance. These results suggest that for problems with relatively simple boundary conditions, AC alone is sufficient to substantially enhance training stability, whereas for more complex boundary conditions, the integration of both mechanisms yields superior performance.
| Heat | Poisson | |||||
| Stp | \( L_2\) | \( \cos(\phi)\) | Stp | \( L_2\) | \( \cos(\phi)\) | |
| PINN | 10127\( _{\pm 2839}\) | 7.52e-3\( _{\pm 4\text{e-}3}\) | 5.89% | \( -\) | \( -\) | 3.85% |
| AC-Only | 1491\( _{\pm 186}\) | 1.15e-3\( _{\pm 6\text{e-}5}\) | 36.12% | \( -\) | \( -\) | 45.22% |
| DR-Only | 8367\( _{\pm 991}\) | 7.16e-3\( _{\pm 4\text{e-}3}\) | 4.89% | 9648\( _{\pm 3075}\) | 4.63e-2\( _{\pm 3\text{e-}2}\) | 14.62% |
| CAML | 1577\( _{\pm 130}\) | 1.16e-3\( _{\pm 8\text{e-}5}\) | 39.16% | 3868\( _{\pm 1932}\) | 5.00e-3\( _{\pm 9\text{e-}4}\) | 52.75% |
In this section, we compare CAML’s performance across four optimizers and verify its compatibility with a range of advanced optimizers. The optimizers we have chosen include: (i) Adam [36], learning rate \( \eta = 1\mathrm{e}{-3}\) , \( \beta_1=0.9\) , \( \beta_2=0.999\) , (ii) L-BFGS [39], we use PyTorch L-BFGS with strong Wolfe line search (default), max history size \( m=50\) , max line-search steps 25, (iii) DCGD [24], same settings as Adam, and (iv) ConFIG [25], same settings as Adam. We used the Heat benchmark with the MLP backbone, following the same setup as the main experiments. Because L-BFGS steps are expensive, we limit the number of L-BFGS iterations to \( T_{\min}^{\text{LBFGS}}=300\) (which typically matches or exceeds the wall-clock time of 6000 Adam steps) and \( T_{\max}^{\text{LBFGS}}=1000\) . In addition, the delayed residuals were disabled in L-BFGS to prevent the gradient explosion caused by the distortion of the Hessian approximation. We also compared the final precision of all optimizers, and Table 5 provides a compact summary.
| Stp | \( L_2@T_{\min}\) | \( L_2@T_{\max}\) | \( \cos(\phi)\) | ||
| Adam | [36] | 1577\( _{\pm 130}\) | 1.16e-3\( _{\pm 8\text{e-}5}\) | 1.12e-3\( _{\pm 4\text{e-}5}\) | 39.16% |
| L-BFGS | [39] | 55\( _{\pm 3}\) | 6.93e-4\( _{\pm 3\text{e-}6}\) | 6.93e-4\( _{\pm 2\text{e-}6}\) | 26.00% |
| DCGD | [24] | 991\( _{\pm 372}\) | 1.15e-3\( _{\pm 8\text{e-}5}\) | 1.08e-3\( _{\pm 3\text{e-}5}\) | 36.37% |
| ConFIG | [25] | 738\( _{\pm 238}\) | 1.13e-3\( _{\pm 6\text{e-}5}\) | 1.06e-3\( _{\pm 2\text{e-}5}\) | 37.14% |
RQ4: Optimizer Selection. We observe that conflict-aware methods (DCGD and ConFIG) consistently provide more favorable optimization dynamics than standard first- and second-order baselines. ConFIG achieves the best trade-off between convergence speed and final accuracy, indicating that explicitly resolving gradient conflicts benefits both efficiency and stability. In addition, L-BFGS achieves the best final accuracy among all optimizers, indicating that second-order curvature information is particularly effective in refining the solution once the optimization trajectory enters a favorable region. Based on these observations, we recommend a two-stage training strategy: first, use a conflict-aware first-order optimizer for rapid convergence, and then switch to L-BFGS for fine-grained refinement.
RQ5: Applicability and Failure Modes. CAML targets the residual-induced loss valley analyzed in Section 1.4, and is most effective when (i) the PDE or BC contains at least one non-trivial zeroth-order term, (ii) PDE coefficients are large enough to induce \( \|\nabla_\theta\mathcal{L}_\text{res}\| \gg \|\nabla_\theta\mathcal{L}_\text{bc}\|\) , or (iii) BCs are composite or strongly nonlinear. Conversely, CAML degenerates to standard PINN under pure Neumann BCs without zeroth-order terms (the offset is annihilated by derivative operators), and yields limited gains when the landscape is already well-conditioned (\( c\approx 0\) throughout training, see our demonstration in Appendix C.7).
This paper explains the gradient pathologies of PINNs under composite boundary conditions from the perspective of loss geometry, showing that gradient conflicts between PDE residuals and boundary terms hinder optimization. To address this, we propose CAML, which aligns constraints via manifold lifting and introduces a residual-delay strategy to stabilize training. Extensive experiments demonstrate that CAML consistently improves convergence and stability across diverse PDEs and network architectures, highlighting its potential as a general PINN loss framework.
Future work will extend this approach to time-dependent (dynamic) PDEs with initial conditions. This will require extending our analytical framework to address the more complex three-way gradient interaction and reformulating the optimal offset as a time-dependent function \( c(t)\) . To achieve this, \( c(t)\) could be solved independently at each time step for linear equations, or approximated via time-domain partitioning and low-dimensional parameterizations for nonlinear cases. See Appendix B.3 for more future work.
Yichen Luo is sponsored by the China Scholarship Council for his PhD study at KTH Royal Institute of Technology, Sweden (No. 202408320060). This research was also supported in part by the National Natural Science Foundation of China (Grant No. 72401232) and XJTLU Technical Service (2024-015).
This paper presents a methodological effort aimed at improving Physics-Informed Neural Networks (PINNs) by mitigating their inherent gradient pathologies. PINNs are increasingly critical in scientific computing, with broad applications ranging from fluid dynamics and climate modeling to advanced manufacturing and thermal engineering. By significantly improving the convergence, efficiency, and reliability of these models, our proposed CAML framework can accelerate complex engineering designs and scientific discoveries. As a fundamental algorithmic and theoretical contribution, this work does not present any immediate ethical concerns or negative societal consequences.
Yichen Luo conceptualized the research, proposed the CAML methodology, conducted the primary theoretical analysis, developed the experimental codebase, performed the majority of the numerical experiments, and wrote the original manuscript. Peiyu Zhu provided general support during the experimental stages of the project. Dongxiao Hu executed a significant portion of the numerical experiments and authored the related work section. Jia Wang refined the manuscript’s presentation and participated in regular project meetings. Tailin Wu verified the formal theoretical proofs and provided critical feedback on the theoretical analysis. Dapeng Lan and Yu Liu participated in broad project discussions. Zhibo Pang supervised the project, acquired funding, and oversaw the final manuscript revision.
In this section, we provide a rigorous justification for the existence of loss valleys in PINNs. The analysis proceeds in three stages: (i) we establish the non-uniqueness of solutions in the functional space induced by the governing differential operator, (ii) we show that this functional non-uniqueness implies the existence of flat, connected sets of global minima of the PINN loss, and (iii) we demonstrate how such functional non-uniqueness can be mapped to the parameter space of a neural network, thereby forming a loss valley.
We first formalize the fact that, in the absence of sufficient boundary or initial conditions, the governing differential operator admits a non-trivial solution set.
Theorem 2 (Non-Triviality of the Solution Space)
Let \( \mathcal{N}\) be a differential operator that defines the PDE constraint \( \mathcal{N}[u]=f\) on a function space \( \mathcal{F}(\Omega)\) . If \( \mathcal{N}\) does not uniquely determine a solution in \( \mathcal{F}(\Omega)\) (e.g., due to missing boundary or initial conditions), then the solution set
(20)
is non-trivial and contains non-isolated solutions. Moreover, around any regular solution \( u^* \in \mathcal{S}_{\mathrm{res}}\) , the solution set locally contains a continuously parameterized family of solutions of dimension at least one.
Proof
Let \( u^*\) be a solution that satisfies \( \mathcal{N}[u^*]=f\) .
(a) Linear case. If \( \mathcal{N}\) is linear and the solution is not uniquely determined, then its kernel \( \ker(\mathcal{N})\) is non-trivial. For any \( v \in \ker(\mathcal{N})\) and scalar \( \alpha \in \mathbb{R}\) ,
(21)
Hence, the solution set contains an affine subspace passing through \( u^*\) and is therefore non-isolated.
(b) Nonlinear case. For nonlinear operators, the solution set need not be globally finite-dimensional. We therefore restrict attention to a local neighborhood of a regular solution. Assume that \( \mathcal{N}\) is Fr
specialChar{39}echet differentiable in a neighborhood of \( u^*\) and that the linearized operator
(22)
has a non-trivial kernel. Suppose further that \( u^*\) is a regular point in the sense that \( D\mathcal{N}[u^*]\) satisfies a constant-rank or regular-value condition (e.g., it is surjective and its kernel is complemented in \( \mathcal{F}(\Omega)\) ).
Under these standard assumptions, Banach-space implicit function or constant-rank theorems imply that the zero set of \( \mathcal{N}\) is locally a smooth embedded submanifold of \( \mathcal{F}(\Omega)\) whose dimension is equal to \( \dim \ker(D\mathcal{N}[u^*]) \ge 1\) . Consequently, there exists a neighborhood \( V\) of \( u^*\) and an open set \( U \subset \mathbb{R}^k\) , \( k \ge 1\) , such that
(23)
where \( \mathbf{c}\) parameterizes a continuous family of solutions. This establishes the local non-uniqueness of the solution space.
We next show that the functional non-uniqueness of solutions induces flat, connected sets of global minima of the PINN loss in function space.
Proposition 1 (Flat Directions in Function Space)
Any continuously parameterized family of solutions
(24)
forms a flat, connected set of global minimizers of loss \( \mathcal{L}_{\mathrm{func}}\) in function space.
Proof
By Theorem 2, the global minimizers of \( \mathcal{L}_{\mathrm{func}}\) coincide with \( \mathcal{S}_{\mathrm{res}} = { u \in \mathcal{F}(\Omega) \mid \mathcal{N}[u] = f }\) .
(a) Global minimality. For any \( \mathbf{c} \in U\) , we have \( u(\cdot;\mathbf{c}) \in \mathcal{S}_{\mathrm{res}}\) , hence \( \mathcal{N}[u(\cdot;\mathbf{c})] = f\) . By assumption,
(25)
Since \( \mathcal{L}_{\mathrm{func}}(u) \ge 0\) for all \( u\) , the global minimum value is \( 0\) , and every element of the family is therefore a global minimizer.
(b) Connectedness. The map \( \mathbf{c} \mapsto u(\cdot;\mathbf{c})\) is continuous by Equation 6 in the main text. If \( U \subset \mathbb{R}^k\) is connected, then its continuous image \( {u(\cdot;\mathbf{c}) \mid \mathbf{c} \in U}\) is also connected.
(c) Flatness. For any smooth curve \( \mathbf{c}(t) \subset U\) , define \( u_t := u(\cdot;\mathbf{c}(t))\) . From the first part,
(26)
Hence the loss remains constant along any such direction, so all tangent directions to the solution family are flat directions. Therefore, the continuously parameterized solution family forms a flat, connected set of global minimizers of \( \mathcal{L}_{\mathrm{func}}\) .
Finally, we show that functional non-uniqueness can be transferred to the parameter space of a neural network, leading to loss valleys in practice.
Let \( u_\theta \in \mathcal{F}(\Omega)\) denote the neural network approximation parameterized by \( \theta \in \mathbb{R}^P\) , and define the parameter-space loss
(27)
Assumption 1 (Representability of a Solution Family)
There exists a \( \mathcal{C}^1\) mapping \( \phi : U \subset \mathbb{R}^k \rightarrow \mathbb{R}^P\) such that
(28)
where \( { u(\cdot;\mathbf{c}) }\) is a solution family as in Proposition 1. Moreover, \( \phi\) is locally non-degenerate in the sense that there exists \( \mathbf{c}^* \in U\) and a direction \( v\in\mathbb{R}^k\) such that
(29)
where \( J\phi(\mathbf{c}^*)\) denotes the Jacobian of \( \phi\) at \( \mathbf{c}^*\) .
Assumption 2 (Local \( \mathcal{C}^2\) Regularity of the Loss)
The parameter-space loss \( \mathcal{L}(\theta)\) is twice continuously differentiable in a neighborhood of \( \Theta_{\mathrm{valley}}\) .
Assumption 1 only requires that the neural network can locally represent a non-isolated solution family, which is consistent with the expressive capacity and over-parameterization of modern neural networks. Assumption 2 is also standard, since PINNs typically adopt globally differentiable activation functions, such as Tanh, Sigmoid, or Sine, together with differentiable PDE operators. Therefore, the resulting residual loss is generally smooth in the relevant parameter region.
Theorem 3 (Existence of Loss Valleys in Parameter Space)
Under Assumptions 1 and 2, the parameter-space loss \( \mathcal{L}(\theta)\) admits a flat, connected set of global minimizers
(30)
Moreover, at any \( \theta^* \in \Theta_{\mathrm{valley}}\) , the Hessian \( \nabla^2_\theta \mathcal{L}(\theta^*)\) has at least one zero eigenvalue corresponding to a flat direction.
Proof
For any \( \mathbf{c} \in U\) ,
(31)
so \( \Theta_{\mathrm{valley}}\) consists entirely of global minimizers. Continuity of \( \phi\) implies that \( \Theta_{\mathrm{valley}}\) is path-connected.
Fix any \( \theta^*=\phi(\mathbf{c}^*)\in \Theta_{\mathrm{valley}}\) . Since \( \theta^*\) is a global minimizer and \( \mathcal{L}\ge 0\) , we have
(32)
By Assumption 1, choose \( v\in\mathbb{R}^k\) such that \( w := J\phi(\mathbf{c}^*)\,v \neq 0\) . Let \( \mathbf{c}(t)=\mathbf{c}^*+tv\) (for \( t\) small) and define \( \theta(t):=\phi(\mathbf{c}(t))\) , so that \( \theta(t)\subset \Theta_{\mathrm{valley}}\) and \( \dot{\theta}(0)=w\neq 0\) .
Then \( \mathcal{L}(\theta(t))\equiv 0\) for all sufficiently small \( t\) , hence
(33)
and by Assumption 2,
(34)
Since \( \theta^*\) is a (global) minimizer and \( \mathcal{L}\) is \( \mathcal{C}^2\) near \( \theta^*\) , the Hessian \( \nabla^2_\theta\mathcal{L}(\theta^*)\) is positive semidefinite. Therefore, the existence of a nonzero \( w\) such that \( w^\top\nabla^2_\theta\mathcal{L}(\theta^*)w=0\) implies that \( \nabla^2_\theta\mathcal{L}(\theta^*)\) has at least one zero eigenvalue.
Remark 1 (Inevitability of Loss Valley)
The above analysis shows that loss valleys in the PDE residual term arise naturally from the structural non-uniqueness of the underlying PDE when insufficient constraints are imposed. Importantly, the argument relies only on local and generic properties of the solution set and does not require global uniqueness or finite-dimensionality of the full solution manifold. This explains why flat loss landscapes are commonly observed in practical PINN training.
In this section, we provide a theoretical explanation for why gradients induced by PDE residual and boundary condition losses tend to be conflicting during PINN optimization. The analysis proceeds in two steps: (i) we characterize the residual-induced low-loss set as a manifold in parameter space, and (ii) we show that boundary-driven descent directions generically contain a normal component opposing the residual gradient, yielding a negative inner product.
Let \( \mathcal{L}_{\mathrm{res}}(\theta)\) and \( \mathcal{L}_{\mathrm{bc}}(\theta)\) denote the PDE residual loss and boundary condition loss, respectively. We define the low-residual set
(35)
where \( \varepsilon>0\) is a threshold that marks the transition from residual-dominated to boundary-influenced gradients. This set corresponds to the residual-induced loss valley discussed in Appendix A.1.
Assumption 3 (Smooth Manifold Approximation)
There exists a neighborhood \( \mathcal{U}\) of \( \mathcal{M}_{\mathrm{res}}\) such that \( \mathcal{M}_{\mathrm{res}}\cap \mathcal{U}\) is a smooth embedded submanifold of \( \mathbb{R}^P\) . Moreover, for every \( \theta \in \mathcal{M}_{\mathrm{res}}\cap \mathcal{U}\) , the gradient \( \nabla_\theta \mathcal{L}_{\mathrm{res}}(\theta)\) is normal to \( \mathcal{M}_{\mathrm{res}}\) .
The validity of this assumption is supported from two aspects. Theoretically, Theorem 2 proves that the zero-residual set \( \mathcal{S}_{\mathrm{res}}\) is indeed a smooth manifold in function space, and the smooth parameterization induced by overparameterized neural networks preserves this local structure. Empirically, the extremely large condition numbers of the Hessian and the presence of clear low-curvature subspaces reported in Table 1 also support the existence of such a manifold structure.
Lemma 2 (Normal–Tangent Decomposition of Boundary Gradients)
Under Assumption 3, for any \( \theta \in \mathcal{M}_{\mathrm{res}} \cap \mathcal{U}\) , the boundary gradient admits an orthogonal decomposition into tangent and normal components with respect to \( \mathcal{M}_{\mathrm{res}}\) ,
(36)
where \( \nabla_\theta^{\top} \mathcal{L}_{\mathrm{bc}}(\theta) \in T_\theta \mathcal{M}_{\mathrm{res}}\) and \( \nabla_\theta^{\perp} \mathcal{L}_{\mathrm{bc}}(\theta) \in N_\theta \mathcal{M}_{\mathrm{res}}\) . Consequently,
(37)
Proof
By definition of orthogonal projection, we have
(38)
with \( P_T(\theta)\nabla_\theta \mathcal{L}_{\mathrm{bc}}(\theta)\in T_\theta \mathcal{M}_{\mathrm{res}}\) and \( P_N(\theta)\nabla_\theta \mathcal{L}_{\mathrm{bc}}(\theta)\in N_\theta \mathcal{M}_{\mathrm{res}}\) . Under Assumption 3, \( \nabla_\theta \mathcal{L}_{\mathrm{res}}(\theta)\in N_\theta \mathcal{M}_{\mathrm{res}}\) , hence it is orthogonal to \( T_\theta \mathcal{M}_{\mathrm{res}}\) . Therefore,
(39)
and the stated identity follows.
We now formalize the fact that the boundary gradient generically contains a normal component opposing the residual gradient.
Assumption 4 (Geometric Complexity of Loss Valleys in Parameter Space)
We assume that the low-loss manifolds induced by the PDE residual \( \mathcal{M}_{\mathrm{res}}\) form highly non-linear, curved, and geometrically complex valleys in the parameter space, as a consequence of the neural network parameterization.
Assumption 5 (Local Normal Preference)
Assume Assumptions 3 and 4. For each relevant point \( \bar\theta \in \mathcal M_{\mathrm{res}}\cap U\) along the Phase II trajectory with \( \nabla_\theta \mathcal L_{\mathrm{res}}(\bar\theta)\neq 0\) , define
(40)
We assume that there exist a local neighborhood \( V_{\bar\theta}\subset U\) of \( \bar\theta\) and a constant \( \kappa_{\bar\theta}>0\) such that, within \( V_{\bar\theta}\) , the boundary-loss gradient has a nonzero local preference toward one normal side of the low-residual manifold:
(41)
or
(42)
In other words, near each relevant point on the Phase II trajectory, the boundary loss exhibits a consistent first-order tendency toward one normal side of \( \mathcal M_{\mathrm{res}}\) . We do not require this preferred side to remain fixed globally along the entire trajectory.
Assumption 4 can be directly observed in Figures 1 and C.1.2, as well as Table 1: the simple linear valley in function space becomes a twisted, non-convex manifold in parameter space. Assumption 5 follows directly from continuity: whenever \( \nabla_\theta \mathcal{L}_{\mathrm{bc}}(\bar\theta)\) has a non-vanishing normal component at \( \bar\theta\) , the continuity of \( \theta \mapsto \langle \nabla_\theta \mathcal{L}_{\mathrm{bc}}(\theta), v(\bar\theta)\rangle\) guaranties a sign-preserving neighborhood \( V_{\bar\theta}\) with \( \kappa_{\bar\theta} = \tfrac{1}{2}|\langle \nabla_\theta \mathcal{L}_{\mathrm{bc}}(\bar\theta), v(\bar\theta)\rangle|\) , while the degenerate case of exact tangency is non-generic and breaks under arbitrarily small perturbations.
Lemma 3 (Gradient Conflict Near the Low-Residual Manifold)
Under Assumptions 3 and 4, consider any \( \theta \in \mathcal{M}_{\mathrm{res}} \cap \mathcal{U}\) such that the boundary gradient has a non-vanishing normal component, i.e., \( \nabla_\theta^{\perp} \mathcal{L}_{\mathrm{bc}}(\theta) \neq \boldsymbol{0}\) . Then the occurrence of a negative inner product
(43)
occurs generically along the optimization trajectory.
Proof
By Lemma 2, only the normal component of the boundary gradient \( \nabla_\theta^{\perp} \mathcal{L}_{\mathrm{bc}}(\theta)\) contributes to the interaction. Moreover, by Assumption 3, \( \nabla_\theta \mathcal L_{\mathrm{res}}(\theta)\in N_\theta \mathcal M_{\mathrm{res}}\) is normal to \( \mathcal M_{\mathrm{res}}\) for all \( \theta\in\mathcal M_{\mathrm{res}}\cap U\) . Let
(44)
be the unit normal given by the residual gradient at \( \theta\) (when \( \nabla_\theta \mathcal L_{\mathrm{res}}(\theta)\neq 0\) ; otherwise the desired strict inequality is trivial to obtain in an arbitrarily small neighborhood).
Under Assumption 5, for each relevant point on the Phase II trajectory, there exists a local neighborhood in which the boundary gradient has a nonzero projection toward one normal side of \( M_{\mathrm{res}}\) . Hence, after possibly restricting the analysis to this local neighborhood and choosing the corresponding orientation of the normal direction, we have either \( \langle \nabla_\theta L_{\mathrm{bc}}(\theta), v(\theta) \rangle < 0\) or \( \langle \nabla_\theta L_{\mathrm{bc}}(\theta), v(\theta) \rangle > 0\) . Without loss of generality, consider the former case (the latter is symmetric). Then
(45)
Because \( v(\theta)\in N_\theta\mathcal M_{\mathrm{res}}\) , the inner product with \( v(\theta)\) depends only on the normal projection of the boundary gradient:
(46)
Hence,
(47)
which, together with Lemma 2, completes the proof of Lemma 3.
Remark 2 (Generic Occurrence of Gradient Conflicts)
In Phase II the iteration is confined near \( \mathcal M_{\mathrm{res}}\) (small residual), while boundary descent attempts to move toward the boundary-preferred side (a nonzero normal component by hypothesis \( \nabla_\theta^\perp\mathcal L_{\mathrm{bc}}(\theta)\neq 0\) ). As soon as the iterate is pushed to that side, \( \mathcal L_{\mathrm{res}}\) increases, and the gradient descent update induced by the residual term, i.e., \( -\nabla_\theta \mathcal L_{\mathrm{res}}\) , drives the iterate back toward the low-residual manifold. Therefore, whenever both mechanisms are active, the residual normal direction and the boundary normal direction are opposite, producing a negative inner product.
Remark 3 (Scope and Empirical Validation)
In practice, owing to the complexity of neural network parameterizations, the geometry of loss valleys in the parameter space is difficult to characterize explicitly. Moreover, different PDEs and boundary conditions give rise to loss valleys with distinct geometric properties. Consequently, the above discussion focuses on high-probability behaviors in generic settings. For specific problem instances, empirical studies are required to quantitatively assess the likelihood of gradient conflicts.
In this section, we complete the theoretical justification of the aligned constraint formulation introduced in the main text. In particular, (i) we show that introducing an additive offset enlarges the set of admissible (low-loss) solutions; (ii) we interpret this effect as a thickening of the boundary-constraint set along an additive direction; and (iii) we discuss how this reduces the traversal required within the residual-induced loss valley.
We now show that optimizing over \( c\) can only decrease the loss, which implies that low-loss regions (sublevel sets) in parameter space become larger.
Let \( u_\theta\) be a neural approximation and define (discrete) aligned losses as in the main text:
(48)
Let \( \mathcal{L}^{\mathrm{std}}(\theta)\) denote the standard PINN loss obtained by fixing \( c=0\) in the same residual expressions:
(49)
Proposition 2 (Sublevel-Set Enlargement)
For all \( \theta\) , \( \overline{\mathcal{L}}^{\mathrm{alg}}(\theta)\le \mathcal{L}^{\mathrm{std}}(\theta)\) . Consequently, for any \( \varepsilon\ge 0\) ,
(50)
Proof
By the definition of \( \overline{\mathcal{L}}^{\mathrm{alg}}(\theta)\) as a minimum over \( c\) ,
(51)
which proves the pointwise inequality. The sublevel-set inclusion Equation (50) follows immediately.
We clarify the operational mechanism of aligned constraints and their effect on the feasible set of PDE-constrained learning.
Assumption 6 (Well-Posedness / Uniqueness of the Governing PDE)
Assume the original PDE with the given boundary condition is well-posed and admits a unique solution \( u^*\in\mathcal{F}(\Omega)\) .
Proposition 3 (Enlarged Feasible Overlap under Aligned Constraints)
Assume that either the PDE or the boundary conditions contain at least one non-trivial zeroth-order term \( \mathcal{Z}\) , and that \( \mathcal{Z}\) depends only on the values of \( u\) . Define the aligned PDE zero-residual set and boundary condition zero-residual set as
(52)
Under Assumption 6, since the original PDE has a unique solution \( u^* \in \mathcal{S}_{\mathrm{res}} \cap \mathcal{S}_{\mathrm{bc}}\) , then
(53)
which implies that the aligned constraints enlarge the intersection of the PDE and boundary-admissible solution sets \( \mathcal{S}_{\mathrm{res}}\) and \( \mathcal{S}_{\mathrm{bc}}\) into an affine subspace obtained by linear shifts of the original solution, analogous to the solution structure of PDEs without zeroth-order terms under pure Neumann boundary conditions.
Proof
Under Assumption 6, if the original problem is well-posed with unique solution \( u^*\in \mathcal{S}_\mathrm{res}\cap \mathcal{S}_\mathrm{bc}\) , then for any \( u\in \mathcal{S}^{\mathrm{alg}}_{\mathrm{res}}(c)\cap \mathcal{S}^{\mathrm{alg}}_{\mathrm{bc}}\) the reconstructed field \( \tilde u=u+c\cdot\mathbf{1}\) satisfies the PDE and boundary condition by definition of \( \mathcal{S}^{\mathrm{alg}}_{\mathrm{res}}\) and \( \mathcal{S}^{\mathrm{alg}}_{\mathrm{bc}}\) . Since derivative terms are invariant under constant shifts,
(54)
and hence
(55)
so \( \tilde u\in \mathcal{S}_\mathrm{res}\cap \mathcal{S}_\mathrm{bc}={u^*}\) and \( u=u^*-c\cdot\mathbf{1}\) .
Corollary 1 (Aligned Minima Contain a Constant-Mode Valley in Parameters)
Under Assumption 6, the reconstructed function \( \widetilde u_{\theta,c}\) is unique at global optimality, but the parameter pair \( (\theta,c)\) need not be. In particular, if the network class is closed under constant shifts (or approximately so), then there may exist multiple \( (\theta,c)\) pairs producing the same \( \widetilde u_{\theta,c}=u^*\) , creating a connected minimizer set in \( (\theta,c)\) -space, and hence a wider low-loss region in \( \theta\) after optimizing out \( c\) .
Remark 4 (Implications for Optimization)
In parameter space, aligned constraints transform a low-dimensional manifold intersection problem into the search over a neighborhood with an explicit degree of freedom given by \( c\) . For each near-residual-minimizing parameter \( \theta\) , an optimal offset \( c^*(\theta)\) can be chosen to reduce boundary violations, thereby improving robustness and convergence of training.
In this section, we derive the closed-form solution for constant offset \( c\) when both the PDE and the boundary conditions have zeroth-order terms that are linear operators. We work with the discrete residuals evaluated at interior points \( {x_i}_{i=1}^{N_{\mathrm{res}}}\subset\Omega\) and boundary points \( {x_b}_{b=1}^{N_{\mathrm{bc}}}\subset\Gamma\) . Assume the zeroth-order PDE term is linear in \( u\) at each interior collocation point:
(56)
Let the boundary zeroth-order coefficient be \( \alpha_b\) at \( x_b\) (consistent with the unified boundary form). After introducing the additive offset \( c\in\mathbb{R}\) only inside zeroth-order terms, the aligned residuals become affine in \( c\) :
(57)
The corresponding weighted aligned objective is
(58)
Differentiate with respect to \( c\) :
(59)
Setting \( \mathcal{J}'(c)=0\) gives the normal equation, we have
(60)
Collecting the terms that multiply \( c\) and defining
(61)
then we can rewrite Equation (60) as
(62)
When \( A>0\) , the quadratic \( \mathcal{J}(c)\) is strictly convex because
(63)
and therefore the stationary point is the unique global minimizer. Solving Equation (62) yields the closed-form optimal offset
(64)
In this section, we consider the case where the zeroth-order PDE operator \( \mathcal{Z}[\cdot]\) is nonlinear in \( u\) . As before, the network output \( u_\theta\) is treated as fixed when optimizing the constant offset \( c\) . After introducing \( c\) only inside zeroth-order terms, the aligned residuals \( \bar r_i(c)\) and \( \bar s_b(c)\) are generally nonlinear functions of \( c\) . The weighted aligned objective \( \mathcal{J}(c)\) is defined as in Equation (58), but no closed-form minimizer exists in this case. We therefore apply Newton’s method for several steps to update \( c\) :
(65)
where \( \mathcal{J}'(c_k)\) and \( \mathcal{J}''(c_k)\) is calculated by the automatic differentiation tool of PyTorch.
In practice, we use a small number of Newton inner iterations. To stabilize early training, we perform more inner steps \( K_{\mathrm{init}}\) in the first optimization iteration, and then reduce to one or two steps \( K_{\mathrm{few}}\) afterwards. After the training has stabilized and converged (\( t \ge t_c\) ), we fix the value of \( c\) and no longer update it in order to further improve accuracy. Throughout, \( c\) is treated as a constant during backpropagation, and gradients are computed only with respect to \( u_\theta\) .
We briefly analyze the local non-degeneracy of the 1D subproblem \( \mathcal{J}(c)\) defined in Equation (14), which underlies the local convergence of Newton’s iteration in Equation (65). Differentiating \( \mathcal{J}\) twice with respect to \( c\) gives
(66)
where \( A(c) := \sum_i [\bar{r}_i'(c)^2 + \bar{r}_i(c)\bar{r}_i''(c)]\) and \( B(c) := \sum_b [\bar{s}_b'(c)^2 + \bar{s}_b(c)\bar{s}_b''(c)]\) , with primes denoting derivatives with respect to \( c\) .
For Dirichlet, Neumann, and Robin boundaries, \( \bar{s}_b(c)\) is affine in \( c\) , so \( B(c) = \sum_b \bar{s}_b'(c)^2 \geq 0\) . The only potential source of negative curvature is the cross term \( \sum_i \bar{r}_i \bar{r}_i''\) in \( A(c)\) , which we cannot guarantee to be non-negative for arbitrary nonlinear \( \mathcal{Z}\) . In practice, however, the signed quantities \( \bar{r}_i \bar{r}_i''\) exhibit partial cancellation across collocation points, whereas \( \bar{r}_i'^2\) accumulates with \( N_{\mathrm{res}}\) , so \( \mathcal{J}''(c^\star) > 0\) holds in the regimes encountered during training and Newton's method retains its quadratic local convergence rate. In the rare case where this fails, the Newton step can be safeguarded by a line search, or replaced by first-order alternatives such as gradient descent or the secant method.
We choose to align the model and solution at constant-mode because it provides the best benefit-to-cost trade-off for the common types of PDE, as it relaxes both the PDE residual and the boundary constraints while keeping the inner solve 1D (closed-form in linear cases). Furthermore, it ingeniously utilizes the additive invariance of the derivative terms, thus allowing the calculation of \( c\) to only consider the influence of zeroth-order terms. Unless for special PDEs, extending the lifting to nontrivial modes would require incorporating derivative terms into the Newton updates and computing higher-order/mixed auto differentiation throughout the network at every step, which is prohibitively expensive and numerically fragile in practice. Therefore, it is difficult to quantify its benefits across different physical fields.
Nevertheless, we do not rule out the possibility that, for certain special classes of PDEs, there exist alignment modes that are both inexpensive and potentially more beneficial. Future work may systematically characterize the structural features of these low-cost, high-gain modes and explore how constant-mode alignment can be extended to broader yet still low-dimensionally parameterized alignment spaces without compromising numerical stability.
In this section, we detail how we generate the two subfigures in Figure 1, which visualize the PDE-residual loss landscape projected onto a 2D subspace, in (i) the function space and (ii) the parameter space. The goal is to highlight that, although the residual-induced valley is relatively simple in the function space, it becomes highly distorted and non-convex after being mapped through a neural parameterization, consistent with the discussion in Section 1.4.1.1.
We consider a one-dimensional Poisson-type equation on the interval \( [0,\pi]\) :
(67)
Given a candidate solution \( u(x)\) , the residual loss over a set of \( N\) collocation points \( {x_i}_{i=1}^N\) is
(68)
For all experiments in Figure 1, we use \( N=100\) uniformly spaced collocation points on \( [0,\pi]\) .
To visualize the loss landscape in the function space, each candidate solution \( u(x)\) is represented by its values at the collocation points:
(69)
Thus, the infinite-dimensional function space is approximated by an \( N\) -dimensional Euclidean space with \( N=100\) .
We generate candidate solutions using a simple sinusoidal ansatz
(70)
where \( \theta = (\omega, \phi, A, B)\) .
Three solutions are constructed by directly setting the parameters:
(71)
Each solution is evaluated on the collocation grid, yielding three vectors \( \mathbf{u}_1, \mathbf{u}_2, \mathbf{u}_3 \in \mathbb{R}^{100}\) . This parameter combination ensures that different solutions only differ in terms of the additive constant \( B\) . In the function space, it forms a straight zero-residual manifold along the dimension of \( B\) , which is the loss valley.
We define a two-dimensional affine subspace in \( \mathbb{R}^{100}\) spanned by the three solution vectors. The center of the plane is chosen as \( \mathbf{c} = \mathbf{u}_1\) . The two orthonormal directions are constructed via Gram–Schmidt:
(72)
Any point on this plane is parameterized by scalars \( (\alpha, \beta)\) :
(73)
Since only the discretized values \( \mathbf{u}(\alpha,\beta)\) are available, the second derivative \( u''(x)\) is approximated using second-order finite differences on the uniform grid. The resulting landscape is visualized using contour plots with logarithmic scaling to capture variations across multiple orders of magnitude. The red curve in Figure 1(a) illustrates a low-loss path within this projected landscape.
To visualize Figure 1(b), we employ a fully connected feedforward neural network with three hidden layers. The network is trained under three analytical solutions defined by Section C.1.1, and each instance is optimized until convergence to a very low residual loss. Specifically, the loss function we use is composed of the following elements:
(74)
where \( \mathcal{L}_\mathrm{data}\) is the solution value obtained from Equation (70).
The experimental hyperparameters are shown in Table 6.
| Category | Hyperparameter | Value |
| Network Architecture | Number of hidden layers | \( 3\) |
| Neurons per hidden layer | \( 20\) | |
| Activation function | tanh | |
| Training Setup | Optimizer | Adam |
| Learning rate | \( 1\times10^{-3}\) | |
| Batch size | Full-batch | |
| Total training steps | \( 3000\) | |
We denote the resulting parameter configurations by \( \theta_1, \theta_2, \theta_3 \in \mathbb{R}^P\) , corresponding to the three independently trained models. A two-dimensional affine subspace in parameter space is constructed with center \( \theta_c = \theta_1\) and directions
(75)
Any parameter point on this plane is given by
(76)
Although the residual loss valley is relatively simple and flat in the function space, the nonlinear mapping from parameters \( \theta\) to functions \( u_\theta(\cdot)\) can significantly distort this structure. Consequently, the projected loss landscape in parameter space exhibits curvature, folding, and local non-convexity, as illustrated in Figure 1(b).
We also conducted experiments with more random initializations, as shown in Figure C.1.2. Note that due to the limitation of two-dimensional slicing, Figure C.1.2 provides a cross-sectional view of the high-dimensional loss landscape, only revealing the rough trajectory of the loss valley rather than indicating that all points along the valley bottom attain zero loss.
We record and analyze the Hessian from the above two experiments in both function and parameter space in Table 1. Let \( f:\mathbb{R}^d \to \mathbb{R}\) be twice continuously differentiable. The condition number of the Hessian matrix \( H(x) = \nabla^2 f(\theta)\) (at a point \( \theta\) ) is defined as
(77)
where \( \lambda_{\max}(\cdot)\) and \( \lambda_{\min}(\cdot)\) denote the largest and smallest eigenvalues, respectively. A larger condition number indicates a more ill-conditioned loss landscape, characterized by elongated, valley-like level sets. In contrast, when the condition number is small, the contour lines become nearly spherical, suggesting that the curvature is more uniform across different directions.
We also compute the subspace similarity of the Hessian to quantify the consistency of low-curvature directions across different constants. Let \( V_i \in \mathbb{R}^{d \times k}\) denote the matrix whose columns are the \( k\) eigenvectors corresponding to the smallest-magnitude eigenvalues of the Hessian at solution \( i\) , forming a near-zero (low-curvature) eigenspace. For the function space, we set \( k = 1\) , and for the parameter space, we set \( k = 100\) . The subspace similarity between solutions \( i\) and \( j\) is defined as
(78)
where \( \|\cdot\|_F\) denotes the Frobenius norm. This quantity lies in \( [0,1]\) , with larger values indicating stronger alignment between the corresponding low-curvature subspaces. By combining the condition number and the subspace similarity, we characterize not only the degree of degeneracy of the loss landscape but also the geometric stability of its flat directions across independently trained solutions.
In the function space, the Hessian exhibits an infinite condition number, and the zero-curvature subspace is strictly aligned, indicating that the loss valley corresponds to a standard straight zero-residual manifold. In the parameter space, although exact zero residuals cannot be achieved due to training error, the condition number remains extremely large. Moreover, the similarity of the low-curvature subspaces in the parameter space stays only around 50% even when \( k\) approaches one-fourth of the total number of network parameters, suggesting that the loss valley is mapped into the parameter space in a highly distorted and non-convex manner.
In this section, we present a minimal yet reproducible numerical experiment that illustrates the Two-Phase behavior of standard PINNs discussed in Section 1.4.2. In particular, we demonstrate how large-amplitude Dirichlet boundary conditions can induce optimization stagnation or oscillations after the PDE residual has already been minimized.
We consider the Poisson equation on the unit square domain \( \Omega = (0,1)^2, \Gamma = \partial \Omega\) , given by
(79)
with Dirichlet boundary conditions
(80)
To enable quantitative evaluation, the exact solution is constructed as
(81)
where the boundary function \( g\) is defined as
(82)
with coefficients \( A = 20, B = 15, C = 8, D = 6\) . This choice intentionally produces a boundary condition with a large amplitude range, significantly increasing the discrepancy between the randomly initialized network output and the true boundary values.
The right-hand side \( f(x,y)\) is computed analytically from \( u^*\) as
(83)
We employ a fully connected feedforward neural network to fit this problem. The hyperparameters are shown in Table 7.
| Category | Hyperparameter | Value |
| Network Architecture | Number of hidden layers | \( 5\) |
| Neurons per hidden layer | \( 80\) | |
| Activation function | tanh | |
| Training Setup | Optimizer | Adam |
| Learning rate | \( 1\times10^{-3}\) | |
| Batch size | Full-batch | |
| Total training steps | \( 6000\) | |
| Sample Mode | Inner collocation points | \( 8000\) |
| Boundary collocation points | \( 600\times 4\) | |
To characterize this failure mode more quantitatively, we recommend monitoring the following diagnostics: (a) The evolution of \( \mathcal{L}_{\text{res}}\) , (b) \( \mathcal{L}_{\text{bc}}\) , (c) the cosine similarity between task gradients \( \cos(\phi)\) , and (d) gradient norm ratio \( \rho_g = \frac{\left\lVert \nabla_{\theta} \mathcal{L}_{\mathrm{res}} \right\rVert}{\left\lVert \nabla_{\theta} \mathcal{L}_{\mathrm{bc}} \right\rVert}\) . Under the above setting, standard PINNs frequently exhibit two-phase behavior, as shown in Figure 25. We have also visualized the optimization trajectory during the training process and superimposed it with the loss landscape, enabling a more intuitive observation of the optimization dynamics, as shown in Figure 30.
Phase I: Rapid PDE Residual Minimization (0–2000 epochs). (a) PDE residual loss: Decreases rapidly by several orders of magnitude and quickly reaches a near-minimal level, indicating that the optimization is strongly attracted to the loss valley induced by the zero-residual PDE manifold. (b) Boundary condition loss: Exhibits transient growth or noticeable fluctuations, suggesting that boundary constraints are not yet effectively enforced during this phase. (c) Gradient angle cosine similarity: Displays large fluctuations whose amplitude gradually diminishes toward zero, reflecting a weakening interaction between the two gradient components. (d) Gradient norm ratio: Significantly greater than one and decreases steadily, confirming the dominant influence of the PDE residual gradient in early training.
Phase II: Boundary-Dominated Valley Traversal (2000–10000 epochs). (a) PDE residual loss: Ceases to decrease monotonically and instead exhibits pronounced oscillations, characteristic of optimization dynamics within a high-curvature region near the bottom of the residual-induced valley. (b) Boundary condition loss: Continues to decrease smoothly and steadily, indicating that boundary constraints increasingly govern the optimization trajectory. (c) Gradient angle cosine similarity: Shows a brief interval of near-zero values immediately after the PDE residual reaches its minimum, implying near-orthogonality between PDE and boundary gradients. Subsequently, boundary gradients begin to steer the optimization; however, the rugged geometry of the residual valley reintroduces significant oscillations. (d) Gradient norm ratio: The mean value is significantly lower than in Phase I, while oscillation amplitude increases substantially, signaling that boundary conditions and PDE residual frequently compete for control of the gradient direction.
We consider a 2D steady-state heat conduction benchmark on a square domain \( \Omega = [0,1]\times[0,1]\) . The goal is to learn/solve a scalar temperature field \( u(x,y)\) that satisfies a homogeneous elliptic PDE in the interior and mixed (Dirichlet/Neumann) boundary conditions on \( \Gamma\) . This benchmark evaluates the ability of a model to handle mixed boundary conditions.
Inside the domain, \( u\) satisfies the Laplace equation
(84)
We impose Dirichlet conditions on the left and bottom boundaries,
(85)
and Neumann conditions on the right and top boundaries (outward normals \( (1,0)\) and \( (0,1)\) ),
(86)
These correspond to a constant prescribed boundary temperature on \( {x=0}\cup{y=0}\) and a constant outward heat flux on \( {x=L}\cup{y=L}\) .
The analytical solution is computed via a truncated Fourier-series representation that satisfies the above mixed boundary conditions:
(87)
with coefficients
(88)
and we use \( M=20\) terms in practice to obtain an accurate numerical approximation of the analytical series solution.
We consider a 2D Poisson benchmark on the unit square \( \Omega = [0,1]\times[0,1]\) . The task is to solve for a scalar field \( u(x,y)\) governed by a Poisson equation with a prescribed source term and Dirichlet boundary conditions on the entire boundary \( \Gamma\) . This benchmark evaluates the ability of a model to handle complex and strong nonlinear boundary conditions.
Inside the domain, \( u\) satisfies
(89)
where the source term is defined as
(90)
with constants \( A=30\) , \( B=25\) .
Dirichlet boundary conditions are imposed on the whole boundary:
(91)
where
(92)
and \( C=18\) , \( D=16\) .
This Poisson problem is constructed via the method of manufactured solutions. The exact solution is given by
(93)
which satisfies the prescribed Dirichlet boundary conditions and yields the source term \( f(x,y)=\Delta u^*(x,y)\) . This analytical solution is used as the ground truth for quantitative error evaluation.
We consider a 2D steady incompressible Navier–Stokes benchmark on the unit square \( \Omega=[0,1]\times[0,1]\) . The objective is to solve for the velocity field \( \boldsymbol{u}(x,y)=(u(x,y),v(x,y))\) and the pressure field \( p(x,y)\) governed by the steady Navier–Stokes equations with a prescribed forcing term and Dirichlet boundary conditions. This benchmark evaluates the ability of a model to handle nonlinear convection, pressure–velocity coupling, and incompressibility constraints.
Inside the domain, the steady incompressible Navier–Stokes equations are given by
(94)
where \( \mathrm{Re}=500\) denotes the Reynolds number and \( \boldsymbol{f}=(f_u,f_v)\) is a known forcing term constructed so that the system admits a closed-form analytical solution.
Dirichlet boundary conditions are imposed on the entire boundary:
(95)
where \( \boldsymbol{u}^*\) is the prescribed exact velocity field. No explicit boundary condition is imposed on the pressure.
The benchmark is defined using a manufactured solution. Let
(96)
where \( (U_x=1,U_y=1)\) is a constant velocity offset. The forcing term \( \boldsymbol{f}\) is analytically derived by substituting \( (\boldsymbol{u}^*,p^*)\) into the Navier–Stokes equations. This exact solution satisfies the incompressibility condition and the prescribed Dirichlet boundary conditions, and is used as the ground truth for error evaluation.
We consider a 2D Helmholtz-type benchmark defined on a bounded, irregular domain \( \Omega\subset\mathbb{R}^2\) . The task is to solve for a scalar field \( u(x,y)\) governed by a variable-coefficient second-order elliptic equation with a reaction term. This benchmark tests models’ ability to handle spatially varying anisotropic operators and nontrivial domain geometries.
Inside the domain, \( u\) satisfies a generalized Helmholtz equation of the form
(97)
where the diffusion tensor \( A(x,y)\in\mathbb{R}^{2\times 2}\) and the reaction coefficient \( q(x,y)\) are given by
(98)
Dirichlet boundary conditions are imposed on the entire boundary:
(99)
where \( u^*\) is a prescribed exact solution.
The benchmark is constructed using a manufactured solution,
(100)
The source term \( f(x,y)\) is obtained analytically by substituting \( u^*\) into the governing equation. This exact solution satisfies the Dirichlet boundary conditions and serves as the ground truth for quantitative error evaluation.
Among all the benchmarks and baselines, the configurations of MLP, PirateNets, and PINNsFormer we used are shown in Table 8, Table 9, and Table 10, respectively.
| Configuration | Value |
| Number of hidden layers | \( 4\) |
| Neurons per hidden layer | \( 64\) |
| Activation function | tanh |
| Configuration | Value |
| Number of hidden layers | \( 3\) |
| Neurons per hidden layer | \( 32\) |
| Activation function | tanh |
| Configuration | Value |
| Number of attention layers in encoder | \( 2\) |
| Number of attention layers in decoder | \( 2\) |
| Head number in attention layer | \( 1\) |
| Neurons per hidden layer | \( 32\) |
| Activation function | WaveAct |
For PINNsFormer in all benchmarks and baselines, we detach the gradients at the encoder input to prevent gradient mixing among neighboring spatial points when automatic differentiation computes the vector–Jacobian product.
Among all the benchmarks and baselines, the hyperparameters we used to train MLP, PirateNets, and PINNsFormer are shown in Table 11, Table 12, and Table 13, respectively.
| Benchmark | \( \eta\) | \( w_\mathrm{res}\) | \( w_\mathrm{bc}\) | \( T_\mathrm{min}\) | \( T_\mathrm{max}\) | \( L_2^\mathrm{stop}\) |
| Heat | 1.0e-3 | 1 | 5 | 6000 | 20000 | 2.0e-3 |
| Poisson | 1.0e-3 | 1 | 100 | 6000 | 20000 | 1.0e-2 |
| NS | 1.0e-3 | 1 | 100 | 6000 | 20000 | 5.0e-3 |
| Helmholtz | 1.0e-3 | 1 | 10 | 4000 | 20000 | 1.0e-3 |
| Benchmark | \( \eta\) | \( w_\mathrm{res}\) | \( w_\mathrm{bc}\) | \( T_\mathrm{min}\) | \( T_\mathrm{max}\) | \( L_2^\mathrm{stop}\) |
| Heat | 1.0e-3 | 1 | 1 | 6000 | 20000 | 1.0e-2 |
| Poisson | 1.0e-3 | 1 | 100 | 6000 | 20000 | 5.0e-2 |
| NS | 3.0e-3 | 1 | 100 | 6000 | 20000 | 1.0e-2 |
| Helmholtz | 1.0e-3 | 1 | 100 | 4000 | 20000 | 5.0e-3 |
| Benchmark | \( \eta\) | \( w_\mathrm{res}\) | \( w_\mathrm{bc}\) | \( T_\mathrm{min}\) | \( T_\mathrm{max}\) | \( L_2^\mathrm{stop}\) |
| Heat | 1.0e-3 | 1 | 1 | 2000 | 10000 | 1.0e-2 |
| Poisson | 1.0e-3 | 1 | 100 | 2000 | 10000 | 5.0e-2 |
| NS | 1.0e-3 | 1 | 100 | 2000 | 10000 | 1.0e-2 |
| Helmholtz | 1.0e-3 | 1 | 100 | 1500 | 10000 | 1.0e-3 |
For CAML, the additional hyperparameters we adopted in different benchmarks are as in Table 14. For other baselines that have additional hyperparameters, we apply the recommended configurations as stated in their original papers.
| Benchmark | \( t_d\) | \( t_r\) | \( K_\mathrm{init}\) | \( K_\mathrm{few}\) | \( t_c\) |
| Heat | 25 | 50 | \( -\) | \( -\) | \( -\) |
| Poisson | 200 | 800 | \( -\) | \( -\) | \( -\) |
| NS | 25 | 50 | 10 | 2 | 1000 |
| Helmholtz | 25 | 50 | \( -\) | \( -\) | \( -\) |
In this section, we present a quantitative analysis of computational costs for both linear and nonlinear CAML PDEs. We use the Heat benchmark as the linear case, and use NS as the nonlinear benchmark. All configurations are consistent with the main experiment. We first measure the time required to compute the closed-form solution for \( c\) in the linear case, as well as the proportion of that time spent in the single iteration. The results are shown in Table 15.
| FP | AD | AC | BP | Percentage | ||
| MLP | Single Iteration (\( ms\) ) | 0.1619 | 1.6322 | 0.1439 | 2.6541 | 3.13% |
| Full Process (\( s\) ) | 1.1971 | 10.7513 | 0.9939 | 17.2778 | 3.29% | |
| PirateNets | Single Iteration (\( ms\) ) | 0.7503 | 7.0681 | 0.1763 | 15.4657 | 0.75% |
| Full Process (\( s\) ) | 6.1116 | 57.5850 | 1.1378 | 116.7246 | 0.63% | |
| PINNsFormer | Single Iteration (\( ms\) ) | 2.3134 | 21.4542 | 0.1487 | 47.0843 | 0.21% |
| Full Process (\( s\) ) | 6.8850 | 66.9518 | 0.4497 | 140.9999 | 0.21% | |
It can be observed that in the linear setting, the additional computational cost introduced by increasing \( c\) is relatively constant. Even for the smallest MLP, it accounts for only approximately 3% of the total cost and can therefore be neglected.
In the nonlinear case, we measured the time required to compute \( c\) under \( K_\mathrm{few}=1\) , \( 2\) , and \( 5\) Newton iterations. The Newton updates are performed only during the first \( 1/6\) of the training process. Once the error stabilizes, the value of \( c\) is fixed. The results are reported in Table 16.
| FP | AD | AC | BP | Percentage | |||
| \( K_{\mathrm{few}}=1\) | MLP | Single Iteration (\( ms\) ) | 0.1799 | 1.4494 | 1.0979 | 2.6280 | 20.50% |
| Full Process (\( s\) ) | 1.2951 | 11.5819 | 3.9947 | 18.6597 | 11.24% | ||
| PirateNets | Single Iteration (\( ms\) ) | 0.7328 | 8.0003 | 1.0322 | 14.8213 | 4.20% | |
| Full Process (\( s\) ) | 6.0822 | 55.2902 | 4.0421 | 112.3832 | 2.27% | ||
| PINNsFormer | Single Iteration (\( ms\) ) | 2.5648 | 23.8952 | 1.8702 | 55.1467 | 2.24% | |
| Full Process (\( s\) ) | 7.2837 | 70.0029 | 0.6324 | 151.3379 | 0.28% | ||
| \( K_{\mathrm{few}}=2\) | MLP | Single Iteration (\( ms\) ) | 0.1895 | 1.4653 | 2.0220 | 2.7328 | 31.55% |
| Full Process (\( s\) ) | 1.4321 | 12.8582 | 8.3121 | 21.5747 | 18.81% | ||
| PirateNets | Single Iteration (\( ms\) ) | 0.7807 | 7.2948 | 2.3412 | 15.0520 | 9.19% | |
| Full Process (\( s\) ) | 6.1321 | 59.8055 | 9.3241 | 116.2348 | 4.87% | ||
| PINNsFormer | Single Iteration (\( ms\) ) | 2.3818 | 21.5656 | 3.0841 | 47.1944 | 4.16% | |
| Full Process (\( s\) ) | 7.8837 | 71.2378 | 1.0790 | 155.3432 | 0.46% | ||
| \( K_{\mathrm{few}}=5\) | MLP | Single Iteration (\( ms\) ) | 0.1614 | 1.5289 | 5.0107 | 2.4561 | 54.72% |
| Full Process (\( s\) ) | 1.3320 | 10.0022 | 19.7822 | 16.9772 | 41.13% | ||
| PirateNets | Single Iteration (\( ms\) ) | 0.7361 | 8.2706 | 5.2069 | 15.1342 | 17.74% | |
| Full Process (\( s\) ) | 6.9929 | 57.9483 | 21.3250 | 118.1247 | 10.43% | ||
| PINNsFormer | Single Iteration (\( ms\) ) | 2.3477 | 22.1690 | 6.0726 | 49.3347 | 7.60% | |
| Full Process (\( s\) ) | 6.4899 | 73.5383 | 2.7600 | 157.1937 | 1.15% | ||
The results indicate that, for nonlinear PDEs, under the typical CAML configuration (\( K_\mathrm{few}=2\) ), the cost of online update \( c\) accounts for approximately 20% of the total training time in lightweight models such as MLP. Since the computational cost of Newton update remains relatively constant across different backbones, this proportion decreases to below 5% for more complex models. Moreover, if the updates of \( c\) were terminated earlier, this proportion would be further reduced.
In this section, we perform an experiment to evaluate the distinction between CAML and explicitly introduced learnable biases in the output layer. Using the Heat benchmark as an illustrative example, we employ the MLP backbone and keep all experimental settings consistent with those in the main experiments. We compare the following configurations: (i) standard PINN with a standard MLP, (ii) standard PINN with an MLP augmented by an explicitly added, learnable output bias, and (iii) CAML with aligned constraints only (to isolate the effect of the delay-residual). The results are shown in Table 17.
| Stp | \( L_2\) | \( \cos(\phi)\) | |
| PINN | 10127\( _{\pm 2839}\) | 7.52e-3\( _{\pm 4\text{e-}3}\) | 5.89% |
| PINN + Learnable Bias | 9264\( _{\pm 1877}\) | 6.53e-3\( _{\pm 2\text{e-}3}\) | 16.13% |
| CAML (AC-Only) | 1491\( _{\pm 186}\) | 1.15e-3\( _{\pm 6\text{e-}5}\) | 36.12% |
As shown in Table 17, introducing a bias term improves over vanilla PINN in all metrics, indicating that increasing the constant-mode flexibility indeed alleviates part of the gradient conflict. However, the improvement remains limited: convergence is still slow, and the final error remains an order of magnitude larger than that of CAML. In contrast, CAML (AC-only) significantly reduces the required training steps, lowers the \( L_2\) error, and substantially increases the gradient alignment ratio. This suggests that simply introducing a trainable bias is insufficient. The key advantage of CAML lies in its per-iteration optimal alignment of the constant mode via an analytic (or low-dimensional) solve, rather than relying on gradient-based adaptation of a bias parameter, which may become trapped in a distorted residual valley.
In this section, we analyze the sensitivity of CAML to its key hyperparameters, including the Newton iteration steps \( K_{\text{init}}\) and \( K_{\text{few}}\) , the constant-fixing iteration \( t_c\) , and the delay-residual parameters \( t_d\) and \( t_r\) .
Newton’s method. The value of \( K_{\text{init}}\) is set higher because, in the first step, \( c\) does not have a warm start, and more iterations are needed to find reasonable initial values. \( K_{\text{few}}\) is set to 2 as a trade-off between accuracy and computational cost (the comparison of costs is shown in Table 15). The selection of \( t_c\) is based on observing when the change in \( c\) stabilizes, thereby avoiding training noise caused by minor fluctuations in the later stage.
Delay-residual schedule. The principle for setting \( t_d\) and \( t_r\) is to ensure that the network roughly satisfies the boundary conditions before the PDE residual term is introduced. To examine the effect of these parameters more systematically, we conducted an additional experiment on the Poisson benchmark with the MLP backbone (5 random seeds). The results are reported in Table 18.
| \( t_d/t_r\) | 0/0 | 40/160 | 80/320 | 120/480 | 160/640 | 200/800 | 800/3200 |
| Stp | \( -\) | 5521 | 4713 | 5595 | 4574 | 3868 | 1984 |
| \( L_2\) | \( -\) | 9.99e-3 | 6.25e-3 | 1.00e-2 | 6.98e-3 | 5.00e-3 | 2.98e-3 |
Overall, we found that for strongly nonlinear boundary conditions, the later the residuals are introduced, the more stable the training becomes. Nevertheless, excessive \( t_d\) may lead to overfitting of the boundary conditions, thereby causing gradient explosion after the introduction of the PDE residual term. Therefore, we suggest setting this value with greater caution.
In this section, we present a failure-case experiment as a consistency check of our theoretical analysis. We consider a well-posed toy Poisson problem with purely Dirichlet boundary conditions. In this case, the model’s initialization is already close to the global optimum, and the PDE loss landscape is relatively smooth, so it will not guide the parameter to a remote valley at the very beginning. In this setting, CAML is not expected to provide a significant improvement over standard PINNs. This experiment clarifies the applicability and limitations of our method.
Consider the Poisson equation on a unit square domain \( \Omega = [0, 1] \times [0, 1]\) :
(101)
where \( u(x, y)\) is the unknown solution and \( f(x, y)\) is the source term. For this case, the source term is chosen as:
(102)
The boundary condition is specified as:
(103)
and the analytical solution for this problem is:
(104)
In this experiment, we measured the performance of the MLP backbone under the original PINN and CAML (AC-only). We simultaneously recorded the final additive offset \( c\) applied by CAML.
| Stp | \( L_2\) | \( \cos(\phi)\) | \( c\) | |
| PINN | 2472\( _{\pm 398}\) | 3.96e-3\( _{\pm 9\text{e-}4}\) | 19.88% | \( -\) |
| CAML (AC-Only) | 2328\( _{\pm 221}\) | 3.92e-3\( _{\pm 6\text{e-}4}\) | 23.21% | 3.16 |
Based on the results in Table 19, the toy Poisson experiment indicates that for well-posed problems with purely Dirichlet boundary conditions, CAML (AC-only) does not provide substantial improvements over standard PINNs. Both methods achieve nearly identical convergence steps and final \( L_2\) errors, with only a modest increase in gradient cosine similarity under CAML. The learned additive offset \( c\) remains small, yielding only minor performance gains. This result is consistent with our theoretical discussion: when initialization has already eliminated additive invariance and the optimization landscape is relatively benign, the optimizer’s search distance within the PDE residual loss valley is already short. Therefore, the advantage of solution set expansion caused by aligned constraints becomes limited.
In this section, we present visualization results to qualitatively assess the performance of CAML integrated into the different backbones. We visualize snapshots of the solution on a two-dimensional domain across different benchmarks.
[1] Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations Journal of Computational Physics 2019 378 686–707
[2] A physics-informed deep learning framework for inversion and surrogate modeling in solid mechanics Computer Methods in Applied Mechanics and Engineering 2021 379 113741
[3] An energy approach to the solution of partial differential equations in computational mechanics via machine learning: Concepts, implementation and applications Computer Methods in Applied Mechanics and Engineering 2020 362 112790
[4] Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations Science 2020 367 6481 1026–1030
[5] NSFnets Journal of Computational Physics 2021 426 109951
[6] Physics-Informed Neural Networks for Heat Transfer Problems Journal of Heat Transfer 2021 143 6 060801
[7] Deep Physics-Informed Neural Networks for Stratified Forced Convection Heat Transfer in Plane Couette Flow: Toward Sustainable Climate Projections in Atmospheric and Oceanic Boundary Layers Fluids 2025 10 12
[8] Physics-informed neural networks for inverse problems in nano-optics and metamaterials Optics Express 2020 28 8 11618–11633
[9] MaxwellNet APL Photonics 2021
[10] Characterizing possible failure modes in physics-informed neural networks 35th International Conference on Neural Information Processing Systems (NeurIPS) 2021
[11] Understanding and mitigating gradient flow pathologies in physics-informed neural networks SIAM Journal on Scientific Computing 2021 43 5 A3055–A3081
[12] When and why PINNs fail to train: A neural tangent kernel perspective Journal of Computational Physics 2022 449 110768
[13] Challenges in training PINNs: a loss landscape perspective 41st International Conference on Machine Learning (ICML) 2024
[14] Beyond derivative pathology of PINNs: Variable splitting strategy with convergence analysis Journal of Machine Learning Research 2024
[15] How does PDE order affect the convergence of PINNs? 38th International Conference on Neural Information Processing Systems (NeurIPS) 2024
[16] Gradient-enhanced physics-informed neural networks for forward and inverse PDE problems Computer Methods in Applied Mechanics and Engineering 2022 393 114823
[17] Is L2 physics-informed loss always suitable for training physics-informed neural network? 36th International Conference on Neural Information Processing Systems (NeurIPS) 2022
[18] Meta-learning PINN loss functions Journal of Computational Physics 2022 458 111121
[19] Neural Spectral Methods: Self-supervised learning in the spectral domain 12th International Conference on Learning Representations (ICLR) 2024
[20] Physics-Informed Parallel Neural Networks with self-adaptive loss weighting for the identification of continuous structural systems Computer Methods in Applied Mechanics and Engineering 2024 427 117042
[21] Self-adaptive weights based on balanced residual decay rate for physics-informed neural networks and deep operator networks Journal of Computational Physics 2025 542 114226
[22] Physics-informed neural networks with adaptive loss weighting algorithm for solving partial differential equations Computers and Mathematics with Applications 2025 181 216–227
[23] Dual-Balancing for Physics-Informed Neural Networks 34th International Joint Conference on Artificial Intelligence (IJCAI) 2025
[24] Dual Cone Gradient Descent for Training Physics-Informed Neural Networks 38th International Conference on Neural Information Processing Systems (NeurIPS) 2024
[25] ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks 13th International Conference on Learning Representations (ICLR) 2025
[26] PirateNets Journal of Machine Learning Research 2024 25 402 1–51
[27] Surrogate modeling for fluid flows based on physics-constrained deep learning without simulation data Computer Methods in Applied Mechanics and Engineering 2020 361 112732
[28] A unified hard-constraint framework for solving geometrically complex PDEs 36th International Conference on Neural Information Processing Systems (NeurIPS) 2022
[29] PINNACLE 12th International Conference on Learning Representations (ICLR) 2024
[30] Visualizing the loss landscape of neural nets 32nd International Conference on Neural Information Processing Systems (NeurIPS) 2018
[31] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima 5th International Conference on Learning Representations (ICLR) 2017
[32] Loss surfaces, mode connectivity, and fast ensembling of DNNs 32nd International Conference on Neural Information Processing Systems (NeurIPS) 2018
[33] Sharpness-aware minimization for efficiently improving generalization 9th International Conference on Learning Representations (ICLR) 2021
[34] Self-adaptive physics-informed neural networks Journal of Computational Physics 2023 474 111722
[35] PINN 12th International Conference on Learning Representations (ICLR) 2024
[36] Adam: A Method for Stochastic Optimization 3rd International Conference on Learning Representations (ICLR) 2015
[37] PyTorch 33rd International Conference on Neural Information Processing Systems (NeurIPS) 2019
[38] Adapting Auxiliary Losses Using Gradient Similarity arXiv preprint arXiv:1812.02224 2018
[39] On the limited memory BFGS method for large scale optimization Mathematical Programming 1989 45 503–528
[40] Achieving High Accuracy with PINNs via Energy Natural Gradient Descent 40th International Conference on Machine Learning (ICML) 2023
[41] Physics-informed neural networks: minimizing residual loss with wide networks and effective activations 33rd International Joint Conference on Artificial Intelligence (IJCAI) 2024
[42] Exact imposition of boundary conditions with distance functions in physics-informed deep neural networks Computer Methods in Applied Mechanics and Engineering 2022 389 114333
[43]
[44] Neural Operator: Graph Kernel Network for Partial Differential Equations ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations 2020 Workshop paper
[45] Fourier Neural Operator for Parametric Partial Differential Equations 9th International Conference on Learning Representations (ICLR) 2021
[46] Physics-Informed Transformer Networks NeurIPS 2023 Workshop on The Symbiosis of Deep Learning and Differential Equations III 2023 Workshop paper
[47] PFNN Journal of Computational Physics 2021 428 110085
[48] Physics-Informed Neural Networks with Hard Constraints for Inverse Design SIAM Journal on Scientific Computing 2021 43 B1105–B1132
[49] Automatic boundary fitting framework of boundary dependent physics-informed neural network solving partial differential equation with complex boundary conditions Computer Methods in Applied Mechanics and Engineering 2023 414 116139
[50] Unisolver: PDE-Conditional Transformers Are Universal PDE Solvers 42nd International Conference on Machine Learning (ICML) 2025