Gwern is Wrong About Variational Principles

What is a Variational Principle and Why Do We Care?

Firstly, I would like to clarify what I will mean by a variational principle. I am speaking of variational principles as: systems of PDEs which can be found as the Euler-Lagrange equations for a specific functional. A functional is a mapping from a suitable space of functions (or trajectories, in physical terms) to the real numbers. The Euler-Lagrange equations are obtained by setting the "functional derivative" of a functional to \(0\), the idea is that the functional has a derivative just like regular old functions do, and evaluating to \(0\) is a necessary condition for a given point to be extremal (maximal or minimal) for the functional. The only caveat is that in general the Euler-Lagrange equations are a set of functional equations (which are only sometimes analytically solvable), they may for example look like: \(mx'' = 0\), where \(x\) is the function for which the functional (in this case \(\int_a^b\frac{m\dot x(t)^2}{2} \ dt\)) is extremal. That particular example shows that Newton's equations are given as the Euler Lagrange equations for kinetic energy summed over time (in a similar manner on Riemannian manifolds the theory of geodesics can be treated in terms of extrema of a corresponding 'energy functional'). Variational principles are useful (usually) because they allow us to correlate study of a set of differential equations to the study of extrema for a functional, which oftentimes is easier to do. For example if a functional has certain convexity properties, we have a very easy time proving existence and uniqueness of an extrema. So although there are sometimes ways to use variational techniques in part on the systems we'll discuss, in general they are not variational and what we'd really call "variational calculus" has a limited utility in terms of its usual toolkit insofar as understanding, deriving, or reasoning about them goes.

What Did Gwern Say?

A brief note on an excerpt from Gwern:

We err if an intentional stance leads us engage in the pathetic fallacy and say that the river-spirit wants to reunite with the ocean (and we must offer sacrifices lest the dikes breach), but we are correct when we say that the river tries to find the optimal path which minimizes its gravitational or free energy. It is both true, predictively useful, and mathematically equivalent to the other way of formulating it, in terms of ‘forward’ processes computing step by step, atom by atom, and at the getting the same answer—but typically much easier to solve. (Ted Chiang’s “Story Of Your Life” tries to convey this perspective via fiction.) And this shortcut is a trick we can use universally, for everything from a river flowing downhill to the orbit of a planet to the path of photon through water minimizing travel time to evolutionary dynamics: instead of trying to understand it step by step, treat the system as a whole via the variational principle as ‘wanting’ to minimize (or maximize) some simple global quantity (a reward), and picking the sequence of actions that does so. (“The river wants to minimize its height, so without simulating it down to the individual water currents, I can look at the map and see that it should ‘choose’ to go left, then right, and then meander over this flat slightly-sloping part. Ah, looks like I was right.”) Then, into this modular trick, just plug in the system and quantity in question, and think like an agent.

Gwern, on the near universal applicability of variational reasoning [1]

There is a serious issue with the line of reasoning suggested by Gwern, which is relatively crucial to the argument about variational principles becoming a latent and widespread 'cheat' for machine cognition. Many natural systems (funnily enough, like fluid flow, which was named as an example) do not and cannot obey a variational principle. While it is true that fluids may appear classical-mechanical, the Navier-Stokes equations for example cannot be obtained from a variational principle. The existence of a variational principle for a given set of PDEs is somewhat akin to exactness for vector fields where there are integrability constraints that must be satisfied (e.g. \(\nabla \times v = 0\) as an integrability constraint for the existence of potentials for a given field) in order for one to exist. Specifically more or less the exact circumstance holds and an operator \(N(u)\) is the gradient of a functional (satisfies a variational principle) if and only if (under suitable regularity conditions) its Gateaux derivative is symmetric at every point, equivalently, if path integration (where the path is a continuous mapping from \([0, 1]\) into the space of admissible functions; one wonders if on spaces with holes a similar result would hold upto homotopy with fixed endpoints instead) depends only upon starting and ending points [2]. Notably, (general) fluids serve as an exception to this circumstance.

Similar issues are present for the Boltzmann equation. I mention this because it is easier to raise the confusion Gwern probably had for the case of the Boltzmann equation with a hard spheres model. How is it possible that a system derived from classical-mechanical laws or taken as a limit of classical mechanical behavior can be non-variational if classical mechanics (a la Lagrangian, Hamiltonian mechanics) is? I would reply that we are only seeing pictures (abstractions) of classical behavior when we look at something like the Navier-Stokes equations or the Boltzmann equation. The Boltzmann equation is a great example because it violates many classical notions: for instance, the lack of time reversibility corresponds to extraordinarily large Poincare recurrence times [3]. Another example of a system where the variational approach has to be supplemented by good old fashioned stupid forces is the case of friction. The bane of many a conceptual luxury, frictional forces have to go on the right hand side of the Lagrangian equations: \(\frac{d}{dt}\frac{\partial L}{\partial\dot{q}_j} - \frac{\partial L}{\partial q_j} = -\frac{\partial \mathcal{F}}{\partial\dot{q}_j}\) where \(\mathcal{F}\) is the Rayleigh dissipation function [4]. Hard sphere or ball collisions are another example where even with classical mechanics, a variational formulation struggles: if we use the usual classical Lagrangian and get the "wrong" reflection direction after a collision then the integral of the action would still be stationary, which we can see by omitting the collision moment from the set for integration and evaluating each before and after half individually. We would have to give infinite weight to the collision times to discern the correct option which means we're going to have to deal with distributions, singular measures, or something else in modeling the problem. We can approximate this situation by smoothing out the "no intersections" constraint and penalizing would be collisions with a central force and then take the hypothetical limit as we approach the hard constraint case. The point being that even in cases which should be simple, variational reasoning can wind up quite troublesome. If I just told you that the hard-spheres model behaves variationally and you sat down and tried to find a Lagrangian, do you think you'd find the variational point of view elegant, simple, or helpful?

Uh oh. ‘Predictively useful’, ‘shortcut’, ‘much easier’, ‘universally’—all properties a neural net loves. All natural to it. Why would it try to solve each heterogeneous problem with a separate, computationally-expensive, bag of tricks, when there’s one weird trick AI safety researchers hate, like adopting teleological and variational reasoning?

Gwern, on the proposed high utility of variational reasoning [1]

I think this just does not hold up to inspection for reasons outlined above. It's vibes based reasoning about how 1. variational principles are near universally applicable (not true) so 2. agents will tend towards generalized variational reasoning capacities (amorphously defined) to achieve 3. a more efficient method of reasoning about a wider variety of situations (only very mildly true). Variational methods are relatively widely applicable but only relatively. I gave a litany of examples above of pretty basic situations (in even classical mechanical systems!) where variational reasoning fails or at the very least struggles. Of course, the principal condition is the restriction to path-independence and symmetry of Gateaux derivatives as mentioned. Setting that aside, the inelastic collisions problem demonstrates the delicacy of applying variational reasoning in the case of nonsmooth constraints, or nonsmoothness in general. It is not a leap of the imagination to conclude that variational principles are a very useful tool which nonetheless most certainly do not serve as some sort of near universal method of reasoning about the world. The whole hypothesis here is that variational reasoning with/for optimal control (agentic thinking + a general variational formulation for dynamics with controls tacked on; I don't presume the sort of optimal control scheme such an intelligence would decide to use in the general case) would lead to general intelligence. Variational reasoning is usually a product intelligences produce to piece together distinct ideas, or to reason abstractly about a given system (e.g. proof of existence for PDEs by appeal to conditions for existence of minimizers for variational principles, when they exist), because although the abstract structure of variational schemes is often quite similar, the specific details and implementation (which must be supplied to understand a problem) is at times ad hoc, bespoke, and only something one finds when they already have some understanding of the dynamics involved.

A serious issue with the entire section is this almost concept-object flip flop, where if someone understands variational principles they would understand how gravity works, or how any system which can be viewed variationally works. Being able to understand the picture once you know how to turn the details into a variational problem, or any other abstract approach to understanding systems for that matter, is bottlenecked by being able to transform the object to begin with, which is probably going to require basically understanding the "non variational" point of view. Or at least, the transformation isn't doing something for you, the abstract general framework isn't doing much for you. Maybe, at best, the assumption that what you're looking at is already variational will take off some cognitive load, but that's a very abstract statement. The core part of the note anyway is done. This is an old, old article, and the argument is old, but this specific idea I couldn't avoid talking about.

No malice is aimed at Gwern, and my intention was not to demean or attack him personally. I just had to say what I wanted to say about this excerpt when I was reading his blog.

Appendix, Optimal Control and Variational Calculus

Gwern kinda mentioned at a few points the link being reinforcement learning and variational calculus, in a footnote mostly. They're pretty related, but also quite different. You can view reinforcement learning as attacking an optimal control problem (usually structured with dynamic programming) with a deep network [5]. Optimal control and variational calculus are a little confounded in Gwern's writing, he talks about maximizing or minimizing reward, in the sense of minimizing the functional of choice in a variational problem, but that's quite distinct from agentic thinking. You can't formulate an optimal control problem as a typical variational problem. The reason why is that you have basically two separate variational problems and they behave totally differently, or even worse you have a completely non-variational constraint. By that I mean that while the choice of control is done to optimize a functional, a simple functional differentiation doesn't work because of the fact that the choices we make now can have long term effects in time that aren't visible to the functional itself. That is:

\begin{equation} \begin{aligned} \delta_u J(x, u)(\phi) &= \\ \frac{d}{d\varepsilon} [l(x(b)) + \int_a^b L(x(t), [u+\varepsilon\phi](t)) \ dt] &= \\ \int_a^b \frac{d}{du} L(x(t), u(t))\phi(t) \ dt \end{aligned} \end{equation}

which would imply \(\frac{dL}{du} = 0\) which is obviously the incorrect condition. This is because the payoff functional knows nothing about the requirement that \(x(t)\) be driven by the appropriate dynamical system. At best that can be written as another variational constraint (\(\delta F[x] = 0\) for some functional \(F\) for which the differential equations governing the trajectory \(x\) are Euler-Lagrange equations), and at worst it's just an arbitrary system of differential equations. The correct equation is the Hamilton-Jacobi-Bellman equation rather than the simple Euler-Lagrange equations and the quantity solved for is the value function rather than the trajectory itself. But considering we literally use these methods to train neural nets (with RL), it seems questionable to expect that a system trained with RL would basically reinvent RL methods for solving individual problems in the wild. Of course, optimal control is even more complicated than variational calculus, so as you can imagine the caveats above apply there too, insofar as expecting optimal control methods to become a universal "cheat" or form of reasoning for runaway-intelligences to find and utilize. My favorite exposition on optimal control as it relates to variational calculus is [6], which is also where the above form for the payoff functional was taken from (p. 437).

References

[1] Accessed 10-10-2025, https://gwern.net/scaling-hypothesis#variational-interpretations
[2] Finlayson, The Method of Weighted Residuals and Variational Principles, 1972, pp. 298-302
[3] McQuarrie, Statistical Mechanics, 1975, pp. 416-417
[4] Goldstein, Classical Mechanics, 1980, p. 24
[5] Dixon et al, Machine Learning in Finance, pp. 323-329
[6] Francis Clarke, Functional Analysis, Calculus of Variations, and Optimal Control

PREVIOUSSubjectivism in Probability

NEXTWhy Convolutions?