Notes for Ajoint Methods
$\newcommand{\E}{\mathbb{E}}$ $\newcommand{\argmax}{\mathrm{argmax}}$ $\newcommand{\Mu}{M}$ $\newcommand{\ba}{\bar{a}}$ $\newcommand{\hx}{\hat{x}}$ $\newcommand{\argmin}{\mathrm{argmin}}$ $\newcommand{\sigmasqr}{\sigma^2}$ This article synthesizes insights from multiple original research papers and both English and Chinese academic sources, with original analysis and derivations by the author.
Todo list
- Translate to Markdown
- Check Equations
- Fix Image & Website References
1. Problem Settings
The goal is to quickly solve:
\[\argmin_p F(x,p) = \int_0^T f(x,p,t) \, dt\]Naturally, we use gradient descent, so we need a fast estimation of $ \nabla_p F(x,p) $, which will be abbreviated as $ d_p F(x,p) $. Note that there are PDE constraints (general form):
\[\frac{dx}{dt} = \dot{x} = \bar{h}(x,p,t) \quad \Rightarrow \quad h(x,\dot{x},p,t) = 0\]and an integral initial condition constraint:
\[g(x(0),p) = 0\]2. Adjoint Method (1st Order)
Clearly, we have:
\[d_p F(x,p) = \int_0^T \left[\partial_x f d_p x + \partial_p f \right] \, dt\]Considering the Lagrangian equation under the constraint, where the parameters are $ \lambda $ and $ \mu $, we have:
\[\mathcal{L} = \int_0^T \left[f(x,p,t) + \lambda^T h(x,\dot{x},p,t)\right] \, dt + \mu^T g(x(0),p)\]where $ \lambda $ is a function of time. Differentiating gives:
\[d_p \mathcal{L} = \int_0^T \left[ \partial_x f d_p x + \partial_p f + \lambda^T \left( \partial_x h d_p x + \partial_{\dot{x}} h d_p \dot{x} + \partial_p h \right) \right] \, dt + \mu^T \left( \partial_{x(0)} g d_p x(0) + \partial_p g \right)\]Now, let’s focus on the fourth term in the integral:
\[\int_0^T \lambda^T \partial_{\dot{x}} h d_p \dot{x} \, dt\]Noticing that:
\[d_p \dot{x} = \frac{d}{dp} \left( \frac{dx}{dt} \right) = \frac{d}{dt} \left( \frac{dx}{dp} \right)\]we assume continuous partial derivatives. Then, we perform integration by parts to obtain:
\[\lambda^T \partial_{\dot{x}} h d_p x \bigg|_0^T - \int_0^T \left[\dot{\lambda}^T \partial_{\dot{x}} h + \lambda^T d_t \partial_{\dot{x}} h \right] d_p x \, dt\]Substituting this back into the original expression, we get:
Notice that the first term from the integration by parts is split. Since $ \lambda, \mu $ are arbitrary, and to avoid the computation of the complex $ d_p x\vert_T $ (which is a Jacobian), we set:
\[\lambda(T) = 0\]Similarly, we define:
\[\mu^T = \lambda^T(0) \partial_{\dot{x}} h(0) g_{x(0)}^{-1}\]This removes the last two terms. To avoid computing any $ d_p x $ in the integral, we select $ \lambda $ such that:
\[f_x + \lambda^T (h_x - d_t h_{\dot{x}}) - \dot{\lambda}^T h_{\dot{x}} = 0\]Thus, we have:
\[d_p \mathcal{L} = \int_0^T \left[ f_p + \lambda^T h_p \right] \, dt + \lambda^T(0) h_{\dot{x}}(0) g_{x(0)}^{-1} g_p\]Since the Lagrangian function at the minimum is the same as the original constraint, we only need to use the gradient of this function for descent. Therefore, we can set:
\[d_p F = d_p \mathcal{L}\]Then:
3. Adjoint Method with Neural ODE
Reviewing the approach for Neural ODE:
\[\mathcal{L}(z(t)) = \mathcal{L} \left( z(t_0) + \int_{t_0}^{t_1} f(z(t), t, \theta) \, dt \right)\]The derivation gives:
\[\nabla_\theta \mathcal{L}(z(t)) = -\int_{t_N}^{t_0} \left( \frac{\partial \mathcal{L}}{\partial z(t)} \right)^T \frac{\partial f(z(t), t, \theta)}{\partial \theta} \, dt\]