非线性规划中的差分动态规划(DDP)算法基础题
题目描述:
考虑一个离散时间非线性最优控制问题:
最小化代价函数 J = φ(x_N) + Σ_{k=0}^{N-1} L(x_k, u_k)
满足系统动力学 x_{k+1} = f(x_k, u_k),其中 x_k ∈ R^n 是状态向量,u_k ∈ R^m 是控制向量。
假设具体问题为:
系统动力学:x_{k+1} = x_k + u_k
代价函数:J = 1/2 * x_N^2 + 1/2 * Σ_{k=0}^{N-1} (x_k^2 + u_k^2)
边界条件:x_0 = 1,N = 3
解题过程:
-
算法概述
差分动态规划是Bellman动态规划原理的局部二阶近似方法。核心思想是通过反向传播和正向传播迭代,在标称轨迹附近进行二次展开,逐步改进控制策略。 -
初始化标称轨迹
首先选择初始控制序列,通常从全零控制开始:
u₀⁽⁰⁾ = 0, u₁⁽⁰⁾ = 0, u₂⁽⁰⁾ = 0
通过前向仿真得到初始状态轨迹:
x₀ = 1
x₁ = x₀ + u₀ = 1 + 0 = 1
x₂ = x₁ + u₁ = 1 + 0 = 1
x₃ = x₂ + u₂ = 1 + 0 = 1 -
反向传播(Backward Pass)
从最后时刻开始反向计算:
步骤3.1:终端条件(k=3)
价值函数:V₃(x) = φ(x₃) = 1/2 * x₃²
一阶导数:V₃ₓ = x₃ = 1
二阶导数:V₃ₓₓ = 1
步骤3.2:时刻k=2
展开系统动力学和代价函数:
- 动力学:f₂ = x₂ + u₂,f₂ₓ = 1,f₂ᵤ = 1
- 即时代价:L₂ = 1/2(x₂² + u₂²),L₂ₓ = x₂ = 1,L₂ᵤ = u₂ = 0,L₂ₓₓ = 1,L₂ᵤᵤ = 1,L₂ₓᵤ = 0
计算Q函数及其导数:
Q₂ₓ = L₂ₓ + f₂ₓ·V₃ₓ = 1 + 1×1 = 2
Q₂ᵤ = L₂ᵤ + f₂ᵤ·V₃ₓ = 0 + 1×1 = 1
Q₂ₓₓ = L₂ₓₓ + f₂ₓ²·V₃ₓₓ = 1 + 1²×1 = 2
Q₂ᵤᵤ = L₂ᵤᵤ + f₂ᵤ²·V₃ₓₓ = 1 + 1²×1 = 2
Q₂ₓᵤ = L₂ₓᵤ + f₂ₓ·f₂ᵤ·V₃ₓₓ = 0 + 1×1×1 = 1
计算控制更新:
δu₂* = -Q₂ᵤᵤ⁻¹·Q₂ᵤ = -1/2 × 1 = -0.5
K₂ = -Q₂ᵤᵤ⁻¹·Q₂ₓᵤ = -1/2 × 1 = -0.5
更新价值函数:
V₂ₓ = Q₂ₓ + K₂ᵀ·Q₂ᵤᵤ·δu₂* = 2 + (-0.5)×2×(-0.5) = 2 + 0.5 = 2.5
V₂ₓₓ = Q₂ₓₓ + Q₂ₓᵤ·K₂ = 2 + 1×(-0.5) = 1.5
步骤3.3:时刻k=1(类似计算)
Q₁ₓ = x₁ + 1×V₂ₓ = 1 + 1×2.5 = 3.5
Q₁ᵤ = u₁ + 1×V₂ₓ = 0 + 1×2.5 = 2.5
Q₁ₓₓ = 1 + 1²×V₂ₓₓ = 1 + 1×1.5 = 2.5
Q₁ᵤᵤ = 1 + 1²×V₂ₓₓ = 1 + 1×1.5 = 2.5
Q₁ₓᵤ = 0 + 1×1×V₂ₓₓ = 1.5
δu₁* = -1/2.5 × 2.5 = -1
K₁ = -1/2.5 × 1.5 = -0.6
步骤3.4:时刻k=0
Q₀ₓ = x₀ + 1×V₁ₓ = 1 + 1×4.1 = 5.1
Q₀ᵤ = u₀ + 1×V₁ₓ = 0 + 1×4.1 = 4.1
Q₀ₓₓ = 1 + 1²×V₁ₓₓ = 1 + 1×1.9 = 2.9
Q₀ᵤᵤ = 1 + 1²×V₁ₓₓ = 1 + 1×1.9 = 2.9
Q₀ₓᵤ = 0 + 1×1×V₁ₓₓ = 1.9
δu₀* = -1/2.9 × 4.1 ≈ -1.414
K₀ = -1/2.9 × 1.9 ≈ -0.655
-
正向传播(Forward Pass)
使用新的控制策略重新仿真系统:
x₀ = 1
u₀ = u₀⁽⁰⁾ + δu₀* = 0 - 1.414 = -1.414
x₁ = x₀ + u₀ = 1 - 1.414 = -0.414
u₁ = u₁⁽⁰⁾ + δu₁* + K₁(x₁ - x₁⁽⁰⁾) = 0 - 1 - 0.6×(-0.414 - 1) = -1 + 0.848 = -0.152
x₂ = x₁ + u₁ = -0.414 - 0.152 = -0.566
u₂ = u₂⁽⁰⁾ + δu₂* + K₂(x₂ - x₂⁽⁰⁾) = 0 - 0.5 - 0.5×(-0.566 - 1) = -0.5 + 0.783 = 0.283
x₃ = x₂ + u₂ = -0.566 + 0.283 = -0.283 -
收敛检查
计算新代价:J = 1/2×(-0.283)² + 1/2×[1² + (-0.414)² + (-0.566)² + (-1.414)² + (-0.152)² + 0.283²] ≈ 1.824
初始代价:J⁽⁰⁾ = 1/2×1² + 1/2×[1²+1²+1²+0+0+0] = 2.0
代价有所改善,继续迭代直至收敛到最优解。
经过几次迭代后,算法将收敛到最优控制序列,使终端状态接近零同时控制能量最小。