非线性规划中的差分动态规划(DDP)算法基础题
字数 2248 2025-11-16 05:40:40

非线性规划中的差分动态规划(DDP)算法基础题

题目描述:
考虑一个离散时间非线性最优控制问题:
最小化代价函数 J = φ(x_N) + Σ_{k=0}^{N-1} L(x_k, u_k)
满足系统动力学 x_{k+1} = f(x_k, u_k),其中 x_k ∈ R^n 是状态向量,u_k ∈ R^m 是控制向量。

假设具体问题为:
系统动力学:x_{k+1} = x_k + u_k
代价函数:J = 1/2 * x_N^2 + 1/2 * Σ_{k=0}^{N-1} (x_k^2 + u_k^2)
边界条件:x_0 = 1,N = 3

解题过程:

  1. 算法概述
    差分动态规划是Bellman动态规划原理的局部二阶近似方法。核心思想是通过反向传播和正向传播迭代,在标称轨迹附近进行二次展开,逐步改进控制策略。

  2. 初始化标称轨迹
    首先选择初始控制序列,通常从全零控制开始:
    u₀⁽⁰⁾ = 0, u₁⁽⁰⁾ = 0, u₂⁽⁰⁾ = 0
    通过前向仿真得到初始状态轨迹:
    x₀ = 1
    x₁ = x₀ + u₀ = 1 + 0 = 1
    x₂ = x₁ + u₁ = 1 + 0 = 1
    x₃ = x₂ + u₂ = 1 + 0 = 1

  3. 反向传播(Backward Pass)
    从最后时刻开始反向计算:

步骤3.1:终端条件(k=3)
价值函数:V₃(x) = φ(x₃) = 1/2 * x₃²
一阶导数:V₃ₓ = x₃ = 1
二阶导数:V₃ₓₓ = 1

步骤3.2:时刻k=2
展开系统动力学和代价函数:

  • 动力学:f₂ = x₂ + u₂,f₂ₓ = 1,f₂ᵤ = 1
  • 即时代价:L₂ = 1/2(x₂² + u₂²),L₂ₓ = x₂ = 1,L₂ᵤ = u₂ = 0,L₂ₓₓ = 1,L₂ᵤᵤ = 1,L₂ₓᵤ = 0

计算Q函数及其导数:
Q₂ₓ = L₂ₓ + f₂ₓ·V₃ₓ = 1 + 1×1 = 2
Q₂ᵤ = L₂ᵤ + f₂ᵤ·V₃ₓ = 0 + 1×1 = 1
Q₂ₓₓ = L₂ₓₓ + f₂ₓ²·V₃ₓₓ = 1 + 1²×1 = 2
Q₂ᵤᵤ = L₂ᵤᵤ + f₂ᵤ²·V₃ₓₓ = 1 + 1²×1 = 2
Q₂ₓᵤ = L₂ₓᵤ + f₂ₓ·f₂ᵤ·V₃ₓₓ = 0 + 1×1×1 = 1

计算控制更新:
δu₂* = -Q₂ᵤᵤ⁻¹·Q₂ᵤ = -1/2 × 1 = -0.5
K₂ = -Q₂ᵤᵤ⁻¹·Q₂ₓᵤ = -1/2 × 1 = -0.5

更新价值函数:
V₂ₓ = Q₂ₓ + K₂ᵀ·Q₂ᵤᵤ·δu₂* = 2 + (-0.5)×2×(-0.5) = 2 + 0.5 = 2.5
V₂ₓₓ = Q₂ₓₓ + Q₂ₓᵤ·K₂ = 2 + 1×(-0.5) = 1.5

步骤3.3:时刻k=1(类似计算)
Q₁ₓ = x₁ + 1×V₂ₓ = 1 + 1×2.5 = 3.5
Q₁ᵤ = u₁ + 1×V₂ₓ = 0 + 1×2.5 = 2.5
Q₁ₓₓ = 1 + 1²×V₂ₓₓ = 1 + 1×1.5 = 2.5
Q₁ᵤᵤ = 1 + 1²×V₂ₓₓ = 1 + 1×1.5 = 2.5
Q₁ₓᵤ = 0 + 1×1×V₂ₓₓ = 1.5

δu₁* = -1/2.5 × 2.5 = -1
K₁ = -1/2.5 × 1.5 = -0.6

步骤3.4:时刻k=0
Q₀ₓ = x₀ + 1×V₁ₓ = 1 + 1×4.1 = 5.1
Q₀ᵤ = u₀ + 1×V₁ₓ = 0 + 1×4.1 = 4.1
Q₀ₓₓ = 1 + 1²×V₁ₓₓ = 1 + 1×1.9 = 2.9
Q₀ᵤᵤ = 1 + 1²×V₁ₓₓ = 1 + 1×1.9 = 2.9
Q₀ₓᵤ = 0 + 1×1×V₁ₓₓ = 1.9

δu₀* = -1/2.9 × 4.1 ≈ -1.414
K₀ = -1/2.9 × 1.9 ≈ -0.655

  1. 正向传播(Forward Pass)
    使用新的控制策略重新仿真系统:
    x₀ = 1
    u₀ = u₀⁽⁰⁾ + δu₀* = 0 - 1.414 = -1.414
    x₁ = x₀ + u₀ = 1 - 1.414 = -0.414
    u₁ = u₁⁽⁰⁾ + δu₁* + K₁(x₁ - x₁⁽⁰⁾) = 0 - 1 - 0.6×(-0.414 - 1) = -1 + 0.848 = -0.152
    x₂ = x₁ + u₁ = -0.414 - 0.152 = -0.566
    u₂ = u₂⁽⁰⁾ + δu₂* + K₂(x₂ - x₂⁽⁰⁾) = 0 - 0.5 - 0.5×(-0.566 - 1) = -0.5 + 0.783 = 0.283
    x₃ = x₂ + u₂ = -0.566 + 0.283 = -0.283

  2. 收敛检查
    计算新代价:J = 1/2×(-0.283)² + 1/2×[1² + (-0.414)² + (-0.566)² + (-1.414)² + (-0.152)² + 0.283²] ≈ 1.824
    初始代价:J⁽⁰⁾ = 1/2×1² + 1/2×[1²+1²+1²+0+0+0] = 2.0
    代价有所改善,继续迭代直至收敛到最优解。

经过几次迭代后,算法将收敛到最优控制序列,使终端状态接近零同时控制能量最小。

非线性规划中的差分动态规划(DDP)算法基础题 题目描述: 考虑一个离散时间非线性最优控制问题: 最小化代价函数 J = φ(x_ N) + Σ_ {k=0}^{N-1} L(x_ k, u_ k) 满足系统动力学 x_ {k+1} = f(x_ k, u_ k),其中 x_ k ∈ R^n 是状态向量,u_ k ∈ R^m 是控制向量。 假设具体问题为: 系统动力学:x_ {k+1} = x_ k + u_ k 代价函数:J = 1/2 * x_ N^2 + 1/2 * Σ_ {k=0}^{N-1} (x_ k^2 + u_ k^2) 边界条件:x_ 0 = 1,N = 3 解题过程: 算法概述 差分动态规划是Bellman动态规划原理的局部二阶近似方法。核心思想是通过反向传播和正向传播迭代,在标称轨迹附近进行二次展开,逐步改进控制策略。 初始化标称轨迹 首先选择初始控制序列,通常从全零控制开始: u₀⁽⁰⁾ = 0, u₁⁽⁰⁾ = 0, u₂⁽⁰⁾ = 0 通过前向仿真得到初始状态轨迹: x₀ = 1 x₁ = x₀ + u₀ = 1 + 0 = 1 x₂ = x₁ + u₁ = 1 + 0 = 1 x₃ = x₂ + u₂ = 1 + 0 = 1 反向传播(Backward Pass) 从最后时刻开始反向计算: 步骤3.1:终端条件(k=3) 价值函数:V₃(x) = φ(x₃) = 1/2 * x₃² 一阶导数:V₃ₓ = x₃ = 1 二阶导数:V₃ₓₓ = 1 步骤3.2:时刻k=2 展开系统动力学和代价函数: 动力学:f₂ = x₂ + u₂,f₂ₓ = 1,f₂ᵤ = 1 即时代价:L₂ = 1/2(x₂² + u₂²),L₂ₓ = x₂ = 1,L₂ᵤ = u₂ = 0,L₂ₓₓ = 1,L₂ᵤᵤ = 1,L₂ₓᵤ = 0 计算Q函数及其导数: Q₂ₓ = L₂ₓ + f₂ₓ·V₃ₓ = 1 + 1×1 = 2 Q₂ᵤ = L₂ᵤ + f₂ᵤ·V₃ₓ = 0 + 1×1 = 1 Q₂ₓₓ = L₂ₓₓ + f₂ₓ²·V₃ₓₓ = 1 + 1²×1 = 2 Q₂ᵤᵤ = L₂ᵤᵤ + f₂ᵤ²·V₃ₓₓ = 1 + 1²×1 = 2 Q₂ₓᵤ = L₂ₓᵤ + f₂ₓ·f₂ᵤ·V₃ₓₓ = 0 + 1×1×1 = 1 计算控制更新: δu₂* = -Q₂ᵤᵤ⁻¹·Q₂ᵤ = -1/2 × 1 = -0.5 K₂ = -Q₂ᵤᵤ⁻¹·Q₂ₓᵤ = -1/2 × 1 = -0.5 更新价值函数: V₂ₓ = Q₂ₓ + K₂ᵀ·Q₂ᵤᵤ·δu₂* = 2 + (-0.5)×2×(-0.5) = 2 + 0.5 = 2.5 V₂ₓₓ = Q₂ₓₓ + Q₂ₓᵤ·K₂ = 2 + 1×(-0.5) = 1.5 步骤3.3:时刻k=1(类似计算) Q₁ₓ = x₁ + 1×V₂ₓ = 1 + 1×2.5 = 3.5 Q₁ᵤ = u₁ + 1×V₂ₓ = 0 + 1×2.5 = 2.5 Q₁ₓₓ = 1 + 1²×V₂ₓₓ = 1 + 1×1.5 = 2.5 Q₁ᵤᵤ = 1 + 1²×V₂ₓₓ = 1 + 1×1.5 = 2.5 Q₁ₓᵤ = 0 + 1×1×V₂ₓₓ = 1.5 δu₁* = -1/2.5 × 2.5 = -1 K₁ = -1/2.5 × 1.5 = -0.6 步骤3.4:时刻k=0 Q₀ₓ = x₀ + 1×V₁ₓ = 1 + 1×4.1 = 5.1 Q₀ᵤ = u₀ + 1×V₁ₓ = 0 + 1×4.1 = 4.1 Q₀ₓₓ = 1 + 1²×V₁ₓₓ = 1 + 1×1.9 = 2.9 Q₀ᵤᵤ = 1 + 1²×V₁ₓₓ = 1 + 1×1.9 = 2.9 Q₀ₓᵤ = 0 + 1×1×V₁ₓₓ = 1.9 δu₀* = -1/2.9 × 4.1 ≈ -1.414 K₀ = -1/2.9 × 1.9 ≈ -0.655 正向传播(Forward Pass) 使用新的控制策略重新仿真系统: x₀ = 1 u₀ = u₀⁽⁰⁾ + δu₀* = 0 - 1.414 = -1.414 x₁ = x₀ + u₀ = 1 - 1.414 = -0.414 u₁ = u₁⁽⁰⁾ + δu₁* + K₁(x₁ - x₁⁽⁰⁾) = 0 - 1 - 0.6×(-0.414 - 1) = -1 + 0.848 = -0.152 x₂ = x₁ + u₁ = -0.414 - 0.152 = -0.566 u₂ = u₂⁽⁰⁾ + δu₂* + K₂(x₂ - x₂⁽⁰⁾) = 0 - 0.5 - 0.5×(-0.566 - 1) = -0.5 + 0.783 = 0.283 x₃ = x₂ + u₂ = -0.566 + 0.283 = -0.283 收敛检查 计算新代价:J = 1/2×(-0.283)² + 1/2×[ 1² + (-0.414)² + (-0.566)² + (-1.414)² + (-0.152)² + 0.283² ] ≈ 1.824 初始代价:J⁽⁰⁾ = 1/2×1² + 1/2×[ 1²+1²+1²+0+0+0 ] = 2.0 代价有所改善,继续迭代直至收敛到最优解。 经过几次迭代后,算法将收敛到最优控制序列,使终端状态接近零同时控制能量最小。