多层感知机(MLP)的前向传播与反向传播过程
字数 3560 2025-10-28 00:29:09

多层感知机(MLP)的前向传播与反向传播过程

题目描述
我们需要推导一个具有单隐藏层的多层感知机(MLP)的前向传播和反向传播过程。网络结构为:输入层2个神经元(\(x_1, x_2\)),隐藏层3个神经元(使用Sigmoid激活函数),输出层1个神经元(使用Sigmoid激活函数)。定义损失函数为均方误差(MSE)。请逐步计算前向传播的输出,并推导反向传播中所有权重和偏置的梯度。


解题过程
1. 网络结构与符号定义

  • 输入:\(\mathbf{x} = [x_1, x_2]\)
  • 隐藏层权重:\(\mathbf{W}^{(1)} = \begin{bmatrix} w_{11}^{(1)} & w_{12}^{(1)} \\ w_{21}^{(1)} & w_{22}^{(1)} \\ w_{31}^{(1)} & w_{32}^{(1)} \end{bmatrix}\),偏置:\(\mathbf{b}^{(1)} = [b_1^{(1)}, b_2^{(1)}, b_3^{(1)}]\)
  • 输出层权重:\(\mathbf{W}^{(2)} = [w_{11}^{(2)}, w_{12}^{(2)}, w_{13}^{(2)}]\),偏置:\(b^{(2)}\)
  • 真实标签:\(y\)
  • Sigmoid函数:\(\sigma(z) = \frac{1}{1+e^{-z}}\),其导数为 \(\sigma'(z) = \sigma(z)(1-\sigma(z))\)
  • 损失函数:\(L = \frac{1}{2}(y - \hat{y})^2\)(MSE,系数 \(\frac{1}{2}\) 便于求导消去常数)。

2. 前向传播
步骤1:输入层到隐藏层

  • 隐藏层加权输入:

\[ \mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)} = \begin{bmatrix} w_{11}^{(1)}x_1 + w_{12}^{(1)}x_2 + b_1^{(1)} \\ w_{21}^{(1)}x_1 + w_{22}^{(1)}x_2 + b_2^{(1)} \\ w_{31}^{(1)}x_1 + w_{32}^{(1)}x_2 + b_3^{(1)} \end{bmatrix} \]

  • 隐藏层输出(激活后):

\[ \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}) = \begin{bmatrix} \sigma(z_1^{(1)}) \\ \sigma(z_2^{(1)}) \\ \sigma(z_3^{(1)}) \end{bmatrix} \]

步骤2:隐藏层到输出层

  • 输出层加权输入:

\[ z^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + b^{(2)} = w_{11}^{(2)}a_1^{(1)} + w_{12}^{(2)}a_2^{(1)} + w_{13}^{(2)}a_3^{(1)} + b^{(2)} \]

  • 最终输出:

\[ \hat{y} = a^{(2)} = \sigma(z^{(2)}) \]

  • 损失值:

\[ L = \frac{1}{2}(y - \hat{y})^2 \]


3. 反向传播(链式法则求梯度)
目标:计算损失对权重和偏置的梯度 \(\frac{\partial L}{\partial \mathbf{W}^{(2)}}, \frac{\partial L}{\partial b^{(2)}}, \frac{\partial L}{\partial \mathbf{W}^{(1)}}, \frac{\partial L}{\partial \mathbf{b}^{(1)}}\)

步骤1:输出层梯度

  • 损失对输出层输入的梯度:

\[ \delta^{(2)} = \frac{\partial L}{\partial z^{(2)}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{(2)}} = (\hat{y} - y) \cdot \sigma'(z^{(2)}) \]

代入 Sigmoid 导数:\(\sigma'(z^{(2)}) = \hat{y}(1-\hat{y})\),得

\[ \delta^{(2)} = (\hat{y} - y) \cdot \hat{y}(1-\hat{y}) \]

  • 权重梯度:

\[ \frac{\partial L}{\partial w_{1j}^{(2)}} = \delta^{(2)} \cdot a_j^{(1)} \quad (j=1,2,3), \quad \frac{\partial L}{\partial \mathbf{W}^{(2)}} = \delta^{(2)} \mathbf{a}^{(1)^T} \]

  • 偏置梯度:

\[ \frac{\partial L}{\partial b^{(2)}} = \delta^{(2)} \]

步骤2:隐藏层梯度

  • 损失对隐藏层输入的梯度(需考虑隐藏层3个神经元):

\[ \delta^{(1)} = \frac{\partial L}{\partial \mathbf{z}^{(1)}} = \left( \mathbf{W}^{(2)^T} \delta^{(2)} \right) \odot \sigma'(\mathbf{z}^{(1)}) \]

其中 \(\odot\) 表示逐元素相乘。代入 \(\sigma'(\mathbf{z}^{(1)}) = \mathbf{a}^{(1)} \odot (1 - \mathbf{a}^{(1)})\),得

\[ \delta^{(1)} = \begin{bmatrix} w_{11}^{(2)} \delta^{(2)} \\ w_{12}^{(2)} \delta^{(2)} \\ w_{13}^{(2)} \delta^{(2)} \end{bmatrix} \odot \begin{bmatrix} a_1^{(1)}(1-a_1^{(1)}) \\ a_2^{(1)}(1-a_2^{(1)}) \\ a_3^{(1)}(1-a_3^{(1)}) \end{bmatrix} \]

  • 权重梯度:

\[ \frac{\partial L}{\partial w_{ij}^{(1)}} = \delta_i^{(1)} \cdot x_j \quad (i=1,2,3; j=1,2), \quad \frac{\partial L}{\partial \mathbf{W}^{(1)}} = \delta^{(1)} \mathbf{x}^T \]

  • 偏置梯度:

\[ \frac{\partial L}{\partial b_i^{(1)}} = \delta_i^{(1)} \quad (i=1,2,3), \quad \frac{\partial L}{\partial \mathbf{b}^{(1)}} = \delta^{(1)} \]


4. 参数更新
使用梯度下降法(学习率 \(\eta\)):

  • 输出层:

\[ \mathbf{W}^{(2)} \leftarrow \mathbf{W}^{(2)} - \eta \frac{\partial L}{\partial \mathbf{W}^{(2)}}, \quad b^{(2)} \leftarrow b^{(2)} - \eta \delta^{(2)} \]

  • 隐藏层:

\[ \mathbf{W}^{(1)} \leftarrow \mathbf{W}^{(1)} - \eta \frac{\partial L}{\partial \mathbf{W}^{(1)}}, \quad \mathbf{b}^{(1)} \leftarrow \mathbf{b}^{(1)} - \eta \delta^{(1)} \]

总结:通过前向传播计算预测值,反向传播逐层计算误差并传递梯度,最终更新参数。此过程可扩展至更深的网络。

多层感知机(MLP)的前向传播与反向传播过程 题目描述 我们需要推导一个具有单隐藏层的多层感知机(MLP)的前向传播和反向传播过程。网络结构为:输入层2个神经元(\(x_ 1, x_ 2\)),隐藏层3个神经元(使用Sigmoid激活函数),输出层1个神经元(使用Sigmoid激活函数)。定义损失函数为均方误差(MSE)。请逐步计算前向传播的输出,并推导反向传播中所有权重和偏置的梯度。 解题过程 1. 网络结构与符号定义 输入:\( \mathbf{x} = [ x_ 1, x_ 2 ] \) 隐藏层权重:\( \mathbf{W}^{(1)} = \begin{bmatrix} w_ {11}^{(1)} & w_ {12}^{(1)} \\ w_ {21}^{(1)} & w_ {22}^{(1)} \\ w_ {31}^{(1)} & w_ {32}^{(1)} \end{bmatrix} \),偏置:\( \mathbf{b}^{(1)} = [ b_ 1^{(1)}, b_ 2^{(1)}, b_ 3^{(1)} ] \) 输出层权重:\( \mathbf{W}^{(2)} = [ w_ {11}^{(2)}, w_ {12}^{(2)}, w_ {13}^{(2)} ] \),偏置:\( b^{(2)} \) 真实标签:\( y \) Sigmoid函数:\( \sigma(z) = \frac{1}{1+e^{-z}} \),其导数为 \( \sigma'(z) = \sigma(z)(1-\sigma(z)) \) 损失函数:\( L = \frac{1}{2}(y - \hat{y})^2 \)(MSE,系数 \(\frac{1}{2}\) 便于求导消去常数)。 2. 前向传播 步骤1:输入层到隐藏层 隐藏层加权输入: \[ \mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)} = \begin{bmatrix} w_ {11}^{(1)}x_ 1 + w_ {12}^{(1)}x_ 2 + b_ 1^{(1)} \\ w_ {21}^{(1)}x_ 1 + w_ {22}^{(1)}x_ 2 + b_ 2^{(1)} \\ w_ {31}^{(1)}x_ 1 + w_ {32}^{(1)}x_ 2 + b_ 3^{(1)} \end{bmatrix} \] 隐藏层输出(激活后): \[ \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}) = \begin{bmatrix} \sigma(z_ 1^{(1)}) \\ \sigma(z_ 2^{(1)}) \\ \sigma(z_ 3^{(1)}) \end{bmatrix} \] 步骤2:隐藏层到输出层 输出层加权输入: \[ z^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + b^{(2)} = w_ {11}^{(2)}a_ 1^{(1)} + w_ {12}^{(2)}a_ 2^{(1)} + w_ {13}^{(2)}a_ 3^{(1)} + b^{(2)} \] 最终输出: \[ \hat{y} = a^{(2)} = \sigma(z^{(2)}) \] 损失值: \[ L = \frac{1}{2}(y - \hat{y})^2 \] 3. 反向传播(链式法则求梯度) 目标:计算损失对权重和偏置的梯度 \(\frac{\partial L}{\partial \mathbf{W}^{(2)}}, \frac{\partial L}{\partial b^{(2)}}, \frac{\partial L}{\partial \mathbf{W}^{(1)}}, \frac{\partial L}{\partial \mathbf{b}^{(1)}}\)。 步骤1:输出层梯度 损失对输出层输入的梯度: \[ \delta^{(2)} = \frac{\partial L}{\partial z^{(2)}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{(2)}} = (\hat{y} - y) \cdot \sigma'(z^{(2)}) \] 代入 Sigmoid 导数:\(\sigma'(z^{(2)}) = \hat{y}(1-\hat{y})\),得 \[ \delta^{(2)} = (\hat{y} - y) \cdot \hat{y}(1-\hat{y}) \] 权重梯度: \[ \frac{\partial L}{\partial w_ {1j}^{(2)}} = \delta^{(2)} \cdot a_ j^{(1)} \quad (j=1,2,3), \quad \frac{\partial L}{\partial \mathbf{W}^{(2)}} = \delta^{(2)} \mathbf{a}^{(1)^T} \] 偏置梯度: \[ \frac{\partial L}{\partial b^{(2)}} = \delta^{(2)} \] 步骤2:隐藏层梯度 损失对隐藏层输入的梯度(需考虑隐藏层3个神经元): \[ \delta^{(1)} = \frac{\partial L}{\partial \mathbf{z}^{(1)}} = \left( \mathbf{W}^{(2)^T} \delta^{(2)} \right) \odot \sigma'(\mathbf{z}^{(1)}) \] 其中 \(\odot\) 表示逐元素相乘。代入 \(\sigma'(\mathbf{z}^{(1)}) = \mathbf{a}^{(1)} \odot (1 - \mathbf{a}^{(1)})\),得 \[ \delta^{(1)} = \begin{bmatrix} w_ {11}^{(2)} \delta^{(2)} \\ w_ {12}^{(2)} \delta^{(2)} \\ w_ {13}^{(2)} \delta^{(2)} \end{bmatrix} \odot \begin{bmatrix} a_ 1^{(1)}(1-a_ 1^{(1)}) \\ a_ 2^{(1)}(1-a_ 2^{(1)}) \\ a_ 3^{(1)}(1-a_ 3^{(1)}) \end{bmatrix} \] 权重梯度: \[ \frac{\partial L}{\partial w_ {ij}^{(1)}} = \delta_ i^{(1)} \cdot x_ j \quad (i=1,2,3; j=1,2), \quad \frac{\partial L}{\partial \mathbf{W}^{(1)}} = \delta^{(1)} \mathbf{x}^T \] 偏置梯度: \[ \frac{\partial L}{\partial b_ i^{(1)}} = \delta_ i^{(1)} \quad (i=1,2,3), \quad \frac{\partial L}{\partial \mathbf{b}^{(1)}} = \delta^{(1)} \] 4. 参数更新 使用梯度下降法(学习率 \(\eta\)): 输出层: \[ \mathbf{W}^{(2)} \leftarrow \mathbf{W}^{(2)} - \eta \frac{\partial L}{\partial \mathbf{W}^{(2)}}, \quad b^{(2)} \leftarrow b^{(2)} - \eta \delta^{(2)} \] 隐藏层: \[ \mathbf{W}^{(1)} \leftarrow \mathbf{W}^{(1)} - \eta \frac{\partial L}{\partial \mathbf{W}^{(1)}}, \quad \mathbf{b}^{(1)} \leftarrow \mathbf{b}^{(1)} - \eta \delta^{(1)} \] 总结 :通过前向传播计算预测值,反向传播逐层计算误差并传递梯度,最终更新参数。此过程可扩展至更深的网络。