多层感知机(MLP)的前向传播与反向传播过程
题目描述
我们需要推导一个具有单隐藏层的多层感知机(MLP)的前向传播和反向传播过程。网络结构为:输入层2个神经元(\(x_1, x_2\)),隐藏层3个神经元(使用Sigmoid激活函数),输出层1个神经元(使用Sigmoid激活函数)。定义损失函数为均方误差(MSE)。请逐步计算前向传播的输出,并推导反向传播中所有权重和偏置的梯度。
解题过程
1. 网络结构与符号定义
- 输入:\(\mathbf{x} = [x_1, x_2]\)
- 隐藏层权重:\(\mathbf{W}^{(1)} = \begin{bmatrix} w_{11}^{(1)} & w_{12}^{(1)} \\ w_{21}^{(1)} & w_{22}^{(1)} \\ w_{31}^{(1)} & w_{32}^{(1)} \end{bmatrix}\),偏置:\(\mathbf{b}^{(1)} = [b_1^{(1)}, b_2^{(1)}, b_3^{(1)}]\)
- 输出层权重:\(\mathbf{W}^{(2)} = [w_{11}^{(2)}, w_{12}^{(2)}, w_{13}^{(2)}]\),偏置:\(b^{(2)}\)
- 真实标签:\(y\)
- Sigmoid函数:\(\sigma(z) = \frac{1}{1+e^{-z}}\),其导数为 \(\sigma'(z) = \sigma(z)(1-\sigma(z))\)
- 损失函数:\(L = \frac{1}{2}(y - \hat{y})^2\)(MSE,系数 \(\frac{1}{2}\) 便于求导消去常数)。
2. 前向传播
步骤1:输入层到隐藏层
- 隐藏层加权输入:
\[ \mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)} = \begin{bmatrix} w_{11}^{(1)}x_1 + w_{12}^{(1)}x_2 + b_1^{(1)} \\ w_{21}^{(1)}x_1 + w_{22}^{(1)}x_2 + b_2^{(1)} \\ w_{31}^{(1)}x_1 + w_{32}^{(1)}x_2 + b_3^{(1)} \end{bmatrix} \]
- 隐藏层输出(激活后):
\[ \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}) = \begin{bmatrix} \sigma(z_1^{(1)}) \\ \sigma(z_2^{(1)}) \\ \sigma(z_3^{(1)}) \end{bmatrix} \]
步骤2:隐藏层到输出层
- 输出层加权输入:
\[ z^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + b^{(2)} = w_{11}^{(2)}a_1^{(1)} + w_{12}^{(2)}a_2^{(1)} + w_{13}^{(2)}a_3^{(1)} + b^{(2)} \]
- 最终输出:
\[ \hat{y} = a^{(2)} = \sigma(z^{(2)}) \]
- 损失值:
\[ L = \frac{1}{2}(y - \hat{y})^2 \]
3. 反向传播(链式法则求梯度)
目标:计算损失对权重和偏置的梯度 \(\frac{\partial L}{\partial \mathbf{W}^{(2)}}, \frac{\partial L}{\partial b^{(2)}}, \frac{\partial L}{\partial \mathbf{W}^{(1)}}, \frac{\partial L}{\partial \mathbf{b}^{(1)}}\)。
步骤1:输出层梯度
- 损失对输出层输入的梯度:
\[ \delta^{(2)} = \frac{\partial L}{\partial z^{(2)}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{(2)}} = (\hat{y} - y) \cdot \sigma'(z^{(2)}) \]
代入 Sigmoid 导数:\(\sigma'(z^{(2)}) = \hat{y}(1-\hat{y})\),得
\[ \delta^{(2)} = (\hat{y} - y) \cdot \hat{y}(1-\hat{y}) \]
- 权重梯度:
\[ \frac{\partial L}{\partial w_{1j}^{(2)}} = \delta^{(2)} \cdot a_j^{(1)} \quad (j=1,2,3), \quad \frac{\partial L}{\partial \mathbf{W}^{(2)}} = \delta^{(2)} \mathbf{a}^{(1)^T} \]
- 偏置梯度:
\[ \frac{\partial L}{\partial b^{(2)}} = \delta^{(2)} \]
步骤2:隐藏层梯度
- 损失对隐藏层输入的梯度(需考虑隐藏层3个神经元):
\[ \delta^{(1)} = \frac{\partial L}{\partial \mathbf{z}^{(1)}} = \left( \mathbf{W}^{(2)^T} \delta^{(2)} \right) \odot \sigma'(\mathbf{z}^{(1)}) \]
其中 \(\odot\) 表示逐元素相乘。代入 \(\sigma'(\mathbf{z}^{(1)}) = \mathbf{a}^{(1)} \odot (1 - \mathbf{a}^{(1)})\),得
\[ \delta^{(1)} = \begin{bmatrix} w_{11}^{(2)} \delta^{(2)} \\ w_{12}^{(2)} \delta^{(2)} \\ w_{13}^{(2)} \delta^{(2)} \end{bmatrix} \odot \begin{bmatrix} a_1^{(1)}(1-a_1^{(1)}) \\ a_2^{(1)}(1-a_2^{(1)}) \\ a_3^{(1)}(1-a_3^{(1)}) \end{bmatrix} \]
- 权重梯度:
\[ \frac{\partial L}{\partial w_{ij}^{(1)}} = \delta_i^{(1)} \cdot x_j \quad (i=1,2,3; j=1,2), \quad \frac{\partial L}{\partial \mathbf{W}^{(1)}} = \delta^{(1)} \mathbf{x}^T \]
- 偏置梯度:
\[ \frac{\partial L}{\partial b_i^{(1)}} = \delta_i^{(1)} \quad (i=1,2,3), \quad \frac{\partial L}{\partial \mathbf{b}^{(1)}} = \delta^{(1)} \]
4. 参数更新
使用梯度下降法(学习率 \(\eta\)):
- 输出层:
\[ \mathbf{W}^{(2)} \leftarrow \mathbf{W}^{(2)} - \eta \frac{\partial L}{\partial \mathbf{W}^{(2)}}, \quad b^{(2)} \leftarrow b^{(2)} - \eta \delta^{(2)} \]
- 隐藏层:
\[ \mathbf{W}^{(1)} \leftarrow \mathbf{W}^{(1)} - \eta \frac{\partial L}{\partial \mathbf{W}^{(1)}}, \quad \mathbf{b}^{(1)} \leftarrow \mathbf{b}^{(1)} - \eta \delta^{(1)} \]
总结:通过前向传播计算预测值,反向传播逐层计算误差并传递梯度,最终更新参数。此过程可扩展至更深的网络。