线性回归:超参数
使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。
超参数是控制训练不同方面的变量。以下是三种常见的超参数:
相比之下,形参是模型本身的一部分,例如权重和偏差。换句话说,超参数是您控制的值;参数是模型在训练期间计算的值。
学习速率
学习速率是一个您设置的浮点数,用于影响模型收敛的速度。如果学习率过低,模型可能需要很长时间才能收敛。不过,如果学习速率过高,模型将永远无法收敛,而是在可最大限度减少损失的权重和偏差附近跳动。目标是选择一个既不太高也不太低的学习速率,以便模型快速收敛。
学习速率决定了在梯度下降过程的每一步中,对权重和偏差所做的更改幅度。模型将梯度乘以学习速率,以确定下一次迭代的模型参数(权重和偏差值)。在梯度下降的第三步中,沿负斜率方向移动的“少量”是指学习速率。
旧模型参数与新模型参数之间的差异与损失函数的斜率成正比。例如,如果斜率较大,模型会采取较大的步长。如果较小,则采取较小的步长。例如,如果梯度的大小为 2.5,学习率为 0.01,则模型会将形参更改 0.025。
理想的学习率有助于模型在合理的迭代次数内收敛。在图 20 中,损失曲线显示模型在前 20 次迭代中显著改进,然后开始收敛:
图 20. 损失图,显示了以可快速收敛的学习速率训练的模型。
相比之下,如果学习速率过小,则可能需要过多的迭代次数才能实现收敛。在图 21 中,损失曲线显示模型在每次迭代后仅略有改进:
图 21. 损失图,显示了以较小学习速率训练的模型。
过大的学习速率永远不会收敛,因为每次迭代都会导致损失在较大范围内波动或持续增加。在图 22 中,损失曲线显示模型在每次迭代后损失先减少后增加;在图 23 中,损失在后续迭代中增加:
图 22. 损失图,显示了以过大的学习速率训练的模型,其中损失曲线随着迭代次数的增加而剧烈波动,时而上升时而下降。
图 23. 损失图,显示了以过大的学习速率训练的模型,其中损失曲线在后来的迭代中急剧增加。
练习:检查您的理解情况
理想的学习率是多少?
理想的学习速率取决于具体问题。
每个模型和数据集都有自己的理想学习率。
批次大小
批次大小是一种超参数,指的是模型在更新权重和偏差之前处理的示例数量。您可能会认为,模型应先计算数据集中每个样本的损失,然后再更新权重和偏差。不过,如果数据集包含数十万甚至数百万个示例,则使用完整批次并不实际。
以下两种常见技术可在不查看数据集中的每个示例的情况下,获得正确的平均梯度,然后更新权重和偏差:随机梯度下降和小批量随机梯度下降。
随机梯度下降法 (SGD):随机梯度下降法在每次迭代中仅使用一个示例(批次大小为 1)。在迭代次数足够多的情况下,SGD 可以正常运行,但噪声非常大。“噪声”是指训练期间导致损失在迭代过程中增加而非减少的变化。“随机”一词表示每个批次中的一个示例是随机选择的。
请注意,在下图中,当模型使用 SGD 更新权重和偏差时,损失会略有波动,这可能会导致损失图出现噪声:
图 24. 使用随机梯度下降法 (SGD) 训练的模型,显示损失曲线中的噪声。
请注意,使用随机梯度下降可能会在整个损失曲线中产生噪声,而不仅仅是在接近收敛时。
小批次随机梯度下降法(小批次 SGD):小批次随机梯度下降法是全批次和 SGD 之间的折衷方案。对于 $ N $ 个数据点,批次大小可以是大于 1 且小于 $ N $ 的任意数字。模型会随机选择每个批次中包含的示例,对它们的梯度求平均值,然后在每次迭代中更新一次权重和偏差。
确定每个批次的样本数量取决于数据集和可用的计算资源。一般来说,小批量大小的行为类似于 SGD,而大批量大小的行为类似于全批量梯度下降。
图 25. 使用小批次随机梯度下降法训练的模型。
在训练模型时,您可能会认为噪声是一种应消除的不良特征。不过,一定程度的噪音可能是一件好事。在后续模块中,您将了解噪声如何帮助模型更好地泛化,以及如何在神经网络中找到最佳权重和偏差。
周期数
在训练期间,一个周期是指模型已处理训练集中的每个示例一次。例如,如果训练集包含 1,000 个示例,而小批次大小为 100 个示例,则模型需要 10 次迭代才能完成一个周期。
训练通常需要多个周期。也就是说,系统需要多次处理训练集中的每个示例。
周期数是一种超参数,您需要在模型开始训练之前设置该参数。在许多情况下,您需要通过实验来确定模型收敛所需的周期数。一般来说,训练周期数越多,模型效果越好,但训练时间也越长。
图 26. 完整批次与小批次。
下表介绍了批次大小和周期与模型更新其参数的次数之间的关系。
批次类型 | 权重和偏差更新何时发生 |
完整批次 | 模型查看完数据集中的所有示例后。例如,如果某个数据集包含 1,000 个样本,并且模型训练了 20 个周期,则模型会更新权重和偏差 20 次,每个周期更新一次。 |
随机梯度下降法 | 模型查看数据集中的单个示例后。 例如,如果某个数据集包含 1,000 个样本,并且训练了 20 个周期,则模型会更新权重和偏差 20,000 次。 |
小批次随机梯度下降法 | 模型查看完每个批次中的示例后。例如,如果某个数据集包含 1,000 个样本,批次大小为 100,并且模型训练了 20 个周期,则模型会更新权重和偏差 200 次。 |
练习:检查您的理解情况
1. 使用小批次随机梯度下降法时,最佳批次大小是多少?
视情况而定
理想的批次大小取决于数据集和可用的计算资源
2. 以下哪项陈述是正确的?
较大的批次不适合包含许多异常值的数据。
这个说法不正确。通过对更多梯度求平均值,较大的批次大小有助于减少数据中存在离群值带来的负面影响。
将学习速率提高一倍可能会减慢训练速度。
这个说法是正确的。将学习速率加倍可能会导致学习速率过大,从而导致权重“四处波动”,增加收敛所需的时间。 与往常一样,最佳超参数取决于您的数据集和可用的计算资源。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-08-17。
[null,null,["最后更新时间 (UTC):2025-08-17。"],[[["\u003cp\u003eHyperparameters, such as learning rate, batch size, and epochs, are external configurations that influence the training process of a machine learning model.\u003c/p\u003e\n"],["\u003cp\u003eThe learning rate determines the step size during model training, impacting the speed and stability of convergence.\u003c/p\u003e\n"],["\u003cp\u003eBatch size dictates the number of training examples processed before updating model parameters, influencing training speed and noise.\u003c/p\u003e\n"],["\u003cp\u003eEpochs represent the number of times the entire training dataset is used during training, affecting model performance and training time.\u003c/p\u003e\n"],["\u003cp\u003eChoosing appropriate hyperparameters is crucial for optimizing model training and achieving desired results.\u003c/p\u003e\n"]]],[],null,["[**Hyperparameters**](/machine-learning/glossary#hyperparameter) are variables\nthat control different aspects of training. Three common hyperparameters are:\n\n- [**Learning rate**](/machine-learning/glossary#learning-rate)\n- [**Batch size**](/machine-learning/glossary#batch-size)\n- [**Epochs**](/machine-learning/glossary#epoch)\n\nIn contrast, [**parameters**](/machine-learning/glossary#parameter) are the\nvariables, like the weights and bias, that are part of the model itself. In\nother words, hyperparameters are values that you control; parameters are values\nthat the model calculates during training.\n\nLearning rate\n\n[**Learning rate**](/machine-learning/glossary#learning-rate) is a\nfloating point number you set that influences how quickly the\nmodel converges. If the learning rate is too low, the model can take a long time\nto converge. However, if the learning rate is too high, the model never\nconverges, but instead bounces around the weights and bias that minimize the\nloss. The goal is to pick a learning rate that's not too high nor too low so\nthat the model converges quickly.\n\nThe learning rate determines the magnitude of the changes to make to the weights\nand bias during each step of the gradient descent process. The model multiplies\nthe gradient by the learning rate to determine the model's parameters (weight\nand bias values) for the next iteration. In the third step of [gradient\ndescent](/machine-learning/crash-course/linear-regression/gradient-descent), the \"small amount\" to move in the direction\nof negative slope refers to the learning rate.\n\nThe difference between the old model parameters and the new model parameters is\nproportional to the slope of the loss function. For example, if the slope is\nlarge, the model takes a large step. If small, it takes a small step. For\nexample, if the gradient's magnitude is 2.5 and the learning rate is 0.01, then\nthe model will change the parameter by 0.025.\n\nThe ideal learning rate helps the model to converge within a reasonable number\nof iterations. In Figure 21, the loss curve shows the model significantly\nimproving during the first 20 iterations before beginning to converge:\n\n**Figure 21**. Loss graph showing a model trained with a learning rate that\nconverges quickly.\n\nIn contrast, a learning rate that's too small can take too many iterations to\nconverge. In Figure 22, the loss curve shows the model making only minor\nimprovements after each iteration:\n\n**Figure 22**. Loss graph showing a model trained with a small learning rate.\n\nA learning rate that's too large never converges because each iteration either\ncauses the loss to bounce around or continually increase. In Figure 23, the loss\ncurve shows the model decreasing and then increasing loss after each iteration,\nand in Figure 24 the loss increases at later iterations:\n\n**Figure 23**. Loss graph showing a model trained with a learning rate that's\ntoo big, where the loss curve fluctuates wildly, going up and down as the\niterations increase.\n\n**Figure 24**. Loss graph showing a model trained with a learning rate that's\ntoo big, where the loss curve drastically increases in later iterations.\n\nExercise: Check your understanding \nWhat is the ideal learning rate? \nThe ideal learning rate is problem-dependent. \nEach model and dataset will have its own ideal learning rate. \n0.01 \n1.0 \n\nBatch size\n\n[**Batch size**](/machine-learning/glossary#batch-size) is a hyperparameter that\nrefers to the number of [**examples**](/machine-learning/glossary#example)\nthe model processes before updating its weights\nand bias. You might think that the model should calculate the loss for *every*\nexample in the dataset before updating the weights and bias. However, when a\ndataset contains hundreds of thousands or even millions of examples, using the\nfull batch isn't practical.\n\nTwo common techniques to get the right gradient on *average* without needing to\nlook at every example in the dataset before updating the weights and bias are\n[**stochastic gradient descent**](/machine-learning/glossary#SGD) and\n[**mini-batch stochastic gradient\ndescent**](/machine-learning/glossary#mini-batch-stochastic-gradient-descent):\n\n- **Stochastic gradient descent (SGD)**: Stochastic gradient descent uses only\n a single example (a batch size of one) per iteration. Given enough\n iterations, SGD works but is very noisy. \"Noise\" refers to variations during\n training that cause the loss to increase rather than decrease during an\n iteration. The term \"stochastic\" indicates that the one example comprising\n each batch is chosen at random.\n\n Notice in the following image how loss slightly fluctuates as the model\n updates its weights and bias using SGD, which can lead to noise in the loss\n graph:\n\n **Figure 25**. Model trained with stochastic gradient descent (SGD) showing\n noise in the loss curve.\n\n Note that using stochastic gradient descent can produce noise throughout the\n entire loss curve, not just near convergence.\n- **Mini-batch stochastic gradient descent (mini-batch SGD)**: Mini-batch\n stochastic gradient descent is a compromise between full-batch and SGD. For\n $ N $ number of data points, the batch size can be any number greater than 1\n and less than $ N $. The model chooses the examples included in each batch\n at random, averages their gradients, and then updates the weights and bias\n once per iteration.\n\n Determining the number of examples for each batch depends on the dataset and\n the available compute resources. In general, small batch sizes behaves like\n SGD, and larger batch sizes behaves like full-batch gradient descent.\n\n **Figure 26**. Model trained with mini-batch SGD.\n\nWhen training a model, you might think that noise is an undesirable\ncharacteristic that should be eliminated. However, a certain amount of noise can\nbe a good thing. In later modules, you'll learn how noise can help a model\n[**generalize**](/machine-learning/glossary#generalization) better and find the\noptimal weights and bias in a [**neural\nnetwork**](/machine-learning/glossary#neural-network).\n\nEpochs\n\nDuring training, an [**epoch**](/machine-learning/glossary#epoch) means that the\nmodel has processed every example in the training set *once* . For example, given\na training set with 1,000 examples and a mini-batch size of 100 examples, it\nwill take the model 10 [**iterations**](/machine-learning/glossary#iteration) to\ncomplete one epoch.\n\nTraining typically requires many epochs. That is, the system needs to process\nevery example in the training set multiple times.\n\nThe number of epochs is a hyperparameter you set before the model begins\ntraining. In many cases, you'll need to experiment with how many epochs it takes\nfor the model to converge. In general, more epochs produces a better model, but\nalso takes more time to train.\n\n**Figure 27**. Full batch versus mini batch.\n\nThe following table describes how batch size and epochs relate to the number of\ntimes a model updates its parameters.\n\n| Batch type | When weights and bias updates occur |\n|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| Full batch | After the model looks at all the examples in the dataset. For instance, if a dataset contains 1,000 examples and the model trains for 20 epochs, the model updates the weights and bias 20 times, once per epoch. |\n| Stochastic gradient descent | After the model looks at a single example from the dataset. For instance, if a dataset contains 1,000 examples and trains for 20 epochs, the model updates the weights and bias 20,000 times. |\n| Mini-batch stochastic gradient descent | After the model looks at the examples in each batch. For instance, if a dataset contains 1,000 examples, and the batch size is 100, and the model trains for 20 epochs, the model updates the weights and bias 200 times. |\n\nExercise: Check your understanding \n1. What's the best batch size when using mini-batch SGD? \nIt depends \nThe ideal batch size depends on the dataset and the available compute resources \n10 examples per batch \n100 examples per batch \n2. Which of the following statements is true? \nLarger batches are unsuitable for data with many outliers. \nThis statement is false. By averaging more gradients together, larger batch sizes can help reduce the negative effects of having outliers in the data. \nDoubling the learning rate can slow down training. \nThis statement is true. Doubling the learning rate can result in a learning rate that is too large, and therefore cause the weights to \"bounce around,\" increasing the amount of time needed to converge. As always, the best hyperparameters depend on your dataset and available compute resources.\n| **Key terms:**\n|\n| - [Batch size](/machine-learning/glossary#batch-size)\n| - [Epoch](/machine-learning/glossary#epoch)\n| - [Generalize](/machine-learning/glossary#generalization)\n| - [Hyperparameter](/machine-learning/glossary#hyperparameter)\n| - [Iteration](/machine-learning/glossary#iteration)\n| - [Learning rate](/machine-learning/glossary#learning-rate)\n| - [Mini-batch](/machine-learning/glossary#mini-batch)\n| - [Mini-batch stochastic gradient descent](/machine-learning/glossary#mini-batch-stochastic-gradient-descent)\n| - [Neural network](/machine-learning/glossary#neural-network)\n| - [Parameter](/machine-learning/glossary#parameter)\n- [Stochastic gradient descent](/machine-learning/glossary#stochastic-gradient-descent-sgd) \n[Help Center](https://support.google.com/machinelearningeducation)"]]