線性迴歸:超參數
透過集合功能整理內容 你可以依據偏好儲存及分類內容。
超參數是可控制訓練各個層面的變數。三種常見的超參數如下:
相反地,參數是模型本身的變數,例如權重和偏差。換句話說,超參數是您控制的值,而參數是模型在訓練期間計算的值。
學習率
學習率是您設定的浮點數,會影響模型收斂的速度。如果學習率過低,模型可能需要很長時間才能收斂。不過,如果學習率過高,模型就永遠不會收斂,而是會在可將損失降至最低的權重和偏差值之間跳動。目標是選擇適中的學習率,讓模型快速收斂。
學習率會決定梯度下降過程的每個步驟中,權重和偏差的變化幅度。模型會將梯度乘以學習率,藉此決定下一次疊代的模型參數 (權重和偏差值)。在梯度下降的第三個步驟中,朝負斜率方向移動的「少量」是指學習率。
舊模型參數和新模型參數之間的差異與損失函式的斜率成正比。舉例來說,如果斜率較大,模型就會採取較大的步距。如果很小,則會小幅移動。舉例來說,如果梯度大小為 2.5,學習率為 0.01,模型就會將參數變更 0.025。
理想的學習率可協助模型在合理次數的疊代中收斂。如圖 20 所示,損失曲線顯示模型在前 20 次疊代期間大幅改善,之後開始收斂:
圖 20. 損失圖表:顯示以快速收斂的學習率訓練的模型。
相反地,如果學習率過低,可能需要多次疊代才能收斂。如圖 21 所示,損失曲線顯示模型在每次疊代後只會進行小幅改良:
圖 21. 損失圖表:以低學習率訓練的模型。
學習率過高時,每次疊代都會導致損失值不斷跳動或持續增加,因此永遠不會收斂。在圖 22 中,損失曲線顯示模型在每次疊代後損失減少,然後增加損失,而在圖 23 中,損失在後續疊代中增加:
圖 22. 損失圖表:顯示以過大學習率訓練的模型,損失曲線會隨著疊代次數增加而大幅波動。
圖 23. 損失圖表:模型訓練時的學習率過高,導致損失曲線在後續疊代中大幅增加。
練習:確認您的理解程度
理想的學習率為何?
理想的學習率取決於問題。
每個模型和資料集都有各自的理想學習率。
批量
批次大小是超參數,指的是模型在更新權重和偏差前處理的範例數量。您可能會認為模型應先計算資料集中每個樣本的損失,再更新權重和偏差。不過,如果資料集包含數十萬甚至數百萬個樣本,使用完整批次並不實際。
如要取得平均梯度,而不必在更新權重和偏差前查看資料集中的每個範例,有兩種常見的技術:隨機梯度下降和迷你批次隨機梯度下降:
隨機梯度下降 (SGD):隨機梯度下降在每次疊代時只會使用單一範例 (批次大小為 1)。只要有足夠的疊代次數,SGD 就能運作,但會產生大量雜訊。「雜訊」是指訓練期間的變異,導致疊代期間的損失增加而非減少。「隨機」一詞表示每個批次中的一個範例是隨機選擇。
請注意下圖,模型使用 SGD 更新權重和偏差時,損失會稍微波動,這可能會導致損失圖表出現雜訊:
圖 24. 以隨機梯度下降 (SGD) 訓練的模型,損失曲線中出現雜訊。
請注意,使用隨機梯度下降法可能會在整個損失曲線中產生雜訊,而不僅是在收斂附近。
小批隨機梯度下降法 (小批 SGD):小批隨機梯度下降法是全批和 SGD 之間的折衷方案。如果資料點數量為 $ N $,批次大小可以是任何大於 1 且小於 $ N $ 的數字。模型會隨機選擇每個批次中包含的樣本,計算這些樣本的梯度平均值,然後在每次疊代時更新權重和偏差。
每個批次的範例數量取決於資料集和可用的運算資源。一般來說,小批量大小的行為類似於 SGD,而大批量大小的行為類似於全批量梯度下降。
圖 25. 使用小批隨機梯度下降法訓練的模型。
訓練模型時,您可能會認為雜訊是不良特徵,應該要消除。不過,適量的干擾可能是有益的。在後續單元中,您將瞭解雜訊如何協助模型更妥善地泛化,並在神經網路中找出最佳權重和偏差。
訓練週期數量
在訓練期間,一個訓練週期表示模型已一次處理訓練集中的每個範例。舉例來說,如果訓練集有 1,000 個樣本,而迷你批次大小為 100 個樣本,模型需要 10 次疊代才能完成一個訓練週期。
訓練通常需要許多訓練週期。也就是說,系統需要多次處理訓練集中的每個範例。
訓練週期數是您在模型開始訓練前設定的超參數。在許多情況下,您需要實驗模型收斂所需的訓練週期數。一般來說,訓練週期越多,模型越好,但訓練時間也越長。
圖 26. 完整批次與迷你批次。
下表說明批次大小和訓練週期與模型更新參數次數的關係。
批次類型 | 權重和偏誤更新時 |
完整批次 | 模型查看資料集中的所有範例後,舉例來說,如果資料集包含 1,000 個範例,且模型訓練 20 個訓練週期,模型就會更新權重和偏差 20 次,每個訓練週期更新一次。 |
隨機梯度下降 | 模型查看資料集中的單一範例後。舉例來說,如果資料集包含 1,000 個範例,且訓練 20 個週期,模型就會更新權重和偏差 20,000 次。 |
小批隨機梯度下降法 | 模型會先查看每個批次中的範例,舉例來說,如果資料集包含 1,000 個樣本,批次大小為 100,且模型訓練 20 個週期,模型就會更新權重和偏差 200 次。 |
練習:確認您的理解程度
1. 使用小批隨機梯度下降法時,最佳批次大小為何?
視情況而定
理想的批次大小取決於資料集和可用的運算資源
2. 以下敘述何者正確?
如果資料包含許多離群值,就不適合使用較大的批次。
這是錯誤的說法。透過平均更多梯度,較大的批次大小有助於減少資料中出現離群值所造成的負面影響。
學習率加倍可能會減緩訓練速度。
此敘述正確無誤。學習率加倍可能會導致學習率過大,進而造成權重「跳動」,增加收斂所需的時間。與往常一樣,最佳超參數取決於資料集和可用的運算資源。
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2025-08-17 (世界標準時間)。
[null,null,["上次更新時間:2025-08-17 (世界標準時間)。"],[[["\u003cp\u003eHyperparameters, such as learning rate, batch size, and epochs, are external configurations that influence the training process of a machine learning model.\u003c/p\u003e\n"],["\u003cp\u003eThe learning rate determines the step size during model training, impacting the speed and stability of convergence.\u003c/p\u003e\n"],["\u003cp\u003eBatch size dictates the number of training examples processed before updating model parameters, influencing training speed and noise.\u003c/p\u003e\n"],["\u003cp\u003eEpochs represent the number of times the entire training dataset is used during training, affecting model performance and training time.\u003c/p\u003e\n"],["\u003cp\u003eChoosing appropriate hyperparameters is crucial for optimizing model training and achieving desired results.\u003c/p\u003e\n"]]],[],null,["[**Hyperparameters**](/machine-learning/glossary#hyperparameter) are variables\nthat control different aspects of training. Three common hyperparameters are:\n\n- [**Learning rate**](/machine-learning/glossary#learning-rate)\n- [**Batch size**](/machine-learning/glossary#batch-size)\n- [**Epochs**](/machine-learning/glossary#epoch)\n\nIn contrast, [**parameters**](/machine-learning/glossary#parameter) are the\nvariables, like the weights and bias, that are part of the model itself. In\nother words, hyperparameters are values that you control; parameters are values\nthat the model calculates during training.\n\nLearning rate\n\n[**Learning rate**](/machine-learning/glossary#learning-rate) is a\nfloating point number you set that influences how quickly the\nmodel converges. If the learning rate is too low, the model can take a long time\nto converge. However, if the learning rate is too high, the model never\nconverges, but instead bounces around the weights and bias that minimize the\nloss. The goal is to pick a learning rate that's not too high nor too low so\nthat the model converges quickly.\n\nThe learning rate determines the magnitude of the changes to make to the weights\nand bias during each step of the gradient descent process. The model multiplies\nthe gradient by the learning rate to determine the model's parameters (weight\nand bias values) for the next iteration. In the third step of [gradient\ndescent](/machine-learning/crash-course/linear-regression/gradient-descent), the \"small amount\" to move in the direction\nof negative slope refers to the learning rate.\n\nThe difference between the old model parameters and the new model parameters is\nproportional to the slope of the loss function. For example, if the slope is\nlarge, the model takes a large step. If small, it takes a small step. For\nexample, if the gradient's magnitude is 2.5 and the learning rate is 0.01, then\nthe model will change the parameter by 0.025.\n\nThe ideal learning rate helps the model to converge within a reasonable number\nof iterations. In Figure 21, the loss curve shows the model significantly\nimproving during the first 20 iterations before beginning to converge:\n\n**Figure 21**. Loss graph showing a model trained with a learning rate that\nconverges quickly.\n\nIn contrast, a learning rate that's too small can take too many iterations to\nconverge. In Figure 22, the loss curve shows the model making only minor\nimprovements after each iteration:\n\n**Figure 22**. Loss graph showing a model trained with a small learning rate.\n\nA learning rate that's too large never converges because each iteration either\ncauses the loss to bounce around or continually increase. In Figure 23, the loss\ncurve shows the model decreasing and then increasing loss after each iteration,\nand in Figure 24 the loss increases at later iterations:\n\n**Figure 23**. Loss graph showing a model trained with a learning rate that's\ntoo big, where the loss curve fluctuates wildly, going up and down as the\niterations increase.\n\n**Figure 24**. Loss graph showing a model trained with a learning rate that's\ntoo big, where the loss curve drastically increases in later iterations.\n\nExercise: Check your understanding \nWhat is the ideal learning rate? \nThe ideal learning rate is problem-dependent. \nEach model and dataset will have its own ideal learning rate. \n0.01 \n1.0 \n\nBatch size\n\n[**Batch size**](/machine-learning/glossary#batch-size) is a hyperparameter that\nrefers to the number of [**examples**](/machine-learning/glossary#example)\nthe model processes before updating its weights\nand bias. You might think that the model should calculate the loss for *every*\nexample in the dataset before updating the weights and bias. However, when a\ndataset contains hundreds of thousands or even millions of examples, using the\nfull batch isn't practical.\n\nTwo common techniques to get the right gradient on *average* without needing to\nlook at every example in the dataset before updating the weights and bias are\n[**stochastic gradient descent**](/machine-learning/glossary#SGD) and\n[**mini-batch stochastic gradient\ndescent**](/machine-learning/glossary#mini-batch-stochastic-gradient-descent):\n\n- **Stochastic gradient descent (SGD)**: Stochastic gradient descent uses only\n a single example (a batch size of one) per iteration. Given enough\n iterations, SGD works but is very noisy. \"Noise\" refers to variations during\n training that cause the loss to increase rather than decrease during an\n iteration. The term \"stochastic\" indicates that the one example comprising\n each batch is chosen at random.\n\n Notice in the following image how loss slightly fluctuates as the model\n updates its weights and bias using SGD, which can lead to noise in the loss\n graph:\n\n **Figure 25**. Model trained with stochastic gradient descent (SGD) showing\n noise in the loss curve.\n\n Note that using stochastic gradient descent can produce noise throughout the\n entire loss curve, not just near convergence.\n- **Mini-batch stochastic gradient descent (mini-batch SGD)**: Mini-batch\n stochastic gradient descent is a compromise between full-batch and SGD. For\n $ N $ number of data points, the batch size can be any number greater than 1\n and less than $ N $. The model chooses the examples included in each batch\n at random, averages their gradients, and then updates the weights and bias\n once per iteration.\n\n Determining the number of examples for each batch depends on the dataset and\n the available compute resources. In general, small batch sizes behaves like\n SGD, and larger batch sizes behaves like full-batch gradient descent.\n\n **Figure 26**. Model trained with mini-batch SGD.\n\nWhen training a model, you might think that noise is an undesirable\ncharacteristic that should be eliminated. However, a certain amount of noise can\nbe a good thing. In later modules, you'll learn how noise can help a model\n[**generalize**](/machine-learning/glossary#generalization) better and find the\noptimal weights and bias in a [**neural\nnetwork**](/machine-learning/glossary#neural-network).\n\nEpochs\n\nDuring training, an [**epoch**](/machine-learning/glossary#epoch) means that the\nmodel has processed every example in the training set *once* . For example, given\na training set with 1,000 examples and a mini-batch size of 100 examples, it\nwill take the model 10 [**iterations**](/machine-learning/glossary#iteration) to\ncomplete one epoch.\n\nTraining typically requires many epochs. That is, the system needs to process\nevery example in the training set multiple times.\n\nThe number of epochs is a hyperparameter you set before the model begins\ntraining. In many cases, you'll need to experiment with how many epochs it takes\nfor the model to converge. In general, more epochs produces a better model, but\nalso takes more time to train.\n\n**Figure 27**. Full batch versus mini batch.\n\nThe following table describes how batch size and epochs relate to the number of\ntimes a model updates its parameters.\n\n| Batch type | When weights and bias updates occur |\n|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| Full batch | After the model looks at all the examples in the dataset. For instance, if a dataset contains 1,000 examples and the model trains for 20 epochs, the model updates the weights and bias 20 times, once per epoch. |\n| Stochastic gradient descent | After the model looks at a single example from the dataset. For instance, if a dataset contains 1,000 examples and trains for 20 epochs, the model updates the weights and bias 20,000 times. |\n| Mini-batch stochastic gradient descent | After the model looks at the examples in each batch. For instance, if a dataset contains 1,000 examples, and the batch size is 100, and the model trains for 20 epochs, the model updates the weights and bias 200 times. |\n\nExercise: Check your understanding \n1. What's the best batch size when using mini-batch SGD? \nIt depends \nThe ideal batch size depends on the dataset and the available compute resources \n10 examples per batch \n100 examples per batch \n2. Which of the following statements is true? \nLarger batches are unsuitable for data with many outliers. \nThis statement is false. By averaging more gradients together, larger batch sizes can help reduce the negative effects of having outliers in the data. \nDoubling the learning rate can slow down training. \nThis statement is true. Doubling the learning rate can result in a learning rate that is too large, and therefore cause the weights to \"bounce around,\" increasing the amount of time needed to converge. As always, the best hyperparameters depend on your dataset and available compute resources.\n| **Key terms:**\n|\n| - [Batch size](/machine-learning/glossary#batch-size)\n| - [Epoch](/machine-learning/glossary#epoch)\n| - [Generalize](/machine-learning/glossary#generalization)\n| - [Hyperparameter](/machine-learning/glossary#hyperparameter)\n| - [Iteration](/machine-learning/glossary#iteration)\n| - [Learning rate](/machine-learning/glossary#learning-rate)\n| - [Mini-batch](/machine-learning/glossary#mini-batch)\n| - [Mini-batch stochastic gradient descent](/machine-learning/glossary#mini-batch-stochastic-gradient-descent)\n| - [Neural network](/machine-learning/glossary#neural-network)\n| - [Parameter](/machine-learning/glossary#parameter)\n- [Stochastic gradient descent](/machine-learning/glossary#stochastic-gradient-descent-sgd) \n[Help Center](https://support.google.com/machinelearningeducation)"]]