学习速率
使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。
此附录包含有关学习速率的一些其他详细信息。
学习速率衰减时间表
最佳学习率衰减时间表系列是一个开放性问题;目前尚不清楚如何构建一组严谨的实验来自信地回答这个问题。 虽然我们不知道最佳家庭时间表,但我们确信以下几点:
- 制定一些(非恒定)时间表非常重要。
- 调整该时间表非常重要。
在优化过程中的不同时间,不同的学习速率效果最好。制定某种类型的调度,更有可能使模型达到良好的学习率。
最佳默认学习速率衰减
我们建议将以下任一学习速率衰减系列作为默认值:
许多其他时间表系列可能也不错。
为什么有些论文会采用复杂的学习率调度?
许多学术论文都使用复杂的分段学习速率 (LR) 衰减时间表。读者经常想知道作者是如何制定出如此复杂的日程安排的。许多复杂的 LR 衰减时间表都是以临时方式根据验证集性能调整时间表的结果。具体来说:
- 使用一些简单的 LR 衰减(或恒定的学习速率)启动一次训练运行。
- 继续运行训练,直到性能似乎停滞不前。 如果出现这种情况,请暂停训练。然后,从这一点开始,以可能更陡峭的学习率衰减时间表(或更小的恒定学习率)恢复训练。重复此流程(直到会议或发布截止日期)。
盲目复制生成的时间表通常不是一个好主意,因为最佳特定时间表对许多其他超参数选择非常敏感。我们建议复制生成相应时间表的算法,不过如果时间表是任意人为判断的结果,则很少能做到这一点。如果这种对验证错误敏感的调度可以完全自动化,那么使用它没有问题,但以验证错误为函数的人工在环调度不稳定且不易重现,因此我们建议避免使用。在发布使用此类时间表的实验结果之前,请尝试使其完全可重现。
应如何调整 Adam 的超参数?
Adam 中的并非所有超参数都同等重要。 以下经验法则对应于研究中不同数量的试验“预算”。
- 如果研究中的试验次数少于 10 次,则仅调整(基本)学习速率。
- 如果研究中有 10-25 次试验,请调整学习速率和
beta_1
。 - 如果试验次数超过 25 次,请调整学习率
beta_1
和 epsilon
。 - 如果实验次数远超 25 次,请额外调整
beta_2
。
鉴于很难提供有关搜索空间以及应从搜索空间中抽样多少个点的通用规则,请将本部分中列出的经验法则视为粗略的指导原则。”
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-07-27。
[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eEmploying a non-constant learning rate decay schedule, such as linear or cosine decay, is crucial for optimal model performance.\u003c/p\u003e\n"],["\u003cp\u003eComplicated, piece-wise learning rate schedules often arise from ad hoc tuning based on validation set performance and should be approached with caution due to reproducibility concerns.\u003c/p\u003e\n"],["\u003cp\u003ePrioritize tuning Adam's hyperparameters strategically: focus on the base learning rate for limited trials, gradually incorporating \u003ccode\u003ebeta_1\u003c/code\u003e, \u003ccode\u003eepsilon\u003c/code\u003e, and \u003ccode\u003ebeta_2\u003c/code\u003e with increasing trial budgets.\u003c/p\u003e\n"],["\u003cp\u003eWhile specific learning rate decay schedules are dataset and model dependent, having a schedule is more important than the specific type.\u003c/p\u003e\n"]]],[],null,["# Learning rate\n\nThis appendix contains a few additional details about learning rate.\n\nLearning rate decay schedule\n----------------------------\n\nThe best learning rate decay schedule family is an open problem;\nit's not clear how to construct a set of rigorous experiments to\nconfidently answer this question.\nAlthough we don't know the best schedule family, we're confident\nof the following:\n\n- It's important to have some (non-constant) schedule.\n- Tuning that schedule is important.\n\nDifferent learning rates work best at different times during the\noptimization process. Having some sort of schedule makes it more\nlikely for the model to hit a good learning rate.\n\n### Best default learning rate decay\n\nWe recommend either of the following learning rate decay families\nas a default:\n\n- Linear decay\n- Cosine decay\n\nMany other schedule families are probably good, too.\n\n### Why do some papers have complicated learning rate schedules?\n\nMany academic papers use complicated piece-wise learning rate (LR)\ndecay schedules. Readers often wonder how the authors arrived at\nsuch a complicated schedule. Many complicated LR decay schedules are\nthe result of tuning the schedule as a function of the validation set\nperformance in an ad hoc way. That is:\n\n1. Start a single training run with some simple LR decay (or a constant learning rate).\n2. Keep training running until the performance seems to stagnate. If this happens, pause training. Then, resume it with a perhaps steeper LR decay schedule (or smaller constant learning rate) from this point. Repeat this process (until the conference or launch deadline).\n\nBlithely copying the resulting schedule is generally not a good idea\nsince the best particular schedule is sensitive to a host of other\nhyperparameter choices. We recommend copying the algorithm that produced\nthe schedule, although this is rarely possible when arbitrary human\njudgment produced the schedule. This type of validation-error-sensitive\nschedule is fine to use if it can be fully automated, but\nhuman-in-the-loop schedules that are a function of validation error are\nbrittle and not easily reproducible, so we recommend avoiding them.\nBefore publishing results that used such a schedule, please try to make\nit fully reproducible.\n\n### How should Adam's hyperparameters be tuned?\n\nNot all the hyperparameters in Adam are equally important.\nThe following rules of thumb correspond to different \"budgets\" for the number\nof trials in a study.\n\n- If \\\u003c 10 trials in a study, only tune the (base) learning rate.\n- If 10-25 trials in a study, tune the learning rate and `beta_1`.\n- If 25+ trials, tune the learning rate, `beta_1`, and `epsilon`.\n- If substantially more than 25 trials, additionally tune tune `beta_2`.\n\nGiven how difficult it is to provide general rules about search spaces and\nhow many points you should sample from the search space, view the rules of\nthumb stated in this section as rough guidelines.\""]]