生产化
使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。
如需为生产环境准备机器学习流水线,您需要执行以下操作:
预配计算资源
运行 ML 流水线需要计算资源,例如 RAM、CPU 和 GPU/TPU。如果没有足够的计算资源,您将无法运行流水线。因此,请务必获取足够的配额,以便预配流水线在生产环境中运行所需的资源。
您可能不会为每个流水线分配配额。您可以改为分配流水线共享的配额。在这种情况下,请验证您是否有足够的配额来运行所有流水线,并设置监控和提醒,以防止单个错误的流水线消耗所有配额。
估算配额
如需估算数据和训练流水线所需的配额,请查找类似的项目,以便根据这些项目进行估算。如需估算投放配额,请尝试预测服务的每秒查询次数。这些方法提供了一个基准。当您在实验阶段开始为解决方案制作原型时,您将开始获得更精确的配额估算值。
在估算配额时,请务必考虑不仅是生产流水线的配额,还有正在进行的实验的配额。
检查您的理解情况
选择用于提供预测的硬件时,您应始终选择比用于训练模型的硬件更强大的硬件。
错误
正确。通常,训练所需的硬件比服务所需的硬件更大。
日志记录、监控和提醒
记录和监控生产模型的行为至关重要。强大的监控基础架构可确保您的模型提供可靠的高质量预测。
良好的日志记录和监控实践有助于主动发现机器学习流水线中的问题,并减轻潜在的业务影响。如果出现问题,系统会向团队成员发送提醒,而全面的日志有助于诊断问题的根本原因。
您应实现日志记录和监控,以检测机器学习流水线中的以下问题:
流水线 | 监控 |
服务 | - 应用数据与训练数据相比出现偏差或漂移
- 预测中的偏差或漂移
- 数据类型问题,例如值缺失或损坏
- 配额用量
- 模型质量指标
|
数据 | - 特征值中的偏差和漂移
- 标签值中的偏差和漂移
- 数据类型问题,例如值缺失或损坏
- 配额使用率
- 即将达到配额上限
|
培训 | |
验证 | |
您还需要为以下内容设置日志记录、监控和提醒:
- 延迟时间。预测需要多长时间才能交付?
- 中断。模型是否已停止提供预测?
检查您的理解情况
以下哪项是记录和监控机器学习流水线的主要原因?
以上皆是
正确。记录和监控机器学习流水线有助于在问题变得严重之前预防和诊断问题。
部署模型
对于模型部署,您需要记录以下内容:
- 开始部署和增加推出范围所需的审批。
- 如何将模型投入生产环境。
- 模型部署的位置,例如是否存在预发布或 Canary 环境。
- 如果部署失败,该怎么办。
- 如何回滚已部署到生产环境中的模型。
在自动执行模型训练后,您可能还想自动执行验证和部署。自动化部署可分摊责任,并降低因单个人员而导致部署受阻的可能性。它还可以减少潜在的错误,提高效率和可靠性,并支持轮班待命和 SRE 支持。
通常,您会将新模型部署到一部分用户,以检查模型是否按预期运行。如果确实如此,请继续进行部署。如果不是,您将回滚部署,并开始诊断和调试问题。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-07-27。
[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eProduction ML pipelines require sufficient compute resources like RAM, CPUs, and GPUs/TPUs for serving, training, data processing, and validation.\u003c/p\u003e\n"],["\u003cp\u003eImplement robust logging, monitoring, and alerting to proactively detect data and model issues (e.g., data drift, prediction skews, quality degradation) across all pipeline stages.\u003c/p\u003e\n"],["\u003cp\u003eEstablish a clear model deployment strategy outlining approvals, procedures, environments, and rollback mechanisms, and aim for automated deployments for efficiency and reliability.\u003c/p\u003e\n"],["\u003cp\u003eEstimate quota needs based on similar projects and service predictions, and factor in resources for both production and ongoing experimentation.\u003c/p\u003e\n"]]],[],null,["# Productionization\n\nTo prepare your ML pipelines for production, you need to do the following:\n\n- Provision compute resources for your pipelines\n- Implement logging, monitoring, and alerting\n\nProvisioning compute resources\n------------------------------\n\nRunning ML pipelines requires compute resources, like RAM, CPUs, and GPUs/TPUs.\nWithout adequate compute, you can't run your pipelines. Therefore, make sure\nto get sufficient quota to provision the required resources your pipelines\nneed to run in production.\n\n- **Serving, training, and validation pipelines**. These pipelines require\n TPUs, GPUs, or CPUs. Depending on your use case, you might train and serve\n on different hardware, or use the same hardware. For example, training might\n happen on CPUs but serving might use TPUs, or vice versa. In general, it's\n common to train on bigger hardware and then serve on smaller hardware.\n\n \u003cbr /\u003e\n\n When picking hardware, consider the following:\n - Can you train on less expensive hardware?\n - Would switching to different hardware boost performance?\n - What size is the model and which hardware will optimize its performance?\n - What hardware is ideal based on your model's architecture?\n\n | **Note:** When switching models between hardware, consider the time and effort to migrate the model. Switching hardware might make the model cheaper to run, but the engineering effort to do so might outweigh the savings---or engineering effort might be better prioritized on other work.\n- **Data pipelines**. Data pipelines require quota for RAM and CPU\n\n You'll need to estimate how\n much quota your pipeline needs to generate training and test datasets.\n\nYou might not allocate quota for each pipeline. Instead, you might\nallocate quota that pipelines share. In such cases, verify\nyou have enough quota to run all your pipelines, and set up monitoring and\naltering to prevent a single, errant pipeline from consuming all the quota.\n\n### Estimating quota\n\nTo estimate the quota you'll need for the data and training pipelines, find\nsimilar projects to base your estimates on. To estimate serving quota, try to\npredict the service's queries per second. These methods provide a baseline. As\nyou begin prototyping a solution during the experimentation phase, you'll begin\nto get a more precise quota estimate.\n\nWhen estimating quota, remember to factor in quota not only for your production\npipelines, but also for ongoing experiments.\n\n### Check Your Understanding\n\nWhen choosing hardware to serve predictions, you should always choose more powerful hardware than was used to train the model. \nFalse \nCorrect. Typically, training requires bigger hardware than serving. \nTrue \n\nLogging, monitoring, and alerting\n---------------------------------\n\nLogging and monitoring a production model's behavior is critical. Robust\nmonitoring infrastructure confirms your models are serving reliable,\nhigh-quality predictions.\n\nGood logging and monitoring practices help proactively identify issues in ML\npipelines and mitigate potential business impact. When issues do occur, alerts\nnotify members of your team, and comprehensive logs facilitate diagnosing the\nproblem's root cause.\n\nYou should implement logging and monitoring to detect the following issues\nwith ML pipelines:\n\n| Pipeline | Monitor |\n|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| Serving | - Skews or drifts in the serving data compared to the training data - Skews or drifts in predictions - Data type issues, like missing or corrupted values - Quota usage - Model quality metrics Calculating a production model's quality is different than calculating a model's quality during training. In production, you won't necessarily have access to the ground truth to compare predictions against. Instead, you'll need to write custom monitoring instrumentation to capture metrics that act as a proxy for model quality. For example, in a mail app, you won't know which mail is spam in real time. Instead, you can monitor the percentage of mail users move to spam. If the number jumps from 0.5% to 3%, that signals a potential issue with the model. | Note that comparing the changes in | the proxy metrics is more insightful than their raw numbers. |\n| Data | - Skews and drifts in feature values - Skews and drifts in label values - Data type issues, like missing or corrupted values - Quota usage rate - Quota limit about to be reached |\n| Training | - Training time - Training failures - Quota usage |\n| Validation | - Skew or drift in the test datasets |\n\nYou'll also want logging, monitoring, alerting for the following:\n\n- **Latency**. How long does it take to deliver a prediction?\n- **Outages**. Has the model stopped delivering predictions?\n\n### Check Your Understanding\n\nWhich of the following is the main reason for logging and monitoring your ML pipelines? \nProactively detect issues before they impact users \nTrack quota and resource usage \nIdentify potential security problems \nAll of the above \nCorrect. Logging and monitoring your ML pipelines helps prevent and diagnose problems before they become serious.\n\nDeploying a model\n-----------------\n\nFor model deployment, you'll want to document the following:\n\n- Approvals required to begin deployment and increase the roll out.\n- How to put a model into production.\n- Where the model gets deployed, for example, if there are staging or canary environments.\n- What to do if a deployment fails.\n- How to rollback a model already in production.\n\nAfter automating model training, you'll want to automate\nvalidation and deployment. Automating deployments distributes\nresponsibility and reduces the likelihood of a deployment being bottlenecked by\na single person. It also reduces potential mistakes, increases efficiency and\nreliability, and enables on-call rotations and SRE support.\n\nTypically you deploy new models to a subset of users to check that the model is\nbehaving as expected. If it is, continue with the deployment. If it's not,\nyou rollback the deployment and begin diagnosing and debugging the issues."]]