生产机器学习系统:部署测试
使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。
您现在可以部署用于预测独角兽外观的独角兽模型了! 在部署时,机器学习 (ML) 流水线应能够顺利运行、更新和提供服务。如果部署模型只需按一个大大的部署按钮就行了,那就太好了。很遗憾,完整的机器学习系统需要针对以下方面进行测试:
- 验证输入数据。
- 验证特征工程。
- 验证新模型版本的质量。
- 验证服务基础架构。
- 测试流水线组件之间的集成。
许多软件工程师都偏爱测试驱动型开发 (TDD)。在 TDD 中,软件工程师会先编写测试,然后再编写“真实”源代码。不过,在机器学习中,TDD 可能很棘手。例如,在训练模型之前,您无法编写测试来验证损失。而是必须先在模型开发过程中发现可实现的损失,然后根据可实现的损失测试新的模型版本。
独角兽模型简介
本部分将提及独角兽模型。以下是您需要知晓的相关信息:
您将使用机器学习构建一个分类模型,用于预测独角兽的出现情况。您的数据集详细说明了 1 万次独角兽出现情况和 1 万次独角兽未出现情况。该数据集包含地理位置、时间、海拔、温度、湿度、树冠覆盖率、彩虹是否存在以及其他一些特征。
使用可重现的训练测试模型更新
也许您想继续改进独角兽模型。例如,假设您对某个特征进行了一些额外的特征工程,然后重新训练了模型,希望获得更好(或至少相同)的结果。很遗憾,有时很难重现模型训练。为了提高重现性,请遵循以下建议:
确定性地为随机数生成器设置种子。 如需了解详情,请参阅数据生成中的随机化
按固定顺序初始化模型组件,以确保组件在每次运行时从随机数生成器获取相同的随机数。ML 库通常会自动处理此要求。
取多次运行模型的平均值。
使用版本控制功能(即使是进行初步迭代),以便在研究模型或流水线时查明代码和参数。
即使遵循这些准则,也可能仍存在其他非确定性来源。
测试对机器学习 API 的调用
如何测试 API 调用的更新?您可以重新训练模型,但这需要花费大量时间。而是编写单元测试来生成随机输入数据,并运行单个梯度下降步骤。如果此步骤在没有错误的情况下完成,则 API 的任何更新都可能不会破坏您的模型。
为流水线组件编写集成测试
在机器学习流水线中,一个组件中的更改可能会导致其他组件出错。编写一项可端到端运行整个流水线的集成测试,以检查各个组件能否协同工作。
除了持续运行集成测试之外,您还应在推送新模型和新软件版本时运行集成测试。运行整个流水线的速度缓慢,这会使持续集成测试变得更加困难。如需更快地运行集成测试,请使用数据子集或更简单的模型进行训练。具体取决于您的模型和数据。为了实现持续覆盖,您需要调整更快的测试,使其能够在每次有新版本的模型或软件时运行。与此同时,运行缓慢的测试会在后台持续运行。
在提供服务之前验证模型质量
在将新模型版本推送到生产环境之前,请测试是否存在以下两种质量下降情况:
在提供服务之前验证模型与基础架构的兼容性
如果模型的更新速度比服务器更快,则模型的软件依赖项可能会与服务器不同,这可能会导致不兼容。通过将模型暂存到服务器的沙盒版本中,确保模型使用的运算存在于服务器中。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-07-27。
[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eDeploying a machine learning model involves validating data, features, model versions, serving infrastructure, and pipeline integration.\u003c/p\u003e\n"],["\u003cp\u003eReproducible model training involves deterministic seeding, fixed initialization order, averaging multiple runs, and using version control.\u003c/p\u003e\n"],["\u003cp\u003eIntegration tests ensure that different components of the ML pipeline work together seamlessly, and should be run continuously and for new model/software versions.\u003c/p\u003e\n"],["\u003cp\u003eBefore serving a new model, validate its quality by checking for sudden and slow degradations against previous versions and fixed thresholds.\u003c/p\u003e\n"],["\u003cp\u003eEnsure model-infrastructure compatibility by staging the model in a sandboxed server environment to avoid dependency conflicts.\u003c/p\u003e\n"]]],[],null,["You're ready to deploy the unicorn model that predicts unicorn appearances!\nWhen deploying, your machine learning (ML) pipeline should run, update, and\nserve without a problem. If only deploying a model were as easy as pressing\na big **Deploy** button. Unfortunately, a full machine learning system\nrequires tests for:\n\n- Validating input data.\n- Validating feature engineering.\n- Validating the quality of new model versions.\n- Validating serving infrastructure.\n- Testing integration between pipeline components.\n\nMany software engineers favor test-driven development (TDD). In TDD, software\nengineers write tests *prior* to writing the \"real\" source code.\nHowever, TDD can be tricky in machine learning.\nFor example, before training your model, you can't write a test to validate\nthe loss. Instead, you must first discover the achievable loss during model\ndevelopment and *then* test new model versions against the achievable loss.\n\nAbout the unicorn model\n\nThis section refers to the **unicorn model**. Here's what you need to know:\n\u003e You are using machine learning to build a classification model that predicts\n\u003e unicorn appearances. Your dataset details 10,000 unicorn appearances and\n\u003e 10,000 unicorn non-appearances. The dataset contains the location,\n\u003e time of day, elevation, temperature, humidity, tree cover, presence of a\n\u003e rainbow, and several other features.\n\nTest model updates with reproducible training\n\nPerhaps you want to continue improving your unicorn model. For example, suppose\nyou do some additional feature engineering on a certain feature and then\nretrain the model, hoping to get better (or at least the same) results.\nUnfortunately, it is sometimes difficult to reproduce model training.\nTo improve reproducibility, follow these recommendations:\n\n- Deterministically seed the random number generator.\n For details, see [randomization in data\n generation](/machine-learning/crash-course/production-ml-systems/monitoring#randomization)\n\n- Initialize model components in a fixed order to ensure the components get the\n same random number from the random number generator on every run.\n ML libraries typically handle this requirement automatically.\n\n- Take the average of several runs of the model.\n\n- Use version control, even for preliminary iterations, so that you can\n pinpoint code and parameters when investigating your model or pipeline.\n\nEven after following these guidelines, other sources of nondeterminism might\nstill exist.\n\nTest calls to machine learning API\n\nHow do you test updates to API calls? You could retrain your model, but\nthat's time intensive. Instead, write a unit test to generate random input data\nand run a single step of gradient descent. If this step completes without\nerrors, then any updates to the API probably haven't ruined your model.\n\nWrite integration tests for pipeline components\n\nIn an ML pipeline, changes in one component can cause errors in other\ncomponents. Check that components work together by writing an\n**integration test** that runs the entire pipeline end-to-end.\n\nBesides running integration tests continuously, you should run integration tests\nwhen pushing new models and new software versions. The slowness of running the\nentire pipeline makes continuous integration testing harder. To run integration\ntests faster, train on a subset of the data or with a simpler model. The details\ndepend on your model and data. To get continuous coverage, you'd adjust your\nfaster tests so that they run with every new version of model or software.\nMeanwhile, your slow tests would run continuously in the background.\n\nValidate model quality before serving\n\nBefore pushing a new model version to production, test for\nthe following two types of quality degradations:\n\n- **Sudden degradation.** A bug in the new version could cause significantly\n lower quality. Validate new versions by checking their quality\n against the previous version.\n\n- **Slow degradation.** Your test for sudden degradation might not detect a slow\n degradation in model quality over multiple versions. Instead, ensure your\n model's predictions on a validation dataset meet a fixed threshold. If your\n validation dataset deviates from live data, then update your validation\n dataset and ensure your model still meets the same quality threshold.\n\nValidate model-infrastructure compatibility before serving\n\nIf your model is updated faster than your server, then your model could have\ndifferent software dependencies from your server, potentially causing\nincompatibilities. Ensure that the operations used by the model are present in\nthe server by staging the model in a sandboxed version of the server.\n| **Key terms:**\n|\n| - [Training-serving skew](/machine-learning/glossary#training-serving-skew)\n- [Z-score normalization](/machine-learning/glossary#z-score-normalization) \n[Help Center](https://support.google.com/machinelearningeducation)"]]