数据集:标签
使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。
本部分重点介绍标签。
直接标签与代理标签
请考虑以下两种不同类型的标签:
- 直接标签,即与模型尝试进行的预测完全相同的标签。也就是说,模型尝试进行的预测正是以数据集中的列的形式呈现的。例如,对于用于预测某人是否拥有自行车的二元分类模型,名为
bicycle owner
的列就是直接标签。 - 代理标签:与模型尝试进行的预测类似(但不完全相同)的标签。例如,订阅《Bicycle Bizarre》杂志的人可能(但不一定)拥有自行车。
直接标签通常比代理标签更好。如果您的数据集提供了可能的直接标签,您可能应该使用它。不过,直接标签通常不可用。
代理标签始终是一种折衷方案,是对直接标签的近似估计,并不完美。不过,有些代理标签的近似值足够接近,因此非常有用。使用代理标签的模型的有效性取决于代理标签与预测之间的关联。
回想一下,每个标签都必须在特征向量中表示为浮点数(因为机器学习从根本上讲只是数学运算的巨大集合)。有时,存在直接标签,但无法轻松地在特征向量中表示为浮点数。在这种情况下,请使用代理标签。
练习:检查您的理解情况
贵公司希望实现以下目标:
向自行车所有者邮寄优惠券(“用旧自行车折抵新自行车,立减 15%”)。
因此,您的模型必须执行以下操作:
预测哪些人拥有自行车。
很遗憾,该数据集不包含名为 bike owner
的列。不过,该数据集确实包含一个名为 recently bought a bicycle
的列。
对于此模型,recently bought a bicycle
是合适的代理标签还是不合适的代理标签?
代理标签良好
recently bought a bicycle
列是一个相对较好的代理标签。毕竟,现在购买自行车的大多数人已经拥有自行车。不过,与所有代理标签一样,即使是效果非常好的代理标签,recently bought a bicycle
也无法做到尽善尽美。毕竟,购买商品的人员并不一定是使用(或拥有)该商品的人员。 例如,人们有时会购买自行车作为礼物。
代理标签不当
与所有代理标签一样,recently bought a bicycle
并不完美(有些自行车是作为礼物购买的,并赠送给他人)。不过,recently bought a bicycle
仍然是判断用户是否拥有自行车的相对较好的指标。
人工生成的数据
部分数据是由人生成的;也就是说,一个或多个人会检查一些信息并提供一个值(通常是标签)。例如,一位或多位气象学家可以检查天空照片并识别云彩类型。
或者,某些数据是自动生成的。也就是说,软件(可能是另一个机器学习模型)会确定该值。例如,机器学习模型可以检查天空图片并自动识别云彩类型。
本部分将探讨由人类生成的数据的优缺点。
优势
- 人工审核员可以执行各种任务,即使是复杂的机器学习模型也可能难以胜任。
- 该流程会迫使数据集所有者制定清晰且一致的标准。
缺点
- 您通常需要向人工评分员支付费用,因此由人工生成的数据可能很昂贵。
- 人非圣贤,难免有错。因此,多个人工评分员可能需要评估同一数据。
仔细思考以下问题,确定您的需求:
- 评价者需要具备怎样的技能?(例如,评价者必须懂特定语言吗?您是否需要对话或自然语言处理应用的语言学家?)
- 您需要多少个标记示例?您需要多久才能收到?
- 您的预算是多少?
请务必仔细检查人工评分员的评分。例如,自行标记 1,000 个示例,然后看看您的结果与其他评分者的结果有多一致。如果出现差异,请勿假定您的评分是正确的,尤其是涉及价值判断时。如果人工评分员出现了错误,不妨考虑添加一些说明来帮助他们,然后重试。
点击加号图标,详细了解由人生成的数据。
无论您是如何获取数据的,手动查看数据都是一个很好的做法。Andrej Karpathy 在 ImageNet 上进行了此实验,并撰文介绍了相关经验。
模型可以使用自动生成的标签和人工生成的标签进行训练。不过,对于大多数模型而言,额外提供一组人工生成的标签(可能会过时)通常不值得增加额外的复杂性和维护工作。不过,有时人工生成的标签可以提供自动标签中没有的额外信息。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-02-26。
[null,null,["最后更新时间 (UTC):2025-02-26。"],[[["\u003cp\u003eThis document explains the differences between direct and proxy labels for machine learning models, highlighting that direct labels are preferred but often unavailable.\u003c/p\u003e\n"],["\u003cp\u003eIt emphasizes the importance of carefully evaluating proxy labels to ensure they are a suitable approximation of the target prediction.\u003c/p\u003e\n"],["\u003cp\u003eHuman-generated data, while offering flexibility and nuanced understanding, can be expensive and prone to errors, requiring careful quality control.\u003c/p\u003e\n"],["\u003cp\u003eMachine learning models can utilize a combination of automated and human-generated labels, but the added complexity of maintaining human-generated labels often outweighs the benefits.\u003c/p\u003e\n"],["\u003cp\u003eRegardless of the label source, manual data inspection and comparison with human ratings are crucial for identifying potential issues and ensuring data quality.\u003c/p\u003e\n"]]],[],null,["This section focuses on [**labels**](/machine-learning/glossary#label).\n\nDirect versus proxy labels\n\nConsider two different kinds of labels:\n\n- **Direct labels** , which are labels identical to the prediction your model is trying to make. That is, the prediction your model is trying to make is exactly present as a column in your dataset. For example, a column named `bicycle owner` would be a direct label for a binary classification model that predicts whether or not a person owns a bicycle.\n- **Proxy labels**, which are labels that are similar---but not identical---to the prediction your model is trying to make. For example, a person subscribing to Bicycle Bizarre magazine probably---but not definitely---owns a bicycle.\n\nDirect labels are generally better than proxy labels. If your dataset\nprovides a possible direct label, you should probably use it.\nOftentimes though, direct labels aren't available.\n\nProxy labels are always a compromise---an imperfect approximation of\na direct label. However, some proxy labels are close enough approximations\nto be useful. Models that use proxy labels are only as useful as the\nconnection between the proxy label and the prediction.\n\nRecall that every label must be represented as a floating-point number\nin the [**feature vector**](/machine-learning/glossary#feature-vector)\n(because machine learning is fundamentally just a huge amalgam of mathematical\noperations). Sometimes, a direct label exists but can't be easily represented as\na floating-point number in the feature vector. In this case, use a proxy label.\n\nExercise: Check your understanding\n\nYour company wants to do the following:\n\u003e Mail coupons (\"Trade in your old bicycle for\n\u003e 15% off a new bicycle\") to bicycle owners.\n\nSo, your model must do the following:\n\u003e Predict which people own a bicycle.\n\nUnfortunately, the dataset doesn't contain a column named `bike owner`.\nHowever, the dataset does contain a column named `recently bought a bicycle`. \nWould `recently bought a bicycle` be a good proxy label or a poor proxy label for this model? \nGood proxy label \nThe column `recently bought a bicycle` is a relatively good proxy label. After all, most of the people who buy bicycles now own bicycles. Nevertheless, like all proxy labels, even very good ones, `recently bought a\nbicycle` is imperfect. After all, the person buying an item isn't always the person using (or owning) that item. For example, people sometimes buy bicycles as a gift. \nPoor proxy label \nLike all proxy labels, `recently bought a bicycle` is imperfect (some bicycles are bought as gifts and given to others). However, `recently bought a bicycle` is still a relatively good indicator that someone owns a bicycle.\n\nHuman-generated data\n\nSome data is **human-generated**; that is, one or more humans examine some\ninformation and provide a value, usually for the label. For example,\none or more meteorologists could examine pictures of the sky and identify\ncloud types.\n\nAlternatively, some data is **automatically-generated**. That is, software\n(possibly, another machine learning model) determines the value. For example, a\nmachine learning model could examine sky pictures and automatically identify\ncloud types.\n\nThis section explores the advantages and disadvantages of human-generated data.\n\nAdvantages\n\n- Human raters can perform a wide range of tasks that even sophisticated machine learning models may find difficult.\n- The process forces the owner of the dataset to develop clear and consistent criteria.\n\nDisadvantages\n\n- You typically pay human raters, so human-generated data can be expensive.\n- To err is human. Therefore, multiple human raters might have to evaluate the same data.\n\nThink through these questions to determine your needs:\n\n- How skilled must your raters be? (For example, must the raters know a specific language? Do you need linguists for dialogue or NLP applications?)\n- How many labeled examples do you need? How soon do you need them?\n- What's your budget?\n\n**Always double-check your human raters**. For example, label 1000 examples\nyourself, and see how your results match other raters' results.\nIf discrepancies surface, don't assume your ratings are the correct ones,\nespecially if a value judgment is involved. If human raters have introduced\nerrors, consider adding instructions to help them and try again.\n\nClick the plus icon to learn more about human-generated data. \nLooking at your data by hand is a good exercise regardless of how you\nobtained your data. Andrej Karpathy did this on\n[ImageNet\nand wrote about the experience](http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet).\n\nModels can train on a mix of automated and human-generated labels. However,\nfor most models, an extra set of human-generated labels (which can become stale)\nare generally not worth the extra complexity and maintenance.\nThat said, sometimes the human-generated labels can provide extra\ninformation not available in the automated labels.\n\n*** ** * ** ***\n\n| **Key terms:**\n|\n| - [Label](/machine-learning/glossary#label)\n- [Feature vector](/machine-learning/glossary#feature-vector) \n[Help Center](https://support.google.com/machinelearningeducation)"]]