衡量嵌入的相似度
使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。
现在,您已经获得了任意两对示例的嵌入。监督式相似性度量会接受这些嵌入,并返回一个数值来衡量它们的相似性。请注意,嵌入是数字向量。如需查找两个向量 \(A = [a_1,a_2,...,a_n]\) 和 \(B = [b_1,b_2,...,b_n]\)之间的相似度,请选择以下三种相似度衡量方法之一:
测量 | 含义 | 公式 | 随着相似度的增加,此衡量指标... |
欧几里得距离 | 向量端点之间的距离 | \(\sqrt{(a_1-b_1)^2+(a_2-b_2)^2+...+(a_N-b_N)^2}\) | 减少 |
余弦 | 向量 \(\theta\) 之间的夹角余弦 | \(\frac{a^T b}{|a| \cdot |b|}\) | 增加 |
点积 | 余弦值乘以两个向量的长度 | \(a_1b_1+a_2b_2+...+a_nb_n\) \(=|a||b|cos(\theta)\) | 增加。也会随着向量的长度而增加。 |
选择相似度度量
与余弦不同,点积与矢量长度成正比。这一点很重要,因为在训练集中出现频率非常高的示例(例如热门的 YouTube 视频)往往具有长度较大的嵌入向量。 如果您想抓住热门趋势,请选择点产品。不过,这样做有风险,因为热门示例可能会导致相似性指标出现偏差。为了平衡这种偏差,您可以将长度乘方 \(\alpha\ < 1\) ,以 \(|a|^{\alpha}|b|^{\alpha}\cos(\theta)\)的形式计算点积。
为了更好地了解矢量长度如何改变相似度衡量标准,请将矢量长度归一化为 1,并注意这三个衡量标准会变为彼此成比例。
将 a 和 b 标准化为 \(||a||=1\) 和 \(||b||=1\)后,这三个测量结果之间的关系如下:
- 欧几里得距离 = \(||a-b|| = \sqrt{||a||^2 + ||b||^2 - 2a^{T}b} = \sqrt{2-2\cos(\theta_{ab})}\)。
- 点积 = \( |a||b| \cos(\theta_{ab}) = 1\cdot1\cdot \cos(\theta_{ab}) = cos(\theta_{ab})\)。
- 余弦 = \(\cos(\theta_{ab})\)。
因此,这三个相似度衡量标准是等效的,因为它们与 \(cos(\theta_{ab})\)成正比。
相似度度量简介
相似度衡量方法用于量化一对示例相对于其他示例对的相似度。下面对手动和监督两种类型进行了比较:
类型 | 如何创建 | 适用场景 | 影响 |
手动 | 手动组合地图项数据。 | 数据集较小,且特征易于组合。 | 深入了解相似度计算结果。如果地图项数据发生变化,您必须手动更新相似度衡量标准。 |
受监管 | 测量由监督式 DNN 生成的嵌入之间的距离。 | 包含难以组合的特征的大型数据集。 | 无法深入了解结果。但是,DNN 可以自动适应不断变化的特征数据。 |
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-02-25。
[null,null,["最后更新时间 (UTC):2025-02-25。"],[[["\u003cp\u003eSupervised similarity measures leverage embeddings to quantify the similarity between data examples using Euclidean distance, cosine, or dot product.\u003c/p\u003e\n"],["\u003cp\u003eDot product incorporates vector length, reflecting popularity, while cosine similarity focuses solely on the angle between vectors, ignoring popularity.\u003c/p\u003e\n"],["\u003cp\u003eNormalizing vector lengths makes Euclidean distance, cosine, and dot product proportional, essentially measuring the same thing.\u003c/p\u003e\n"],["\u003cp\u003eSupervised similarity, using embeddings and a distance metric, is suitable for large, complex datasets, while manual similarity, relying on feature combinations, is better for small, straightforward datasets.\u003c/p\u003e\n"]]],[],null,["# Measuring similarity from embeddings\n\nYou now have embeddings for any pair of examples. A supervised similarity\nmeasure takes these embeddings and returns a number measuring their similarity.\nRemember that embeddings are vectors of numbers. To find the similarity between\ntwo vectors \\\\(A = \\[a_1,a_2,...,a_n\\]\\\\) and \\\\(B = \\[b_1,b_2,...,b_n\\]\\\\),\nchoose one of these three similarity measures:\n\n| Measure | Meaning | Formula | As similarity increases, this measure... |\n|--------------------|-----------------------------------------------|--------------------------------------------------------------|---------------------------------------------------|\n| Euclidean distance | Distance between ends of vectors | \\\\(\\\\sqrt{(a_1-b_1)\\^2+(a_2-b_2)\\^2+...+(a_N-b_N)\\^2}\\\\) | Decreases |\n| Cosine | Cosine of angle \\\\(\\\\theta\\\\) between vectors | \\\\(\\\\frac{a\\^T b}{\\|a\\| \\\\cdot \\|b\\|}\\\\) | Increases |\n| Dot product | Cosine multiplied by lengths of both vectors | \\\\(a_1b_1+a_2b_2+...+a_nb_n\\\\) \\\\(=\\|a\\|\\|b\\|cos(\\\\theta)\\\\) | Increases. Also increases with length of vectors. |\n\nChoosing a similarity measure\n-----------------------------\n\nIn contrast to the cosine, the dot product is proportional to the vector length.\nThis is important because examples that appear very frequently in the training\nset (for example, popular YouTube videos) tend to have embedding vectors with\nlarge lengths.\n\nIf you\nwant to capture popularity, then choose dot product. However, the risk is that\npopular examples may skew the similarity metric. To balance this skew, you can\nraise the length to an exponent \\\\(\\\\alpha\\\\ \\\u003c 1\\\\) to calculate the dot product\nas \\\\(\\|a\\|\\^{\\\\alpha}\\|b\\|\\^{\\\\alpha}\\\\cos(\\\\theta)\\\\).\n\nTo better understand how vector length changes the similarity measure, normalize\nthe vector lengths to 1 and notice that the three measures become proportional\nto each other. \nProof: Proportionality of Similarity Measures \nAfter normalizing a and b such that \\\\(\\|\\|a\\|\\|=1\\\\) and \\\\(\\|\\|b\\|\\|=1\\\\), these three measures are related as:\n\n- Euclidean distance = \\\\(\\|\\|a-b\\|\\| = \\\\sqrt{\\|\\|a\\|\\|\\^2 + \\|\\|b\\|\\|\\^2 - 2a\\^{T}b} = \\\\sqrt{2-2\\\\cos(\\\\theta_{ab})}\\\\).\n- Dot product = \\\\( \\|a\\|\\|b\\| \\\\cos(\\\\theta_{ab}) = 1\\\\cdot1\\\\cdot \\\\cos(\\\\theta_{ab}) = cos(\\\\theta_{ab})\\\\).\n- Cosine = \\\\(\\\\cos(\\\\theta_{ab})\\\\).\nThus, all three similarity measures are equivalent because they are proportional to \\\\(cos(\\\\theta_{ab})\\\\).\n\nReview of similarity measures\n-----------------------------\n\nA similarity measure quantifies the similarity between a pair of\nexamples, relative to other pairs of examples. The two types, manual and\nsupervised, are compared below:\n\n| Type | How to create | Best for | Implications |\n|------------|--------------------------------------------------------------------|-------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|\n| Manual | Manually combine feature data. | Small datasets with features that are straightforward to combine. | Gives insight into results of similarity calculations. If feature data changes, you must manually update the similarity measure. |\n| Supervised | Measure distance between embeddings generated by a supervised DNN. | Large datasets with hard-to-combine features. | Gives no insight into results. However, a DNN can automatically adapt to changing feature data. |"]]