如前所述,相關叢集 ID 可取代該叢集中所有範例的其他特徵。這種替換方式可減少特徵數量,進而減少儲存、處理及訓練資料模型所需的資源。對於龐大的資料集來說,這些節省的成本就會變得相當可觀。
舉例來說,單一 YouTube 影片可能包含以下特徵資料:
觀眾所在位置、時間和客層
註解時間戳記、文字和使用者 ID
影片代碼
將 YouTube 影片分群後,系統會以單一叢集 ID 取代這組特徵,藉此壓縮資料。
隱私權保護
您可以將使用者分組,並將使用者資料與叢集 ID 建立關聯,而非使用者 ID,以便稍微保護隱私。舉例來說,假設您想根據 YouTube 使用者的觀看記錄訓練模型。您可以將使用者分組,並只傳遞叢集 ID,而非將使用者 ID 傳遞至模型。這樣一來,個別觀看記錄就不會連結至個別使用者。請注意,叢集必須包含足夠大量的使用者,才能確保隱私權。
[null,null,["上次更新時間:2025-02-25 (世界標準時間)。"],[[["\u003cp\u003eClustering is an unsupervised machine learning technique used to group similar unlabeled data points into clusters based on defined similarity measures.\u003c/p\u003e\n"],["\u003cp\u003eCluster analysis can be applied to various domains like market segmentation, social network analysis, and medical imaging to identify patterns and simplify complex datasets.\u003c/p\u003e\n"],["\u003cp\u003eClustering enables data compression by replacing numerous features with a single cluster ID, reducing storage and processing needs.\u003c/p\u003e\n"],["\u003cp\u003eIt facilitates data imputation by inferring missing feature data from other examples within the same cluster.\u003c/p\u003e\n"],["\u003cp\u003eClustering offers a degree of privacy preservation by associating user data with cluster IDs instead of individual identifiers.\u003c/p\u003e\n"]]],[],null,["# What is clustering?\n\nSuppose you are working with a dataset that includes patient information from a\nhealthcare system. The dataset is complex and includes both categorical and\nnumeric features. You want to find patterns and similarities in the dataset.\nHow might you approach this task?\n\n[**Clustering**](/machine-learning/glossary#clustering) is an unsupervised\nmachine learning technique designed to group\n[**unlabeled examples**](https://developers.google.com/machine-learning/glossary#unlabeled_example)\nbased on their similarity to each other. (If the examples are labeled, this\nkind of grouping is called\n[**classification**](https://developers.google.com/machine-learning/glossary#classification_model).)\nConsider a hypothetical patient\nstudy designed to evaluate a new treatment protocol. During the study, patients\nreport how many times per week they experience symptoms and the severity of the\nsymptoms. Researchers can use clustering analysis to group patients with similar\ntreatment responses into clusters. Figure 1 demonstrates one possible grouping\nof simulated data into three clusters.\n**Figure 1: Unlabeled examples grouped into three clusters\n(simulated data).**\n\nLooking at the unlabeled data on the left of Figure 1, you could guess that\nthe data forms three clusters, even without a formal definition of similarity\nbetween data points. In real-world applications, however, you need to explicitly\ndefine a **similarity measure** , or the metric used to compare samples, in\nterms of the dataset's features. When examples have only a couple of features,\nvisualizing and measuring similarity is straightforward. But as the number of\nfeatures increases, combining and comparing features becomes less intuitive\nand more complex. Different similarity measures may be more or less appropriate\nfor different clustering scenarios, and this course will address choosing an\nappropriate similarity measure in later sections:\n[Manual similarity measures](/machine-learning/clustering/kmeans/similarity-measure)\nand\n[Similarity measure from embeddings](/machine-learning/clustering/autoencoder/similarity-measure).\n\nAfter clustering, each group is assigned a unique label called a **cluster ID**.\nClustering is powerful because it can simplify large, complex datasets with\nmany features to a single cluster ID.\n\nClustering use cases\n--------------------\n\nClustering is useful in a variety of industries. Some common applications\nfor clustering:\n\n- Market segmentation\n- Social network analysis\n- Search result grouping\n- Medical imaging\n- Image segmentation\n- Anomaly detection\n\nSome specific examples of clustering:\n\n- The [Hertzsprung-Russell diagram](https://wikipedia.org/wiki/Hertzsprung%E2%80%93Russell_diagram) shows clusters of stars when plotted by luminosity and temperature.\n- Gene sequencing that shows previously unknown genetic similarities and dissimilarities between species has led to the revision of taxonomies previously based on appearances.\n- The [Big 5](https://wikipedia.org/wiki/Big_Five_personality_traits) model of personality traits was developed by clustering words that describe personality into 5 groups. The [HEXACO](https://wikipedia.org/wiki/HEXACO_model_of_personality_structure) model uses 6 clusters instead of 5.\n\n### Imputation\n\nWhen some examples in a cluster have missing feature data, you can infer the\nmissing data from other examples in the cluster. This is called\n[imputation](https://developers.google.com/machine-learning/glossary/#value-imputation).\nFor example, less popular videos can be clustered with more popular videos\nto improve video recommendations.\n\n### Data compression\n\nAs discussed, the relevant cluster ID can replace other features for all\nexamples in that cluster. This substitution reduces the number of features and\ntherefore also reduces the resources needed to store, process, and train models\non that data. For very large datasets, these savings become significant.\n\nTo give an example, a single YouTube video can have feature data including:\n\n- viewer location, time, and demographics\n- comment timestamps, text, and user IDs\n- video tags\n\nClustering YouTube videos replaces this set of features with a\nsingle cluster ID, thus compressing the data.\n\n### Privacy preservation\n\nYou can preserve privacy somewhat by clustering users and associating user data\nwith cluster IDs instead of user IDs. To give one possible example, say you want\nto train a model on YouTube users' watch history. Instead of passing user IDs\nto the model, you could cluster users and pass only the cluster ID. This\nkeeps individual watch histories from being attached to individual users. Note\nthat the cluster must contain a sufficiently large number of users in order to\npreserve privacy.\n| **Key terms:**\n|\n| - [clustering](/machine-learning/glossary#clustering)\n| - [example](/machine-learning/glossary#example)\n| - [unlabeled example](/machine-learning/glossary#unlabeled_example)\n| - [classification](/machine-learning/glossary#classification_model)"]]