[null,null,["最后更新时间 (UTC):2024-08-13。"],[[["\u003cp\u003eBinning is a feature engineering technique used to group numerical data into categories (bins) to improve model performance when a linear relationship is weak or data is clustered.\u003c/p\u003e\n"],["\u003cp\u003eBinning can be beneficial when features exhibit a "clumpy" distribution rather than a linear one, allowing the model to learn separate weights for each bin.\u003c/p\u003e\n"],["\u003cp\u003eWhile creating multiple bins is possible, it's generally recommended to avoid an excessive number as it can lead to insufficient training examples per bin and increased feature dimensionality.\u003c/p\u003e\n"],["\u003cp\u003eQuantile bucketing is a specific binning technique that ensures each bin contains a roughly equal number of examples, which can be particularly useful for datasets with skewed distributions.\u003c/p\u003e\n"],["\u003cp\u003eBinning offers an alternative to scaling or clipping and is particularly useful for handling outliers and improving model performance on non-linear data.\u003c/p\u003e\n"]]],[],null,["**Binning** (also called **bucketing** ) is a\n[**feature engineering**](/machine-learning/glossary#feature_engineering)\ntechnique that groups different numerical subranges into *bins* or\n[***buckets***](/machine-learning/glossary#bucketing).\nIn many cases, binning turns numerical data into categorical data.\nFor example, consider a [**feature**](/machine-learning/glossary#feature)\nnamed `X` whose lowest value is 15 and\nhighest value is 425. Using binning, you could represent `X` with the\nfollowing five bins:\n\n- Bin 1: 15 to 34\n- Bin 2: 35 to 117\n- Bin 3: 118 to 279\n- Bin 4: 280 to 392\n- Bin 5: 393 to 425\n\nBin 1 spans the range 15 to 34, so every value of `X` between 15 and 34\nends up in Bin 1. A model trained on these bins will react no differently\nto `X` values of 17 and 29 since both values are in Bin 1.\n\nThe [**feature vector**](/machine-learning/glossary#feature_vector) represents\nthe five bins as follows:\n\n| Bin number | Range | Feature vector |\n|------------|---------|-----------------------------|\n| 1 | 15-34 | \\[1.0, 0.0, 0.0, 0.0, 0.0\\] |\n| 2 | 35-117 | \\[0.0, 1.0, 0.0, 0.0, 0.0\\] |\n| 3 | 118-279 | \\[0.0, 0.0, 1.0, 0.0, 0.0\\] |\n| 4 | 280-392 | \\[0.0, 0.0, 0.0, 1.0, 0.0\\] |\n| 5 | 393-425 | \\[0.0, 0.0, 0.0, 0.0, 1.0\\] |\n\nEven though `X` is a single column in the dataset, binning causes a model\nto treat `X` as *five* separate features. Therefore, the model learns\nseparate weights for each bin.\n\nBinning is a good alternative to [**scaling**](/machine-learning/glossary#scaling)\nor [**clipping**](/machine-learning/glossary#clipping) when either of the\nfollowing conditions is met:\n\n- The overall *linear* relationship between the feature and the [**label**](/machine-learning/glossary#label) is weak or nonexistent.\n- When the feature values are clustered.\n\nBinning can feel counterintuitive, given that the model in the\nprevious example treats the values 37 and 115 identically. But when\na feature appears more *clumpy* than linear, binning is a much better way to\nrepresent the data.\n\nBinning example: number of shoppers versus temperature\n\nSuppose you are creating a model that predicts the number of\nshoppers by the outside temperature for that day. Here's a plot of the\ntemperature versus the number of shoppers:\n**Figure 9.** A scatter plot of 45 points.\n\nThe plot shows, not surprisingly, that the number of shoppers was highest when\nthe temperature was most comfortable.\n\nYou could represent the feature as raw values: a temperature of 35.0 in the\ndataset would be 35.0 in the feature vector. Is that the best idea?\n\nDuring training, a linear regression model learns a single weight for each\nfeature. Therefore, if temperature is represented as a single feature, then a\ntemperature of 35.0 would have five times the influence (or one-fifth the\ninfluence) in a prediction as a temperature of 7.0. However, the plot doesn't\nreally show any sort of linear relationship between the label and the\nfeature value.\n\nThe graph suggests three clusters in the following subranges:\n\n- Bin 1 is the temperature range 4-11.\n- Bin 2 is the temperature range 12-26.\n- Bin 3 is the temperature range 27-36.\n\n**Figure 10.** The scatter plot divided into three bins.\n\nThe model learns separate weights for each bin.\n\nWhile it's possible to create more than three bins, even a separate bin for\neach temperature reading, this is often a bad idea for the following reasons:\n\n- A model can only learn the association between a bin and a label if there are enough examples in that bin. In the given example, each of the 3 bins contains at least 10 examples, which *might* be sufficient for training. With 33 separate bins, none of the bins would contain enough examples for the model to train on.\n- A separate bin for each temperature results in 33 separate temperature features. However, you typically should *minimize* the number of features in a model.\n\nExercise: Check your understanding\n\nThe following plot shows the median home price for each 0.2 degrees of\nlatitude for the mythical country of Freedonia:\n**Figure 11.** Median home value per 0.2 degrees latitude.\n\nThe graphic shows a nonlinear pattern between home value and latitude,\nso representing latitude as its floating-point value is unlikely to help\na model make good predictions. Perhaps bucketing latitudes would be a better\nidea? \nWhat would be the best bucketing strategy? \nDon't bucket. \nGiven the randomness of most of the plot, this is probably the best strategy. \nCreate four buckets:\n\n- 41.0 to 41.8\n- 42.0 to 42.6\n- 42.8 to 43.4\n- 43.6 to 44.8 \nIt would be hard for a model to find a single predictive weight for all the homes in the second bin or the fourth bin, which contain few examples. \nMake each data point its own bucket. \nThis would only be helpful if the training set contains enough examples for each 0.2 degrees of latitude. In general, homes tend to cluster near cities and be relatively scarce in other places.\n\nQuantile Bucketing\n\n**Quantile bucketing** creates bucketing boundaries such that the number\nof examples in each bucket is exactly or nearly equal. Quantile bucketing\nmostly hides the outliers.\n\nTo illustrate the problem that quantile bucketing solves, consider the\nequally spaced buckets shown in the following figure, where each\nof the ten buckets represents a span of exactly 10,000 dollars.\nNotice that the bucket from 0 to 10,000 contains dozens of examples\nbut the bucket from 50,000 to 60,000 contains only 5 examples.\nConsequently, the model has enough examples to train on the 0 to 10,000\nbucket but not enough examples to train on for the 50,000 to 60,000 bucket.\n**Figure 13.** Some buckets contain a lot of cars; other buckets contain very few cars.\n\nIn contrast, the following figure uses quantile bucketing to divide car prices\ninto bins with approximately the same number of examples in each bucket.\nNotice that some of the bins encompass a narrow price span while others\nencompass a very wide price span.\n**Figure 14.** Quantile bucketing gives each bucket about the same number of cars. Bucketing with equal intervals works for many data distributions. For skewed data, however, try quantile bucketing. Equal intervals give extra information space to the long tail while compacting the large torso into a single bucket. Quantile buckets give extra information space to the large torso while compacting the long tail into a single bucket.\n| **Key terms:**\n|\n| - [Binning](/machine-learning/glossary#binning)\n| - [Bucketing](/machine-learning/glossary#bucketing)\n| - [Clipping](/machine-learning/glossary#clipping)\n| - [Feature](/machine-learning/glossary#feature)\n| - [Feature engineering](/machine-learning/glossary#feature_engineering)\n| - [Feature vector](/machine-learning/glossary#feature_vector)\n| - [Label](/machine-learning/glossary#label)\n- [Scaling](/machine-learning/glossary#scaling) \n[Help Center](https://support.google.com/machinelearningeducation)"]]