请考虑以下数据集分区。 What should you do to ensure that the examples in the training set have a similar statistical distribution to the examples in the validation set and the test set?
[null,null,["最后更新时间 (UTC):2024-11-14。"],[[["\u003cp\u003eOverfitting occurs when a model performs well on training data but poorly on new, unseen data.\u003c/p\u003e\n"],["\u003cp\u003eA model is considered to generalize well if it accurately predicts on new data, indicating it hasn't overfit.\u003c/p\u003e\n"],["\u003cp\u003eOverfitting can be detected by observing diverging loss curves for training and validation sets on a generalization curve.\u003c/p\u003e\n"],["\u003cp\u003eCommon causes of overfitting include unrepresentative training data and overly complex models.\u003c/p\u003e\n"],["\u003cp\u003eDataset conditions for good generalization include examples being independent, identically distributed, and stationary, with similar distributions across partitions.\u003c/p\u003e\n"]]],[],null,["[**Overfitting**](/machine-learning/glossary#overfitting) means creating a model\nthat matches (*memorizes* ) the\n[**training set**](/machine-learning/glossary#training-set) so\nclosely that the model fails to make correct predictions on new data.\nAn overfit model is analogous to an invention that performs well in the lab but\nis worthless in the real world.\n| **Tip:** Overfitting is a common problem in machine learning, not an academic hypothetical.\n\nIn Figure 11, imagine that each geometric shape represents a tree's position\nin a square forest. The blue diamonds mark the locations of healthy trees,\nwhile the orange circles mark the locations of sick trees.\n**Figure 11.** Training set: locations of healthy and sick trees in a square forest.\n\nMentally draw any shapes---lines, curves, ovals...anything---to separate the\nhealthy trees from the sick trees. Then, expand the next line to examine\none possible separation.\n\nExpand to see one possible solution (Figure 12). \n**Figure 12.** A complex model for distinguishing sick from healthy trees.\n\nThe complex shapes shown in Figure 12 successfully categorized all but two of\nthe trees. If we think of the shapes as a model, then this is a fantastic\nmodel.\n\nOr is it? A truly excellent model successfully categorizes *new* examples.\nFigure 13 shows what happens when that same model makes predictions on new\nexamples from the test set:\n**Figure 13.**Test set: a complex model for distinguishing sick from healthy trees.\n\nSo, the complex model shown in Figure 12 did a great job on the training set\nbut a pretty bad job on the test set. This is a classic case of a model\n*overfitting* to the training set data.\n\nFitting, overfitting, and underfitting\n\nA model must make good predictions on *new* data.\nThat is, you're aiming to create a model that \"fits\" new data.\n\nAs you've seen, an overfit model makes excellent predictions on the training\nset but poor predictions on new data. An\n[**underfit**](/machine-learning/glossary#underfitting) model\ndoesn't even make good predictions on the training data. If an overfit model is\nlike a product that performs well in the lab but poorly in the real world,\nthen an underfit model is like a product that doesn't even do well in\nthe lab.\n**Figure 14.** Underfit, fit, and overfit models.\n\n[**Generalization**](/machine-learning/glossary#generalization) is the\nopposite of overfitting. That is, a model that *generalizes well* makes good\npredictions on new data. Your goal is to create a model that generalizes\nwell to new data.\n\nDetecting overfitting\n\nThe following curves help you detect overfitting:\n\n- loss curves\n- generalization curves\n\nA [**loss curve**](/machine-learning/glossary#loss-curve) plots a model's loss\nagainst the number of training iterations.\nA graph that shows two or more loss curves is called a [**generalization\ncurve**](/machine-learning/glossary#generalization-curve). The following\ngeneralization curve shows two loss curves:\n**Figure 15.** A generalization curve that strongly implies overfitting.\n\nNotice that the two loss curves behave similarly at first and then diverge.\nThat is, after a certain number of iterations, loss declines or\nholds steady (converges) for the training set, but increases\nfor the validation set. This suggests overfitting.\n\nIn contrast, a generalization curve for a well-fit model shows two loss curves\nthat have similar shapes.\n\nWhat causes overfitting?\n\nVery broadly speaking, overfitting is caused by one or both of the following\nproblems:\n\n- The training set doesn't adequately represent real life data (or the validation set or test set).\n- The model is too complex.\n\nGeneralization conditions\n\nA model trains on a training set, but the real test of a model's worth is how\nwell it makes predictions on new examples, particularly on real-world data.\nWhile developing a model, your test set serves as a proxy for real-world data.\nTraining a model that generalizes well implies the following dataset conditions:\n\n- Examples must be [**independently and identically distributed**](/machine-learning/glossary#independently-and-identically-distributed-i.i.d), which is a fancy way of saying that your examples can't influence each other.\n- The dataset is [**stationary**](/machine-learning/glossary#stationarity), meaning the dataset doesn't change significantly over time.\n- The dataset partitions have the same distribution. That is, the examples in the training set are statistically similar to the examples in the validation set, test set, and real-world data.\n\nExplore the preceding conditions through the following exercises.\n\nExercises: Check your understanding \nConsider the following dataset partitions. What should you do to ensure that the examples in the training set have a similar statistical distribution to the examples in the validation set and the test set? \nShuffle the examples in the dataset extensively before partitioning them. \nYes. Good shuffling of examples makes partitions much more likely to be statistically similar. \nSort the examples from earliest to most recent. \nIf the examples in the dataset are not stationary, then sorting makes the partitions *less* similar. \nDo nothing. Given enough examples, the law of averages naturally ensures that the distributions will be statistically similar. \nUnfortunately, this is not the case. The examples in certain sections of the dataset may differ from those in other sections. \nA streaming service is developing a model to predict the popularity of potential new television shows for the next three years. The streaming service plans to train the model on a dataset containing hundreds of millions of examples, spanning the previous ten years. Will this model encounter a problem? \nProbably. Viewers' tastes change in ways that past behavior can't predict. \nYes. Viewer tastes are not stationary. They constantly change. \nDefinitely not. The dataset is large enough to make good predictions. \nUnfortunately, viewers' tastes are nonstationary. \nProbably not. Viewers' tastes change in predictably cyclical ways. Ten years of data will enable the model to make good predictions on future trends. \nAlthough certain aspects of entertainment are somewhat cyclical, a model trained from past entertainment history will almost certainly have trouble making predictions about the next few years. \nA model aims to predict the time it takes for people to walk a mile based on weather data (temperature, dew point, and precipitation) collected over one year in a city whose weather varies significantly by season. Can you build and test a model from this dataset, even though the weather readings change dramatically by season? \nYes \nYes, it is possible to build and test a model from this dataset. You just have to ensure that the data is partitioned equally, so that data from all four seasons is distributed equally into the different partitions. \nNo \nAssuming this dataset contains enough examples of temperature, dew point, and precipitation, then you can build and test a model from this dataset. You just have to ensure that the data is partitioned equally, so that data from all four seasons is distributed equally into the different partitions.\n\nChallenge exercise\n\nYou are creating a model that predicts the ideal date for riders to buy a\ntrain ticket for a particular route. For example, the model might recommend\nthat users buy their ticket on July 8 for a train that departs July 23.\nThe train company updates prices hourly, basing their updates on a variety\nof factors but mainly on the current number of available seats. That is:\n\n- If a lot of seats are available, ticket prices are typically low.\n- If very few seats are available, ticket prices are typically high.\n\nYour model exhibits low loss on the validation set and the test set but sometimes makes terrible predictions on real-world data. Why? \nClick here to see the answer \n**Answer:** The real world model is struggling with a\n**[feedback loop](/machine-learning/glossary#feedback-loop)**.\n\nFor example, suppose the model recommends that users buy tickets on July 8.\nSome riders who use the model's recommendation buy their tickets at 8:30\nin the morning on July 8. At 9:00, the train company raises prices because\nfewer seats are now available. Riders using the model's recommendation have\naltered prices. By evening, ticket prices might be much higher than in the\nmorning.\n| **Key terms:**\n|\n| - [Feedback loop](/machine-learning/glossary#feedback-loop)\n| - [Generalization](/machine-learning/glossary#generalization)\n| - [Generalization curve](/machine-learning/glossary#generalization-curve)\n| - [Independently and identically distributed (i.i.d)](/machine-learning/glossary#independently-and-identically-distributed-i.i.d)\n| - [Loss curve](/machine-learning/glossary#loss-curve)\n| - [Overfitting](/machine-learning/glossary#overfitting)\n| - [Stationarity](/machine-learning/glossary#stationarity)\n| - [Training set](/machine-learning/glossary#training-set)\n- [Underfitting](/machine-learning/glossary#underfitting) \n[Help Center](https://support.google.com/machinelearningeducation)"]]