步驟 1:收集資料
透過集合功能整理內容 你可以依據偏好儲存及分類內容。
收集資料是解決任何監督式機器學習問題最重要的步驟。文字分類器的效能取決於用來建立資料集的資料集品質。
如果您沒有想解決的特定問題,只是單純想探索文字分類,則可以有許多開放原始碼資料集。您可以在 GitHub 存放區中找到部分連結。另一方面,如要解決特定問題,就需要收集必要的資料。許多機構都提供公用 API 來存取其資料,例如 X API 或 NY Times API。您或許可以運用這些 API 解決自己要解決的問題。
收集資料時,請注意下列事項:
- 如果您使用的是公用 API,請先瞭解 API 的限制,再使用公開 API。舉例來說,某些 API 會限制查詢頻率。
- 您的訓練範例越多 (本指南的其他部分稱為「範例」) 越好。這有助於模型更妥善地「一般化」。
- 確認每個類別或主題的樣本數量並未達到不平衡。也就是說,每個類別中的樣本數量應有數量相當的差異。
- 請確保範例充分涵蓋可能的輸入內容空間,而不是只有常見情況。
在本指南中,我們將使用網際網路電影資料庫 (IMDb) 電影評論資料集說明工作流程。這個資料集包含 IMDb 網站上張貼的電影評論,以及評論者是否喜歡該電影的對應標籤 (「正面」或「負面」)。這是情緒分析問題的典型範例
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2025-07-27 (世界標準時間)。
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["缺少我需要的資訊","missingTheInformationINeed","thumb-down"],["過於複雜/步驟過多","tooComplicatedTooManySteps","thumb-down"],["過時","outOfDate","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["示例/程式碼問題","samplesCodeIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-07-27 (世界標準時間)。"],[[["\u003cp\u003eHigh-quality data is crucial for building effective supervised machine learning text classifiers, with more training samples generally leading to better performance.\u003c/p\u003e\n"],["\u003cp\u003ePublic APIs and open-source datasets can be leveraged for data collection, but it's important to understand API limitations and ensure data balance across classes.\u003c/p\u003e\n"],["\u003cp\u003eAdequate data representation across all possible input variations is necessary, and the IMDb movie reviews dataset will be used to demonstrate text classification workflow for sentiment analysis.\u003c/p\u003e\n"],["\u003cp\u003eWhen collecting data, aim for a balanced dataset with a sufficient number of samples for each class to avoid imbalanced datasets and promote better model generalization.\u003c/p\u003e\n"]]],[],null,["Gathering data is the most important step in solving any supervised machine\nlearning problem. Your text classifier can only be as good as the dataset it is\nbuilt from.\n\nIf you don't have a specific problem you want to solve and are just interested\nin exploring text classification in general, there are plenty of open source\ndatasets available. You can find links to some of them in our [GitHub\nrepo](https://github.com/google/eng-edu/blob/master/ml/guides/text_classification/load_data.py).\nOn the other hand, if you are tackling a specific problem,\nyou will need to collect the necessary data. Many organizations provide public\nAPIs for accessing their data---for example, the\n[X API](https://developer.x.com/docs) or the\n[NY Times API](http://developer.nytimes.com/). You may be able to leverage\nthese APIs for the problem you are trying to solve.\n\nHere are some important things to remember when collecting data:\n\n- If you are using a public API, understand the *limitations* of the API before using them. For example, some APIs set a limit on the rate at which you can make queries.\n- The more training examples (referred to as *samples* in the rest of this guide) you have, the better. This will help your model [generalize](/machine-learning/glossary#generalization) better.\n- Make sure the number of samples for every *class* or topic is not overly [imbalanced](/machine-learning/glossary#class_imbalanced_data_set). That is, you should have comparable number of samples in each class.\n- Make sure that your samples adequately cover the *space of possible inputs*, not only the common cases.\n\nThroughout this guide, we will use the [Internet Movie Database (IMDb) movie\nreviews dataset](http://ai.stanford.edu/%7Eamaas/data/sentiment/) to illustrate\nthe workflow. This dataset contains movie reviews posted by people on the IMDb\nwebsite, as well as the corresponding labels (\"positive\" or \"negative\")\nindicating whether the reviewer liked the movie or not. This is a classic\nexample of a sentiment analysis problem."]]