[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["没有我需要的信息","missingTheInformationINeed","thumb-down"],["太复杂/步骤太多","tooComplicatedTooManySteps","thumb-down"],["内容需要更新","outOfDate","thumb-down"],["翻译问题","translationIssue","thumb-down"],["示例/代码问题","samplesCodeIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eThis guide simplifies selecting a text classification model by identifying the best-performing algorithm for a given dataset based on accuracy and training time.\u003c/p\u003e\n"],["\u003cp\u003eA flowchart and algorithm are provided to guide model selection, primarily focusing on two options: using a multi-layer perceptron (MLP) with n-grams for datasets with a low sample-to-words-per-sample ratio or a sequence model (sepCNN) for datasets with a high ratio.\u003c/p\u003e\n"],["\u003cp\u003eExtensive experimentation across various text classification problems and datasets informed the recommendations, emphasizing the sample-to-words-per-sample ratio as a key factor in model selection.\u003c/p\u003e\n"],["\u003cp\u003eWhile the guide aims for optimal accuracy with minimal computation, it may not always yield the absolute best results due to potential variations in dataset characteristics, goals, or the emergence of newer algorithms.\u003c/p\u003e\n"],["\u003cp\u003eUsers can utilize the flowchart as a starting point for model construction, iteratively refining the model based on their specific needs and dataset properties.\u003c/p\u003e\n"]]],[],null,["At this point, we have assembled our dataset and gained insights into the key\ncharacteristics of our data. Next, based on the metrics we gathered in\n[Step 2](/machine-learning/guides/text-classification/step-2), we should think\nabout which classification model we should use. This means asking questions\nsuch as:\n\n- How do you present the text data to an algorithm that expects numeric input? (This is called data preprocessing and vectorization.)\n- What type of model should you use?\n- What configuration parameters should you use for your model?\n\nThanks to decades of research, we have access to a large array of data\npreprocessing and model configuration options. However, the availability of a\nvery large array of viable options to choose from can greatly increase the\ncomplexity and scope of a particular problem. Given that the best\noptions might not be obvious, a naive solution would be to try every possible\noption exhaustively, pruning some choices through intuition. However, that would\nbe tremendously expensive.\n\nIn this guide, we attempt to significantly simplify the process of selecting a\ntext classification model. For a given dataset, our goal is to find the\nalgorithm that achieves close to maximum accuracy while minimizing computation\ntime required for training. We ran a large number (\\~450K) of experiments across\nproblems of different types (especially sentiment analysis and topic\nclassification problems), using 12 datasets, alternating for each dataset\nbetween different data preprocessing techniques and different model\narchitectures. This helped us identify dataset parameters that influence optimal\nchoices.\n\nThe model selection algorithm and flowchart below are a summary of our\nexperimentation. Don't worry if you don't understand all the terms used in them\nyet; the following sections of this guide will explain them in depth.\n\nAlgorithm for Data Preparation and Model Building\n\n1. Calculate the number of samples/number of words per sample ratio.\n2. If this ratio is less than 1500, tokenize the text as [n-grams](/machine-learning/glossary#n-gram) and use a simple multi-layer perceptron (MLP) model to classify them (left branch in the flowchart below):\n 1. Split the samples into word n-grams; convert the n-grams into vectors.\n 2. Score the importance of the vectors and then select the top 20K using the scores.\n 3. Build an MLP model.\n3. If the ratio is greater than 1500, tokenize the text as sequences and use a [sepCNN](/machine-learning/glossary?utm_source=DevSite&utm_campaign=Text-Class-Guide&utm_medium=referral&utm_content=glossary&utm_term=sepCNN#depthwise-separable-convolutional-neural-network-sepcnn) model to classify them (right branch in the flowchart below):\n 1. Split the samples into words; select the top 20K words based on their frequency.\n 2. Convert the samples into word sequence vectors.\n 3. If the original number of samples/number of words per sample ratio is less than 15K, using a fine-tuned pre-trained embedding with the sepCNN model will likely provide the best results.\n4. Measure the model performance with different hyperparameter values to find the best model configuration for the dataset.\n\nIn the flowchart below, the yellow boxes indicate data and model preparation\nprocesses. Grey boxes and green boxes indicate choices we considered for each\nprocess. Green boxes indicate our recommended choice for each process.\n\nYou can use this flowchart as a starting point to construct your first\nexperiment, as it will give you good accuracy at low computation costs. You can\nthen continue to improve on your initial model over the subsequent iterations.\n\n\n**Figure 5: Text classification flowchart**\n\nThis flowchart answers two key questions:\n\n1. Which learning algorithm or model should you use?\n2. How should you prepare the data to efficiently learn the relationship between text and label?\n\nThe answer to the second question depends on the answer to the first question;\nthe way we preprocess data to be fed into a model will depend on what model we\nchoose. Models can be broadly classified into two categories: those that use\nword ordering information (sequence models), and ones that just see text as\n\"bags\" (sets) of words (n-gram models). Types of sequence models include\nconvolutional neural networks (CNNs), recurrent neural networks (RNNs), and\ntheir variations. Types of n-gram models include:\n\n- [logistic regression](/machine-learning/glossary#logistic-regression)\n- [simple multi-layer perceptrons](https://wikipedia.org/wiki/Multilayer_perceptron) (MLPs, or fully-connected neural networks)\n- [gradient boosted trees](/machine-learning/glossary#gradient-boosted-decision-trees-gbt)\n- [support vector machines](/machine-learning/glossary#kernel-support-vector-machines-ksvms)\n\n**From our experiments, we have observed that the ratio of \"number of samples\"\n(S) to \"number of words per sample\" (W) correlates with which model performs\nwell.**\n\nWhen the value for this ratio is small (\\\u003c1500), small multi-layer perceptrons\nthat take n-grams as input (which we'll call **Option A** ) perform better or at\nleast as well as sequence models. MLPs are simple to define and understand, and\nthey take much less compute time than sequence models. When the value for this\nratio is large (\\\u003e= 1500), use a sequence model (**Option B** ). In the steps\nthat follow, you can skip to the relevant subsections (labeled **A** or **B**)\nfor the model type you chose based on the samples/words-per-sample ratio.\n\nIn the case of our IMDb review dataset, the samples/words-per-sample ratio is\n\\~144. This means that we will create a MLP model.\n| **Note** : When using the above flowchart, keep in mind that it may not necessarily lead you to the most optimal results for your problem, for several reasons: \n| - Your goal may be different. We optimized for the best accuracy that could be achieved in the shortest possible compute time. An alternate flow may produce a better result, say, when optimizing for [area under the curve (AUC)](https://developers.google.com/machine-learning/glossary#AUC). \n| - We picked typical and common algorithm choices. As the field continues to evolve, new cutting-edge algorithms and enhancements may be relevant to your data and may perform better. \n| - While we used several datasets to derive and validate the flowchart, there may be specific characteristics to your dataset that favor using an alternate flow."]]