[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["缺少我需要的資訊","missingTheInformationINeed","thumb-down"],["過於複雜/步驟過多","tooComplicatedTooManySteps","thumb-down"],["過時","outOfDate","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["示例/程式碼問題","samplesCodeIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-07-27 (世界標準時間)。"],[[["\u003cp\u003eThis module explores language models, which estimate the probability of a token or sequence of tokens occurring within a longer sequence, enabling tasks like text generation, translation, and summarization.\u003c/p\u003e\n"],["\u003cp\u003eLanguage models utilize context, the surrounding information of a target token, to enhance prediction accuracy, with recurrent neural networks offering more context than traditional N-grams.\u003c/p\u003e\n"],["\u003cp\u003eN-grams are ordered sequences of words used to build language models, with longer N-grams providing more context but potentially encountering sparsity issues.\u003c/p\u003e\n"],["\u003cp\u003eTokens, the atomic units of language modeling, represent words, subwords, or characters and are crucial for understanding and processing language.\u003c/p\u003e\n"],["\u003cp\u003eWhile recurrent neural networks improve context understanding compared to N-grams, they have limitations, paving the way for the emergence of large language models that evaluate the whole context simultaneously.\u003c/p\u003e\n"]]],[],null,["| **Estimated module length:** 45 minutes\n| **Learning objectives**\n|\n| - Define a few different types of language models and their components.\n| - Describe how large language models are created and the importance of context and parameters.\n| - Identify how large language models take advantage of self-attention.\n| - Reveal three key problems with large language models.\n| - Explain how fine-tuning and distillation can improve a model's predictions and efficiency.\n| **Prerequisites:**\n|\n| This module assumes you are familiar with the concepts covered in the\n| following modules:\n|\n| - [Introduction to Machine Learning](/machine-learning/intro-to-ml)\n| - [Linear regression](/machine-learning/crash-course/linear-regression)\n| - [Working with categorical data](/machine-learning/crash-course/categorical-data)\n| - [Datasets, generalization, and overfitting](/machine-learning/crash-course/overfitting)\n| - [Neural networks](/machine-learning/crash-course/neural-networks)\n| - [Embeddings](/machine-learning/crash-course/embeddings)\n\nWhat is a language model?\n\nA [**language model**](/machine-learning/glossary#language-model)\nestimates the probability of a [**token**](/machine-learning/glossary#token)\nor sequence of tokens occurring within a longer sequence of tokens. A token\ncould be a word, a subword (a subset of a word), or even a single character.\n\nClick the icon to learn more about tokens. \nMost modern language models tokenize by subwords, that is, by chunks of\ntext containing semantic meaning. The chunks could vary in length from\nsingle characters like punctuation or the possessive *s* to whole words.\nPrefixes and suffixes might be represented as separate subwords.\nFor example, the word *unwatched* might be represented by the following\nthree subwords:\n\n- un (the prefix)\n- watch (the root)\n- ed (the suffix)\n\nThe word *cats* might be represented by the following two subwords:\n\n- cat (the root)\n- s (the suffix)\n\nA more complex word like \"antidisestablishmentarianism\" might be represented\nas six subwords:\n\n- anti\n- dis\n- establish\n- ment\n- arian\n- ism\n\nTokenization is language specific, so the number of characters per token\ndiffers across languages. For English, one token corresponds to \\~4 characters\nor about 3/4 of a word, so 400 tokens \\~= 300 English words.\n\nTokens are the atomic unit or smallest unit of language modeling.\n\nTokens are now also being successfully applied to\n[computer vision](https://ai.googleblog.com/2023/03/scaling-vision-transformers-to-22.html) and\n[audio generation](https://ai.googleblog.com/2022/10/audiolm-language-modeling-approach-to.html).\n\nConsider the following sentence and the token(s) that might complete it: \n\n```\nWhen I hear rain on my roof, I _______ in my kitchen.\n```\n\nA language model determines the probabilities of different tokens or\nsequences of tokens to complete that blank. For example, the following\nprobability table identifies some possible tokens and their probabilities:\n\n| Probability | Token(s) |\n|-------------|------------------|\n| 9.4% | cook soup |\n| 5.2% | warm up a kettle |\n| 3.6% | cower |\n| 2.5% | nap |\n| 2.2% | relax |\n\nIn some situations, the sequence of tokens could be an entire sentence,\nparagraph, or even an entire essay.\n\nAn application can use the probability table to make predictions.\nThe prediction might be the highest probability (for example, \"cook soup\")\nor a random selection from tokens having a probability greater than a certain\nthreshold.\n\nEstimating the probability of what fills in the blank in a text sequence can\nbe extended to more complex tasks, including:\n\n- Generating text.\n- Translating text from one language to another.\n- Summarizing documents.\n\nBy modeling the statistical patterns of tokens, modern language models develop\nextremely powerful internal representations of language and can generate\nplausible language.\n\nN-gram language models\n\n[**N-grams**](/machine-learning/glossary#n-gram) are ordered sequences of words\nused to build language models, where N is the number of words in the sequence.\nFor example, when N is 2, the N-gram is called a **2-gram** (or a\n[**bigram**](/machine-learning/glossary#bigram)); when N is 5, the N-gram is\ncalled a 5-gram. Given the following phrase in a training document: \n\n```\nyou are very nice\n```\n\nThe resulting 2-grams are as follows:\n\n- you are\n- are very\n- very nice\n\nWhen N is 3, the N-gram is called a **3-gram** (or a\n[**trigram**](/machine-learning/glossary#trigram)). Given that same phrase, the\nresulting 3-grams are:\n\n- you are very\n- are very nice\n\nGiven two words as input, a language model based on 3-grams can predict the\nlikelihood of the third word. For example, given the following two words: \n\n```\norange is\n```\n\nA language model examines all the different 3-grams derived from its training\ncorpus that start with `orange is` to determine the most likely third word.\nHundreds of 3-grams could start with the two words `orange is`, but you can\nfocus solely on the following two possibilities: \n\n```\norange is ripe\norange is cheerful\n```\n\nThe first possibility (`orange is ripe`) is about orange the fruit,\nwhile the second possibility (`orange is cheerful`) is about the color\norange.\n\nContext\n\nHumans can retain relatively long contexts. While watching Act 3 of a play, you\nretain knowledge of characters introduced in Act 1. Similarly, the\npunchline of a long joke makes you laugh because you can remember the context\nfrom the joke's setup.\n\nIn language models, **context** is helpful information before or after the\ntarget token. Context can help a language model determine whether \"orange\"\nrefers to a citrus fruit or a color.\n\nContext can help a language model make better predictions, but does a\n3-gram provide sufficient context? Unfortunately, the only context a 3-gram\nprovides is the first two words. For example, the two words `orange is` doesn't\nprovide enough context for the language model to predict the third word.\nDue to lack of context, language models based on 3-grams make a lot of mistakes.\n\nLonger N-grams would certainly provide more context than shorter N-grams.\nHowever, as N grows, the relative occurrence of each instance decreases.\nWhen N becomes very large, the language model typically has only a single\ninstance of each occurrence of N tokens, which isn't very helpful in\npredicting the target token.\n\nRecurrent neural networks\n\n[**Recurrent neural\nnetworks**](/machine-learning/glossary#recurrent-neural-network)\nprovide more context than N-grams. A recurrent neural network is a type of\n[**neural network**](/machine-learning/glossary#neural-network) that trains on\na sequence of tokens. For example, a recurrent neural network\ncan *gradually* learn (and learn to ignore) selected context from each word\nin a sentence, kind of like you would when listening to someone speak.\nA large recurrent neural network can gain context from a passage of several\nsentences.\n\nAlthough recurrent neural networks learn more context than N-grams, the amount\nof useful context recurrent neural networks can intuit is still relatively\nlimited. Recurrent neural networks evaluate information \"token by token.\"\nIn contrast, large language models---the topic of the next\nsection---can evaluate the whole context at once.\n\nNote that training recurrent neural networks for long contexts is constrained by\nthe [**vanishing gradient\nproblem**](/machine-learning/glossary#vanishing-gradient-problem).\n\nExercise: Check your understanding \nWhich language model makes better predictions for English text?\n\n- A language model based on 6-grams\n- A language model based on 5-grams \nThe answer depends on the size and diversity of the training set. \nIf the training set spans millions of diverse documents, then the model based on 6-grams will probably outperform the model based on 5-grams. \nThe language model based on 6-grams. \nThis language model has more context, but unless this model has trained on a lot of documents, most of the 6-grams will be rare. \nThe language model based on 5-grams. \nThis language model has less context, so it is unlikely to outperform the language model based on 6-grams.\n| **Key terms:**\n|\n| - [Bigram](/machine-learning/glossary#bigram)\n| - [Language model](/machine-learning/glossary#language-model)\n| - [N-gram](/machine-learning/glossary#n-gram)\n| - [Neural network](/machine-learning/glossary#neural-network)\n| - [Recurrent neural network](/machine-learning/glossary#recurrent-neural-network)\n| - [Token](/machine-learning/glossary#token)\n| - [Trigram](/machine-learning/glossary#trigram)\n- [Vanishing gradient problem](/machine-learning/glossary#vanishing-gradient-problem) \n[Help Center](https://support.google.com/machinelearningeducation)"]]