[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eConvolutional Neural Networks (CNNs) utilize convolutions to automatically learn and extract image features, eliminating the need for manual feature engineering.\u003c/p\u003e\n"],["\u003cp\u003eA CNN is comprised of convolutional, ReLU, and pooling layers for feature extraction, followed by fully connected layers for classification.\u003c/p\u003e\n"],["\u003cp\u003eDuring convolution, filters slide across the input image, extracting features like edges and textures which are then used for image classification.\u003c/p\u003e\n"],["\u003cp\u003eReLU introduces non-linearity to the model, while pooling downsamples the feature maps to reduce computational complexity while retaining important information.\u003c/p\u003e\n"],["\u003cp\u003eThe final fully connected layers classify the image based on the extracted features, often using a softmax activation function to output probabilities for each potential classification label.\u003c/p\u003e\n"]]],[],null,["Introducing Convolutional Neural Networks\n\nA breakthrough in building models for image classification came with the\ndiscovery that a [convolutional neural\nnetwork](https://wikipedia.org/wiki/Convolutional_neural_network) (CNN) could\nbe used to progressively extract higher- and higher-level representations of the\nimage content. Instead of preprocessing the data to derive features like\ntextures and shapes, a CNN takes just the image's raw pixel data as input and\n\"learns\" how to extract these features, and ultimately infer what object they\nconstitute.\n\nTo start, the CNN receives an input feature map: a three-dimensional matrix\nwhere the size of the first two dimensions corresponds to the length and width\nof the images in pixels. The size of the third dimension is 3 (corresponding to\nthe 3 channels of a color image: red, green, and blue). The CNN comprises a\nstack of modules, each of which performs three operations.\n\n1. Convolution\n\nA *convolution* extracts tiles of the input feature map, and applies filters to\nthem to compute new features, producing an output feature map, or *convolved\nfeature* (which may have a different size and depth than the input feature map).\nConvolutions are defined by two parameters:\n\n- **Size of the tiles that are extracted** (typically 3x3 or 5x5 pixels).\n- **The depth of the output feature map**, which corresponds to the number of filters that are applied.\n\nDuring a convolution, the filters (matrices the same size as the tile size)\neffectively slide over the input feature map's grid horizontally and vertically,\none pixel at a time, extracting each corresponding tile (see Figure 3).\n\n\n*Figure 3. A 3x3 convolution of depth 1 performed over a 5x5 input feature map,\nalso of depth 1. There are nine possible 3x3 locations to\nextract tiles from the 5x5 feature map, so this convolution produces a 3x3\noutput feature map.*\n| In Figure 3, the output feature map (3x3) is smaller than the input feature map (5x5). If you instead want the output feature map to have the same dimensions as the input feature map, you can add *padding* (blank rows/columns with all-zero values) to each side of the input feature map, producing a 7x7 matrix with 5x5 possible locations to extract a 3x3 tile.\n\nFor each filter-tile pair, the CNN performs element-wise multiplication of the\nfilter matrix and the tile matrix, and then sums all the elements of the\nresulting matrix to get a single value. Each of these resulting values for every\nfilter-tile pair is then output in the *convolved feature* matrix (see Figures\n4a and 4b).\n\n\n*Figure 4a. **Left** : A 5x5 input feature map (depth 1). **Right**: a 3x3\nconvolution (depth 1).*\n\n*Figure 4b. **Left** : The 3x3\nconvolution is performed on the 5x5 input feature map. **Right**: the\nresulting convolved feature. Click on a value in the output feature map to\nsee how it was calculated.*\n\nDuring training, the CNN \"learns\" the optimal values for the filter matrices\nthat enable it to extract meaningful features (textures, edges, shapes) from the\ninput feature map. As the number of filters (output feature map depth) applied\nto the input increases, so does the number of features the CNN can extract.\nHowever, the tradeoff is that filters compose the majority of resources expended\nby the CNN, so training time also increases as more filters are added.\nAdditionally, each filter added to the network provides less incremental value\nthan the previous one, so engineers aim to construct networks that use the\nminimum number of filters needed to extract the features necessary for accurate\nimage classification.\n\n2. ReLU\n\nFollowing each convolution operation, the CNN applies a Rectified Linear Unit\n(ReLU) transformation to the convolved feature, in order to introduce\nnonlinearity into the model. The ReLU function, \\\\(F(x)=max(0,x)\\\\), returns *x*\nfor all values of *x* \\\u003e 0, and returns 0 for all values of *x* ≤ 0.\n| ReLU is used as an activation function in a variety of neural networks; for more background, see [Introduction to Neural Networks](https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/) in [Machine Learning\n| Crash Course](https://developers.google.com/machine-learning/crash-course/).\n\n3. Pooling\n\nAfter ReLU comes a pooling step, in which the CNN downsamples the convolved\nfeature (to save on processing time), reducing the number of dimensions of the\nfeature map, while still preserving the most critical feature information. A\ncommon algorithm used for this process is called [max\npooling](https://wikipedia.org/wiki/Convolutional_neural_network#Pooling_layer).\n\nMax pooling operates in a similar fashion to convolution. We slide over the\nfeature map and extract tiles of a specified size. For each tile, the maximum\nvalue is output to a new feature map, and all other values are discarded. Max\npooling operations take two parameters:\n\n- **Size** of the max-pooling filter (typically 2x2 pixels)\n- **Stride**: the distance, in pixels, separating each extracted tile. Unlike with convolution, where filters slide over the feature map pixel by pixel, in max pooling, the stride determines the locations where each tile is extracted. For a 2x2 filter, a stride of 2 specifies that the max pooling operation will extract all nonoverlapping 2x2 tiles from the feature map (see Figure 5).\n\n*Figure 5. **Left** : Max pooling performed over a 4x4\nfeature map with a 2x2 filter and stride of 2. **Right**: the output of the\nmax pooling operation. Note the resulting feature map is now 2x2, preserving\nonly the maximum values from each tile.*\n\nFully Connected Layers\n\nAt the end of a convolutional neural network are one or more fully connected\nlayers (when two layers are \"fully connected,\" every node in the first layer is\nconnected to every node in the second layer). Their job is to perform\nclassification based on the features extracted by the convolutions. Typically,\nthe final fully connected layer contains a softmax activation function, which\noutputs a probability value from 0 to 1 for each of the classification labels\nthe model is trying to predict.\n| For more on softmax and multi-class classification, see [Multi-Class Neural Networks](https://developers.google.com/machine-learning/crash-course/neural-networks/multi-class) in [Machine Learning\n| Crash Course](https://developers.google.com/machine-learning/crash-course/).\n\nFigure 6 illustrates the end-to-end structure of a convolutional neural network.\n\n*Figure 6. The CNN shown here contains two convolution modules (convolution + ReLU +\npooling) for feature extraction, and two fully connected layers for\nclassification. Other CNNs may contain larger or smaller numbers of\nconvolutional modules, and greater or fewer fully connected layers. Engineers\noften experiment to figure out the configuration that produces the best results\nfor their model.*\n| **Key Terms**\n|\n| |---------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|\n| | - [convolutional filter](/machine-learning/glossary#convolutional_filter) | - [convolutional neural network](/machine-learning/glossary#convolutional_neural_network) |\n| | - [convolutional operation](/machine-learning/glossary#convolutional_operation) | - [pooling](/machine-learning/glossary#pooling) |\n| | - [ReLU](/machine-learning/glossary#ReLU) | - [stride](/machine-learning/glossary#stride) |\n|"]]