通过 GKE 中的多个 GPU 提供 LLM


本教程演示了如何使用 GKE 上的多个 GPU 部署和应用大语言模型 (LLM),以实现高效且可伸缩的推理。您将创建一个使用多个 L4 GPU 的 GKE 集群,并准备基础设施以提供以下任一模型:

根据模型的数据格式,所需的 GPU 数量也会有所不同。在本教程中,每个模型都使用两个 L4 GPU。如需了解详情,请参阅计算 GPU 的数量

本教程适用于机器学习 (ML) 工程师、平台管理员和运维人员,以及对使用 Kubernetes 容器编排功能提供 LLM 感兴趣的数据和 AI 专家。如需详细了解我们在 Google Cloud内容中提及的常见角色和示例任务,请参阅常见的 GKE Enterprise 用户角色和任务

在阅读本页面之前,请确保您熟悉以下内容:

目标

在本教程中,您将执行以下操作:

  1. 创建集群和节点池。
  2. 准备工作负载。
  3. 部署工作负载。
  4. 与 LLM 界面交互。

准备工作

在开始之前,请确保您已执行以下任务:

  • 启用 Google Kubernetes Engine API。
  • 启用 Google Kubernetes Engine API
  • 如果您要使用 Google Cloud CLI 执行此任务,请安装初始化 gcloud CLI。 如果您之前安装了 gcloud CLI,请运行 gcloud components update 以获取最新版本。
  • 某些模型有额外要求。确保您满足以下要求:

准备环境

  1. 在 Google Cloud 控制台中,启动 Cloud Shell 实例:
    打开 Cloud Shell

  2. 设置默认环境变量:

    gcloud config set project PROJECT_ID gcloud config set billing/quota_project PROJECT_ID export PROJECT_ID=$(gcloud config get project) export CONTROL_PLANE_LOCATION=us-central1 

    PROJECT_ID 替换为您的 Google Cloud 项目 ID

创建 GKE 集群和节点池

您可以在 GKE Autopilot 或 Standard 集群中的 GPU 上应用 LLM。我们建议您使用 Autopilot 集群获得全托管式 Kubernetes 体验。如需选择最适合您的工作负载的 GKE 操作模式,请参阅选择 GKE 操作模式

Autopilot

  1. 在 Cloud Shell 中,运行以下命令:

    gcloud container clusters create-auto l4-demo \   --project=${PROJECT_ID} \   --location=${CONTROL_PLANE_LOCATION} \   --release-channel=rapid 

    GKE 会根据所部署的工作负载的请求,创建具有所需 CPU 和 GPU 节点的 Autopilot 集群。

  2. 配置 kubectl 以与您的集群通信:

    gcloud container clusters get-credentials l4-demo --location=${CONTROL_PLANE_LOCATION} 

Standard

  1. 在 Cloud Shell 中,运行以下命令以创建使用适用于 GKE 的工作负载身份联合的标准集群:

    gcloud container clusters create l4-demo \   --location ${CONTROL_PLANE_LOCATION} \   --workload-pool ${PROJECT_ID}.svc.id.goog \   --enable-image-streaming \   --node-locations=${CONTROL_PLANE_LOCATION}-a \   --workload-pool=${PROJECT_ID}.svc.id.goog \   --machine-type n2d-standard-4 \   --num-nodes 1 --min-nodes 1 --max-nodes 5 \   --release-channel=rapid 

    集群创建可能需要几分钟的时间。

  2. 运行以下命令来为集群创建节点池

    gcloud container node-pools create g2-standard-24 --cluster l4-demo \   --location ${CONTROL_PLANE_LOCATION} \   --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \   --machine-type g2-standard-24 \   --enable-autoscaling --enable-image-streaming \   --num-nodes=0 --min-nodes=0 --max-nodes=3 \   --node-locations ${CONTROL_PLANE_LOCATION}-a,${CONTROL_PLANE_LOCATION}-c \   --spot 

    GKE 会为 LLM 创建以下资源:

    • 公共 Google Kubernetes Engine (GKE) Standard 版本集群。
    • 机器类型为 g2-standard-24 的节点池缩减为 0 个节点。在启动请求 GPU 的 Pod 之前,您无需为任何 GPU 付费。此节点池预配 Spot 虚拟机,其价格低于默认标准 Compute Engine 虚拟机,但不保证可用性。您可以从此命令中移除 --spot 标志以及 text-generation-inference.yaml 配置中的 cloud.google.com/gke-spot 节点选择器,以使用按需虚拟机。
  3. 配置 kubectl 以与您的集群通信:

    gcloud container clusters get-credentials l4-demo --location=${CONTROL_PLANE_LOCATION} 

准备工作负载

本部分介绍了如何根据您要使用的模型设置工作负载。本教程使用 Kubernetes Deployment 来部署模型。Deployment 是一个 Kubernetes API 对象,可让您运行在集群节点中分布的多个 Pod 副本。

Llama 3 70b

  1. 设置默认环境变量:

    export HF_TOKEN=HUGGING_FACE_TOKEN 

    HUGGING_FACE_TOKEN 替换为您的 HuggingFace 令牌。

  2. 为 HuggingFace 令牌创建 Kubernetes Secret

    kubectl create secret generic l4-demo \     --from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN} \     --dry-run=client -o yaml | kubectl apply -f - 
  3. 创建以下 text-generation-inference.yaml Deployment 清单:

    apiVersion: apps/v1 kind: Deployment metadata:   name: llm spec:   replicas: 1   selector:     matchLabels:       app: llm   template:     metadata:       labels:         app: llm     spec:       containers:       - name: llm         image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-1.ubuntu2204.py310         resources:           requests:             cpu: "10"             memory: "60Gi"             nvidia.com/gpu: "2"           limits:             cpu: "10"             memory: "60Gi"             nvidia.com/gpu: "2"         env:         - name: MODEL_ID           value: meta-llama/Meta-Llama-3-70B-Instruct         - name: NUM_SHARD           value: "2"         - name: MAX_INPUT_TOKENS           value: "2048"         - name: PORT           value: "8080"         - name: QUANTIZE           value: bitsandbytes-nf4         - name: HUGGING_FACE_HUB_TOKEN           valueFrom:             secretKeyRef:               name: l4-demo               key: HUGGING_FACE_TOKEN         volumeMounts:           - mountPath: /dev/shm             name: dshm           # mountPath is set to /tmp as it's the path where the HUGGINGFACE_HUB_CACHE environment           # variable in the TGI DLCs is set to instead of the default /data set within the TGI default image.           # i.e. where the downloaded model from the Hub will be stored           - mountPath: /tmp             name: ephemeral-volume       volumes:         - name: dshm           emptyDir:               medium: Memory         - name: ephemeral-volume           ephemeral:             volumeClaimTemplate:               metadata:                 labels:                   type: ephemeral               spec:                 accessModes: ["ReadWriteOnce"]                 storageClassName: "premium-rwo"                 resources:                   requests:                     storage: 150Gi       nodeSelector:         cloud.google.com/gke-accelerator: "nvidia-l4"         cloud.google.com/gke-spot: "true"

    在此清单中:

    • NUM_SHARD 必须为 2,因为模型需要两个 NVIDIA L4 GPU。
    • QUANTIZE 设置为 bitsandbytes-nf4,这意味着模型以 4 位模式加载,而不是 32 位模式。这样,GKE 就可以减少所需的 GPU 内存量,并提高推理速度。不过,模型准确率可能会降低。如需了解如何计算要请求的 GPU,请参阅计算 GPU 的数量
  4. 应用清单:

    kubectl apply -f text-generation-inference.yaml 

    输出类似于以下内容:

    deployment.apps/llm created 
  5. 验证模型的状态:

    kubectl get deploy 

    输出类似于以下内容:

    NAME          READY   UP-TO-DATE   AVAILABLE   AGE llm           1/1     1            1           20m 
  6. 查看正在运行的部署的日志:

    kubectl logs -l app=llm 

    输出类似于以下内容:

    {"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291} {"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328} {"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329} {"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343} 

Mixtral 8x7b

  1. 设置默认环境变量:

    export HF_TOKEN=HUGGING_FACE_TOKEN 

    HUGGING_FACE_TOKEN 替换为您的 HuggingFace 令牌。

  2. 为 HuggingFace 令牌创建 Kubernetes Secret

    kubectl create secret generic l4-demo \     --from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN} \     --dry-run=client -o yaml | kubectl apply -f - 
  3. 创建以下 text-generation-inference.yaml Deployment 清单:

    apiVersion: apps/v1 kind: Deployment metadata:   name: llm spec:   replicas: 1   selector:     matchLabels:       app: llm   template:     metadata:       labels:         app: llm     spec:       containers:       - name: llm         image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311         resources:           requests:             cpu: "5"             memory: "40Gi"             nvidia.com/gpu: "2"           limits:             cpu: "5"             memory: "40Gi"             nvidia.com/gpu: "2"         env:         - name: MODEL_ID           value: mistralai/Mixtral-8x7B-Instruct-v0.1         - name: NUM_SHARD           value: "2"         - name: PORT           value: "8080"         - name: QUANTIZE           value: bitsandbytes-nf4         - name: HUGGING_FACE_HUB_TOKEN           valueFrom:             secretKeyRef:               name: l4-demo               key: HUGGING_FACE_TOKEN                   volumeMounts:           - mountPath: /dev/shm             name: dshm           # mountPath is set to /tmp as it's the path where the HF_HOME environment           # variable in the TGI DLCs is set to instead of the default /data set within the TGI default image.           # i.e. where the downloaded model from the Hub will be stored           - mountPath: /tmp             name: ephemeral-volume       volumes:         - name: dshm           emptyDir:               medium: Memory         - name: ephemeral-volume           ephemeral:             volumeClaimTemplate:               metadata:                 labels:                   type: ephemeral               spec:                 accessModes: ["ReadWriteOnce"]                 storageClassName: "premium-rwo"                 resources:                   requests:                     storage: 100Gi       nodeSelector:         cloud.google.com/gke-accelerator: "nvidia-l4"         cloud.google.com/gke-spot: "true"

    在此清单中:

    • NUM_SHARD 必须为 2,因为模型需要两个 NVIDIA L4 GPU。
    • QUANTIZE 设置为 bitsandbytes-nf4,这意味着模型以 4 位模式加载,而不是 32 位模式。这样,GKE 就可以减少所需的 GPU 内存量,并提高推理速度。但是,这可能会降低模型准确率。如需了解如何计算要请求的 GPU,请参阅计算 GPU 的数量
  4. 应用清单:

    kubectl apply -f text-generation-inference.yaml 

    输出类似于以下内容:

    deployment.apps/llm created 
  5. 验证模型的状态:

    watch kubectl get deploy 

    Deployment 准备就绪后,输出类似于以下内容:

    NAME          READY   UP-TO-DATE   AVAILABLE   AGE llm           1/1     1            1           10m 

    如需退出监控,请按 CTRL + C

  6. 查看正在运行的部署的日志:

    kubectl logs -l app=llm 

    输出类似于以下内容:

    {"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291} {"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328} {"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329} {"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343} 

Falcon 40b

  1. 创建以下 text-generation-inference.yaml Deployment 清单:

    apiVersion: apps/v1 kind: Deployment metadata:   name: llm spec:   replicas: 1   selector:     matchLabels:       app: llm   template:     metadata:       labels:         app: llm     spec:       containers:       - name: llm         image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.1-4.ubuntu2204.py310         resources:           requests:             cpu: "10"             memory: "60Gi"             nvidia.com/gpu: "2"           limits:             cpu: "10"             memory: "60Gi"             nvidia.com/gpu: "2"         env:         - name: MODEL_ID           value: tiiuae/falcon-40b-instruct         - name: NUM_SHARD           value: "2"         - name: PORT           value: "8080"         - name: QUANTIZE           value: bitsandbytes-nf4         volumeMounts:           - mountPath: /dev/shm             name: dshm           # mountPath is set to /data as it's the path where the HUGGINGFACE_HUB_CACHE environment           # variable points to in the TGI container image i.e. where the downloaded model from the Hub will be           # stored           - mountPath: /data             name: ephemeral-volume       volumes:         - name: dshm           emptyDir:               medium: Memory         - name: ephemeral-volume           ephemeral:             volumeClaimTemplate:               metadata:                 labels:                   type: ephemeral               spec:                 accessModes: ["ReadWriteOnce"]                 storageClassName: "premium-rwo"                 resources:                   requests:                     storage: 175Gi       nodeSelector:         cloud.google.com/gke-accelerator: "nvidia-l4"         cloud.google.com/gke-spot: "true"

    在此清单中:

    • NUM_SHARD 必须为 2,因为模型需要两个 NVIDIA L4 GPU。
    • QUANTIZE 设置为 bitsandbytes-nf4,这意味着模型以 4 位模式加载,而不是 32 位模式。这样,GKE 就可以减少所需的 GPU 内存量,并提高推理速度。不过,模型准确率可能会降低。如需了解如何计算要请求的 GPU,请参阅计算 GPU 的数量
  2. 应用清单:

    kubectl apply -f text-generation-inference.yaml 

    输出类似于以下内容:

    deployment.apps/llm created 
  3. 验证模型的状态:

    watch kubectl get deploy 

    Deployment 准备就绪后,输出类似于以下内容:

    NAME          READY   UP-TO-DATE   AVAILABLE   AGE llm           1/1     1            1           10m 

    如需退出监控,请按 CTRL + C

  4. 查看正在运行的部署的日志:

    kubectl logs -l app=llm 

    输出类似于以下内容:

    {"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291} {"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328} {"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329} {"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343} 

创建 ClusterIP 类型的 Service

在集群内部公开您的 pod,以便其他应用发现和访问这些 pod。

  1. 创建以下 llm-service.yaml 清单:

    apiVersion: v1 kind: Service metadata:   name: llm-service spec:   selector:     app: llm   type: ClusterIP   ports:     - protocol: TCP       port: 80       targetPort: 8080 
  2. 应用清单:

    kubectl apply -f llm-service.yaml 

部署聊天界面

使用 Gradio 构建一个 Web 应用,使您可以与模型进行互动。Gradio 是一个 Python 库,它具有一个可为聊天机器人创建界面的 ChatInterface 封装容器。

Llama 3 70b

  1. 创建一个名为 gradio.yaml 的文件:

    apiVersion: apps/v1 kind: Deployment metadata:   name: gradio   labels:     app: gradio spec:   strategy:     type: Recreate   replicas: 1   selector:     matchLabels:       app: gradio   template:     metadata:       labels:         app: gradio     spec:       containers:       - name: gradio         image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4         resources:           requests:             cpu: "512m"             memory: "512Mi"           limits:             cpu: "1"             memory: "512Mi"         env:         - name: CONTEXT_PATH           value: "/generate"         - name: HOST           value: "http://llm-service"         - name: LLM_ENGINE           value: "tgi"         - name: MODEL_ID           value: "meta-llama/Meta-Llama-3-70B-Instruct"         - name: USER_PROMPT           value: "<|begin_of_text|><|start_header_id|>user<|end_header_id|> prompt <|eot_id|><|start_header_id|>assistant<|end_header_id|>"         - name: SYSTEM_PROMPT           value: "prompt <|eot_id|>"         ports:         - containerPort: 7860 --- apiVersion: v1 kind: Service metadata:   name: gradio-service spec:   type: LoadBalancer   selector:     app: gradio   ports:   - port: 80     targetPort: 7860 
  2. 应用清单:

    kubectl apply -f gradio.yaml 
  3. 找到 Service 的外部 IP 地址:

    kubectl get svc 

    输出类似于以下内容:

    NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE gradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m 
  4. EXTERNAL-IP 列中复制外部 IP 地址。

  5. 在您的网络浏览器中使用外部 IP 地址及公开的端口查看模型界面:

    http://EXTERNAL_IP 

Mixtral 8x7b

  1. 创建一个名为 gradio.yaml 的文件:

    apiVersion: apps/v1 kind: Deployment metadata:   name: gradio   labels:     app: gradio spec:   strategy:     type: Recreate   replicas: 1   selector:     matchLabels:       app: gradio   template:     metadata:       labels:         app: gradio     spec:       containers:       - name: gradio         image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4         resources:           requests:             cpu: "512m"             memory: "512Mi"           limits:             cpu: "1"             memory: "512Mi"         env:         - name: CONTEXT_PATH           value: "/generate"         - name: HOST           value: "http://llm-service"         - name: LLM_ENGINE           value: "tgi"         - name: MODEL_ID           value: "mixtral-8x7b"         - name: USER_PROMPT           value: "[INST] prompt [/INST]"         - name: SYSTEM_PROMPT           value: "prompt"         ports:         - containerPort: 7860 --- apiVersion: v1 kind: Service metadata:   name: gradio-service spec:   type: LoadBalancer   selector:     app: gradio   ports:   - port: 80     targetPort: 7860 
  2. 应用清单:

    kubectl apply -f gradio.yaml 
  3. 找到 Service 的外部 IP 地址:

    kubectl get svc 

    输出类似于以下内容:

    NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE gradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m 
  4. EXTERNAL-IP 列中复制外部 IP 地址。

  5. 在您的网络浏览器中使用外部 IP 地址及公开的端口查看模型界面:

    http://EXTERNAL_IP 

Falcon 40b

  1. 创建一个名为 gradio.yaml 的文件:

    apiVersion: apps/v1 kind: Deployment metadata:   name: gradio   labels:     app: gradio spec:   strategy:     type: Recreate   replicas: 1   selector:     matchLabels:       app: gradio   template:     metadata:       labels:         app: gradio     spec:       containers:       - name: gradio         image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.4         resources:           requests:             cpu: "512m"             memory: "512Mi"           limits:             cpu: "1"             memory: "512Mi"         env:         - name: CONTEXT_PATH           value: "/generate"         - name: HOST           value: "http://llm-service"         - name: LLM_ENGINE           value: "tgi"         - name: MODEL_ID           value: "falcon-40b-instruct"         - name: USER_PROMPT           value: "User: prompt"         - name: SYSTEM_PROMPT           value: "Assistant: prompt"         ports:         - containerPort: 7860 --- apiVersion: v1 kind: Service metadata:   name: gradio-service spec:   type: LoadBalancer   selector:     app: gradio   ports:   - port: 80     targetPort: 7860 
  2. 应用清单:

    kubectl apply -f gradio.yaml 
  3. 找到 Service 的外部 IP 地址:

    kubectl get svc 

    输出类似于以下内容:

    NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE gradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m 
  4. EXTERNAL-IP 列中复制外部 IP 地址。

  5. 在您的网络浏览器中使用外部 IP 地址及公开的端口查看模型界面:

    http://EXTERNAL_IP 

计算 GPU 的数量

GPU 的数量取决于 QUANTIZE 标志的值。在本教程中,QUANTIZE 设置为 bitsandbytes-nf4,这意味着模型以 4 位模式加载。

一个 700 亿个参数模型至少需要 40 GB 的 GPU 内存,相当于 700 亿 x 4 位(700 亿 x 4 位= 35 GB),并考虑 5 GB 的开销。在这种情况下,单个 L4 GPU 没有足够的内存。因此,本教程中的示例使用两个 L4 GPU 的内存 (2 x 24 = 48 GB)。此配置足以在 L4 GPU 中运行 Falcon 40b 或 Llama 3 70b。

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用,请删除包含这些资源的项目,或者保留项目但删除各个资源。

删除集群

为避免因您在本指南中创建的资源导致您的 Google Cloud 账号产生费用,请删除 GKE 集群:

gcloud container clusters delete l4-demo --location ${CONTROL_PLANE_LOCATION} 

后续步骤