Apache Beam RunInference for scikit-learn

This notebook demonstrates the use of the RunInference transform for scikit-learn, also called sklearn. Apache Beam RunInference has implementations of the ModelHandler class prebuilt for scikit-learn. For more information about using RunInference, see Get started with AI/ML pipelines in the Apache Beam documentation.

You can choose the appropriate model handler based on your input data type:

With RunInference, these model handlers manage batching, vectorization, and prediction optimization for your scikit-learn pipeline or model.

This notebook demonstrates the following common RunInference patterns:

Generate predictions.
Postprocess results after RunInference.
Run inference with multiple models in the same pipeline.

The linear regression models used in these samples are trained on data that correspondes to the 5 and 10 times tables; that is,y = 5x and y = 10x respectively.

Before you begin

Complete the following setup steps:

Install dependencies for Apache Beam.
Authenticate with Google Cloud.
Specify your project and bucket. You use the project and bucket to save and load models.

pip install google-api-core --quiet pip install google-cloud-pubsub google-cloud-bigquery-storage --quiet pip install apache-beam[gcp,dataframe] --quiet

About scikit-learn versions

scikit-learn is a build-dependency of Apache Beam. If you need to install a different version of sklearn , use %pip install scikit-learn==<version>

from google.colab import auth auth.authenticate_user()

import pickle from sklearn import linear_model from typing import Tuple  import numpy as np import apache_beam as beam  from apache_beam.ml.inference.sklearn_inference import ModelFileType from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerNumpy from apache_beam.ml.inference.base import KeyedModelHandler from apache_beam.ml.inference.base import PredictionResult from apache_beam.ml.inference.base import RunInference from apache_beam.options.pipeline_options import PipelineOptions  # NOTE: If an error occurs, restart your runtime.

import os  # Constants project = "<PROJECT_ID>" # @param {type:'string'} bucket = "<BUCKET_NAME>" # @param {type:'string'}  # To avoid warnings, set the project. os.environ['GOOGLE_CLOUD_PROJECT'] = project

Create the data and the scikit-learn model

This section demonstrates the following steps:

Create the data to train the scikit-learn linear regression model.
Train the linear regression model.
Save the scikit-learn model using pickle.

In this example, you create two models, one with the 5 times model and a second with the 10 times model.

# Input data to train the sklearn model for the 5 times table. x = np.arange(0, 100, dtype=np.float32).reshape(-1, 1) y = (x * 5).reshape(-1, 1)  def train_and_save_model(x, y, model_file_name):   regression = linear_model.LinearRegression()   regression.fit(x,y)    with open(model_file_name, 'wb') as f:       pickle.dump(regression, f)  five_times_model_filename = 'sklearn_5x_model.pkl' train_and_save_model(x, y, five_times_model_filename)  # Change y to be 10 times, and output a 10 times table. ten_times_model_filename = 'sklearn_10x_model.pkl' train_and_save_model(x, y, ten_times_model_filename) y = (x * 10).reshape(-1, 1) train_and_save_model(x, y, 'sklearn_10x_model.pkl')

Create a scikit-learn RunInference pipeline

This section demonstrates how to do the following:

Define a scikit-learn model handler that accepts an array_like object as input.
Read the data from BigQuery.
Use the scikit-learn trained model and the scikit-learn RunInference transform on unkeyed data.

%pip install --upgrade google-cloud-bigquery --quiet

gcloud config set project $project

 Updated property [core/project].

# Populated BigQuery table  from google.cloud import bigquery  client = bigquery.Client(project=project)  # Make sure the dataset_id is unique in your project. dataset_id = '{project}.maths'.format(project=project) dataset = bigquery.Dataset(dataset_id)  # Modify the location based on your project configuration. dataset.location = 'US' dataset = client.create_dataset(dataset, exists_ok=True)  # Table name in the BigQuery dataset. table_name = 'maths_problems_1'  query = """     CREATE OR REPLACE TABLE       {project}.maths.{table} ( key STRING OPTIONS(description="A unique key for the maths problem"),     value FLOAT64 OPTIONS(description="Our maths problem" ) );     INSERT INTO maths.{table}     VALUES       ("first_example", 105.00),       ("second_example", 108.00),       ("third_example", 1000.00),       ("fourth_example", 1013.00) """.format(project=project, table=table_name)  create_job = client.query(query) create_job.result()

 <google.cloud.bigquery.table._EmptyRowIterator at 0x7f97abb4e850>

sklearn_model_handler = SklearnModelHandlerNumpy(model_uri=five_times_model_filename)    pipeline_options = PipelineOptions().from_dictionary(                                       {'temp_location':f'gs://{bucket}/tmp'})  # Define the BigQuery table specification. table_name = 'maths_problems_1' table_spec = f'{project}:maths.{table_name}'  with beam.Pipeline(options=pipeline_options) as p:   (       p        | "ReadFromBQ" >> beam.io.ReadFromBigQuery(table=table_spec)       | "ExtractInputs" >> beam.Map(lambda x: [x['value']])        | "RunInferenceSklearn" >> RunInference(model_handler=sklearn_model_handler)       | beam.Map(print)   )

 PredictionResult(example=[1000.0], inference=array([5000.])) PredictionResult(example=[1013.0], inference=array([5065.])) PredictionResult(example=[108.0], inference=array([540.])) PredictionResult(example=[105.0], inference=array([525.]))

Use sklearn RunInference on keyed inputs