Uncategorized

3 Ways for SQL Server R and Python Workloads (Machine Learning Services)

As you know, the latest SQL Server (and coming soon in Azure SQL Database) has Machine Learning Services, which enables you to work with machine learning workloads (R / Python workloads) without data transfer across network. That is, now you can run your ML workloads close to data with SQL Server.

To take the benefits of this in-database workloads, you can take several approaches (patterns) and here I summarize these approaches along with the real scenarios. I also show you some important notices what you should know for each approaches.

Here we use R with well-known Iris classification sample.

Sample Table (Dataset)

In this post we use the following brief “IrisTbl” table in SQL Server database. (You can download from here.)
This dataset has 150 rows – in which 130 rows have “SpeciesPredicted” column values (SpeciesPredicted is not NULL) and we use these 130 rows for training. Other 20 rows don’t have “SpeciesPredicted” values (SpeciesPredicted is NULL) and we use these 20 rows for scoring (prediction).

Approach 1 : Scripting with Stored Procedures

The first approach is to use sp_execute_external_script stored procedure, which enables you to call R or python script running on SQL Server machine learning services.
With this approach (and 2nd approach), you can manipulate your ML workloads as one of database objects, e.g, database triggers, database transactions, batching your stored procedures with SQL Server Agent, or working with SQL Server management utilities (resource governor, etc).

Let’s see the following sample code.
Here we call R script in SQL stored procedure, in which we train the model with the decision tree algorithm for “IrisTbl” table data (above dataset). Finally this stored procedure returns generated model binary (which type must be VARBINARY(MAX)) as one of output arguments.

CREATE PROCEDURE GenerateTreeModel (
  @model_name VARCHAR(50) OUTPUT,
  @model_data VARBINARY(MAX) OUTPUT)
AS
BEGIN
  EXEC sp_execute_external_script
    @language = N'R',
    @script = N'
model <- rxDTree(
  SpeciesLabeled ~ SepalLength + SepalWidth + PetalLength + PetalWidth,
  data = IrisData,
  cp = 0.01)
modelname <- "Decision Tree"
modeldata <- as.raw(serialize(model, connection=NULL))
',
    @input_data_1 = N'SELECT * FROM IrisTbl WHERE SpeciesPredicted IS NOT NULL',
    @input_data_1_name = N'IrisData',
    @params = N'
      @r_rowsPerRead int,
      @modelname varchar(50) OUTPUT,
      @modeldata varbinary(max) OUTPUT',
    @r_rowsPerRead = 10000,
    @modelname = @model_name OUTPUT,
    @modeldata = @model_data OUTPUT;
END;
GO

As you can see, you can pass data between SQL script and R script using input and output parameters. (Here we only use one single input parameter.) When you don’t define these parameters, the default value – “InputDataSet” for the input parameter and “OutputDataSet” for the output parameter – is used.

Now let’s invoke this stored procedure (GenerateTreeModel).
The following code is invoking our stored procedure, and saving the generated model into “ModelTbl” table.

-- model table definition
CREATE TABLE ModelTbl ( 
  ModelName VARCHAR(50),
  ModelData VARBINARY(MAX) 
);
GO

DECLARE @model_name VARCHAR(50), @model_data VARBINARY(MAX);
-- Invoke GenerateTreeModel
EXEC GenerateTreeModel @model_name OUTPUT, @model_data OUTPUT;
-- Save model in table
IF EXISTS(SELECT * FROM ModelTbl where ModelName=@model_name)
  UPDATE ModelTbl SET ModelData=@model_data WHERE ModelName=@model_name
ELSE
  INSERT INTO ModelTbl(ModelName, ModelData) VALUES(@model_name, @model_data);

One important note is : With SQL Server Enterprise Edition, data can be streamed and scaled with parallelization. (In contrast, it’s not available with other editions and the workloads must fit in memory. If data is so large, the allocate failure will occur with other editions.)

Note : The actual number of execution processes depends on several factors (server resources, etc).

With this reason, it’s better to use RevoScaleR or MicrosoftML functions (revoscalepy or microsoftml in python), even though you can use standard R functions in sp_execute_external_script. Even though your server is having sufficient RAM, the default size of external resources on SQL Server is limited for not preventing database workloads and you might suffer for running large training tasks with standard R functions. (See here for increasing external resource size.) By using RevoScaleR functions, data is read by chunk and the operation is distributed in multiple processes and the results are aggregated. (See my early post “Benefits of Microsoft R and Machine Learning Server (R Server)” for RevoScaleR package.)
In the following section, I’ll show you another approach to use SQL Server Compute Context. If you use RevoScaleR functions, you can also use SQL Server Compute Context instead of using sp_execute_external_script.

Note : You can install additional packages using RGui tool in SQL Server distributed files. (See the official document “Install additional R packages on SQL Server” for details.)

Note : When you want to parallelize with standard R functions in sp_execute_external_script stored procedure, you must explicitly set @parallel parameter (parallelize or not) in sp_execute_external_script by yourself. (When you use RevoScaleR functions, this parameter is automatically set.)

You can also use sp_execute_external_script for scoring (prediction).
The following code retrieves the model binary from table and predicts the labels.

CREATE PROCEDURE GetPredictedResultsAsTable (
  @modelname VARCHAR(50)
)
AS
BEGIN
  DECLARE @model_data VARBINARY(MAX) = (SELECT ModelData FROM ModelTbl WHERE ModelName = @modelname);
  DROP TABLE IF EXISTS TempPredicted;
  CREATE TABLE TempPredicted (
    SpeciesPredicted VARCHAR(50),
    ID INT
  );
  INSERT INTO TempPredicted
  EXEC sp_execute_external_script
    @language = N'R',
    @script = N'
model <- unserialize(modeldata)
OutputDataSet <- rxPredict(
  model,
  data = IrisScoreData,
  type = "class",
  predVarNames = "SpeciesPredicted",
  extraVarsToWrite = c("ID"))
',
    @input_data_1 = N'SELECT ID, SepalLength, SepalWidth, PetalLength, PetalWidth FROM IrisTbl WHERE SpeciesPredicted IS NULL',
    @input_data_1_name = N'IrisScoreData',
    @params = N'
      @r_rowsPerRead int,
      @modeldata varbinary(max)',
    @r_rowsPerRead = 10000,
    @modeldata = @model_data;
END;
GO

-- Save predicted labels as tmp table
EXEC GetPredictedResultsAsTable "Decision Tree"

-- Merge into original table
UPDATE IrisTbl
SET IrisTbl.SpeciesPredicted = TempPredicted.SpeciesPredicted
FROM IrisTbl INNER JOIN TempPredicted
  ON IrisTbl.ID = TempPredicted.ID;

As you can see below, the predicted values are saved in TempPredicted table with this code. And these results are merges into original table (IrisTbl).

Note : You can also use rxDataStep() for saving results into SQL table.

As you know, one of pain points with this approach is “debugging”. You cannot use IDE or help utilities for developing R or python code in SQL, then you must debug outside of database and sometimes it might be different from production code in database.
Therefore a lot of lines of external script might make you confused.

Approach 2 : Native Scoring

If you focus on scoring tasks in database, you can also use built-in T-SQL function (PREDICT()) instead of using sp_execute_external_script. This kind of implementation is called “native scoring” (realtime scoring), because it depends on only SQL Server runtime and does not require additional R or python runtime. You can also take performance advantage, because it doesn’t use external language runtime. (Especially when it’s called repeatedly.)

One important thing for using native scoring is that you must serialize the trained model with rxSerializeModel() function instead of using as.raw(serialize()).
See the following highlighted code. (I changed the previous sample code as follows.)

CREATE PROCEDURE GenerateTreeModel (
  @model_name VARCHAR(50) OUTPUT,
  @model_data VARBINARY(MAX) OUTPUT)
AS
BEGIN
  EXEC sp_execute_external_script
    @language = N'R',
    @script = N'
model <- rxDTree(
  SpeciesLabeled ~ SepalLength + SepalWidth + PetalLength + PetalWidth,
  data = IrisData,
  cp = 0.01)
modelname <- "Decision Tree"
### Change raw format
# modeldata <- as.raw(serialize(model, connection=NULL))
modeldata <- rxSerializeModel(model, realtimeScoringOnly = TRUE)
',
    @input_data_1 = N'SELECT * FROM IrisTbl WHERE SpeciesPredicted IS NOT NULL',
    @input_data_1_name = N'IrisData',
    @params = N'
      @r_rowsPerRead int,
      @modelname varchar(50) OUTPUT,
      @modeldata varbinary(max) OUTPUT',
    @r_rowsPerRead = 10000,
    @modelname = @model_name OUTPUT,
    @modeldata = @model_data OUTPUT;
END;

Now the following code is the prediction (scoring) sample with native scoring approach.
As you can see below, the predicted results are the list of each possibilities with float-type values. (The result is different from the result of rxPredict().)

DROP TABLE IF EXISTS TempPredicted;
GO

CREATE TABLE TempPredicted (
  ID INT,
  SpeciesLabeled VARCHAR(50),
  setosa_Pred FLOAT,
  versicolor_Pred FLOAT,
  virginica_Pred FLOAT
);
GO

DECLARE @model VARBINARY(MAX) = (SELECT ModelData from ModelTbl where ModelName = 'Decision Tree');
INSERT INTO TempPredicted
SELECT D.ID, D.SpeciesLabeled, P.*
FROM PREDICT(MODEL = @model,
  DATA = IrisTbl AS D) WITH (setosa_Pred FLOAT, versicolor_Pred FLOAT, virginica_Pred FLOAT) AS P
WHERE D.SpeciesPredicted IS NULL;
GO

You can also create trained model outside of SQL Server database and bring it inside database without extra runtime. You can train with Spark cluster (HDInsight) or other massive computing platforms and focus on scoring (prediction) in SQL Server database.

The following is the sample code to train the model in your working client and save the model into the remote database with Microsoft R Client.
Of course, you can use debugging help with IDE (RStudio, etc) in this training tasks.

### Read training data from local file
colInfo <- list(
  list(index = 1, newName="ID", type="integer"),
  list(index = 2, newName="SepalLength", type="numeric"),
  list(index = 3, newName="SepalWidth", type="numeric"),
  list(index = 4, newName="PetalLength", type="numeric"),
  list(index = 5, newName="PetalWidth", type="numeric"),
  list(index = 6, newName="SpeciesLabeled", type="factor",
    levels=c("versicolor", "setosa", "virginica")),
  list(index = 7, newName="SpeciesPredicted", type="factor",
    levels=c("versicolor", "setosa", "virginica"))
)
orgData <- RxTextData(
  fileSystem = "native",
  file = "C:\\tmp\\iris.csv",
  colInfo = colInfo,
  delimiter = ",",
  firstRowIsColNames = F,
  stringsAsFactors = T
)
trainData <- rxDataStep(
  inData = orgData,
  rowSelection = !(is.na(SpeciesPredicted)))

### Train model !
model <- rxDTree(
  SpeciesLabeled ~ SepalLength + SepalWidth + PetalLength + PetalWidth,
  data = trainData,
  cp = 0.01)

### Save model to remote database
serializedmodel <- rxSerializeModel(
  model,
  realtimeScoringOnly = TRUE)
modelrow <- data.frame(ModelName = 'Decision Tree', ModelData = I(list(serializedmodel)))
con <- "Driver=SQL Server;Server=52.170.118.1;Database=testdb;Uid=demouser01;Pwd=P@ssw0rd"
modeltbl <- RxSqlServerData(
  connectionString = con,
  table = "ModelTbl",
  rowsPerRead = 10000)
rxDataStep(
  inData = modelrow,
  outFile = modeltbl,
  overwrite = TRUE)

One important caveat for this approach is that the supported model is limited. Currently models must be generated by rxLinMod(), rxLogit(), rxBTrees(), rxDtree() or rxDForest() in RevoScaleR. (See the official document “Advanced analytics on SQL Server – Realtime scoring“.) Of course, you cannot also bring your models generated by standard R functions.

Note : You can also use built-in sp_rxPredict stored procedure (which is the wrapper of rxPredict() in RevoScaleR) instead, but the use of PREDICT() T-SQL function is recommended in SQL Server 2017 or later.

Approach 3 : Use SQL Server Compute Context

The last approach is different from previous ones.
Now let’s consider the following code for remote access with ODBC module.

Note : Be sure to open tcp port 1433 in firewall settings.

library(RODBC)
library(tree)

con <- "Driver=SQL Server;Server=52.170.118.1;Database=testdb;Uid=demouser01;Pwd=P@ssw0rd"
ch <- odbcDriverConnect(connection = con)

### Train
trainds <- sqlQuery(ch, "SELECT * FROM IrisTbl WHERE SpeciesPredicted IS NOT NULL")
model <- tree(
  SpeciesLabeled ~ SepalLength + SepalWidth + PetalLength + PetalWidth,
  data = trainds)

### Predict
testds <- sqlQuery(ch, "SELECT * FROM IrisTbl WHERE SpeciesPredicted IS NULL")
labels <- predict(model, testds, type='class')

With this code, all data is transferred into the client and processed by the single thread in your client. If the size of data is so large, it’ll hung up … (Please use air-flight dataset in my previous post – the size of data is 500 MB with 2 million rows.)

In such a case, you can also take benefits of in-database workloads by using SQL Server compute context.
Now let’s see the next example.

### Set compute context
con <- "Driver=SQL Server;Server=52.170.118.1;Database=testdb;Uid=demouser01;Pwd=P@ssw0rd"
sqlCompute <- RxInSqlServer(
  connectionString = con,
  wait = T,
  numTasks = 5,
  consoleOutput = F)
rxSetComputeContext(sqlCompute)
#rxSetComputeContext("local")  # if debugging in local !

### Train
trainds <- RxSqlServerData(
  connectionString = con,
  databaseName = "testdb",
  sqlQuery = "SELECT * FROM IrisTbl WHERE SpeciesPredicted IS NOT NULL",
  rowsPerRead = 10000,
  stringsAsFactors = T)
### Retrieve and show remote data
# test <- rxGetInfo(
#   data = trainds,
#   numRows = 150)
# test$data
model <- rxDTree(
  SpeciesLabeled ~ SepalLength + SepalWidth + PetalLength + PetalWidth,
  data = trainds,
  cp = 0.01,
  minSplit = sqrt(130),  # data size is 130
  maxNumBins = min(1001, max(101, sqrt(130))))  # data size is 130

### View result tree with browser
# library(RevoTreeView)
# plot(createTreeView(model))

### Predict
testds <- RxSqlServerData(
  connectionString = con,
  databaseName = "testdb",
  sqlQuery = "SELECT * FROM IrisTbl WHERE SpeciesPredicted IS NULL",
  rowsPerRead = 10000,
  stringsAsFactors = T)
sqlServerOutDS <- RxSqlServerData(
  connectionString = con,
  table = "PredTbl",
  rowsPerRead = 10000)
rxPredict(
  model,
  data = testds,
  type = "class",
  outData = sqlServerOutDS,
  writeModelVars = TRUE,
  overwrite = TRUE,
  predVarNames = "SpeciesPredicted")

The logic flow of this code is similar to the previous ODBC sample code. But this works so different from the previous one.
As I explained in my early post about RevoScaleR functions (see “Benefits of Microsoft R and Machine Learning Server“), the data is not transferred across network and machine learning tasks will be run and parallelized on the server side (close to SQL Server) by using RxInSqlServer context. (See the following illustrated.)
That is, you can take advantages of reducing overhead of network transfer, streaming of data read, and multiple processes with RxInSqlServer context. (You must use SQL Server Enterprise for streaming and multiple processes, as I mentioned above.)

Note that here I’m setting concurrent process (numTasks attribute) as 5, but the database will determine the actual concurrent number depending on the server resources. (More fewer processes might be used.)

One pain point for using this approach is that you must care about several tricky things, because data is not in local machine.
For example, you cannot use standard R functions (rpart(), tree(), etc) for data manipulation with RxInSqlServer context. (You must use RevoScaleR or revoscalepy functions instead.) You cannot also have predicted-results as local objects and therefore you must set outData attribute as above. (This attribute will save the results in SQL table.) Moreover you must explicitly set minSplit and maxNumBins attributes in rxDTree() training as above, because the number of data is unknown.

Note : SQL Server compute context running on Linux is supported in ML Server 9.3 release.

When you want to run some standard R blocks (without RevoScaleR functions), you must write as follows using rxDataStep() . This code reads data in chunk and applies R functions to each chunk in turn in the server side.

processfunc <- function(df) {
  ... # write standard R block running in server side
  return(...)
}
OutputDataSet <- rxDataStep(
  inData = IrisData,
  outFile = tempResult,
  transformFunc = processfunc,
  overwrite = T)

 

Here we used only 150 rows of data, but in-database workloads is so much scalable and you can handle tens of millions of data with SQL Server Enterprise. (Let’s try !)
Next I’ll show you how to run deep learning (AI) workloads on powerful SQL Server ML Services.

 

Advertisements

Categories: Uncategorized

Tagged as: , ,

1 reply »

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s