Quick Look – Benefits of Microsoft R and ML Services (R Server)

Scale your machine learning workloads on R (series)

In my previous post, I described how to leverage your R skills using Microsoft technologies for ordinary business users (marketers, business analysts, etc) with Power BI.
In this post, I describe what is the benefits of Microsoft R technologies for professional developers (programmers, data scientists, etc) with a few lines of code.

Designed for multithreading

R is most popular statistical programming language, but it’s having some concerns for enterprise use. The biggest one is the lack of parallelism.

Microsoft R Open (MRO) is renamed from the famous Revolution R Open (RRO). By using MRO you can take advantage of multithreading and high performance, although the functions of MRO is still compatible for the basic functions of open source R like CRAN.

Note : You can also use other choice (snow, etc) for parallel computing in R.

Please see the following official document about the benchmark.

The Benefits of Multithreaded Performance with Microsoft R Open

For example, this document says that the matrix manipulation is many times faster than the open source R. Let’s see the following simple example.
The A is 10000 x 5000 matrix which elements is the repeated values of 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, … The B is the cross-product matrix by A. Here we measure this cros-product operation using system.time().

A <- matrix (1:5,10000,5000)
system.time (B <- crossprod(A))

The following is the results.
Here I’m using Lenovo X1 Carbon (with Intel Core vPro i7), and MRO is over 8 times faster than the other open R.

R 3.3.2 [64 bit] (CRAN)

Microsoft R 3.3.2 [64 bit] (MRO)

The analysis function for the large amount of data is also faster than the other R runtime.

Note : If you’re using RStudio and installing both open source R and MRO, you can change the R runtime for RStudio from [Tools] – [Global Options] menu.

Note (Added on Feb 2018) : Microsoft R Open is now default R in Anaconda distribution. (See “anaconda.com blog – Introducing Microsoft R Open as Default R for Anaconda Distribution“.)


Distributed and Scaling

By using Microsoft Machine Learning Server (ML Server) or Microsoft Machine Learning Services (ML Services) (formerly, Revolution R Enterprise -> R Server), you can also distribute and scale the computing of R across the multiple computers. Moreover ML Server on Microsoft R Open also provides data chunking (streaming) from disks and it can operate with massive volume of data. (Without ML Server, data must fit in memory.)

The Machine Learning Server (ML Server) can be run on Windows (SQL Server), Linux, Teradata, Hadoop (Spark) clusters or SQL Server.
You can also use ML Services (R Server) as one of the workload on Spark (see the following illustrated), with which you can distribute R algorithms on Spark cluster.
In this post, I show you how to use ML Server (or ML Services) on Spark cluster.

Note : R Server is renamed as “Machine Learning Server (Services)” on Sep 2017 and it also includes python support as well as R.

Note : For Windows, ML Server is licensed under SQL Server. You can easily get the standalone ML Server (with SQL Server 2016 Developer edition) by using the virtual machine called “Data Science Virtual Machine” (DSVM) in Microsoft Azure.

Note : Spark MLlib is also the machine learning component widely used on Spark community, but currently almost functions are not supported in R. (The mainly supported languages are Python, Java, and Scala.)
You can also use SparkR, but you must remember that SparkR is currently just the data transformation for R computing (not mature for machine learning tasks).

Here I skip how to setup your ML Server (R Server) on Spark, but the easiest way is to use Azure-managed Hadoop clusters called “HDInsight“. You can setup your own experiment environment by just a few steps as follows. (You can also install ML Server on Spark without HDInsight.)

  1. Create ML Services (R Server) workload on Azure HDInsight. You just input several terms along with HDInsight cluster creation wizard on Azure Portal, and all the computer nodes (head nodes, edge nodes, worker nodes, zookeeper nodes) is automatically setup. (See below.)
    Please see “Microsoft Azure – Get started using ML Services on HDInsight” for details.
  2. If needed, setup RStudio connected to the Spark cluster (edge node) above. Note that RStudio Server Community Edition on edge node is automatically installed by HDInsight, then you just only setup your client environment. (Currently you don’t need to install RStudio Server by yourself and it can be included in ML Services.)
    Please see “Installing RStudio with ML Services on HDInsight” for the client setup.

Note : Here I used Azure Data Lake store (not Azure Storage Blobs) for the primary storage on Hadoop clusters. (And I setup the service principal and its access permissions for this Data Lake account itself.)
For more details, please refer “Create an HDInsight cluster with Data Lake Store using Azure Portal“.

Using RStudio Server on the edge node on Spark cluster, you can use RStudio on web browser using SSH tunnel. (See the following screenshot.)
It’s very convenient way for running and debugging your R scripts on ML Services (R Server).

Now I prepared the dataset (a part of Japanese stocks daily reports) over 35,000,000 records over 1 GB. (You can download from here.)
When I run my R script with this huge data on my client PC, the script fails because of the allocation error or timeout. But you can run the same workloads on ML Services (R Server) with Spark cluster.

Now here’s (below is) the R script which I run on ML Services (R Server) in Azure HDInsight. (Please type the following code in RStudio on web browser.)

##### The format of source data
##### (company-code, year, month, day, week, open-price, difference)

# Set Spark clusters context
spark <- RxSpark(
  consoleOutput = TRUE,
  extraSparkConfig = "--conf spark.speculation=true",
  nameNode = "adl://jpstockdata.azuredatalakestore.net",
  port = 0,
  idleTimeout = 90000

# Import data
fs <- RxHdfsFileSystem(
  hostName = "adl://jpstockdata.azuredatalakestore.net",
  port = 0)
colInfo <- list(
  list(index = 1, newName="Code", type="character"),
  list(index = 2, newName="Year", type="integer"),
  list(index = 3, newName="Month", type="integer"),
  list(index = 4, newName="Day", type="integer"),
  list(index = 5, newName="DayOfWeek", type="factor",
       levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")),
  list(index = 6, newName="Open", type="integer"),
  list(index = 7, newName="Diff", type="integer")
orgData <- RxTextData(
  fileSystem = fs,
  file = "/history/testCsv.txt",
  colInfo = colInfo,
  delimiter = ",",
  firstRowIsColNames = FALSE

# execute : rxLinMod (lm)
system.time(lmData <- rxDataStep(
  inData = orgData,
  transforms = list(DiffRate = (Diff / Open) * 100),
  maxRowsByCols = 300000000))
system.time(lmObj <- rxLinMod(
  DiffRate ~ DayOfWeek,
  data = lmData,
  cube = TRUE))

# If needed, predict (rxPredict) using trained model or save it.
# Here's just ploting the means of DiffRate for each DayOfWeek.
lmResult <- rxResultsDF(lmObj)
rxLinePlot(DiffRate ~ DayOfWeek, data = lmResult)

# execute : rxCrossTabs (xtabs)
system.time(ctData <- rxDataStep(
  inData = orgData,
  transforms = list(Close = Open + Diff),
  maxRowsByCols = 300000000))
system.time(ctObj <- rxCrossTabs(
  formula = Close ~ F(Year):F(Month),
  data = ctData,
  means = TRUE

When you set the context with rxSetComputeContext, the data is not transferred to this host by the network, but the R script is transferred to each server and executed. (You can use rxImport, when you want to download the data into the host in which the script is running.)

Before running this script, I’ve uploaded the source data on Azure Data Lake store which is the same storage as the primary storage of Hadoop cluster. The code “adl://...” in my program means the uri of Azure Data Lake store account. (It is HDFS extension.)

The above functions which is prefixed by “rx” are called RevoScaleR (ScaleR) functions. These functions are provided for distributing and scaling, and each RevoScaleR functions are the scaling ones of the corresponding basic R functions. For example, RxTextData is corresponding to read.table or read.csv, rxLinMod is corresponding to lm (linear regression model), and rxCrossTabs is xtabs (cross-tabulation).
You can use these R functions for leveraging the computing power of Hadoop clusters. (See the following reference document for details.)

Microsoft R – RevoScaleR Functions for Hadoop

Microsoft R – Comparison of Base R and ScaleR Functions

Note : You can also use revoscalepy package for python code in ML Services.

Note : For more details (descriptions, arguments, etc) about each ScaleR functions, please type “?{function name}” (ex. “?rxLinePlot“) in R Console.

Note : You can also use more fast modeling functions implemented by Microsoft Research called MicrosoftML (MML). (This also includes the functionality for the anomaly detection and the deep nural networks.) Now these functions are only on Windows and not in Spark clusters, but soon will be updated in the future.
See “Building a machine learning model with the MicrosoftML package“.

The following illustrates the topology of Spark clusters. The workloads of ML Services (R Server) reside in the edge node and worker nodes.

The edge node is having the role of the development front-end, and you can interact with ML Services (R Server) through this node. (Currently, ML Services on Azure HDInsight is the only cluster which provides the edge node by default.) RStudio Server is installed on this edge node, and when you run your R scripts through RStudio on the web browser, this node starts all computations distributed to worker nodes. If the computation cannot be distributed for some reason (both intentionally and accidentally), this task will be run on this local edge node.

While the script is running, please see YARN resource manager (rm) UI in Hadoop. You could find the running the application of RevoScaleR (ScaleR) on the scheduler. (See the following screenshot.)

When you monitor the worker nodes in resource manager UI, you could find that all nodes are used for computation.

The RevoScaleR functions are also provided in Microsoft R Client (on top of Microsoft R Open), which can run on the standalone computer. (You don’t need extra servers.)
Microsoft R Client is very useful. Using Microsoft R Client, you can send completed R commands to the remote ML Server for execution. (Use mrsdeploy package.)
Moreover you can run and test RevoScaleR functions on your local computer without extra server settings, and you can migrate to the distributed clusters for production. That is, you can use lightweight Microsoft R Client in your development time.


You can take a lot of advantages of the robust computing platform with Microsoft R technologies !


[Change history]

2018/06/29  Name change for Azure HDInsight R Server : “R Server” -> “ML Services (R Server)”



Categories: Uncategorized

Tagged as: ,

1 reply »

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s