Walkthrough of Azure Cosmos DB Graph (Gremlin)

I show you the general tasks of Azure Cosmos DB Gremlin (graph) for your first use, and a little bit dive into the practical usage of graph query.

This post will also help the beginner to learn how to use the graph database itself, since Azure Cosmos DB is one of the compliant database to the popular graph framework “Apache TinkerPop“.

Create your database and collection

First you must prepare your database.

With Azure Cosmos DB, you must provision account, database, and collection just like Azure Cosmos DB NoSQL database.
You can create these objects using API (REST or SDK), but here we use UI of Azure Portal.

When you create Azure Cosmos DB account in Azure Portal, you must select “Gremlin (graph)” as the supported API as the following picture.

After you’ve created your account, next you create your database and collection with Data Explorer as follows.
The following “Database id” is the identifier for a database, and “Graph id” is for a collection.

Calling API

Before calling APIs, please copy your account key in your Azure Portal.

Moreover please copy the Gremlin URI in your Azure Portal.

Now you can develop your application with APIs.

As I mentioned before, Azure Cosmos DB is one of the database that is compliant with TinkerPop open source framework. Therefore you can use a lot of existing tools compliant with TinkerPop. (See “Apache TinkerPop” for language drivers and other tools.)
For example, if you use PHP for your programming, you can easily download and install gremlin-php by the following command. (You can also use App Service Editor in Azure App Services. No need to prepare your local physical machine for trial !)

curl http://getcomposer.org/installer | php
php composer.phar require brightzone/gremlin-php "2.*"

The following is the simple PHP program code which is retrieving all vertices (nodes) in graph database. (Later I explain about the gremlin language.)

Note that :

  • host is the host name of the previously copied gremlin uri.
  • username is /dbs/{your database name}/colls/{your collection name}.
  • password is the previously copied account key.
<?php
require_once('vendor/autoload.php');
use BrightzoneGremlinDriverConnection;

$db = new Connection([
  'host' => 'graph01.graphs.azure.com',
  'port' => '443',
  'graph' => 'graph',
  'username' => '/dbs/db01/colls/test01',
  'password' => 'In12qhzXYz...',
  'ssl' => TRUE
]);
$db->open();
$res = $db->send('g.V()');
$db->close();

// output the all vertex in db
var_dump($res);
?>

The retrieved result is so called GraphSON format as follows. In this PHP example, the result will be serialized to the PHP array with the following same format.

{
  "id": "u001",
  "label": "person",
  "type": "vertex",
  "properties": {
    "firstName": [
      {
        "value": "John"
      }
    ],
    "age": [
      {
        "value": 45
      }
    ]
  }
}

You can also use SDK for Azure Cosmos DB graph (.NET, Java, Node.js), which is specific for Azure Cosmos DB.
Especially, if you need some specific operations for Azure Cosmos DB (ex: creating or managing database, collections, etc), it’s better to use this SDK.

For example, the following is the C# example code using Azure Cosmos DB Graph SDK. (Here the gremlin language is also used. Later I explain about the details.)
Note that the endpoint differs from the previous gremlin uri.

using Microsoft.Azure.Documents;
using Microsoft.Azure.Documents.Client;
using Microsoft.Azure.Documents.Linq;
using Microsoft.Azure.Graphs;
using Newtonsoft.Json;

static void Main(string[] args)
{
  using (DocumentClient client = new DocumentClient(
    new Uri("https://graph01.documents.azure.com:443/"),
    "In12qhzXYz..."))
  {
    DocumentCollection graph = client.CreateDocumentCollectionIfNotExistsAsync(
      UriFactory.CreateDatabaseUri("db01"),
      new DocumentCollection { Id = "test01" },
      new RequestOptions { OfferThroughput = 1000 }).Result;
    // drop all vertex
    IDocumentQuery<dynamic> query1 =
      client.CreateGremlinQuery<dynamic>(graph, "g.V().drop()");
    dynamic result1 = query1.ExecuteNextAsync().Result;
    Console.WriteLine($"{JsonConvert.SerializeObject(result1)}");
    // add vertex
    IDocumentQuery<dynamic> query2 =
      client.CreateGremlinQuery<dynamic>(graph, "g.addV('person').property('id', 'u001').property('firstName', 'John')");
    dynamic result2 = query2.ExecuteNextAsync().Result;
    Console.WriteLine($"{JsonConvert.SerializeObject(result2)}");
  }

  Console.WriteLine("Done !");
  Console.ReadLine();
}

Before building your application with Azure Cosmos DB SDK, you must install Microsoft.Azure.Graphs package with NuGet. (Other dependent libraries like Microsoft.Azure.DocumentDB, etc are also installed in your project.)

Interactive Console and Visualize

As I described above, TinkerPop framework is having the various open source utilities contributed by communities.

For example, if you want to run the gremlin language (query, etc) with the interactive console, you can use Gremlin Console.
Please see the official document “Azure Cosmos DB: Create, query, and traverse a graph in the Gremlin console” for details about Gremlin Console with Azure Cosmos DB.

There’re also several libraries or software for visualizing gremlin-compatibile graph in Tinkerpop framework.

If you’re using Visual Studio and Azure Cosmos DB, the following Github sample source (written as ASP.NET web project) is very easy to use for visualizing Azure CosmosDB graph.

[Gitub] Azure-Samples / azure-cosmos-db-dotnet-graphexplorer
https://github.com/Azure-Samples/azure-cosmos-db-dotnet-graphexplorer

Gremlin Language

As you’ve seen in my previous programming example, it’s very important to understand the gremlin language (query, etc) for your practical use.
Let’s dive into the gremlin language (query, etc), which is not deep, but practical level of understanding.

First, we simply create the vertex (node).
The following is creating 2 vertices of “John” and “Mary”.

g.addV('employee').property('id', 'u001').property('firstName', 'John').property('age', 44)
g.addV('employee').property('id', 'u002').property('firstName', 'Mary').property('age', 37)

The following is creating the edge between 2 vertices of John and Mary. (This sample means that John is a manager for Mary.)
As you can see, you can specify (identify) the targeting vertex with the previous “id” property.

g.V('u002').addE('manager').to(g.V('u001'))

In this post, we use the following simple structure (vertices and edges) for our subsequent examples.

g.addV('employee').property('id', 'u001').property('firstName', 'John').property('age', 44)
g.addV('employee').property('id', 'u002').property('firstName', 'Mary').property('age', 37)
g.addV('employee').property('id', 'u003').property('firstName', 'Christie').property('age', 30)  
g.addV('employee').property('id', 'u004').property('firstName', 'Bob').property('age', 35)
g.addV('employee').property('id', 'u005').property('firstName', 'Susan').property('age', 31)
g.addV('employee').property('id', 'u006').property('firstName', 'Emily').property('age', 29)
g.V('u002').addE('manager').to(g.V('u001'))
g.V('u005').addE('manager').to(g.V('u001'))
g.V('u004').addE('manager').to(g.V('u002'))
g.V('u005').addE('friend').to(g.V('u006'))
g.V('u005').addE('friend').to(g.V('u003'))
g.V('u006').addE('friend').to(g.V('u003'))
g.V('u006').addE('manager').to(g.V('u004'))

The following is the example which retrieves vertices with some query conditions. This retrieves the employees whose age is greater than 40. (If you query edges, use g.E() instead of g.V().)

g.V().hasLabel('employee').has('age', gt(40))

As I described above, the retrieved result is so called GraphSON format as follows.

{
  "id": "u001",
  "label": "employee",
  "type": "vertex",
  "properties": {
    "firstName": [
      {
        "id": "9a5c0e2a-1249-4e2c-ada2-c9a7f33e26d5",
        "value": "John"
      }
    ],
    "age": [
      {
        "id": "67d681b1-9a24-4090-bac5-be77337ec903",
        "value": 44
      }
    ]
  }
}

You can also use the logical operation (and(), or()) for the graph query.
For example, the following returns only “Mary”.

g.V().hasLabel('employee').and(has('age', gt(35)), has('age', lt(40)))

Next we handle the traversals. (You can traverse the edge.)
Next is the simple traversal example, which just retrieves Mary’s manager. (The result will be “John”.)

g.V('u002').out('manager').hasLabel('employee')

Note that the following returns the same result. The operation outE() returns the edge element and is getting the incoming vertex by inV(). (Explicitly traversing elements, vertex -> edge -> vertex.)

g.V('u002').outE('manager').inV().hasLabel('employee')

The following retrieves Mary’s manager (i.e, “John”) and retrieves the all employees whose direct report is him (“John”).
The result will be “Mary” and “Susan”.

g.V('u002').out('manager').hasLabel('employee').in('manager').hasLabel('employee')

If you want to omit the repeated elements in path, you can use simplePath() as follows. This returns only “Susan”, because “Mary” is the repeated vertex.

g.V('u002').out('manager').hasLabel('employee').in('manager').hasLabel('employee').simplePath()

Now let’s consider the traversal of the relation “friend”. (See the picture illustrated above.)
As you know, “manager” is the directional relation, but “friend” will be the undirectional (non-directional) relation. That is, if A is a friend of B, B will also be a friend of A.
In such a case, you can use both() (or bothE()) operation as follows. The following retrieves Emily’s friend, and the result is both “Susan” and “Christie”.

g.V('u006').both('friend').hasLabel('employee')

If you want to traverse until some condition matches, you can use repeat().until().
The following retrieves the reporting path (the relation of direct reports) from “John” to “Emily”.

g.V('u001').repeat(in('manager')).until(has('id', 'u006')).path()

The result is “John” – “Mary” – “Bob” – “Emily” as the following GraphSON.

{
  "labels": [
    ...
  ],
  "objects": [
    {
      "id": "u001",
      "label": "employee",
      "type": "vertex",
      "properties": {
        "firstName": [
          {
            "id": "9a5c0e2a-1249-4e2c-ada2-c9a7f33e26d5",
            "value": "John"
          }
        ],
        "age": [
          {
            "id": "67d681b1-9a24-4090-bac5-be77337ec903",
            "value": 44
          }
        ]
      }
    },
    {
      "id": "u002",
      "label": "employee",
      "type": "vertex",
      "properties": {
        "firstName": [
          {
            "id": "8d3b7a38-5b8e-4614-b2c4-a28306d3a534",
            "value": "Mary"
          }
        ],
        "age": [
          {
            "id": "2b0804e5-58cc-4061-a03d-5a296e7405d9",
            "value": 37
          }
        ]
      }
    },
    {
      "id": "u004",
      "label": "employee",
      "type": "vertex",
      "properties": {
        "firstName": [
          {
            "id": "3b804f2e-0428-402c-aad1-795f692f740b",
            "value": "Bob"
          }
        ],
        "age": [
          {
            "id": "040a1234-8646-4412-9488-47a5af75a7d7",
            "value": 35
          }
        ]
      }
    },
    {
      "id": "u006",
      "label": "employee",
      "type": "vertex",
      "properties": {
        "firstName": [
          {
            "id": "dfb2b624-e145-4a78-b357-5e147c1de7f6",
            "value": "Emily"
          }
        ],
        "age": [
          {
            "id": "f756c2e9-a16d-4959-b9a3-633cf08bcfd7",
            "value": 29
          }
        ]
      }
    }
  ]
}

Finally, let’s consider the shortest path from “Emily” to “John”. We assume that you can traverse either “manager” (directional) or “friend” (undirectional).

Now the following returns the possible paths from “Emily” to “John” connected by either “manager” (directional) or “friend” (undirectional).

g.V('u006').repeat(union(both('friend').simplePath(), out('manager').simplePath())).until(has('id', 'u001')).path()

This result is 3 paths :
Emily – Susan – John
Emily – Christie – Susan – John
Emily – Bob – Mary – John

When you want to count the number of each paths (current local elements), use count(local) operation.

g.V('u006').repeat(union(both('friend').simplePath(), out('manager').simplePath())).until(has('id', 'u001')).path().count(local)

This result is :
3
4
4

Then the following returns both count and paths as follows.

g.V('u006').repeat(union(both('friend').simplePath(), out('manager').simplePath())).until(has('id', 'u001')).path().group().by(count(local))
{
  "3": [
    {
      "labels": [...],
      "objects": [
        {
          "id": "u006",
          ...
        },
        {
          "id": "u005",
          ...
        },
        {
          "id": "u001",
          ...
        }
      ]
    }
  ],
  "4": [
    {
      "labels": [...],
      "objects": [
        {
          "id": "u006",
          ...
        },
        {
          "id": "u003",
          ...
        },
        {
          "id": "u005",
          ...
        },
        {
          "id": "u001",
          ...
        }
      ]
    },
    {
      "labels": [...],
      "objects": [
        {
          "id": "u006",
          ...
        },
        {
          "id": "u004",
          ...
        },
        {
          "id": "u002",
          ...
        },
        {
          "id": "u001",
          ...
        }
      ]
    }
  ]
}

 

[Reference]

TinkerPop Documentation (including language reference)
http://tinkerpop.apache.org/docs/current/reference/

 

Advertisements

Build your own Web API protected by Azure AD v2.0 endpoint with custom scopes

* This post is writing about Azure AD v2.0 endpoint. If you’re using v1, please see “Build your own api with Azure AD (written in Japanese)”.

You can now build your own Web API protected by the OAuth flow and you can add your own scopes with Azure AD v2.0 endpoint (also with Azure AD B2C).
Here I show you how to setup, how to build, and how to consider with custom scopes in v2.0 endpoint. (You can also learn several OAuth scenarios and ideas through this post.)

I note that now your Microsoft Account (consumer account) cannot provide the following scenarios with custom (user-defined) scopes. Please use your organization account (Azure AD account).

Note : For Azure AD B2C, please refer the post “Azure AD B2C Access Tokens now in public preview” in team blog.

Register your own Web API

First we register our custom Web API in v2.0 endpoint, and consent this app in the tenant.

Please go to Application Registration Portal, and start to register your own Web API by pressing [Add an app] button. In the application settings, click [Add Platform] and select [Web API].

In the added platform pane, you can see the following generated scope (access_as_user) by default.

This scope is used as follows.
For example, when you create your client app to access this custom Web API by OAuth, this client can access the following uri for the permissions calling Web API with the scope value.

https://login.microsoftonline.com/common/oauth2/v2.0/authorize
  ?response_type=id_token+code
  &response_mode=form_post
  &client_id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  &scope=openid+api%3a%2f%2f8a9c6678-7194-43b0-9409-a3a10c3a9800%2faccess_as_user
  &redirect_uri=https%3A%2F%2Flocalhost%2Ftest
  &nonce=abcdef

Now let’s change this default scope, and define the new read and write scopes as follows here. (We assume that the scopes are api://8a9c6678-7194-43b0-9409-a3a10c3a9800/read and api://8a9c6678-7194-43b0-9409-a3a10c3a9800/write.)

Note : If the admin consent is needed for using the custom scope (the admin can only use your scope), please edit the manifest manually.

Next we must also add “Web” platform (not “Web API” platform), because the user needs to consent this api application before using these custom scopes.

For example, please remember “Office 365”. The organizations or users who don’t purchase (subscribe) Office 365 cannot use the Office 365 API’s permissions. (No Office 365 API permissions are displayed in their Azure AD settings.) After you purchase Office 365 in https://portal.office.com/, you can start to use these API’s permissions.
Your custom api is the same. Before using these custom scopes, the user have to involve this custom application in the tenant or the individual.

When some user accesses the following url in their web browser and login with the user’s credential, the following consent UI will be displayed. Once the user approves this consent, this custom Web API application is registered in user’s individual permissions. (Note that the client_id is the application id of this custom Web API application, and the redirect_uri is the redirect url on “Web” platform in your custom Web API application. Please change these values to meet your application settings.)

https://login.microsoftonline.com/common/oauth2/v2.0/authorize
  ?response_type=id_token
  &response_mode=form_post
  &client_id=8a9c6678-7194-43b0-9409-a3a10c3a9800
  &scope=openid
  &redirect_uri=https%3A%2F%2Flocalhost%2Ftestapi
  &nonce=abcdef

Note : You can revoke the permission with https://account.activedirectory.windowsazure.com/, when you are using the organization account (Azure AD Account). It’s https://account.live.com/consent/Manage, when you’re using the consumer account (Microsoft Account).

Use the custom scope in your client application

After the user has consented the custom Web API application, now the user can use the custom scopes (api://.../read and api://.../write in this example) in the user’s client application. (In this post, we use the OAuth code grant flow with the web client application.)

First let’s register the new client application in Application Registration Portal with the user account who consented your Web API application. In this post, we create as “Web” platform for this client application (i.e, web client application).

The application password (client secret) must also be generated as follows in the application settings.

Now let’s consume the custom scope (of custom Web API) with this generated web client.
Access the following url with your web browser. (As you can see, the requesting scope is the previously registered custom scope api://8a9c6678-7194-43b0-9409-a3a10c3a9800/read.)
Here client_id is the application id of the web client application (not custom Web API application), and redirect_uri is the redirect url of the web client application.

https://login.microsoftonline.com/common/oauth2/v2.0/authorize
  ?response_type=code
  &response_mode=query
  &client_id=b5b3a0e3-d85e-4b4f-98d6-e7483e49bffc
  &scope=api%3A%2F%2F8a9c6678-7194-43b0-9409-a3a10c3a9800%2Fread
  &redirect_uri=https%3a%2f%2flocalhost%2ftestwebclient

Note : In the real production, it’s also better to retrieve the id token (i.e, response_type=id_token+code), since your client will have to validate the returned token and check if the user has logged-in correctly.
This sample will skip this complicated steps for your understandings.

When you access this url, the following login page will be displayed.

After the login succeeds with the user’s credential, the following consent is displayed.
As you can see, this shows that the client will use the permission of “Read test service data” (custom permission), which is the previously registered custom scope permission (api://8a9c6678-7194-43b0-9409-a3a10c3a9800/read).

After you approve this consent, the code will be returned into your redirect url as follows.

https://localhost/testwebclient?code=OAQABAAIAA...

Next, using code value, you can request the access token for the requested resource (custom scope) with the following HTTP request.
This client_id and client_secret are each application id and application password of the user’s web client application.

HTTP Request

POST https://login.microsoftonline.com/common/oauth2/v2.0/token
Content-Type: application/x-www-form-urlencoded

grant_type=authorization_code
&code=OAQABAAIAA...
&client_id=b5b3a0e3-d85e-4b4f-98d6-e7483e49bffc
&client_secret=pmC...
&scope=api%3A%2F%2F8a9c6678-7194-43b0-9409-a3a10c3a9800%2Fread
&redirect_uri=https%3A%2F%2Flocalhost%2Ftestwebclient

HTTP Response

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "token_type": "Bearer",
  "scope": "api://8a9c6678-7194-43b0-9409-a3a10c3a9800/read",
  "expires_in": 3599,
  "ext_expires_in": 0,
  "access_token": "eyJ0eXAiOi..."
}

Note : If you want to get refresh token, you must add “offline_access” to the scopes.

Using the returned access token (access_token property), you can call your custom Web API as follows and the API can verify the passed token. (Later I show you how to verify this token in your custom Web API.)

GET https://localhost/testapi
Authorization: Bearer eyJ0eXAiOi...

Verify access token in your Web API

Now it’s turn in your custom Web API.

How to check whether the access token is valid ? How to get the logged-in user’s claims ?

First you must remember that v2.0 endpoint returns the following token format.

id token access token
organization account (Azure AD) JWT JWT
consumer account (MSA) JWT Compact Tickets

As you can see in the table above, the passed access token is IETF JWT (Json Web Token) format as follows, if you are using Azure AD account (organization account).

  • JWT has 3 string tokens delimited by the dot (.) character.
  • Each delimited tokens are the base64 url encoded (encoded by RFC 4686).
  • Each delimited tokens (3 tokens) are having :
    Certificate information (ex: the type of key, key id, etc), claim information (ex: user name, tenant id, token expiration, etc), and digital signature (byte code).

For example, the following is PHP example of decoding access token. (The sample of C# is here.)
This sample outputs the 2nd delimited token string (i.e, claims information) as result.

<?php
echo "The result is " . token_test("eyJ0eXAiOi...");

// return claims
function token_test($token) {
  $res = 0;

  // 1 create array from token separated by dot (.)
  $token_arr = explode('.', $token);
  $header_enc = $token_arr[0];
  $claim_enc = $token_arr[1];
  $sig_enc = $token_arr[2];

  // 2 base 64 url decoding
  $header = base64_url_decode($header_enc);
  $claim = base64_url_decode($claim_enc);
  $sig = base64_url_decode($sig_enc);

  return $claim;
}

function base64_url_decode($arg) {
  $res = $arg;
  $res = str_replace('-', '+', $res);
  $res = str_replace('_', '/', $res);
  switch (strlen($res) % 4) {
    case 0:
      break;
    case 2:
      $res .= "==";
      break;
    case 3:
      $res .= "=";
      break;
    default:
      break;
  }
  $res = base64_decode($res);
  return $res;
}
?>

The output result (claim information) is the json string as follows.

{
  "aud": "8a9c6678-7194-43b0-9409-a3a10c3a9800",
  "iss": "https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/v2.0",
  "iat": 1498037743,
  "nbf": 1498037743,
  "exp": 1498041643,
  "aio": "ATQAy/8DAA...",
  "azp": "b5b3a0e3-d85e-4b4f-98d6-e7483e49bffc",
  "azpacr": "1",
  "name": "Christie Cline",
  "oid": "fb0d1227-1553-4d71-a04f-da6507ae0d85",
  "preferred_username": "ChristieC@MOD776816.onmicrosoft.com",
  "scp": "read",
  "sub": "Pcz_ssYLnD...",
  "tid": "3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15",
  "ver": "2.0"
}

The aud means the application id for targeting web api (here, custom Web API), nbf (= not before) is the starting time of the token expiration, exp is the expiring time of the token, tid means the tenant id of this logged-in user, and scp is the granted scopes.
With these claim values, your custom Web API can check if the passed token is valid.

Here I show you the PHP sample code for checking these claims.

<?php
echo "The result is " . token_test("eyJ0eXAiOi...");

// return 1, if token is valid
// return 0, if token is invalid
function token_test($token) {
  // 1 create array from token separated by dot (.)
  $token_arr = explode('.', $token);
  $header_enc = $token_arr[0];
  $claim_enc = $token_arr[1];
  $sig_enc = $token_arr[2];

  // 2 base 64 url decoding
  $header = 
    json_decode(base64_url_decode($header_enc), TRUE);
  $claim =
    json_decode(base64_url_decode($claim_enc), TRUE);
  $sig = base64_url_decode($sig_enc);

  // 3 expiration check
  $dtnow = time();
  if($dtnow <= $claim['nbf'] or $dtnow >= $claim['exp'])
    return 0;

  // 4 audience check
  if (strcmp($claim['aud'], '8a9c6678-7194-43b0-9409-a3a10c3a9800') !== 0)
    return 0;

  // 5 scope check
  if (strcmp($claim['scp'], 'read') !== 0)
    return 0;
    
  // other checks if needed (lisenced tenant, etc)
  // Here, we skip these steps ...

  return 1;
}

function base64_url_decode($arg) {
  $res = $arg;
  $res = str_replace('-', '+', $res);
  $res = str_replace('_', '/', $res);
  switch (strlen($res) % 4) {
    case 0:
      break;
    case 2:
      $res .= "==";
      break;
    case 3:
      $res .= "=";
      break;
    default:
      break;
  }
  $res = base64_decode($res);
  return $res;
}
?>

But it’s not complete !

Now let’s consider what if some malicious one has changed this token ? For example, if you are a developer, you can easily change the returned token string with Fiddler or other developer tools and you might be able to login to the critical corporate applications with other user’s credential.

Lastly, the digital signature (the third token in access token string) works against this kind of attacks.

The digital signature is generated using the private key in Microsoft identity provider (Azure AD, etc), and you can verify using the public key which everyone can access. Moreover this digital signature is derived from {1st delimited token string}.{2nd delimited token string} string.
That is, if you change the claims (2nd token string) in access token, the digital signature must be also generated again. And only Microsoft identity provider can create this digital signature. (The malicious user cannot.)

That is, all you have to do is to check whether this digital signature is valid with public key. Let’s see how to do that.

First you can get the public key from https://{issuer url}/.well-known/openid-configuration. (The issuer url is equal to the “iss” value in the claim.) In this case, you can get from the following url.

GET https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/v2.0/.well-known/openid-configuration
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "authorization_endpoint": "https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/oauth2/v2.0/authorize",
  "token_endpoint": "https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/oauth2/v2.0/token",
  "token_endpoint_auth_methods_supported": [
    "client_secret_post",
    "private_key_jwt"
  ],
  "jwks_uri": "https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/discovery/v2.0/keys",
  "response_modes_supported": [
    "query",
    "fragment",
    "form_post"
  ],
  "subject_types_supported": [
    "pairwise"
  ],
  "id_token_signing_alg_values_supported": [
    "RS256"
  ],
  "http_logout_supported": true,
  "frontchannel_logout_supported": true,
  "end_session_endpoint": "https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/oauth2/v2.0/logout",
  "response_types_supported": [
    "code",
    "id_token",
    "code id_token",
    "id_token token"
  ],
  "scopes_supported": [
    "openid",
    "profile",
    "email",
    "offline_access"
  ],
  "issuer": "https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/v2.0",
  "claims_supported": [
    "sub",
    "iss",
    "cloud_instance_name",
    "cloud_graph_host_name",
    "aud",
    "exp",
    "iat",
    "auth_time",
    "acr",
    "nonce",
    "preferred_username",
    "name",
    "tid",
    "ver",
    "at_hash",
    "c_hash",
    "email"
  ],
  "request_uri_parameter_supported": false,
  "tenant_region_scope": "NA",
  "cloud_instance_name": "microsoftonline.com",
  "cloud_graph_host_name": "graph.windows.net"
}

Next you access to the location of “jwks_uri” property (see above), and you can get public key list from that location. Finally you can find appropriate key by matching the “kid” (key id).

Here I show you the complete code by PHP as follows.

<?php
echo "The result is " . token_test("eyJ0eXAiOi...");

// return 1, if token is valid
// return 0, if token is invalid
function token_test($token) {
  // 1 create array from token separated by dot (.)
  $token_arr = explode('.', $token);
  $header_enc = $token_arr[0];
  $claim_enc = $token_arr[1];
  $sig_enc = $token_arr[2];

  // 2 base 64 url decoding
  $header =
    json_decode(base64_url_decode($header_enc), TRUE);
  $claim =
    json_decode(base64_url_decode($claim_enc), TRUE);
  $sig = base64_url_decode($sig_enc);

  // 3 period check
  $dtnow = time();
  if($dtnow <= $claim['nbf'] or $dtnow >= $claim['exp'])
    return 0;

  // 4 audience check
  if (strcmp($claim['aud'], '8a9c6678-7194-43b0-9409-a3a10c3a9800') !== 0)
    return 0;

  // 5 scope check
  if (strcmp($claim['scp'], 'read') !== 0)
    return 0;

  // other checks if needed (lisenced tenant, etc)
  // Here, we skip these steps ...

  //
  // 6 check signature
  //

  // 6-a get key list
  $keylist =
    file_get_contents('https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/discovery/v2.0/keys');
  $keylist_arr = json_decode($keylist, TRUE);
  foreach($keylist_arr['keys'] as $key => $value) {
    
    // 6-b select one key
    if($value['kid'] == $header['kid']) {
      
      // 6-c get public key from key info
      $cert_txt = '-----BEGIN CERTIFICATE-----' . "n" . chunk_split($value['x5c'][0], 64) . '-----END CERTIFICATE-----';
      $cert_obj = openssl_x509_read($cert_txt);
      $pkey_obj = openssl_pkey_get_public($cert_obj);
      $pkey_arr = openssl_pkey_get_details($pkey_obj);
      $pkey_txt = $pkey_arr['key'];
      
      // 6-d validate signature
      $token_valid =
        openssl_verify($header_enc . '.' . $claim_enc, $sig, $pkey_txt, OPENSSL_ALGO_SHA256);
      if($token_valid == 1)
        return 1;
      else
        return 0;      
    }
  }
  
  return 0;
}

function base64_url_decode($arg) {
  $res = $arg;
  $res = str_replace('-', '+', $res);
  $res = str_replace('_', '/', $res);
  switch (strlen($res) % 4) {
    case 0:
      break;
    case 2:
      $res .= "==";
      break;
    case 3:
      $res .= "=";
      break;
    default:
      break;
  }
  $res = base64_decode($res);
  return $res;
}
?>

Calling another services in turn (OAuth On-Behalf-Of Flow)

As you can see above, the access token is for the some specific api (for “aud“) and you cannot reuse the token for another api.
What if your custom Web API needs to call another api (for ex, Microsoft Graph API, etc) ?

In such a case, your api can convert to another token with OAuth on-behalf-of flow as follows. No need to display the login UI again.
In this example, our custom Web API will connect to Microsoft Graph API and get e-mail messages of the logged-in user.

Note : For a long ago I explained about this on-behalf-of flow in my blog post with Azure AD v1 endpoint, but here I will explain with v2.0 endpoint, because it’s a little tricky …

First, as the official document says (see here), you need to use tenant-aware endpoint when you use on-behalf-of flow with v2.0 endpoint. That is, the administrator consent (admin consent) is needed before requesting the on-behalf-of flow. (In this case, the user consent for custom Web API which is done in the previous section in this post is not needed.)

Before proceeding the admin consent, you must add the delegated permission for your custom Web API in Application Registration Portal. In this example, we add Mail.Read permission as follows. (When you use admin consent, you cannot add scopes on the fly and you must set the permissions beforehand.)

Next the administrator in the user tenant must access the following url using the web browser for administrator consent.
Note that xxxxx.onmicrosoft.com can also be the tenant id (which is the Guid retrieved as “tid” in the previous claims). 8a9c6678-7194-43b0-9409-a3a10c3a9800 is the application id of the custom Web API and https://localhost/testapi is the redirect url of the custom Web API.

https://login.microsoftonline.com/xxxxx.onmicrosoft.com/adminconsent
  ?client_id=8a9c6678-7194-43b0-9409-a3a10c3a9800
  &state=12345
  &redirect_uri=https%3A%2F%2Flocalhost%2Ftestapi

After logged-in with the tenant administrator, the following consent is displayed. When the administrator approves this consent, your custom Web API is registered in the tenant. As a result, all users in this tenant can use this custom Web API and custom scopes.

Note : You can revoke the admin-consented application in your tenant with Azure Portal. (Of course, the administrator privilege is needed for this operation.)

Now you can ready for the OAuth on-behalf-of flow in v2.0 endpoint !

First the user (non-administrator) gets the access token for the custom Web API and call the custom Web API with this access token. This flow is the same as above and I skip the steps here.

Then the custom Web API can request the following HTTP POST for Azure AD v2.0 endpoint using the passed access token. (I note that eyJ0eXAiOi... is the passed access token for this custom Web API, 8a9c6678-7194-43b0-9409-a3a10c3a9800 is the application id of your custom Web API, and itS... is the application password of your custom Web API.)
This POST method is requesting the new access token for https://graph.microsoft.com/mail.read (pre-defined scope).

POST https://login.microsoftonline.com/xxxxx.onmicrosoft.com/oauth2/v2.0/token
Content-Type: application/x-www-form-urlencoded

grant_type=urn%3Aietf%3Aparams%3Aoauth%3Agrant-type%3Ajwt-bearer
&assertion=eyJ0eXAiOi...
&requested_token_use=on_behalf_of
&scope=https%3A%2F%2Fgraph.microsoft.com%2Fmail.read
&client_id=8a9c6678-7194-43b0-9409-a3a10c3a9800
&client_secret=itS...

The following is the HTTP response for this on-behalf-of request. As you can see, your custom Web API can take new access token.

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "token_type": "Bearer",
  "scope": "https://graph.microsoft.com/Mail.Read https://graph.microsoft.com/User.Read",
  "expires_in": 3511,
  "ext_expires_in": 0,
  "access_token": "eyJ0eXAiOi..."
}

The returned access token is having the scope for Mail.Read (https://graph.microsoft.com/mail.read), and it’s not the application only token, but the user token by the logged-in user, although it’s done by the backend (server-to-server) without user interaction. (Please parse and decode this access token as I described above.)

Therefore, when your custom Web API connects to Microsoft Graph endpoint with this access token, the logged-in user’s e-mail messages will be returned to your custom Web API.

GET https://graph.microsoft.com/v1.0/me/messages
  ?$orderby=receivedDateTime%20desc
  &$select=subject,receivedDateTime,from
  &$top=20
Accept: application/json
Authorization: Bearer eyJ0eXAiOi...

 

[Reference] App types for the Azure Active Directory v2.0 endpoint
https://docs.microsoft.com/en-us/azure/active-directory/develop/active-directory-v2-flows

 

Analyze your data in Azure Data Lake with R (R extension)

Azure Data Lake (ADL), which offers the unlimited data storage, is the reasonable choice (or cost effective) for the simple batch-based analysis.
You remember the data is more critical rather than the program ! In the case of analyzing data in your Azure Data Lake Store, you don’t need to move or download your data into the remote host. You can run the python or R code on Azure Data Lake Analytics in the cloud hosted.

Here I show you how to use this R extensions with some brief examples along with the real scenarios.

Note : In this post we consider the simple batch-based scenario. If you need more advanced scenarios with the data in ADL store, please use ADL store with Hadoop (HDInsight) with R Server, Spark, Storm, etc.
See “Benefits of Microsoft R and R Server” in my previous post for more details.

Note : U-SQL development with Python and R is also supported in Visual Studio Code. See “ADL Tools for Visual Studio Code (VSCode) supports Python & R Programming” in team blog. (Added on Nov 2017)

Setting-up

Before starting, you must prepare your Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA) with Azure Portal. (Please see the Azure document.)

Next, on your Azure Data Lake Analytics blade, click [Sample Scripts], and select [Install U-SQL Extensions]. (See the following screenshot.)
It starts the installation of extensions in your Data Lake Analytics (ADLA).

Let’s see what kind of installation was made.

After installation is completed, please click [Success] and [Duplicate Script] button. (The installation is executed as Data Lake job.)

As you know, Data Lake Analytics is the .NET-based platform and you can extend using your own custom .NET classes.

R extension is the same. As you can see (see below) in this script job, the R extension classes (ExtR.dll) are installed in your Data Lake Analytics. (Note that the extensions of python and the extensions of cognitive services are also installed.)
As I show you later, you can use these installed classes in your U-SQL script.

Note : You can see these installed dll on /usqlext/assembly folder in your ADLS (Data Lake Store).

Let’s get started !

Now it’s ready.

You can find a lot of examples in /usqlext/samples/R on ADLS. (These are the famous iris classification examples.) You can soon run these U-SQL files (.usql files) with Azure Portal, Visual Studio, or Visual Studio Code (if using Mac), and see the result and how it works. (Here we use Visual Studio.)

For instance, the following is retrieving the data in iris.csv and analyzing for the prediction target “Species” with linear regression. (Sorry, but this sample is meaningless because it’s just returning the base64 encoded trained model. I show you some complete example later…)

R extension (ExtR.dll) includes the custom reducer (.NET class) named Extension.R.Reducer, then you can use this extension class with U-SQL REDUCE expression as follows.

REFERENCE ASSEMBLY [ExtR]; // Load library

DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/test.txt";
DECLARE @myRScript = @"
inputFromUSQL$Species <- as.factor(inputFromUSQL$Species)
lm.fit <- lm(unclass(Species)~.-Par, data=inputFromUSQL)
library(base64enc)
outputToUSQL <-
  data.frame(
    Model=base64encode(serialize(lm.fit, NULL)),
    stringsAsFactors = FALSE)
";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @IrisData
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, Model string
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @OutputFilePredictions
  USING Outputters.Tsv();

As you can see in this sample code, you can use inputFromUSQL for retrieving the input data in your R script. And you can use outputToUSQL as returned result to U-SQL. That is, your R script can communicate with U-SQL script by using these pre-defined variables.

Instead of using outputToUSQL, you can just write the result to the R output. For instance, you can rewrite the above example as follows. (I changed the source code with bold fonts.)

REFERENCE ASSEMBLY [ExtR]; // Load library

DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/test.txt";
DECLARE @myRScript = @"
inputFromUSQL$Species <- as.factor(inputFromUSQL$Species)
lm.fit <- lm(unclass(Species)~.-Par, data=inputFromUSQL)
library(base64enc)
#outputToUSQL <-
#  data.frame(
#    Model=base64encode(serialize(lm.fit, NULL)),
#    stringsAsFactors = FALSE)
data.frame(
  Model=base64encode(serialize(lm.fit, NULL)),
  stringsAsFactors = FALSE)
";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @IrisData
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, Model string
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @OutputFilePredictions
  USING Outputters.Tsv();

We used inline R script in the above example, but you can also separate the R script from your U-SQL script as follows. (See the line with bold fonts.)

REFERENCE ASSEMBLY [ExtR]; // Load library

DEPLOY RESOURCE @"/usqlext/samples/R/testscript01.R";

DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/test.txt";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @IrisData
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, Model string
  READONLY Par
  USING new Extension.R.Reducer(
    scriptFile:"testscript01.R",
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @OutputFilePredictions
  USING Outputters.Tsv();

Partitioning

By using REDUCE expression, you can separate your analysis workload by partitions. Each partitions can be executed in parallel, then you can efficiently predict some massive amount of data by using this partitioning capability.

To make things simple, let’s consider the following sample data. Here we use the first column as partition key.

test01.csv

1,1
1,2
1,3
1,4
2,5
2,6
2,7
2,8
3,9
3,10
3,11
3,12

The following is the brief example which is calculating min, max, and mean for each partitions.

REFERENCE ASSEMBLY [ExtR];

DECLARE @SrcFile string = @"/sampledat/test01.csv";
DECLARE @DstFile string = @"/sampledat/output01.txt";
DECLARE @myRScript = @"
outputToUSQL <- data.frame(
  CalcType = c(""min"", ""max"", ""mean""),
  CalcValue = c(
    min(inputFromUSQL$Value),
    max(inputFromUSQL$Value),
    mean(inputFromUSQL$Value)
  )
)
";

@ExtendedData =
  EXTRACT PartitionId int,
      Value int
  FROM @SrcFile
  USING Extractors.Csv();

@RScriptOutput = REDUCE @ExtendedData ON PartitionId
  PRODUCE PartitionId int, CalcType string, CalcValue double
  READONLY PartitionId
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @DstFile
  USING Outputters.Tsv();

The following screenshot is the result of this U-SQL.
Each partition is executed independently in parallel, and all results are collected by REDUCE operation.

Note that you have to specify ON {partition keys (multiple)} or ALL when you’re using REDUCE clause. (You cannot skip ON / ALL.)
So if you don’t need partitioning, you specify the pseudo partition (one same partition for all raw) like the following script.

REFERENCE ASSEMBLY [ExtR];

DECLARE @SrcFile string = @"/sampledat/test01.csv";
DECLARE @DstFile string = @"/sampledat/output01.txt";
DECLARE @myRScript = @"
outputToUSQL <- data.frame(
  CalcType = c(""min"", ""max"", ""mean""),
  CalcValue = c(
    min(inputFromUSQL$Value),
    max(inputFromUSQL$Value),
    mean(inputFromUSQL$Value)
  )
)
";

@ExtendedData =
  EXTRACT SomeId int,
      Value int
  FROM @SrcFile
  USING Extractors.Csv();

@ExtendedData2 =
  SELECT 0 AS Par, // pseudo partition
       *
  FROM @ExtendedData;

@RScriptOutput = REDUCE @ExtendedData2 ON Par
  PRODUCE Par int, CalcType string, CalcValue double
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @DstFile
  USING Outputters.Tsv();

Installing packages

There are default supported packages in R extension, but you can install extra packages if needed. (See here for the default packages of R extension. It’s also including RevoScaleR package.)

First you download the package file (.zip, .tar.gz, etc) using your local R console. Now here we download the famous svm package “e1071”. (We assume the file name is e1071_1.6-8.tar.gz.)

download.packages("e1071", destdir="C:\tmp")

Next you upload this package file to the folder in your ADLS (Data Lake Store).
After that, you can specify this package file in your U-SQL and you can install this package in your R script as follows.

REFERENCE ASSEMBLY [ExtR];

DEPLOY RESOURCE @"/sampledat/e1071_1.6-8.tar.gz";

DECLARE @SrcFile string = @"/sampledat/iris.csv";
DECLARE @DstFile string = @"/sampledat/output03.txt";
DECLARE @myRScript = @"
install.packages('e1071_1.6-8.tar.gz', repos = NULL) # installing package
library(e1071) # loading package
# something to analyze !
# (Later we'll create the code here ...)
data.frame(Res = c(""result1"", ""result2""))
";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @SrcFile
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, Res string
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @DstFile
  USING Outputters.Tsv();

Loading R data

In the real scenario, you might use the pre-trained model for predictions. In such a case, you can create the trained model (R objects) beforehand, and you can load these R objects on your R script in U-SQL.

First you create the trained model using the following script in your local environment. The file “model.rda” will be saved in your local file system.
(Here we’re using script for saving, but you can also use RStudio IDE.)

library(e1071)
inputCSV <- read.csv(
  file = "C:\tmp\iris_train.csv",
  col.names = c(
    "SepalLength",
    "SepalWidth",
    "PetalLength",
    "PetalWidth",
    "Species")
)
mymodel <- svm(
  Species~.,
  data=inputCSV,
  probability = T)
save(mymodel, file = "C:\tmp\model.rda")

Note that we assume our  training data (iris data) is as follows. (It’s the same as U-SQL extension sample files…) :

iris_train.csv

5.1,3.5,1.4,0.2,setosa
7,3.2,4.7,1.4,versicolor
6.3,3.3,6,2.5,virginica
4.9,3,1.4,0.2,setosa
...

Then you upload this generated model (model.rda file) on the folder in your ADLS (Data Lake Store).

Now it’s ready, and let’s go jump into the U-SQL.

See the following R script in U-SQL.
This R script is loading the previous pre-trained model (model.rda). By this, you can use pre-trained R object “mymodel” in your R script.
All you have to do is to predict your input data with this model object.

REFERENCE ASSEMBLY [ExtR];

DEPLOY RESOURCE @"/sampledat/e1071_1.6-8.tar.gz";
DEPLOY RESOURCE @"/sampledat/model.rda";

DECLARE @SrcFile string = @"/sampledat/iris.csv";
DECLARE @DstFile string = @"/sampledat/output03.txt";
DECLARE @myRScript = @"
install.packages('e1071_1.6-8.tar.gz', repos = NULL)
library(e1071)
load(""model.rda"")
pred <- predict(
  mymodel,
  inputFromUSQL,
  probability = T)
prob <- attr(pred, ""probabilities"")
result <- data.frame(prob, stringsAsFactors = FALSE)
result$answer <- inputFromUSQL$Species
outputToUSQL <- result
";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @SrcFile
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, setosa double, versicolor double, virginica double, answer string
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @DstFile
  USING Outputters.Tsv();

[Reference] Tutorial: Get started with extending U-SQL with R
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-r-extensions

 

How to use Azure REST API with Certificate programmatically (without interactive Login UI)

This post is just tips for Azure programming. (One of frequently asked questions by developers)

A long ago, I introduced how to use Azure REST API (in resource manager mode) and role model called RBAC (Role Based Access Control). (See my old post “Manage your Azure resource with REST API (ARM)“.)
In usual cases, API is called with the HTTP header of Azure AD user token, which is retrieved by the OAuth flow and user’s interactive login activity.

But, what if your application is some kind of the backend process like daemon ?

In this post I explain about this scenario as follows.

Note : If you want to run some kind of automation jobs (scheduled or triggered) for Azure resources, you can register your job (scripts) in Azure Automation. Of course, you don’t need to login using interactive UI for running the Automation job.
Here it’s assumed that we access Azure resources from outside of Azure.

What is concerns ?

In terms of AuthN/AuthZ flow in Azure Active Directory (Azure AD), you can use application permissions in order to access some API protected by Azure AD from the backend service like daemon. For more details about application permissions, see “Develop your backend server-side application (Deamon, etc) protected by Azure AD without login UI” (post in Japanese).

For example: if you want to create your application which sync all users’ information (in some specific tenant) in the background periodically, you set the application permission like following screenshot.

If you want to sync all users’ calendar data in some tenant in the background, select the following application permission. (See the following screenshot.)

But unfortunately there’s no application permission for Azure Management API ! Look at the following screenshot.
So, what should we do ?

The answer is “use service principal and RBAC” (not application permissions).

What is the role based access control (RBAC) ? As we saw in my old post “Manage your Azure resource with REST API (ARM)“, RBAC is used for the Azure resource’s access permissions in new resource manager (ARM) mode.
“Role” (ex: Reader role, Contributor role, Backup Operator role, etc, etc) is some set of Azure permissions, and you can assign some role to some users or groups. That is, RBAC provides the granular permissions for Azure resources.
This role assignment cannot only be set in each resource (storage, virtual machine, network, etc), but also can be set in subscription level or resource group level. If some role is assigned in resource group level, the user can access all the resources in this resource group. (i.e, the assignment is inherited through your subscription, resource groups, and each resources.)
What you should remember is that you can assign role to service principal, not only users or groups !

Let’s see the configuration and usage of these combination (together with service principal and RBAC).

Step1 – Register service principal

Here I describe how to configure the service principal.

First you add new service principal in Azure AD tenant as follows.
Login to Azure Portal (https://portal.azure.com), click “Azure Active Directory” on the left navigation, and click “App registrations” menu.

Press “Add” button, and register the new app (i.e, service principal) by pushing “Create” button.

The service principal is successfully generated.
Finally, please copy the application id and app id uri of this service principal (app) in Azure Portal. (We use these values later.)

Step2 – Set Azure permissions with service principal in RBAC

Next, please assign some role to this service principal in Azure Portal.
The following is this procedure. Here in this post, I set the read permission for all the resources in some specific resource group. (We set “Reader” role.)

  1. View the resource group in Azure Portal.
  2. Press “Access control (IAM)”
  3. Press “Add” button.
    Select “Reader” as role and select the previous service principal as members.
  4. Press “Save” button.

Step3A – Retrieve access token with app key

You can retrieve the access token (which is needed for calling Azure rest api) by either app key (app password, client secret) or certificate.
First we explain the case of app key.

To create the app key for your service principal, just you create the key in your service principal with Azure Portal. Select [Settings] and [Keys] in Azure AD application management, and you can generate the key (password string) for your service principal.

For retrieving access token, you just publish the following HTTP request. (The value of client secret is the previously generated app key.)
Note that you must use the tenant-aware url instead of using “https://login.microsoftonline.com/common/oauth2/token”.

HTTP Request (Get access token)

POST https://login.microsoftonline.com/yourtenant.onmicrosoft.com/oauth2/token
Accept: application/json
Content-Type: application/x-www-form-urlencoded

resource=https%3A%2F%2Fmanagement.core.windows.net%2F
&client_id=1c13ff57-672e-4e76-a71d-2a7a75890609
&client_secret=9zVm9Li1...
&grant_type=client_credentials

HTTP Response (Get access token)

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "token_type": "Bearer",
  "expires_in": "3600",
  "ext_expires_in": "10800",
  "expires_on": "1488429872",
  "not_before": "1488425972",
  "resource": "https://management.core.windows.net/",
  "access_token": "eyJ0eXAiOi..."
}

The returned value of access_token attribute is the access token for your Azure REST API calls. Later we use this token value.

Step3B – Retrieve access token with certificate

If you want to use certificate (instead of app key), first you must create and register the certificate into this service principal.
Here we create the self-signed certificate for demo purpose with the following command. (We use makecert utility in Windows SDK.)
As a result, 3 files named “server.pvk”, “server.cer”, and “server.pfx” are generated with this command.

rem -- create self signed CA cert (CA.pvk, CA.cer)
makecert -r -pe -n "CN=My Root Authority" -ss CA -sr CurrentUser -a sha1 -sky signature -cy authority -sv CA.pvk CA.cer

rem -- create self signed server cert (server.pvk, server.cer, server.pfx)
makecert -pe -n "CN=DemoApp" -a sha1 -sky Exchange -eku 1.3.6.1.5.5.7.3.1 -ic CA.cer -iv CA.pvk -sp "Microsoft RSA SChannel Cryptographic Provider" -sy 12 -sv server.pvk server.cer
pvk2pfx -pvk server.pvk -spc server.cer -pfx server.pfx -pi {password}

The generated “server.cer” file includes the public key. You can get this encoded raw data and base64 encoded thumbprint by the following PowerShell command, and please copy these strings. The result is in raw.txt and thumbprint.txt.

Note that the base64 encoded thumbprint is not the familiar hexadecimal thumbprint string. If you have the hexadecimal thumbprint, you can get this string by converting to the binary and encoding with base64. (You can use online site like here for this encoding …)

$cer =
  New-Object System.Security.Cryptography.X509Certificates.X509Certificate2
$cer.Import("C:Demotestserver.cer")

# Get encoded raw
$raw = $cer.GetRawCertData()
$rawtxt = [System.Convert]::ToBase64String($raw)
echo $rawtxt > raw.txt

# Get thumbprint
$hash = $cer.GetCertHash()
$thumbprint = [System.Convert]::ToBase64String($hash)
echo $thumbprint > thumbprint.txt

Note : You can also get key pair and .pfx with openssl as follows.
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -config "C:Program Files (x86)Gitsslopenssl.cnf"
openssl pkcs12 -export -out server.pfx -inkey key.pem -in cert.pem
You can also get the hexadecimal thumbprint as follows, and convert to base64 encoded thumbprint.
openssl x509 -in cert.pem -fingerprint -noout

In Azure Portal, please view the previously registered app in Azure AD, and press “Manifest” button. (The application manifest is showed in the editor.)

In the manifest editor, please add the following key value (bold fonts) in the manifest. Here customKeyIdentifier is the base64 encoded thumbprint of certificate (which was retrieved by the previous command), and value is the raw data. Moreover you must create unique GUID for the following keyId.

After that, please press the “Save” button.

{
  "appId": "1c13ff57-672e-4e76-a71d-2a7a75890609",
  "appRoles": [],
  "availableToOtherTenants": false,
  "displayName": "test01",
  "errorUrl": null,
  "groupMembershipClaims": null,
  "homepage": "https://localhost/test01",
  "identifierUris": [
    "https://microsoft.onmicrosoft.com/b71..."
  ],
  "keyCredentials": [
    {
      "customKeyIdentifier": "CTTz0wG...",
      "keyId": "9de40fd2-9559-4b52-b075-04ab17227411",
      "type": "AsymmetricX509Cert",
      "usage": "Verify",
      "value": "MIIDFjC..."
    }
  ],
  "knownClientApplications": [],
  "logoutUrl": null,
  ...

}

Now you start to build your backend (daemon) application.
Before you connect to the Azure resources using REST API, your program must take the access token for the REST API calls.

First you must create the RS256 digital signature using the previous private key (server.pfx). The input string (payload) for this signature must be the base64 uri encoded string of the following 2 tokens delimited by dot ( . ) character. That is, if it’s assumed that the base64 uri encoded string of {"alg":"RS256","x5t":"CTTz0..."} (1st token) is eyJhbGciOi*** (it’ll be large string and here we’re omitting…) and the base64 uri encoded string of {"aud":"https:...","exp":1488426871,"iss":"1c13f...",...} (2nd token) is eyJhdWQiOi***, then the input string must be eyJhbGciOi***.eyJhdWQiOi***. You create digital signature using this input string and previously generated key.

  • x5t is the certificate thumbprint which is previously retrieved by PowerShell command.
  • nbf is the start time (epoch time) of this token expiration, and exp is the end time.
  • iss and sub is your application id (client id).
  • jti is the arbitary GUID which is used for protecting reply attacks.

1st token

{
  "alg": "RS256",
  "x5t": "CTTz0wGaBvl1qhHEmVdw04vExqw"
}

2nd token

{
  "aud": "https://login.microsoftonline.com/microsoft.onmicrosoft.com/oauth2/token",
  "exp": 1488426871,
  "iss": "1c13ff57-672e-4e76-a71d-2a7a75890609",
  "jti": "e86b1f2b-b001-4630-86f5-5f953aeec694",
  "nbf": 1488426271,
  "sub": "1c13ff57-672e-4e76-a71d-2a7a75890609"
}

In order to create the RS256 digital signature for that input string (payload), you can use some libraries. For example: you can use openssl_sign (which needs pem format private key) for PHP programmers, or you might be able to use jsjws for JavaScript. For C# (.NET), I’ll show the complete code later in this chapter.
For more details about this format, please see “Build your API protected by Azure AD (How to verify access token)” (sorry, it’s written in Japanese).

After you’ve created the signature, you can now get the client assertion as follows. This format is the standardized JWT format (see RFC 7519).

{base64 uri encoded string of 1st token}.{base64 uri encoded string of 2nd token}.{base64 uri encoded of generated signature}

Finally you can get access token for Azure REST API with the following HTTP request.
The each attributes are :

  • The url fragment yourtenant.onmicrosoft.com is your tenant domain. In this application flow, you cannot use https://login.microsoftonline.com/common/oauth2/token instead.
  • https://management.core.windows.net/ is the resource id of the Azure REST API (fixed value). The value of resource must be url-encoded string.
  • 1c13ff57-672e-4e76-a71d-2a7a75890609 is the application id of your service principal.
  • eyJhbGciOi... is the client assertion which is previously created. (See above)

HTTP Request (Get access token)

POST https://login.microsoftonline.com/yourtenant.onmicrosoft.com/oauth2/token
Accept: application/json
Content-Type: application/x-www-form-urlencoded

resource=https%3A%2F%2Fmanagement.core.windows.net%2F
&client_id=1c13ff57-672e-4e76-a71d-2a7a75890609
&client_assertion_type=urn%3Aietf%3Aparams%3Aoauth%3Aclient-assertion-type%3Ajwt-bearer
&client_assertion=eyJhbGciOi...
&grant_type=client_credentials

The client assertion is checked by the stored raw key (public key), and if it’s verified, the result is successfully returned as follows.

HTTP Response (Get access token)

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "token_type": "Bearer",
  "expires_in": "3600",
  "ext_expires_in": "10800",
  "expires_on": "1488429872",
  "not_before": "1488425972",
  "resource": "https://management.core.windows.net/",
  "access_token": "eyJ0eXAiOi..."
}

The returned value of access_token attribute is the access token for your Azure REST API calls.

As we described above, we saw how to get access token with raw HTTP request.
But if you’re using C# (.NET), you can get access token by a few lines of code as follows with ADAL (Microsoft.IdentityModel.Clients.ActiveDirectory libarary).

...
using Microsoft.IdentityModel.Clients.ActiveDirectory;
using System.Security.Cryptography.X509Certificates;
...

AuthenticationContext ctx = new AuthenticationContext(
  "https://login.microsoftonline.com/yourtenant.onmicrosoft.com/oauth2/authorize",
  false);
X509Certificate2 cert = new X509Certificate2(
  @"C:tmpserver.pfx",
  "P@ssw0rd", // password of private key
  X509KeyStorageFlags.MachineKeySet);
ClientAssertionCertificate ast = new ClientAssertionCertificate(
  "1c13ff57-672e-4e76-a71d-2a7a75890609", // application id
  cert);
var res = await ctx.AcquireTokenAsync(
  "https://management.core.windows.net/",  // resource id
  ast);
MessageBox.Show(res.AccessToken);

Step4 – Connect Azure resources !

Now let’s connect using Azure REST API.

Here we’re getting a Virtual Machine resource using Azure REST API. The point is to set the value of access token as HTTP Authorization header as follows.

HTTP Request (Azure REST API)

GET https://management.azure.com/subscriptions/b3ae1c15-...
  /resourceGroups/TestGroup01
  /providers/Microsoft.Compute/virtualMachines
  /testmachine01?api-version=2017-03-30
Accept: application/json
Authorization: Bearer eyJ0eXAiOi...

HTTP Response (Azure REST API)

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "properties": {
    "vmId": "525f07f9-d4e9-40dc-921f-6adf4ebb8f21",
    "hardwareProfile": {
      "vmSize": "Standard_A2"
    },
    "storageProfile": {
      "imageReference": {
        "publisher": "MicrosoftWindowsServer",
        "offer": "WindowsServer",
        "sku": "2012-R2-Datacenter",
        "version": "latest"
      },
      "osDisk": {
        "osType": "Windows",
        "name": "testmachine01",
        "createOption": "FromImage",
        "vhd": {
          "uri": "https://test01.blob.core.windows.net/vhds/testvm.vhd"
        },
        "caching": "ReadWrite"
      },
      "dataDisks": []
    },
    "osProfile": {
      "computerName": "testmachine01",
      "adminUsername": "tsmatsuz",
      "windowsConfiguration": {
        "provisionVMAgent": true,
        "enableAutomaticUpdates": true
      },
      "secrets": []
    },
    "networkProfile": {
      "networkInterfaces":[
        {
          "id":"..."
        }
      ]
    },
    "diagnosticsProfile": {
      "bootDiagnostics": {
        "enabled": true,
        "storageUri": "https://test01.blob.core.windows.net/"
      }
    },
    "provisioningState": "Succeeded"
  },
  "resources": [
    {
      "properties": {
        "publisher": "Microsoft.Azure.Diagnostics",
        "type": "IaaSDiagnostics",
        "typeHandlerVersion": "1.5",
        "autoUpgradeMinorVersion": true,
        "settings": {
          "xmlCfg":"PFdhZENmZz...",
          "StorageAccount":"test01"
        },
        "provisioningState": "Succeeded"
      },
      "type": "Microsoft.Compute/virtualMachines/extensions",
      "location": "japaneast",
      "id": "...",
      "name": "Microsoft.Insights.VMDiagnosticsSettings"
    }
  ],
  "type": "Microsoft.Compute/virtualMachines",
  "location": "japaneast",
  "tags": {},
  "id": "...",
  "name": "testmachine01"
}

Of course, you can also execute create/update/delete or some other actions (manage, etc) along with the role assignment in your Azure subscription.

 

Accelerate MXNet R training (deep learning) by multiple machines and GPUs

Scale your machine learning workloads on R (series)

In my previous post, I showed how to reduce your scoring workloads on deep learning using MXNetR.
Here I show how to accelerate (scaling up and out) in the training aspect on deep learning.

Get the power of devices – GPUs

For the training perspective, it needs so complicated calculations and so much computing workloads compared with the scoring workloads, and GPU is a very important factor for reducing and leveraging the workloads. (See the previous post. It needs to determine all the gradient, weights, and bias with the repeated matrices calculations.)
With MXNet, you can easily take advantage of GPU utilized deep learning, and let’s take a look at these useful capabilities.

First you can easily get the GPU utilized environment using Azure N-series (NC, NV) Virtual Machines. You must setup (install) all components (drivers, packages, etc) with the following installation script, and then you can get the GPU-powered MXNet. (You must compile MXNet with USE_CUDA=1 switch.) Because we’re setting up RStudio server as follows, you can connect to this Linux machine using your familiar RStudio client on the web.
For the details about this compiling, you can refer “Machine Learning team blog : Building Deep Neural Networks in the Cloud with Azure GPU VMs, MXNet and Microsoft R Server“.

I note that you should download the following software (drivers, tools) before running this script.

#!/usr/bin/env bash

#
# install R (MRAN)
#
wget https://mran.microsoft.com/install/mro/3.3.2/microsoft-r-open-3.3.2.tar.gz
tar -zxvpf microsoft-r-open-3.3.2.tar.gz
cd microsoft-r-open
sudo ./install.sh -a -u
cd ..
sudo rm -rf microsoft-r-open
sudo rm microsoft-r-open-3.3.2.tar.gz

#
# install gcc, python, etc
#
sudo apt-get install -y libatlas-base-dev libopencv-dev libprotoc-dev python-numpy python-scipy make unzip git gcc g++ libcurl4-openssl-dev libssl-dev
sudo update-alternatives --install "/usr/bin/cc" "cc" "/usr/bin/gcc" 50

#
# install CUDA (you can download cuda_8.0.44_linux.run)
#
chmod 755 cuda_8.0.44_linux.run
sudo ./cuda_8.0.44_linux.run -override
sudo update-alternatives --install /usr/bin/nvcc nvcc /usr/bin/gcc 50
export LIBRARY_PATH=/usr/local/cudnn/lib64/
echo -e "nexport LIBRARY_PATH=/usr/local/cudnn/lib64/" >> .bashrc

#
# install cuDNN (you can download cudnn-8.0-linux-x64-v5.1.tgz)
#
tar xvzf cudnn-8.0-linux-x64-v5.1.tgz
sudo mv cuda /usr/local/cudnn
sudo ln -s /usr/local/cudnn/include/cudnn.h /usr/local/cuda/include/cudnn.h
export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/local/cudnn/lib64/:$LD_LIBRARY_PATH
echo -e "nexport LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/local/cudnn/lib64/:$LD_LIBRARY_PATH" >> ~/.bashrc

#
# install MKL (you can download l_mkl_2017.0.098.tgz)
#
tar xvzf l_mkl_2017.0.098.tgz
sudo ./l_mkl_2017.0.098/install.sh

# Additional setup for MRAN and CUDA
sudo touch /etc/ld.so.conf
echo "/usr/local/cuda/lib64/" | sudo tee --append /etc/ld.so.conf
echo "/usr/local/cudnn/lib64/" | sudo tee --append /etc/ld.so.conf
sudo ldconfig

#
# download MXNet source
#
MXNET_HOME="$HOME/mxnet/"
git clone https://github.com/dmlc/mxnet.git "$HOME/mxnet/" --recursive
cd "$MXNET_HOME"

#
# configure MXNet
#
cp make/config.mk .
# if use dist_sync or dist_async in kv_store (see later)
#echo "USE_DIST_KVSTORE = 1" >>config.mk
# if use Azure BLOB Storage
#echo "USE_AZURE = 1" >>config.mk
# For GPU
echo "USE_CUDA = 1" >>config.mk
echo "USE_CUDA_PATH = /usr/local/cuda" >>config.mk
echo "USE_CUDNN = 1" >>config.mk
# For MKL
#source /opt/intel/bin/compilervars.sh intel64 -platform linux
#echo "USE_BLAS = mkl" >>config.mk
#echo "USE_INTEL_PATH = /opt/intel/" >>config.mk

#
# compile and install MXNet
#
make -j$(nproc)
sudo apt-get install libxml2-dev
sudo Rscript -e "install.packages('devtools', repo = 'https://cran.rstudio.com')"
cd R-package
sudo Rscript -e "library(devtools); library(methods); options(repos=c(CRAN='https://cran.rstudio.com')); install_deps(dependencies = TRUE)"
sudo Rscript -e "install.packages(c('curl', 'httr'))"
sudo Rscript -e "install.packages(c('Rcpp', 'DiagrammeR', 'data.table', 'jsonlite', 'magrittr', 'stringr', 'roxygen2'), repos = 'https://cran.rstudio.com')"
cd ..
sudo make rpkg
sudo R CMD INSTALL mxnet_current_r.tar.gz
cd ..

#
# install RStudio server
#
sudo apt-get -y install gdebi-core
wget -O rstudio.deb https://download2.rstudio.org/rstudio-server-0.99.902-amd64.deb
sudo gdebi -n rstudio.deb

Note : Later I explain about USE_DIST_KVSTORE switch.

“Still so much complicated and it takes much time to compile !”

Don’t worry. If you think it’s hard, you can use pre-configured machines (VM) called “Deep Learning Toolkit for the DSVM (Data Science Virtual Machines)” (currently Windows only) in Microsoft Azure (see below). Using this VM template, you can take almost all components required for the deep neural network computing, including NC-series VM with GPUs (Telsa), those drivers (software and toolkit), Microsoft R, R Server, and GPU-accelerated MXNet (and other DNN libraries). No setup is needed !
I note that this deployment requires the access to Azure NC instances which depends on the choice the regions and HDD/SSD. In this post I selected South Central US region.

Note : Below is the matrix of available instance or services by Azure regions.
https://azure.microsoft.com/en-us/regions/services/

Here I’m using 2 GPUs labeled 0 and 1. (see below)

nvidia-smi -pm 1
nvidia-smi

Now let’s see how it is used in our programming !

If you want to use GPUs for the training of neural networks, you can easily switch the device mode into GPU mode using MXNet as follows. In this code, the data batch is partitioned by 2 GPUs, and the results are summed up.
I note that here I don’t focus on the algorithms or networks itself (please refer the previous post), but here I’m using LeNet network (Convolutional Neural Network, CNN) for MNIST example (which is the famous handwritten digits recognition example, and you can see details here). I’m so sorry, but this learning is not so deep (but shallow) and GPU power is not so much used (less than 10%) in this example. Please use CIFAR-10 or some other heavy workloads for the real scenario…

#####
#
# train.csv (training data) is:
# (label, pixel0, pixel1, ..., pixel783)
# 1, 0, 0, ..., 0
# 4, 0, 0, ..., 0
# ...
#
#####
#
# test.csv (scoring data) is:
# (pixel0, pixel1, ..., pixel783)
# 0, 0, ..., 0
# 0, 0, ..., 0
# ...
#
#####

require(mxnet)

# read training data
train <- read.csv(
  "C:\Users\tsmatsuz\Desktop\training\train.csv",
  header=TRUE)
train <- data.matrix(train)

# separate label and pixel
train.x <- train[,-1]
train.x <- t(train.x/255)
train.array <- train.x
dim(train.array) <- c(28, 28, 1, ncol(train.x))
train.y <- train[,1]

#
# configure network
#

# input
data <- mx.symbol.Variable('data')
# first conv
conv1 <- mx.symbol.Convolution(
  data=data,
  kernel=c(5,5),
  num_filter=20)
tanh1 <- mx.symbol.Activation(
  data=conv1,
  act_type="tanh")
pool1 <- mx.symbol.Pooling(
  data=tanh1,
  pool_type="max",
  kernel=c(2,2),
  stride=c(2,2))
# second conv
conv2 <- mx.symbol.Convolution(
  data=pool1,
  kernel=c(5,5),
  num_filter=50)
tanh2 <- mx.symbol.Activation(
  data=conv2,
  act_type="tanh")
pool2 <- mx.symbol.Pooling(
  data=tanh2,
  pool_type="max",
  kernel=c(2,2),
  stride=c(2,2))
# first fullc
flatten <- mx.symbol.Flatten(data=pool2)
fc1 <- mx.symbol.FullyConnected(
  data=flatten,
  num_hidden=500)
tanh3 <- mx.symbol.Activation(
  data=fc1,
  act_type="tanh")
# second fullc
fc2 <- mx.symbol.FullyConnected(data=tanh3, num_hidden=10)
# loss
lenet <- mx.symbol.SoftmaxOutput(data=fc2)

# train !
kv <- mx.kv.create(type = "local")
mx.set.seed(0)
tic <- proc.time()
model <- mx.model.FeedForward.create(
  lenet,
  X=train.array,
  y=train.y,
  ctx=list(mx.gpu(0),mx.gpu(1)),
  kvstore = kv,
  num.round=5,
  array.batch.size=100,
  learning.rate=0.05,
  momentum=0.9,
  wd=0.00001,
  eval.metric=mx.metric.accuracy,
  epoch.end.callback=mx.callback.log.train.metric(100))

# score (1st time)
test <- read.csv(
  "C:\Users\tsmatsuz\Desktop\training\test.csv",
  header=TRUE)
test <- data.matrix(test)
test <- t(test/255)
test.array <- test
dim(test.array) <- c(28, 28, 1, ncol(test))
preds <- predict(model, test.array)
pred.label <- max.col(t(preds)) - 1
print(table(pred.label))

Distributed Training – Scale across multiple machines

MXNet training workload is not only distributed by devices (GPUs), but also multiple machines.

Here we assume there’re three machines (hosts), named “server01”, “server02”, and “server03”. Now we launch the parallel job on server02 and server03 from server01 console.

Before staring, you must compile MXNet with USE_DIST_KVSTORE=1 on all hosts. (See the above bash script example)

Note : Currently this distributed kvstore settings (USE_DIST_KVSTORE=1) is not enabled in existing Data Science Virtual Machines (DSVM) or Deep Learning Toolkit for the DSVM by default. Hence you must setup (compile) by yourself.

In terms of the distribution protocol, ssh, mpi (mpirun), and yarn can be used for the remote execution and cluster management. Here we use ssh for example.

First we setup the trust between host machines.
We create the key pair on server01 using the following command, and the generated key pair (id_rsa and id_rsa.pub) is in .ssh folder. During creation, you set blank (null) to the certificate password (passphrase).

ssh-keygen -t rsa

ls -al .ssh

drwx------ 2 tsmatsuz tsmatsuz 4096 Feb 21 05:01 .
drwxr-xr-x 7 tsmatsuz tsmatsuz 4096 Feb 21 04:52 ..
-rw------- 1 tsmatsuz tsmatsuz 1766 Feb 21 05:01 id_rsa
-rw-r--r-- 1 tsmatsuz tsmatsuz  403 Feb 21 05:01 id_rsa.pub

Next you copy the generated public key (id_rsa.pub) into {home of the same user id}/.ssh directory on server02 and server03. The file name must be “authorized_keys“.

Now let’s confirm that you can pass the command (pwd) to the remote hosts (server02, server03) from server01 as follows. If succeeded, the current working directory on remote host will be returned. (We assume that 10.0.0.5 is the ip address of server02 or server03.)

ssh -o StrictHostKeyChecking=no 10.0.0.5 -p 22 pwd

/home/tsmatsuz

Next you create the file named “hosts” in your working directory on server01, and please write the ip of accessing remote hosts (server02 and server03) in each rows.

10.0.0.5
10.0.0.6

In my example, I simply use the following trainig script test01.R (which is executed on remote hosts). As you can see, here we set the dist_sync for kvstore value, which means the synchronous parallel execution. (The weight and bias on remote hosts will be updated synchronously.)

test01.R

#####
#
# train.csv (training data) is:
# (label, pixel0, pixel1, ..., pixel783)
# 1, 0, 0, ..., 0
# 4, 0, 0, ..., 0
# ...
#
#####
#
# test.csv (scoring data) is:
# (pixel0, pixel1, ..., pixel783)
# 0, 0, ..., 0
# 0, 0, ..., 0
# ...
#
#####

require(mxnet)

# read training data
train_d <- read.csv(
  "train.csv",
  header=TRUE)
train_m <- data.matrix(train_d)

# separate label and pixel
train.x <- train_m[,-1]
train.y <- train_m[,1]

# transform image pixel [0, 255] into [0,1]
train.x <- t(train.x/255)

# configure network
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

# train !
model <- mx.model.FeedForward.create(
  softmax,
  X=train.x,
  y=train.y,
  ctx=mx.cpu(),
  num.round=10,
  kvstore = "dist_sync",
  array.batch.size=100,
  learning.rate=0.07,
  momentum=0.9,
  eval.metric=mx.metric.accuracy,
  initializer=mx.init.uniform(0.07),
  epoch.end.callback=mx.callback.log.train.metric(100))

Now you can run the parallel execution.
On server01 console, you call the following launch.py with executable command “Rscript test01.R“. Then this command is executed on server02 and server03, and those are traced by the job monitor. (Here we assume that /home/tsmatsuz/mxnet is the installation directory of MXNet.)

Note that be sure to put this R script (test01.R) and the training data (train.csv) in the same directory on server02 and server03. If you want to see the weight and bias are properly updated, it’s better to use the different training data between server02 and server03. (Or it’ll be better to download appropriate data files from remote site on the fly.)

/home/tsmatsuz/mxnet/tools/launch.py -n 1 -H hosts 
  --launcher ssh Rscript test01.R

The output result is the following. The upper side is the result by the single node, and bottom side is the synchronized result by server02 and server03.

Note : In your real application, it’s better to use the InfiniBand network. Sorry, but please be patient to be released InfiniBand support on Azure VM N-series.

Active learning (Online learning) with MXNet

In the previous section, we thought the parallel and synchronous training workloads. Lastly we think about training in time series.

As you know, it will take much wasting time to re-train from the beginning. Using MXNet, you can reserve the trained model, and refine the model incrementally with new data as follows.
As you can see, we’re setting the previously trained symbol and parameters in the 2nd training. (See below with bold fonts)

#####
#
# train.csv (training data) is:
# (label, pixel0, pixel1, ..., pixel783)
# 1, 0, 0, ..., 0
# 4, 0, 0, ..., 0
# ...
#
#####
#
# test.csv (scoring data) is:
# (pixel0, pixel1, ..., pixel783)
# 0, 0, ..., 0
# 0, 0, ..., 0
# ...
#
#####

require(mxnet)

# read first 500 training data
train_d <- read.csv(
  "C:\Users\tsmatsuz\Desktop\training\train.csv",
  header=TRUE)
train_m <- data.matrix(train_d[1:500,])

# separate label and pixel
train.x <- train_m[,-1]
train.y <- train_m[,1]

# transform image pixel [0, 255] into [0,1]
train.x <- t(train.x/255)

# configure network
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

# train !
model <- mx.model.FeedForward.create(
  softmax,
  X=train.x,
  y=train.y,
  ctx=mx.cpu(),
  num.round=10,
  array.batch.size=100,
  learning.rate=0.07,
  momentum=0.9,
  eval.metric=mx.metric.accuracy,
  initializer=mx.init.uniform(0.07),
  epoch.end.callback=mx.callback.log.train.metric(100))

# score (1st time)
test <- read.csv("C:\Users\tsmatsuz\Desktop\training\test.csv", header=TRUE)
test <- data.matrix(test)
test <- t(test/255)
preds <- predict(model, test)
pred.label <- max.col(t(preds)) - 1
table(pred.label)

#
# save the current model if needed
#

# save model to file
mx.model.save(
  model = model,
  prefix = "mymodel",
  iteration = 500
)

# load model from file
model_loaded <- mx.model.load(
  prefix = "mymodel",
  iteration = 500
)

# re-train for the next 500 data !
train_d <- read.csv(
  "C:\Users\tsmatsuz\Desktop\training\train.csv",
  header=TRUE)
train_m <- data.matrix(train_d[501:1000,])
train.x <- train_m[,-1]
train.y <- train_m[,1]
train.x <- t(train.x/255)
model = mx.model.FeedForward.create(
  model_loaded$symbol,
  arg.params = model_loaded$arg.params,
  aux.params = model_loaded$aux.params,
  X=train.x,
  y=train.y,
  ctx=mx.cpu(),
  num.round=10,
  #kvstore = kv,
  array.batch.size=100,
  learning.rate=0.07,
  momentum=0.9,
  eval.metric=mx.metric.accuracy,
  initializer=mx.init.uniform(0.07),
  epoch.end.callback=mx.callback.log.train.metric(100))

# score (2nd time)
test <- read.csv("C:\Users\tsmatsuz\Desktop\training\test.csv", header=TRUE)
test <- data.matrix(test)
test <- t(test/255)
preds <- predict(model, test)
pred.label <- max.col(t(preds)) - 1
table(pred.label)

Note : “Active Learning” here is not for educational terminology, but for machine learning terminology. Please see Wikipedia “Active Learning (Machine Learning)”.

 

[Reference] MXNet Docs : Run MXNet on Multiple CPU/GPUs with Data Parallel
http://mxnet.io/how_to/multi_devices.html

 

Scale your deep learning workloads on MXNet R (scoring phase)

Scale your machine learning workloads on R (series)

In my previous post, I described about the basis of scaling the statistical R computing using Azure Hadoop (HDInsight) and R Server.
Some folks asked me “What if the deep learning workloads ?”

This post and next will give you the answer for this question.

Overall recap

The machine learning team is providing some useful resources about this concerns as follows. Please refer these document for the technical backgrounds or details.
Here MXNet is used for implementing the deep neural networks (including the shallow network in neural nets) with R.

Machine learning blog – Building Deep Neural Networks in the Cloud with Azure GPU VMs, MXNet and Microsoft R Server
https://blogs.technet.microsoft.com/machinelearning/2016/09/15/building-deep-neural-networks-in-the-cloud-with-azure-gpu-vms-mxnet-and-microsoft-r-server/

Channel9 – Deep Learning in Microsoft R Server Using MXNet on High-Performance GPUs in the Public Cloud
https://channel9.msdn.com/Events/Machine-Learning-and-Data-Sciences-Conference/Data-Science-Summit-2016/MSDSS21

Machine learning blog – Applying Deep Learning at Cloud Scale, with Microsoft R Server & Azure Data Lake
https://blogs.technet.microsoft.com/machinelearning/2016/10/31/applying-cloud-deep-learning-at-scale-with-microsoft-r-server-azure-data-lake/

 

In my post (here) I show you the programming code or how-to-step along with these useful team’s resources.

For the training perspective, MXNet natively has the capability of data prallelization by the multiple devices, including the utilization of massive power of GPU. (See “MXNet how-To – Run MXNet on Multiple CPU/GPUs with Data Parallel“.) The key-value store of MXNet works for the synchronization in the multiple devices.
You can easily run the GPU-enabled Virtual Machines (Data Science Virtual Machines or N-Series Virtual Machines) in Microsoft Azure, and see how it works. I will show you this scenario (training scenario) in the next post.

For the scoring perspective, you can indepently run the scoring tasks (indepently not like the training task) for each data separately, since you can easily scale the workloads using a series of devices or machines.
In this post I show you the step-by-step tutorial for that scoring scenario. (Here we also use Spark and R Server on Azure.) The sample I show you here is so trivial code, but the scoring might use the extremely large data in the actual system.

Our sample

In this post we use the familiar MNIST example (which is the example of handwritten digits recognition) for the deep neural networks.
You can easily copy the script of MNIST and download the sample data from the official tutorial “MXNet R – Handwritten Digits Classification Competition“. This uses a large number of 28 X 28 = 783 pixel images, i.e, 28 X 28 input neurons of greyscale number data.

Please see the following script.

Note : When you’re using Windows or Mac and run install.packages() for installing MXNet package (in DMLC repository), the latest package (mxnet 0.9.4) uses visNetwork 1.0.3.
Since currently you must install the latest visNetwork package as follows.
install.packages("visNetwork", repos="https://cran.revolutionanalytics.com/")

Note : When you want to install MXNet on Linux, please see the bash setup script sample here (Ubuntu sample).

R MNIST Complete Code – Standalone

require(mxnet)

#####
# read training data
#
# train.csv is:
# (label, pixel0, pixel1, ..., pixel783)
# 1, 0, 0, ..., 0
# 4, 0, 0, ..., 0
# ...
#####
train <- read.csv("C:\tmp\train.csv", header=TRUE)
train <- data.matrix(train)

# separate label and pixel
train.x <- train[,-1]
train.y <- train[,1]

# transform image pixel [0, 255] into [0,1]
train.x <- t(train.x/255)

# configure network
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

# train !
# (If you want to use gpu, please set like ctx=list(mx.gpu(0),mx.gpu(1)) )
mx.set.seed(0)
model <- mx.model.FeedForward.create(
  softmax,
  X=train.x,
  y=train.y,
  ctx=mx.cpu(),
  num.round=10,
  array.batch.size=100,
  learning.rate=0.07,
  momentum=0.9,
  eval.metric=mx.metric.accuracy,
  initializer=mx.init.uniform(0.07),
  epoch.end.callback=mx.callback.log.train.metric(100))

#####
# read scoring data
#
# test.csv is:
# (pixel0, pixel1, ..., pixel783)
# 0, 0, ..., 0
# 0, 0, ..., 0
# ...
#####
test <- read.csv("C:\tmp\test.csv", header=TRUE)
test <- data.matrix(test)
test <- t(test/255)

# score !
# ("preds" is the matrix of the possibility of each number)
preds <- predict(model, test)

#####
# output result (occurance count of each number)
#
# The result is:
#    0    1    2 ...     9
# 2814 3168 2711 ...  2778 
#####
pred.label <- max.col(t(preds)) - 1
table(pred.label)

Here I don’t focus on the deep learning algorithms and networks itself, and you can proceed the rest of this readings as a black box for the details. But I have briefly illustrated the network of this sample code as follows, and please refer.
This sample is always fed forward without fed back or loops by the back-propagation algorithm (which determines the gradient) with 0.07 of learning rate (which is the ratio of delta for training) and 100 of batch size (which is the bunch of training data set for each time).

 

Setting up Spark clusters

As I described in my previous post, you can easily create your R Server and Spark clusters on Azure.
Here I skip how to setup, but please see my previous post for details (using Azure Data Lake store, RStudio setup, etc).

Moreover, here you have one more thing that needs to be done.
Currently R Server on Azure HDInsight (Hadoop) cluster is not including MXNet. Because of this reason, you must install MXNet on all worker nodes (ubuntu 16) using HDInsight script action. (You just only create the installation script and set this script on Azure Portal. See the following screenshot.)

Note : If you want to apply the script action on the edge node, you must use the sku of HDInsight Premium. So it’s better to run the MXNet workloads only on worker nodes. (The edge node is just for orchestrating.)

Below is my script action (.sh) for MXNet installation. Here we’re downloading the source code of MXNet and compiling with gcc.
As you can see, you don’t need to utilize GPU in scoring phase, and installation (compilation) is much simpler.

#!/usr/bin/env bash
##########
#
# HDInsight script action
# for installing MXNet
# without GPU utilized
#
##########

# install gcc, python libraries, etc
sudo apt-get install -y libatlas-base-dev libopencv-dev libprotoc-dev python-numpy python-scipy make unzip git gcc g++ libcurl4-openssl-dev libssl-dev
sudo update-alternatives --install "/usr/bin/cc" "cc" "/usr/bin/gcc" 50

MXNET_HOME="$HOME/mxnet/"

# download MXNet source code (incl. dmlc-core, etc)
git clone https://github.com/dmlc/mxnet.git "$HOME/mxnet/" --recursive

# make and install MXNet and related modules
cd "$MXNET_HOME"

make -j$(nproc)

sudo Rscript -e "install.packages('devtools', repo = 'https://cran.rstudio.com')"

cd R-package
sudo Rscript -e "library(devtools); library(methods); options(repos=c(CRAN='https://cran.rstudio.com')); install_deps(dependencies = TRUE)"
sudo Rscript -e "install.packages(c('curl', 'httr'))"
sudo Rscript -e "install.packages(c('Rcpp', 'DiagrammeR', 'data.table', 'jsonlite', 'magrittr', 'stringr', 'roxygen2'), repos = 'https://cran.rstudio.com')"

cd ..
sudo make rpkg

sudo R CMD INSTALL mxnet_current_r.tar.gz

Note : If your script action encounters some errors, you can see the log using Ambari UI https://{your cluster name}.azurehdinsight.net.

Prepare your trained model

Before running the scoring workloads, we prepare the trained model and save it to the local disk as follows. (See the following script.)
This script is saving the MXNet trained model in c:tmp. Here the two of files, mymodel-symbol.json and mymodel-0100.params will be created.

R MNIST Train and Save model

require(mxnet)

#####
# train.csv is:
# (label, pixel0, pixel1, ..., pixel783)
# 1, 0, 0, ..., 0
# 4, 0, 0, ..., 0
# ...
#####

# read input data
train <- read.csv("C:\tmp\train.csv", header=TRUE)
train <- data.matrix(train)

# separate label and pixel
train.x <- train[,-1]
train.y <- train[,1]

# transform image pixel [0, 255] into [0,1]
train.x <- t(train.x/255)

# configure network
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

# train !
# (If you want to use gpu, please set like ctx=list(mx.gpu(0),mx.gpu(1)) )
mx.set.seed(0)
model <- mx.model.FeedForward.create(
  softmax,
  X=train.x,
  y=train.y,
  ctx=mx.cpu(),
  num.round=10,
  array.batch.size=100,
  learning.rate=0.07,
  momentum=0.9,
  eval.metric=mx.metric.accuracy,
  initializer=mx.init.uniform(0.07),
  epoch.end.callback=mx.callback.log.train.metric(100))

# save model to file
# (created : mymodel-symbol.json, mymodel-0100.params)
if(file.exists("C:\tmp\mymodel-symbol.json"))
  file.remove("C:\tmp\mymodel-symbol.json")
if(file.exists("C:\tmp\mymodel-0100.params"))
  file.remove("C:\tmp\mymodel-0100.params")
mx.model.save(
  model = model,
  prefix = "C:\tmp\mymodel",
  iteration = 100
)

After running this script, upload these model files (mymodel-symbol.json, mymodel-0100.params) into the downloadable location. (The scoring program will extract these files.)
In my example, I uploaded to Azure Data Lake store (adl://mltest.azuredatalakestore.net/dnndata folder) which is the same storage as the primary storage of Hadoop cluster.

R Script for scaling

It’s ready. Now let’s start the programming for scaling.

As I mentioned in my previous post, we can use “rx” prefixed ScaleR functions for Spark cluster scaling.

“So… we must rewrite all the functions with ScaleR functions !?”

Don’t worry ! Of course, not.
In this case, you can just use rxExec() function, which enables some bunch of codes to be distributed into the clusters in parallel.

Let’s see the following sample code. (As I described in my previous post, this code should be run on the edge node.)

R MNIST scoring on Spark cluster (R Server)

# Set Spark clusters context
spark <- RxSpark(
  consoleOutput = TRUE,
  extraSparkConfig = "--conf spark.speculation=true",
  nameNode = "adl://mltest.azuredatalakestore.net",
  port = 0,
  idleTimeout = 90000
)
rxSetComputeContext(spark)

image.score <- function(filename) {
  require(mxnet)

  storage.prefix <- "adl://mltest.azuredatalakestore.net/dnndata"
  
  #####
  # test.csv is:
  # (pixel0, pixel1, ..., pixel783)
  # 0, 0, ..., 0
  # 0, 0, ..., 0
  # ...
  #####

  # copy model to local
  if(file.exists("mymodel-symbol.json"))
    file.remove("mymodel-symbol.json")
  if(file.exists("mymodel-0100.params"))
    file.remove("mymodel-0100.params")
  rxHadoopCopyToLocal(
    file.path(storage.prefix, "mymodel-symbol.json"),
    "mymodel-symbol.json")
  rxHadoopCopyToLocal(
    file.path(storage.prefix, "mymodel-0100.params"),
    "mymodel-0100.params")
  
  # load model
  model_loaded <- mx.model.load(
    #prefix = "mymodel",
    prefix = "mymodel",
    iteration = 100
  )

  # copy scoring file to local
  if(file.exists(filename))
    file.remove(filename)
  srcfile <- file.path(storage.prefix, filename)
  rxHadoopCopyToLocal(srcfile, filename)
  
  # read scoring data
  #test <- read.csv(filename, header=TRUE)
  test <- read.csv(
    filename,
    header=TRUE)
  test <- data.matrix(test)
  test <- t(test/255)
  
  # Score !
  preds <- predict(model_loaded, test)
  pred.label <- max.col(t(preds)) - 1
  return(pred.label)
}

# FUN : function to execute in parallel
# elemArgs : the arguments to the function
# elemType : "nodes" or "cores"
# timesToRun : total number of instances to run
arglist <- list(
  list(filename="test01.csv"),
  list(filename="test02.csv"),
  list(filename="test03.csv"),
  list(filename="test04.csv"),
  list(filename="test05.csv"),
  list(filename="test06.csv"),
  list(filename="test07.csv"),
  list(filename="test08.csv"),
  list(filename="test09.csv"),
  list(filename="test10.csv"))
result <- rxExec(
  FUN = image.score,
  elemArgs = arglist,
  elemType = "nodes",
  timesToRun = length(arglist))

# output result
output <- table(unlist(result))
print(output)

As you can see, here we are passing 10 “filename” arguments to image.score function using rxExec(). Eventually each image.score workloads (total 10 workloads) are distributed into the worker nodes on Spark cluster.
The return value of rxExec() is the aggregated results of 10 image.score instance executions.

If needed, you can easily scale your cluster (ex. 4 nodes -> 16 nodes) on Azure Portal and get the massive computing resource for your workloads.

 

In “Channel9 – Deep Learning in Microsoft R Server Using MXNet on High-Performance GPUs in the Public Cloud“, they’re handling the real scenario of the CIFAR-10 image classification in Mona Lisa.

Azure Hadoop (Data Lake, HDInsight) team is also writing in the recent post for Caffe on Spark cluster, and please see, if you have much interest.

Azure Data Lake & Azure HDInsight Blog : Distributed Deep Learning on HDInsight with Caffe on Spark
https://blogs.msdn.microsoft.com/azuredatalake/2017/02/02/distributed-deep-learning-on-hdinsight-with-caffe-on-spark/

 

Note : With new rxExecBy(), you don’t have to manually split and move the data, and can be used on not only Spark cluster but any other parallel platforms. (Added on Apr 2017)

Benefits of Microsoft R and R Server – Quick Look

Scale your machine learning workloads on R (series)

In my previous post, I described how to leverage your R skills using Microsoft technologies for ordinary business users (marketers, business analysts, etc) with Power BI.
In this post, I describe what is the benefits of Microsoft R technologies for professional developers (programmers, data scientists, etc) with a few lines of code.

Designed for multithreading

R is most popular statistical programming language, but it’s having some concerns for enterprise use. The biggest one is the lack of parallelism.

Microsoft R Open (MRO) is renamed from the famous Revolution R Open (RRO). By using MRO you can take advantage of multithreading and high performance, although the functions of MRO is still compatible for the basic functions of open source R like CRAN.

Note : You can also use other choice (snow, etc) for parallel computing in R.

Please see the following official document about the benchmark.

The Benefits of Multithreaded Performance with Microsoft R Open
https://mran.microsoft.com/documents/rro/multithread/

For example, this document says that the matrix manipulation is many times faster than the open source R. Let’s see the following simple example.
The A is 10000 x 5000 matrix which elements is the repeated values of 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, … The B is the cross-product matrix by A. Here we measure this cros-product operation using system.time().

A <- matrix (1:5,10000,5000)
system.time (B <- crossprod(A))

The following is the results.
Here I’m using Lenovo X1 Carbon (with Intel Core vPro i7), and MRO is over 8 times faster than the other open R.

R 3.3.2 [64 bit] (CRAN)

Microsoft R 3.3.2 [64 bit] (MRO)

The analysis function for the large amount of data is also faster than the other R runtime.

Note : If you’re using RStudio and installing both open source R and MRO, you can change the R runtime for RStudio from [Tools] – [Global Options] menu.

 

Distributed and Scaling

By using Microsoft R Server (formerly, Revolution R Enterprise), you can also distribute and scale the computing of R across the multiple computers.
The R Server can be run on Windows (SQL Server), Linux, Teradata, and Hadoop (Spark) clusters.
You can also use R Server as one of the workload on Spark (see the following illustrated), and here we use the Spark cluster in this post. Using R Server on Spark, you can distribute the R algorithm on Spark cluster.

Note : For Windows, R Server is licensed under SQL Server. You can easily get the standalone R server (with SQL Server 2016 Developer edition) by using the virtual machine (called “Data Science Virtual Machine”) in Microsoft Azure.

Note : Spark MLlib is also the machine learning component to get the power of computing on Spark, but Python, Java, and Scala are the mainly used programming languages. (Currently almost functions are not supported in R.)

Note : You can also use SparkR. But you must remember that SparkR is currently just the data transformation for R computing (i.e, not mature).

Here I skip how to setup your R Server on Spark, but the easiest way is to use Azure Hadoop clusters (called “HDInsight”). You can setup your own experiment environment by just a few steps as follows.

  1. Create R Server on Spark cluster (Azure HDInsight). You just input several terms along with HDInsight cluster creation wizard on Azure Portal, and all the computer nodes (head nodes, edge nodes, worker nodes, zookeeper nodes) is automatically setup.
    Please see “Microsoft Azure – Get started using R Server on HDInsight” for details.
  2. If needed, setup RStudio connected to the Spark cluster (edge node) above. Note that RStudio Server Community Edition on edge node is automatically installed by HDInsight, then you just only setup your client environment. (Currently you don’t need to install RStudio Server by yourself.)
    Please see “Installing RStudio with R Server on HDInsight” for the client setup.

Note : Here I used Azure Data Lake store (not Azure Storage Blobs) for the primary storage on Hadoop clusters. (And I setup the service principal and its access permissions for this Data Lake account itself.)
For more details, please refer “Create an HDInsight cluster with Data Lake Store using Azure Portal“.

Using RStudio Server on the edge node on Spark cluster, you can use RStudio on the web browser using SSH tunnel. (See the following screenshot.)
It’s very convenient way for running and debugging your R scripts on R Server.

Here I prepared the source data (Japanese stocks daily reports) over 35,000,000 records (over 1 GB).
When I run my R script with this huge data on my local computer, the script fails because of the allocation error or timeout. In such a case, you can solve this problem using R Server on Spark cluster.

Now here’s the R script which I run on R Server.

When you set the context with rxSetComputeContext as follows, the data is not transferred to this host by the network, but the R script is transferred to each server and executed. (You can use rxImport, when you want to download the data into the host in which the script is running.)

##### The format of source data
##### (company-code, year, month, day, week, open-price, difference)
#
#3076,2017,1,30,Monday,2189,25
#3076,2017,1,27,Friday,2189,-1
#3076,2017,1,26,Thursday,2215,-29
#...
#...
#####

# Set Spark clusters context
spark <- RxSpark(
  consoleOutput = TRUE,
  extraSparkConfig = "--conf spark.speculation=true",
  nameNode = "adl://jpstockdata.azuredatalakestore.net",
  port = 0,
  idleTimeout = 90000
)
rxSetComputeContext(spark);

# Import data
fs <- RxHdfsFileSystem(
  hostName = "adl://jpstockdata.azuredatalakestore.net",
  port = 0)
colInfo <- list(
  list(index = 1, newName="Code", type="character"),
  list(index = 2, newName="Year", type="integer"),
  list(index = 3, newName="Month", type="integer"),
  list(index = 4, newName="Day", type="integer"),
  list(index = 5, newName="DayOfWeek", type="factor",
       levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")),
  list(index = 6, newName="Open", type="integer"),
  list(index = 7, newName="Diff", type="integer")
)
orgData <- RxTextData(
  fileSystem = fs,
  file = "/history/testCsv.txt",
  colInfo = colInfo,
  delimiter = ",",
  firstRowIsColNames = FALSE
)

# execute : rxLinMod (lm)
system.time(lmData <- rxDataStep(
  inData = orgData,
  transforms = list(DiffRate = (Diff / Open) * 100),
  maxRowsByCols = 300000000))
system.time(lmObj <- rxLinMod(
  DiffRate ~ DayOfWeek,
  data = lmData,
  cube = TRUE))

# If needed, predict (rxPredict) using trained model or save it.
# Here's just ploting the means of DiffRate for each DayOfWeek.
lmResult <- rxResultsDF(lmObj)
rxLinePlot(DiffRate ~ DayOfWeek, data = lmResult)

# execute : rxCrossTabs (xtabs)
system.time(ctData <- rxDataStep(
  inData = orgData,
  transforms = list(Close = Open + Diff),
  maxRowsByCols = 300000000))
system.time(ctObj <- rxCrossTabs(
  formula = Close ~ F(Year):F(Month),
  data = ctData,
  means = TRUE
))
print(ctObj)

Before running this script, I’ve uploaded the source data on Azure Data Lake store which is the same storage as the primary storage of Hadoop cluster. “adl://...” means the uri of Azure Data Lake store account.

The above functions which is prefixed by “rx” are called ScaleR functions (functions in RevoScaleR package). These functions are provided for distributing and scaling, and each ScaleR function is the scaling one of the corresponding basic R functions. For example, RxTextData is corresponding to read.table or read.csv, rxLinMod is corresponding to lm (linear regression model), and rxCrossTabs is xtabs (cross-tabulation).
You can use these R functions for leveraging the computing power of Hadoop clusters. (See the following reference document for details.)

Microsoft R – RevoScaleR Functions for Hadoop
https://msdn.microsoft.com/en-us/microsoft-r/scaler/scaler-hadoop-functions

Microsoft R – Comparison of Base R and ScaleR Functions
https://msdn.microsoft.com/en-us/microsoft-r/scaler/compare-base-r-scaler-functions

Note : For more details (descriptions, arguments, etc) about each ScaleR functions, please type “?{function name}” (ex. “?rxLinePlot“) in R Console.

Note : You can also use more fast modeling functions implemented by Microsoft Research called MicrosoftML (MML). (This also includes the functionality for the anomaly detection and the deep nural networks.) Now these functions are only on Windows and not in Spark clusters, but soon will be updated in the future.
See “Building a machine learning model with the MicrosoftML package“.

The following illustrates the topology of Spark clusters. The workloads of R Server reside in the edge node and worker nodes.

The edge node is having the role of the development front-end, and you can interact with R Server through this node. (Currently, R Server on HDInsight is the only cluster which provides the edge node by default.)
For example, RStudio Server is installed on this edge node, and when you run your R scripts through RStudio on the web browser, this node starts all computations distributed to worker nodes. If the computation cannot be distributed for some reason (both intentionally and accidentally), this task will be run on this local edge node.

While the script is running, please see YARN on resource manager (rm) in Hadoop. You could find the running the application of ScaleR on the scheduler. (See the following screenshot.)

When you monitor the worker nodes in resource manager UI, you could find that all nodes are used for computation.

The ScaleR functions are also provided in Microsoft R Client (on top of Microsoft R Open), which can run on the standalone computer. (You don’t need extra servers.)
Using Microsoft R Client, you can send completed R commands to the remote R Server for execution. (Use mrsdeploy package.) Or, you can learn and test these ScaleR functions on the local computer with Microsoft R Client, and you can migrate to the distributed clusters.
I think it’s better idea to use Microsoft R Client in the development time, because there’re so many overheads (fee, provisioning, etc) to use Hadoop clusters when you’re in the development.

 

You can take a lot of advantages of the robust computing platform with Microsoft R technologies !