Azure AD v2 endpoint – How to use custom scopes for admin consent

In my early post I explained about administrator consent (admin consent) in Azure AD v2 endpoint. The admin consent is very useful and needed for the various scenarios, such as app permissions (application-level privilege without interactive sign-in UI), granting entire employees without individual user consents, or on-behalf-of flow in your web api.

For the use of admin consent, you must pre-define the permissions in portal UI (App Registration Portal). But, in this UI, you can set the scopes only for Microsoft Graph (see the following screenshot), and some ISV folks ask me “how to use admin consent for legacy APIs (Outlook Rest API, etc), 3rd party apps, or your own custom apps ?”.

The answer is “edit manifest by your own !”

In this post I show you how to configure this custom settings in Azure AD v2 endpoint with application manifest step-by-step.

Permissions in Azure AD v2 endpoint – Under the hood

Now let’s take a look at predefined scopes in v2 endpoint.

For the sake of scenarios such as application permissions (application-level privilege without interactive sign-in UI), granting entire employees without individual user consents (delegated permissions) etc, you can pre-define the needed scopes to the application within App Registration Portal.

Note !  For the usual user consent, you don’t need to pre-define scopes in Azure AD v2 endpoint and you can set scopes dynamically (on the fly) when you authenticate.
This pre-defined permission is needed for admin consent.

These settings are all written in the application manifest, and you can edit the manifest directly with “Edit Application Manifest” button (see below) in App Registration Portal.

For example, when you set “offline_access” scope and “openid” scope in your app, these are written in the manifest as follows.

{
  "id": "678806b7-adee-4f4e-b548-947e10512e00",
  "appId": "c471620b-73a3-4a88-a926-eda184e6fde9",
  "name": "testapp01",
  "oauth2AllowImplicitFlow": false,
  "requiredResourceAccess": [
    {
      "resourceAppId": "00000003-0000-0000-c000-000000000000",
      "resourceAccess": [
        {
          "id": "7427e0e9-2fba-42fe-b0c0-848c9e6a8182",
          "type": "Scope"
        },
        {
          "id": "37f7f235-527c-4136-accd-4a02d197296e",
          "type": "Scope"
        }
      ]
    }
  ],
  ...
  
}

As you can see, the app’s ID and scope’s ID is used for settings. For example, 00000003-0000-0000-c000-000000000000 means “Microsoft Graph” application and 7427e0e9-2fba-42fe-b0c0-848c9e6a8182 means “offline_access” scope.
The following is the list of IDs for the familiar scopes. (All scopes are in “Microsoft Graph” application.)

openid 37f7f235-527c-4136-accd-4a02d197296e
email 64a6cdd6-aab1-4aaf-94b8-3cc8405e90d0
profile 37f7f235-527c-4136-accd-4a02d197296e
offline_access 7427e0e9-2fba-42fe-b0c0-848c9e6a8182

In conclusion, you just edit the manifest by your own, when you need to add custom scopes except for Microsoft Graph application.
The next concern is : How to get (look for) the app’s id and scope’s id ?

Use permissions for other APIs

When you want to use Outlook REST API, OneDrive API, Azure AD Graph API, Power BI REST API, etc, etc …, first you should go to Azure Active Directory settings in Azure Portal.
In this settings, you can set these APIs for required permissions and you can see the app’s id and scope’s id in the manifest text as follows. (The following is the sample for https://outlook.office.com/mail.read scope.)

You can use same ID in v2 endpoint and can set these scopes using the manifest editor in App Registration Portal as follows.

{
  "id": "678806b7-adee-4f4e-b548-947e10512e00",
  "appId": "c471620b-73a3-4a88-a926-eda184e6fde9",
  "name": "testapp01",
  "oauth2AllowImplicitFlow": false,
  "requiredResourceAccess": [
    {
      "resourceAppId": "00000003-0000-0000-c000-000000000000",
      "resourceAccess": [
        {
          "id": "7427e0e9-2fba-42fe-b0c0-848c9e6a8182",
          "type": "Scope"
        },
        {
          "id": "37f7f235-527c-4136-accd-4a02d197296e",
          "type": "Scope"
        }
      ]
    },
    {
      "resourceAppId": "00000002-0000-0ff1-ce00-000000000000",
      "resourceAccess": [
        {
          "id": "185758ba-798d-4b72-9e54-429a413a2510",
          "type": "Scope"
        }
      ]
    }
  ],
  ...
  
}

Now let’s access the following admin consent url in v2 endpoint with your browser. (Change testdirectory.onmicrosoft.com, c471620b-73a3-4a88-a926-eda184e6fde9, and https://localhost/testapp01 for your appropriate values.)

https://login.microsoftonline.com/testdirectory.onmicrosoft.com/adminconsent?client_id=c471620b-73a3-4a88-a926-eda184e6fde9&state=12345&redirect_uri=https%3a%2f%2flocalhost%2ftestapp01

After you logged-in by the administrator account, the following consent screen is displayed. Note that this Mail.Read scope is not in Microsoft Graph, but in Outlook REST API. Therefore when you get your own e-mails in your inbox, you must access https://outlook.office365.com/api/{version}/me/messages instead of using https://graph.microsoft.com/{version}/me/messages. (You cannot use https://graph.microsoft.com/{version}/me/messages with this consent.)

Use permissions for your own custom applications

As I mentioned in my early post “Build your own Web API protected by Azure AD v2.0 endpoint with custom scopes“, you can define your own custom scopes for your api applications in App Registration Portal.

For example, we assume that 2 scopes in our api application are defined as the following screenshot in App Registration Portal.

These scopes will be written in the manifest as follows.
In this case, the app’s ID is c4894da7-0070-486e-af4c-e1c2ba5e24ae, and the scope’s IDs are 223e6396-1b01-4a16-bb2f-03eaed9f31a8 and 658e7fa5-bb32-4ed1-93eb-b040ebc6bfec.

{
  "id": "2cd5c8e0-6a6c-4e8d-870c-46c08f3be3d9",
  "appId": "c4894da7-0070-486e-af4c-e1c2ba5e24ae",
  "identifierUris": [
    "api://c4894da7-0070-486e-af4c-e1c2ba5e24ae"
  ],
  "oauth2Permissions": [
    {
      "adminConsentDescription": "Allow the client to write your data with myapi01 on behalf of the signed-in user.",
      "adminConsentDisplayName": "Write data with myapi",
      "id": "223e6396-1b01-4a16-bb2f-03eaed9f31a8",
      "isEnabled": true,
      "lang": null,
      "origin": "Application",
      "type": "User",
      "userConsentDescription": "Allow the client to write your data with myapi01 on your behalf.",
      "userConsentDisplayName": "Write data with myapi",
      "value": "write_as_user"
    },
    {
      "adminConsentDescription": "Allow the client to read your data with myapi01 on behalf of the signed-in user.",
      "adminConsentDisplayName": "Read data with myapi",
      "id": "658e7fa5-bb32-4ed1-93eb-b040ebc6bfec",
      "isEnabled": true,
      "lang": null,
      "origin": "Application",
      "type": "User",
      "userConsentDescription": "Allow the client to read your data with myapi01 on your behalf.",
      "userConsentDisplayName": "Read data with myapi",
      "value": "read_as_user"
    }
  ],
  ...
  
}

Therefore, on the client side, you can set these scopes of the api application using the manifest editor as follows.

{
  "id": "678806b7-adee-4f4e-b548-947e10512e00",
  "appId": "c471620b-73a3-4a88-a926-eda184e6fde9",
  "name": "testapp01",
  "oauth2AllowImplicitFlow": false,
  "requiredResourceAccess": [
    {
      "resourceAppId": "00000003-0000-0000-c000-000000000000",
      "resourceAccess": [
        {
          "id": "7427e0e9-2fba-42fe-b0c0-848c9e6a8182",
          "type": "Scope"
        },
        {
          "id": "37f7f235-527c-4136-accd-4a02d197296e",
          "type": "Scope"
        }
      ]
    },
    {
      "resourceAppId": "c4894da7-0070-486e-af4c-e1c2ba5e24ae",
      "resourceAccess": [
        {
          "id": "658e7fa5-bb32-4ed1-93eb-b040ebc6bfec",
          "type": "Scope"
        }
      ]
    }
  ],
  ...
  
}

First the user (administrator) must subscribe the api application before using this api’s permission in the client application.
Let’s access the following url, in which c4894da7-0070-486e-af4c-e1c2ba5e24ae is the app’s ID of the api application. (See my early post “Build your own Web API protected by Azure AD v2.0 endpoint with custom scopes” for details.)

https://login.microsoftonline.com/testdirectory.onmicrosoft.com/adminconsent?client_id=c4894da7-0070-486e-af4c-e1c2ba5e24ae&state=12345&redirect_uri=https%3a%2f%2flocalhost%2fmyapi01

After you’ve subscribed the api application, now let’s access the following admin consent url for the client application, in which c471620b-73a3-4a88-a926-eda184e6fde9 is the app’s ID of the client side.

https://login.microsoftonline.com/testdirectory.onmicrosoft.com/adminconsent?client_id=c471620b-73a3-4a88-a926-eda184e6fde9&state=12345&redirect_uri=https%3a%2f%2flocalhost%2ftestapp01

After you logged-in with the administrator account, you can find that the following consent screen is displayed.

Advertisements

Azure Batch AI – Walkthrough and How it works

In my previous post, I showed how to setup and configure for the training with multiple machines.
Using Azure Batch AI, you’ll find that it’s very straightforward and simplifying many tasks for AI workloads, especially when it’s distributed training workloads.

Azure Batch AI is built on top of Azure Batch (see my early post), and then you can easily understand this mechanism when you’re familiar with Azure Batch.
Now let’s see how Batch AI works and how you can use it !

Here we use Azure CLI (Command Line Interface), but it’s also integrated with tool’s UI like Azure Portal, Visual Studio, or Visual Studio Code.

Simplifies the distributed training

First, we assume that you are familiar with the basic concepts of distributed training with Cognitive Toolkit (CNTK). In this post we use the following distributed-training sample code (Train_MNIST.py) with training data (Train-28x28_cntk_text.txt), which is the exactly same sample code in my previous post. (See “Walkthrough – Distributed training with CNTK” for details.)
This sample trains the convolutional neural network (LeNet) with distributed manners and save the trained model (ConvNet_MNIST.dnn) after the training is done.

Train_MNIST.py

import numpy as np
import sys
import os
import cntk
from cntk.train.training_session import *

# folder which includes "Train-28x28_cntk_text.txt"
data_path = sys.argv[1]
# folder for model output
model_path = sys.argv[2]

# Creates and trains a feedforward classification model for MNIST images
def train_mnist():
  # variables denoting the features and label data
  input_var = cntk.input_variable((1, 28, 28))
  label_var = cntk.input_variable(10)

  # instantiate the feedforward classification model
  with cntk.layers.default_options(
    init = cntk.layers.glorot_uniform(),
    activation=cntk.ops.relu):
    h = input_var/255.0
    h = cntk.layers.Convolution2D((5,5), 32, pad=True)(h)
    h = cntk.layers.MaxPooling((3,3), (2,2))(h)
    h = cntk.layers.Convolution2D((3,3), 48)(h)
    h = cntk.layers.MaxPooling((3,3), (2,2))(h)
    h = cntk.layers.Convolution2D((3,3), 64)(h)
    h = cntk.layers.Dense(96)(h)
    h = cntk.layers.Dropout(0.5)(h)
    z = cntk.layers.Dense(10, activation=None)(h)

  # create source for training data
  reader_train = cntk.io.MinibatchSource(
    cntk.io.CTFDeserializer(
      os.path.join(data_path, 'Train-28x28_cntk_text.txt'),
      cntk.io.StreamDefs(
        features = cntk.io.StreamDef(field='features', shape=28 * 28),
        labels = cntk.io.StreamDef(field='labels', shape=10)
      )),
      randomize=True,
      max_samples = 60000,
      multithreaded_deserializer=True)

  # create trainer
  epoch_size = 60000
  minibatch_size = 64
  max_epochs = 40

  lr_schedule = cntk.learning_rate_schedule(
    0.2,
    cntk.UnitType.minibatch)
  learner = cntk.sgd(
    z.parameters,
    lr_schedule)
  dist_learner = cntk.train.distributed.data_parallel_distributed_learner(
    learner = learner,
    num_quantization_bits=32, # non-quantized SGD
    distributed_after=0) # all data is parallelized

  progress_printer = cntk.logging.ProgressPrinter(
    tag='Training',
    num_epochs=max_epochs,
    freq=100,
    log_to_file=None,
    rank=cntk.train.distributed.Communicator.rank(),
    gen_heartbeat=True,
    distributed_freq=None)
  ce = cntk.losses.cross_entropy_with_softmax(
    z,
    label_var)
  pe = cntk.metrics.classification_error(
    z,
    label_var)
  trainer = cntk.Trainer(
    z,
    (ce, pe),
    dist_learner,
    progress_printer)

  # define mapping from reader streams to network inputs
  input_map = {
    input_var : reader_train.streams.features,
    label_var : reader_train.streams.labels
  }

  cntk.logging.log_number_of_parameters(z) ; print()

  session = training_session(
    trainer=trainer,
    mb_source = reader_train,
    model_inputs_to_streams = input_map,
    mb_size = minibatch_size,
    progress_frequency=epoch_size,
    checkpoint_config = CheckpointConfig(
      frequency = epoch_size,
      filename = os.path.join(
        model_path, "ConvNet_MNIST"),
      restore = False),
    test_config = None
  )
  session.train()

  # save model only by one process
  if 0 == cntk.Communicator.rank():
    z.save(
      os.path.join(
        model_path,
        "ConvNet_MNIST.dnn"))

  # clean-up distribution
  cntk.train.distributed.Communicator.finalize();

  return

if __name__=='__main__':
  train_mnist()

Train-28x28_cntk_text.txt

|labels 0 0 0 0 0 0 0 1 0 0 |features 0 3 18 18 126... (784 pixel data)
|labels 0 1 0 0 0 0 0 0 0 0 |features 0 0 139 0 253... (784 pixel data)
...

Azure Batch AI uses shared script and data (both input data and model output) for the distributed training. Therefore it uses the shared storage like Azure blob, Azure file share, or NFS file servers as following illustrated.
In this post we use Azure Blob Storage and we assume that the storage account name is “batchaistore01” and the container name is “share01”.

Before starting, locate the previous Train_MNIST.py (script) and Train-28x28_cntk_text.txt (input data) in the blob container.

Now let’s create the cluster (multiple computing nodes) with the following CLI command.
Here we’re creating 2 nodes of Data Science Virtual Machines (DSVM), in which the required software is all pre-configured including GPU-accelerated drivers, OpenMPI and deep learning libraries (TensorFlow, Cognitive Toolkit, Caffe, Keras, etc) with GPU utilized.
Instead of using DSVM, you can also use simple Ubuntu 16.04 LTS (use “UbuntuLTS” instead of “UbuntuDSVM” in the following command) and create a job with the container image, in which the required software is configured. (Later I’ll show you how to use the container image in the job.)

Note : When you don’t need the distributed training (you only use one single node for the training), you can just specify “--min 1” and “--max 1” in the following command.

Note : Currently (in preview) the supported regions are EastUS, WestUS2, WestEurope and available VM Size (pricing tier) is D-series, E-series, F-series, NC-series, ND-series.

# create cluster (computing nodes)
az batchai cluster create --name cluster01 \
  --resource-group testrg01 \
  --location eastus \
  --vm-size STANDARD_NC6 \
  --image UbuntuDSVM \
  --min 2 --max 2 \
  --storage-account-name batchaistore01 \
  --storage-account-key ciUzonyM... \
  --container-name share01 \
  --user-name tsmatsuz --password P@ssw0rd

Please remember my previous post, in which we manually configured the computing node.
Using Batch AI, the only one command runs all configuration tasks including MPI and inter-node’s communication settings. It significantly simplifies the distributed training !

After you’ve created the cluster, please wait until “allocationState” is steady without any errors using “az batchai cluster show” command. (See the following results.)

# show status for cluster creation
az batchai cluster show \
  --name cluster01 \
  --resource-group testrg01


{
  "allocationState": "steady",
  "creationTime": "2017-12-18T07:01:00.424000+00:00",
  "currentNodeCount": 1,
  "errors": null,
  "location": "EastUS",
  "name": "cluster01",
  "nodeSetup": {
    "mountVolumes": {
      "azureBlobFileSystems": null,
      "azureFileShares": [
        {
          "accountName": "batchaistore01",
          "azureFileUrl": "https://batchaistore01.file.core.windows.net/share01",
          "credentials": {
            "accountKey": null,
            "accountKeySecretReference": null
          },
          "directoryMode": "0777",
          "fileMode": "0777",
          "relativeMountPath": "dir01"
        }
      ],
      "fileServers": null,
      "unmanagedFileSystems": null
    },
    "setupTask": null
  },
  ...
}

You can get the access information for the computing nodes with the following command.
As you can see, here’s 2 computing nodes : one is accessed by ssh with port 50000 and another with port 50001. (The inbound NAT is used for accessing nodes with public addresses.)

# Show endpoints
az batchai cluster list-nodes \
  --name cluster01 \
  --resource-group testrg01


[
  {
    "ipAddress": "52.168.68.25",
    "nodeId": "tvm-4283973576_1-20171218t070031z",
    "port": 50000.0
  },
  {
    "ipAddress": "52.168.68.25",
    "nodeId": "tvm-4283973576_2-20171218t070031z",
    "port": 50001.0
  }
]

Please login to the computing node with ssh client, and you can find that the storage container “share01” is mounted as path “$AZ_BATCHAI_MOUNT_ROOT/bfs” (/mnt/batch/tasks/shared/LS_root/mounts/bfs in my environment) in your computing node. (You can change the mount point “bfs” with the cluster creation option.)

When you set the same value for minimum (min) and maximum (max) as cluster nodes’ count with cluster creation option, the auto-scale setting is disabled and it uses the fixed nodes for the cluster.
When you set the different values, the auto-scale setting is enabled automatically. (See the following result.)
You can check whether the auto-scaling is enabled with “az batchai cluster show” command.

Note : When using the fixed nodes, you can change the node count with “az batchai cluster resize“. When using auto-scaling, you can change the auto-scale configuration with “az batchai cluster auto-scale“.

# creating fixed nodes...
az batchai cluster create --name cluster01 \
  --resource-group testrg01 \
  --location eastus \
  --min 2 --max 2 ...


{
  "allocationState": "steady",
  "scaleSettings": {
    "autoScale": null,
    "manual": {
      "nodeDeallocationOption": "requeue",
      "targetNodeCount": 2
    }
  },
  ...
}
# creating auto-scaling nodes...
az batchai cluster create --name cluster01 \
  --resource-group testrg01 \
  --location eastus \
  --min 1 --max 3 ...


{
  "allocationState": "steady",
  "scaleSettings": {
    "autoScale": {
      "initialNodeCount": 0,
      "maximumNodeCount": 3,
      "minimumNodeCount": 1
    },
    "manual": null
  },
  ...
}

When your cluster is ready, now you can publish the training job !

Before starting your job, you must first create the following json for the job definition.
With this definition, the command “python Train_MNIST.py {dir for training data} {dir for model output}” will be launched on MPI (on “mpirun” command). Therefore the training workloads will run on all nodes in parallel and the results are exchanged each other.

Note : When you set 2 as “nodeCount” as follows, 2 computing nodes are set in the host file for MPI (located as $AZ_BATCHAI_MPI_HOST_FILE) and the file is used by “mpirun” command.

Note : Here we use the settings for CNTK (Cognitive Toolkit), but you can also specify the framework settings for TensorFlow, Caffe (incl. Caffe2) and Chainer with Batch AI.
Later I’ll explain more about job definition file…

job.json

{
  "properties": {
    "nodeCount": 2,
    "cntkSettings": {
      "pythonScriptFilePath": "$AZ_BATCHAI_MOUNT_ROOT/bfs/Train_MNIST.py",
      "commandLineArgs": "$AZ_BATCHAI_MOUNT_ROOT/bfs $AZ_BATCHAI_MOUNT_ROOT/bfs/model"
    },
    "stdOutErrPathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/bfs/output"
  }
}

You can run this job with the following command.
As I mentioned in my early post (see my post “Azure Batch – Walkthrough and How it works“), the job runs under an auto-user account named “_azbatch”, and not used the provisioned user account (“tsmatsuz” in my sample).

# run job !
az batchai job create --name job01 \
  --cluster-name cluster01 \
  --resource-group testrg01 \
  --location eastus \
  --config job.json

After you published the job, you can see the job processing state (queued, running, succeeded, etc) with the following command.

# see the job status
az batchai job list -o table


Name    Resource Group    Cluster    Cluster RG    Tool      Nodes  State        Exit code
------  ----------------  ---------  ------------  ------  -------  ---------  -----------
job01   testrg01          cluster01  testrg01      cntk          2  succeeded            0

The result (the generated model and console output) is located in the shared (mounted) folder and you can download these files. (As I’ll explain later, you can also get the download url with CLI command.)

When you want to terminate the queued job, you can run the following command.

# terminate job
az batchai job terminate \
  --name job01 \
  --resource-group testrg01

 

More about job definitions

Here I show you more about job definitions (job.json).

When you use your own favorite framework like Keras, MXNet, etc except for BatchAI-supported frameworks, you can use customToolkitSettings in job definitions.
For example, you can define the previous CNTK distributed training using “customToolkitSettings” as follows. (The result is the same as before.)

Using custom settings (job.json)

{
  "properties": {
    "nodeCount": 2,
    "customToolkitSettings": {
      "commandLine": "mpirun --mca btl_tcp_if_include eth0 --hostfile $AZ_BATCHAI_MPI_HOST_FILE python $AZ_BATCHAI_MOUNT_ROOT/bfs/Train_MNIST.py $AZ_BATCHAI_MOUNT_ROOT/bfs $AZ_BATCHAI_MOUNT_ROOT/bfs/model"
    },    
    "stdOutErrPathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/bfs/output"
  }
}

In my previous example, I configured command with only “$AZ_BATCHAI_MOUNT_ROOT” in job definition (job.json). But you can also write the definition as follows with good modularity. (Here we’re defining extra variables “AZ_BATCHAI_INPUT_MYSCRIPT”, “AZ_BATCHAI_INPUT_MYDATA”, and “AZ_BATCHAI_OUTPUT_MYMODEL”.)

With inputDirectories and outputDirectories (job.json)

{
  "properties": {
    "nodeCount": 2,
    "inputDirectories": [
      {
        "id": "MYSCRIPT",
        "path": "$AZ_BATCHAI_MOUNT_ROOT/bfs"
      },
      {
        "id": "MYDATA",
        "path": "$AZ_BATCHAI_MOUNT_ROOT/bfs"
      }
    ],
    "outputDirectories": [
      {
        "id": "MYMODEL",
        "pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/bfs",
        "pathSuffix": "model"
      }
    ],
    "cntkSettings": {
      "pythonScriptFilePath": "$AZ_BATCHAI_INPUT_MYSCRIPT/Train_MNIST.py",
      "commandLineArgs": "$AZ_BATCHAI_INPUT_MYDATA $AZ_BATCHAI_OUTPUT_MYMODEL"
    },
    "stdOutErrPathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/bfs/output"
  }
}

The benefit of this writing is not only for modularity.
When you write with “outputDirectories” as above, you can get the details about output files with this id of “outputDirectories” as follows. Here are 3 files (ConvNet_MNIST, ConvNet_MNIST.ckp, ConvNet_MNIST.dnn) in the directory AZ_BATCHAI_OUTPUT_MYMODEL.
(When you access job stdout and stderr, use “stdouterr” for directory id.)

# list files in "MYMODEL"
az batchai job list-files \
  --name job03 \
  --output-directory-id MYMODEL \
  --resource-group testrg01

[
  {
    "contentLength": 415115,
    "downloadUrl": "https://batchaistore01.blob.core.windows.net/share01/b3a...",
    "name": ".../outputs/model/ConvNet_MNIST"
  },
  {
    "contentLength": 1127,
    "downloadUrl": "https://batchaistore01.blob.core.windows.net/share01/b3a...",
    "name": ".../outputs/model/ConvNet_MNIST.ckp"
  },
  {
    "contentLength": 409549,
    "downloadUrl": "https://batchaistore01.blob.core.windows.net/share01/b3a...",
    "name": ".../outputs/model/ConvNet_MNIST.dnn"
  }
]

You can also view the results with integrated UI in Azure Portal or AI tools, etc.

Azure Portal

AI tools for Visual Studio

In my previous example, we used only built-in modules in python.
If you want to install some additional module, please use the job preparation as follows.

Invoking pre-task (job.json)

{
  "properties": {
    "nodeCount": 1,
    "jobPreparation": {
      "commandLine": "pip install mpi4py"
    },
    ...
  }
}

You can also specify commands using sh file as follows.

Invoking pre-task (job.json)

{
  "properties": {
    "nodeCount": 1,
    "jobPreparation": {
      "commandLine": "bash $AZ_BATCHAI_MOUNT_ROOT/myprepare.sh"
    },
    ...
  }
}

As I mentioned earlier, you can use the docker image without DSVM provisioning.
With the following definition, you can use simple Ubuntu VM for cluster creation and use docker image for provisioning.

With docker image (job.json)

{
  "properties": {
    "nodeCount": 2,
    "cntkSettings": {
      "pythonScriptFilePath": "$AZ_BATCHAI_MOUNT_ROOT/bfs/Train_MNIST.py",
      "commandLineArgs": "$AZ_BATCHAI_MOUNT_ROOT/bfs $AZ_BATCHAI_MOUNT_ROOT/bfs/model"
    },
    "stdOutErrPathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/bfs/output",
    "containerSettings": {
      "imageSourceRegistry": {
        "image": "microsoft/cntk:2.1-gpu-python3.5-cuda8.0-cudnn6.0"
      }
    }
  }
}

 

Next I show you Azure Distributed Data Engineering Toolkit (aztk), which is also built on top of Azure Batch and provides Spark cluster with docker images.
Enjoy your AI works with Azure !

 

[Reference]

Github – Azure Batch AI recipes
https://github.com/Azure/BatchAI/tree/master/recipes

Walkthrough – Distributed training with CNTK (Cognitive Toolkit)

The advantage of CNTK is not only performance, but it can also support multiple devices (GPUs) or multiple machines with the flexible way by built-in capabilities and rapidly scale along with the number of processes.

In my old post I showed the idea of distributed training with MXNET and R samples (see “Accelerate Mxnet R trainig by GPUs and multiple machines“). Here I show you the same technique with Cognitive Toolkit (CNTK).
In this post we use python, but you can also use with CNTK R bindings.

Setting-up your machines

Before staring, you must provision and configure your computing machines (nodes) as follows. If you run the distributed training with multiple machines, all machines should be configured as same.

  • Configure the network for inter-nodes’ communication.
  • When using GPUs, install (compile) and configure the drivers and related libraries (cuda, cudnn, etc).
  • Install and setup OpenMPI on multiple machines.
  • Put the following program (.py) and data on all multiple machines. It’s better to use NFS shared (mounted) directory and place all files on this directory.

Moreover you must setup the trust between host machines.
Here we create the key pair on my machine using the following command. (The key pair “id_rsa” and “id_rsa.pub” are generated in .ssh folder. During creation, we set blank (null) to the certificate password.)
After that, you copy the generated public key (id_rsa.pub) into {home of the same user id}/.ssh directory on other remote machines. The file name must be changed to “authorized_keys”.

ssh-keygen -t rsa

ls -al .ssh

drwx------ 2 tsmatsuz tsmatsuz 4096 Feb 21 05:01 .
drwxr-xr-x 7 tsmatsuz tsmatsuz 4096 Feb 21 04:52 ..
-rw------- 1 tsmatsuz tsmatsuz 1766 Feb 21 05:01 id_rsa
-rw-r--r-- 1 tsmatsuz tsmatsuz  403 Feb 21 05:01 id_rsa.pub

Let’s confirm that you can pass the command (pwd) to the remote hosts with the following command without any prompt. If succeeded, the current working directory on the remote host will be returned. (Here we assume that 10.0.0.5 is the ip address of the remote host.)

ssh -o StrictHostKeyChecking=no 10.0.0.5 -p 22 pwd

/home/tsmatsuz

The easiest way for provisioning is to use Data Science Virtual Machine (DSVM) in Azure computing, or use docker image (microsoft/cntk:2.1-gpu-python3.5-cuda8.0-cudnn6.0, etc) with docker clusters. Using DSVM, you don’t need to setup all the dependent libraries and there are all automatically pre-configured including TensorFlow, CNTK, MXNet, etc with GPU utilized. The only thing you have to do is to configure the inter-node’s communications.
You can also use the InfiniBand network for inter-communication with Azure infrastructure.

Note : In my next post I’ll show you the most easiest way for provisioning with Azure Batch AI. Using Batch AI, you can provision your computing nodes including inter-node communication’s configuration with only one command !

Our Sample – Single process

Now let’s start programming. Here we use the brief MNIST (hand-writing image’s) sample.

First you can download the sample data (Train-28x28_cntk_text.txt and Test-28x28_cntk_text.txt) from Batch AI sample (BatchAIQuickStart.zip). These data are CNTK text format (CTF) with the probability (0 – 1) as “labels” and 28 x 28 (=784) gray-scale pixel data (0-255) as “features” as follows.

Train-28x28_cntk_text.txt, Test-28x28_cntk_text.txt

|labels 0 0 0 0 0 0 0 1 0 0 |features 0 3 18 18 126...
|labels 0 1 0 0 0 0 0 0 0 0 |features 0 0 139 0 253...
...

Here we start with the following brief training code with convolutional neural network (LeNet). As you can see, this code save the trained model after the training is done.

Train_MNIST.py

import numpy as np
import sys
import os
import cntk

# folder which includes "Train-28x28_cntk_text.txt"
data_path = sys.argv[1]
# folder for model output
model_path = sys.argv[2]

def train_mnist():
  # variables denoting the features and label data
  input_var = cntk.input_variable((1, 28, 28))
  label_var = cntk.input_variable(10)

  # instantiate the feedforward classification model
  with cntk.layers.default_options(
    init = cntk.layers.glorot_uniform(),
    activation=cntk.ops.relu):
    h = input_var/255.0
    h = cntk.layers.Convolution2D((5,5), 32, pad=True)(h)
    h = cntk.layers.MaxPooling((3,3), (2,2))(h)
    h = cntk.layers.Convolution2D((3,3), 48)(h)
    h = cntk.layers.MaxPooling((3,3), (2,2))(h)
    h = cntk.layers.Convolution2D((3,3), 64)(h)
    h = cntk.layers.Dense(96)(h)
    h = cntk.layers.Dropout(0.5)(h)
    z = cntk.layers.Dense(10, activation=None)(h)

  # create source for training data
  reader_train = cntk.io.MinibatchSource(
    cntk.io.CTFDeserializer(
      os.path.join(data_path, 'Train-28x28_cntk_text.txt'),
      cntk.io.StreamDefs(
        features = cntk.io.StreamDef(field='features', shape=28 * 28),
        labels = cntk.io.StreamDef(field='labels', shape=10)
      )),
      randomize=True,
      max_sweeps = cntk.io.INFINITELY_REPEAT)  

  # create trainer
  epoch_size = 60000
  minibatch_size = 64
  max_epochs = 40

  lr_schedule = cntk.learning_rate_schedule(
    0.2,
    cntk.UnitType.minibatch)
  learner = cntk.sgd(
    z.parameters,
    lr_schedule)

  progress_printer = cntk.logging.ProgressPrinter(
    tag='Training',
    num_epochs=max_epochs)
  ce = cntk.losses.cross_entropy_with_softmax(
    z,
    label_var)
  pe = cntk.metrics.classification_error(
    z,
    label_var)
  trainer = cntk.Trainer(
    z,
    (ce, pe),
    learner,
    progress_printer)

  # define mapping from reader streams to network inputs
  input_map = {
    input_var : reader_train.streams.features,
    label_var : reader_train.streams.labels
  }

  cntk.logging.log_number_of_parameters(z) ; print()

  # Train with minibatches
  for epoch in range(max_epochs):
    sample_count = 0
    while sample_count < epoch_size:
      # fetch minibatch
      data = reader_train.next_minibatch(
        min(minibatch_size, epoch_size - sample_count),
        input_map=input_map)
      # update model with it
      trainer.train_minibatch(data)
      # count samples processed so far
      sample_count += data[label_var].num_samples

    # print result
    trainer.summarize_training_progress()

  # save model
  z.save(
    os.path.join(
      model_path,
      "ConvNet_MNIST.dnn".format(epoch)))

  return

if __name__=='__main__':
  train_mnist()

After model is created, you can load the generated model and predict (evaluate) the values from the hand-writing images as follows.

Test_MNIST.py

import numpy as np
import sys
import os
import cntk
from cntk.ops.functions import load_model

# folder which includes "Test-28x28_cntk_text.txt"
data_path = sys.argv[1]
# folder for model data
model_path = sys.argv[2]

def test_mnist():
  # variables denoting the features and label data
  input_var = cntk.input_variable((1, 28, 28))
  label_var = cntk.input_variable(10)

  # load model
  z = load_model(os.path.join(
    model_path,
    "ConvNet_MNIST.dnn"))
  out = cntk.softmax(z)

  # load test data
  reader_test = cntk.io.MinibatchSource(
    cntk.io.CTFDeserializer(
      os.path.join(data_path, 'Test-28x28_cntk_text.txt'),
      cntk.io.StreamDefs(
        features  = cntk.io.StreamDef(field='features', shape=28 * 28),
        labels = cntk.io.StreamDef(field='labels', shape= 10)
      )),
      randomize=False,
      max_sweeps = 1)

  # fetch first 25 rows as min batch.
  test_size = 25
  data = reader_test.next_minibatch(
    test_size,
    input_map={
      input_var : reader_test.streams.features,
      label_var : reader_test.streams.labels
    })
  data_label = data[label_var].asarray()
  data_input = data[input_var].asarray()
  data_input = np.reshape(
    data_input,
    (test_size, 1, 28, 28))

  # predict
  predicted_result = [out.eval(data_input[i]) for i in range(len(data_input))]

  # find the index with the maximum probability
  pred = [np.argmax(predicted_result[i]) for i in range(len(predicted_result))]

  # actual label
  actual = [np.argmax(data_label[i]) for i in range(len(data_label))]

  # print result !
  print("Label:", actual[:25])
  print("Predicted:", pred)

  return

if __name__=='__main__':
  test_mnist()

This code can run with only one single process. When you run this code, you can get the array of predicted values and actual (true) values as follows. (See the output result.)

# train and save the model
python Train_MNIST.py /home/tsmatsuz/work /home/tsmatsuz/work/model
# evaluate (predict) with generated model
python Test_MNIST.py /home/tsmatsuz/work /home/tsmatsuz/work/model

Modify our code for distributed training

With CNTK’s built-in functions, you can easily change your program for the distributed training.

Here is our code for the distributed training version. (The highlighted line is the modified code.)
As you can see, you can use the built-in learner for distributed training (data_parallel_distributed_learner) instead of the local learner. Moreover, instead of using train_minibatch() method, you must define the training session (session = training_session(...)) and start this session with session.train().

Train_MNIST.py (Distributed)

# modified for the distributed training
# (see highlighted)

import numpy as np
import sys
import os
import cntk
from cntk.train.training_session import *

# folder which includes "Train-28x28_cntk_text.txt"
data_path = sys.argv[1]
# folder for model output
model_path = sys.argv[2]

# Creates and trains a feedforward classification model for MNIST images
def train_mnist():
  # variables denoting the features and label data
  input_var = cntk.input_variable((1, 28, 28))
  label_var = cntk.input_variable(10)

  # instantiate the feedforward classification model
  with cntk.layers.default_options(
    init = cntk.layers.glorot_uniform(),
    activation=cntk.ops.relu):
    h = input_var/255.0
    h = cntk.layers.Convolution2D((5,5), 32, pad=True)(h)
    h = cntk.layers.MaxPooling((3,3), (2,2))(h)
    h = cntk.layers.Convolution2D((3,3), 48)(h)
    h = cntk.layers.MaxPooling((3,3), (2,2))(h)
    h = cntk.layers.Convolution2D((3,3), 64)(h)
    h = cntk.layers.Dense(96)(h)
    h = cntk.layers.Dropout(0.5)(h)
    z = cntk.layers.Dense(10, activation=None)(h)

  # create source for training data
  reader_train = cntk.io.MinibatchSource(
    cntk.io.CTFDeserializer(
      os.path.join(data_path, 'Train-28x28_cntk_text.txt'),
      cntk.io.StreamDefs(
        features = cntk.io.StreamDef(field='features', shape=28 * 28),
        labels = cntk.io.StreamDef(field='labels', shape=10)
      )),
      randomize=True,
      max_samples = 60000,
      multithreaded_deserializer=True)

  # create trainer
  epoch_size = 60000
  minibatch_size = 64
  max_epochs = 40

  lr_schedule = cntk.learning_rate_schedule(
    0.2,
    cntk.UnitType.minibatch)
  learner = cntk.sgd(
    z.parameters,
    lr_schedule)
  dist_learner = cntk.train.distributed.data_parallel_distributed_learner(
    learner = learner,
    num_quantization_bits=32, # non-quantized SGD
    distributed_after=0) # all data is parallelized

  progress_printer = cntk.logging.ProgressPrinter(
    tag='Training',
    num_epochs=max_epochs,
    freq=100,
    log_to_file=None,
    rank=cntk.train.distributed.Communicator.rank(),
    gen_heartbeat=True,
    distributed_freq=None)
  ce = cntk.losses.cross_entropy_with_softmax(
    z,
    label_var)
  pe = cntk.metrics.classification_error(
    z,
    label_var)
  trainer = cntk.Trainer(
    z,
    (ce, pe),
    dist_learner,
    progress_printer)

  # define mapping from reader streams to network inputs
  input_map = {
    input_var : reader_train.streams.features,
    label_var : reader_train.streams.labels
  }

  cntk.logging.log_number_of_parameters(z) ; print()

  session = training_session(
    trainer=trainer,
    mb_source = reader_train,
    model_inputs_to_streams = input_map,
    mb_size = minibatch_size,
    progress_frequency=epoch_size,
    checkpoint_config = CheckpointConfig(
      frequency = epoch_size,
      filename = os.path.join(
        model_path, "ConvNet_MNIST"),
      restore = False),
    test_config = None
  )
  session.train()

  # save model only by one process
  if 0 == cntk.Communicator.rank():
    z.save(
      os.path.join(
        model_path,
        "ConvNet_MNIST.dnn"))

  # clean-up distribution
  cntk.train.distributed.Communicator.finalize();

  return

if __name__=='__main__':
  train_mnist()

CNTK supports the several types of parallel (distributed training) algorithms, and here we’re using data-parallel training (data_parallel_distributed_learner), which distributes the workloads along with each mini batch data and aggregates (exchanges) the results from multiple processes. (Here we distribute only data and not using the quantized gradient methodology.)
See “CNTK – Multiple GPUs and Machines” for the details about parallel algorithms.

For example, when you run the following command in your computing host, 2 worker processes run in your computer and exchange the results each other.
After the training is done, only one process (rank=0) saves the trained model (ConvNet_MNIST.dnn) in the model folder.
Then you can evaluate (predict) using the generated model same as the above. (See Test_MNIST.py for prediction.)

mpiexec --npernode 2 python Train_MNIST.py /home/tsmatsuz/work /home/tsmatsuz/work/model

When you run on multiple computing machines (nodes), first you create the following hosts file in your working node. This file lists the computing nodes, on which the training workloads will be started by MPI. (Here we use only 2 nodes for distributed training.)

hosts

10.0.0.4 slots=1 # 1 worker on node0
10.0.0.5 slots=1 # 1 worker on node1

Next you locate the program (Train_MNIST.py) and data (Train-28x28_cntk_text.txt) in the shared (mounted) folder, and make sure that every nodes can access both program and data.

Finally, run the following command in your working node. (“/mnt/test” is the shared folder and “eth0” is the network interface name on my working node.)
Before training, CNTK will seek for the infiniband network and it’s automatically used if available. Then the training workloads will run on all nodes in parallel and the results are exchanged.

Note: You can suppress the lookup for the infiniband network with “-mca btl ^openib” option.

mpiexec --mca btl_tcp_if_include eth0 \
  --hostfile hosts \
  python /mnt/test/Train_MNIST.py /mnt/test /mnt/test/model

 

Next time I’ll show you about Azure Batch AI. You will find how easy it’s for the distributed training workloads ! (This post is the preparation for the next lesson…)

 

[Reference]

Cognitive Toolkit – Multiple GPUs and Machines
https://docs.microsoft.com/en-us/cognitive-toolkit/Multiple-GPUs-and-machines

Azure Batch – Walkthrough and How it works

Batch is another side of the benefits of massive cloud computing. With this post and next, I show you how to use and how it works about Azure Batch and Batch AI, and first here I focus on Azure Batch fundamentals for your essential understanding.
(Note added 01/30/2018 : I added another post about Batch AI.)

When you use cloud computing resource for the purpose of critical batch execution, you should consider all the work yourself, such as constructing infrastructure (Virtual Machines, Virtual Networks, etc with ARM template), provisioning libraries and applications, providing queue, scaling, security, monitoring, collecting results, or clean-up, etc etc. Using Azure Batch, it helps you to do these batch provision or execution workloads with cloud massive computing scale.

In this post, we mainly use Azure CLI for your easy tracing and checking. But you must remember that you can also use UI (Azure Portal), PowerShell, REST API or language’s SDKs such as .NET (C#), Java, Python, and R (doAzureParallel and rAzureBatch package).

Before starting …

Before starting, select “New” – “Batch Service” in Azure Portal and create Batch Account. (Only for creating batch account we use Azure Portal, but you can create batch account also with Azure CLI.)

First you log-in to Azure batch account with Azure CLI as follows. In this post we’re using batch account named “test01” in the resource group “testrg01”.
Here we’re using interactive login UI, but you can also login silently with service principal and password (see my old post “How to use Azure REST API with Certificate programmatically (without interactive UI)“).

# you login Azure with interactive UI
az login
# you login batch account
az batch account login \
 --resource-group testrg01 \
 --name test01

Preparing batch pool (computing nodes)

Before executing batch, you must prepare the application package and batch pool.
The application package is the archived executables (and related other files) with compressed zip format. Now you compress executable applications (.exe, .dll, etc) into zip file, and upload this zipped file (myapp-exe.zip) as the batch application package with the following command.

# Upload and register your archive as application package
az batch application package create \
  --resource-group testrg01 \
  --name test01 \
  --application-id app01 \
  --package-file myapp-exe.zip \
  --version 1.0
# Set this version of package as default version
az batch application set \
  --resource-group testrg01 \
  --name test01 \
  --application-id app01 \
  --default-version 1.0

Next you must provision the batch pool with application packages as follows.
The batch pool is the computing nodes in Azure. Here we’re creating 3 nodes with Windows image (Publisher “MicrosoftWindowsServer”, Offer “WindowsServer”, Sku “2016-Datacenter”, Node agent SKU ID “batch.node.windows amd64”), but of course, you can use Linux computing nodes for the batch pool.

As you can see below, here we are using virtual machine with Standard A1, but you can also use NC-series (which has Tesla GPUs) for machine learning workloads. Moreover you can use Data Science Virtual Machine (DSVM) for Azure Batch, which is the pre-configured machine (including Python, R, CNTK, Tensorflow, MXNet, etc with GPU utilized) for machine learning and deep learning.
Please see “Azure Batch – List of virtual machine images” for the supported machine images.

az batch pool create \
 --id pool01 \
 --target-dedicated 3 \
 --image MicrosoftWindowsServer:WindowsServer:2016-Datacenter:latest \
 --node-agent-sku-id "batch.node.windows amd64" \
 --vm-size Standard_A1 \
 --application-package-references app01#1.0

On batch pool creation, the packaged application is extracted in the directory “D:\batch\tasks\apppackages\{folder name derived from your package name}” (/mnt/batch/tasks/apppackages/{…} in Linux), and this path string is referred by the environment variable %AZ_BATCH_APP_PACKAGE_app01#1.0% (“app01” is your app package name and “1.0” is the version) on your batch process.
For the details about environment variables, please refer “Azure Batch compute node environment variables“.

When some set-up is needed in each nodes, you can use the start-task setting.
For example, when you need your package installation instead of extracting zip archive, you can specify the following installation task (see --start-task-command-line option) on pool’s creation. This setting copies https://mystorage.blob.core.windows.net/fol01/myapp.msi as myapp.msi in current working directory and run msiexec with silent installation process.

Note : The error or output is written in the file (stderr.txt or stdout.txt) on %AZ_BATCH_NODE_STARTUP_DIR% directory (D:\batch\tasks\startup\ in Windows, /mnt/batch/tasks/startup in Linux) on each computing nodes.

az batch pool create \
 --id pool08 \
 --target-dedicated 3 \
 --image MicrosoftWindowsServer:WindowsServer:2016-Datacenter:latest \
 --node-agent-sku-id "batch.node.windows amd64" \
 --vm-size Standard_A1 \
 --application-package-references app08#1.0 \
 --start-task-command-line "cmd /c msiexec /i myapp.msi /quiet" \
 --start-task-resource-files "myapp.msi=https://mystorage.blob.core.windows.net/fol01/myapp.msi" \
 --start-task-wait-for-success

When you want to setup distributed training environments with some deep learning libraries (CNTK etc), you should setup MPI with admin elevated installation. But unfortunately there’s no command options in Azure Bath CLI for admin elevated execution.
In such a case, you can specify the detailed parameters with json format as follows. This json is the same format as batch rest api or ARM template. (See the reference for details.)

Note : See “Cognitive Toolkit : Multiple GPUs and Machines” for distributed training with CNTK. For distributed trainig with MXNetR, see my old post “Accelerate MXNet R training by multiple machines and GPUs“.
For the distributed training, some kind of mechanism for inter-node communication must be needed.

command

# Install MPI in start-task with admin elevated privileges
# (see the following "pooldef.json")
az batch pool create --json-file pooldef.json

pooldef.json (parameter’s definition)

{
  "id": "pool01",
  "virtualMachineConfiguration": {
    "imageReference": {
      "publisher": "MicrosoftWindowsServer",
      "offer": "WindowsServer",
      "sku": "2016-Datacenter",
      "version": "latest"
    },
    "nodeAgentSKUId": "batch.node.windows amd64"
  },
  "vmSize": "Standard_A1",
  "targetDedicatedNodes": "3",
  "enableInterNodeCommunication": true,
  "maxTasksPerNode": 1,
  "applicationPackageReferences": [
    {
      "applicationId": "app01",
      "version": "1.0"
    }
  ],
  "startTask": {
    "commandLine": "cmd /c MSMpiSetup.exe -unattend -force",
    "resourceFiles": [
      {
        "blobSource": "https://mystorage.blob.core.windows.net/fol01/MSMpiSetup.exe",
        "filePath": "MSMpiSetup.exe"
      }
    ],
    "userIdentity": {
      "autoUser": {
        "elevationLevel": "admin"
      }
    },
    "waitForSuccess": true
  }
}

When you want to deploy with pre-configured images, you can use custom VM images or docker images (containerized machine image) with batch pool.

Note : There’re two ways to use docker container for Azure Batch.
The first approach is to use start task for setting up docker on host machine (and run your batch task commands with “docker run” command).
The second approach is to create pool with machine image running docker and containerized image setting with containerConfiguration parameter. When you use the second approach, you can run batch task commands in docker containers, not directly on virtual machine. See “Run container applications on Azure Batch” for details.

When pool’s status gets “steady” and the each computing nodes’ state gets “idle” without any errors, it’s ready for your job execution !

# show pool's status
az batch pool show --pool-id pool01
{
  "id": "pool01",
  "allocationState": "steady",
  "applicationPackageReferences": [
    {
      "applicationId": "app07",
      "version": "1.0"
    }
  ],
  "currentDedicatedNodes": 3,
  "currentLowPriorityNodes": 0,
  "enableAutoScale": false,
  "enableInterNodeCommunication": false,
  "state": "active",
  ...
}
# show node's state
az batch node list --pool-id pool01
[
  {
    "id": "tvm-1219235766_1-20171207t021610z",
    "affinityId": "TVM:tvm-1219235766_1-20171207t021610z",
    "state": "idle",
    "errors": null,
    "endpointConfiguration": {
      "inboundEndpoints": [
        {
          "backendPort": 3389,
          "frontendPort": 50000,
          "name": "SSHRule.0",
          "protocol": "tcp",
          "publicFqdn": "dnsazurebatch-7e6395de-5c8f-444a-9cb6-f7bd809e2d3c-c.westus.cloudapp.azure.com",
          "publicIpAddress": "104.42.129.75"
        }
      ]
    },
    "ipAddress": "10.0.0.4",
    ...
  },
  {
    "id": "tvm-1219235766_2-20171207t021610z",
    "affinityId": "TVM:tvm-1219235766_2-20171207t021610z",
    "state": "idle",
    "errors": null,
    "endpointConfiguration": {
      "inboundEndpoints": [
        {
          "backendPort": 3389,
          "frontendPort": 50002,
          "name": "SSHRule.2",
          "protocol": "tcp",
          "publicFqdn": "dnsazurebatch-7e6395de-5c8f-444a-9cb6-f7bd809e2d3c-c.westus.cloudapp.azure.com",
          "publicIpAddress": "104.42.129.75"
        }
      ]
    },
    "ipAddress": "10.0.0.6",
    ...
  },
  {
    "id": "tvm-1219235766_3-20171207t021610z",
    "affinityId": "TVM:tvm-1219235766_3-20171207t021610z",
    "state": "idle",
    "errors": null,
    "endpointConfiguration": {
      "inboundEndpoints": [
        {
          "backendPort": 3389,
          "frontendPort": 50001,
          "name": "SSHRule.1",
          "protocol": "tcp",
          "publicFqdn": "dnsazurebatch-7e6395de-5c8f-444a-9cb6-f7bd809e2d3c-c.westus.cloudapp.azure.com",
          "publicIpAddress": "104.42.129.75"
        }
      ]
    },
    "ipAddress": "10.0.0.5",
    ...
  }
]

Manage your pool (computing nodes)

Batch pool is based on Azure Virtual Machine Scale Sets (VMSS) and you can manage these nodes like VMSS with batch cli.

For example, you can easily change the number of nodes (scale-out and scale-in) with the following command. Here we’re changing the number of node to 5.

az batch pool resize \
  --pool-id pool01 \
  --target-dedicated-nodes 5

Only nodes are charged as usage prices in Azure Batch. Then if you want to configure for cost-effective consumption, you can also enable automatic scaling (auto-scaling) for pool. Moreover you can also use low-priority VMs for batch pool, which is very low pricing virtual machine. (We used on-demand VMs in above examples.)

By default, tasks run under an auto-user account (named “_azbatch”). This account is built-in user account and is created automatically by the Batch service. When you want to login to each computing nodes, you can add the named user account instead of using auto-user account.
The following is adding the named user account into the node, which node id is “tvm-1219235766_1-20171207t021610z”. (This operation can also be done by Azure Portal UI.)

az batch node user create \
  --pool-id pool07 \
  --node-id tvm-1219235766_1-20171207t021610z \
  --name tsmatsuz \
  --password P@ssw0rd \
  --is-admin

Using named user account, you can connect to each computing node with RDP (Windows) or SSH (Linux).
With the previous “az batch node list” command, you can get the public ip and public port by inbound NAT for RDP (port 3389). (See the previous command results.) Or use “az batch node remote-login-settings show” command for the remote access addresses.

Job and Tasks

The one batch execution is controlled by the logical operation called “job”, and each process (command execution) is managed by “task”. One job has many tasks, and each tasks are launched by parallel or having dependencies.

The typical flow of batch execution is as follows :

  1. Create job (which is associated with the pool)
  2. Add tasks to the previous job
  3. Each tasks are automatically queued and scheduled for execution on computing nodes.

You can also assign job manager task, which determines the completion of all other tasks in the job. In this post, we don’t use the job manager task (jobManagerTask=null) for simplicity and we only use directly executed tasks under the job.

How to pass the results ? It depends on your design.
The task can write the result into the file or task’s standard output is written in the text file. After all tasks complete, you can collect (download) these outputs from VMs and create one mashed-up result.
Another approach is that you pass the connection (connection string, credentials, etc) for accessing storage (blob, db, etc) to tasks, and all tasks write results into this shared storage.
In this post, we use the first approach.

Now let’s see the following example. (As I mentioned above, %AZ_BATCH_APP_PACKAGE_app01#1.0% is the environment variable for batch task.)
After you add a task in a job, the task are automatically scheduled for execution. Even when all tasks are completed, no action is occurred by default. Then you must set “terminateJob” as --on-all-tasks-complete option, and the job is automatically completed after all tasks are completed.

Note :  You can also run your task as admin elevated process same like previous start-task. (Please specify the json file and set the detailed parameters same like previous start-task.)

# Create Job
az batch job create \
  --id job01 \
  --pool-id pool01
# Create Tasks
az batch task create \
  --job-id job01 \
  --task-id task01 \
  --application-package-references app01#1.0 \
  --command-line "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\myapp.exe"
az batch task create \
  --job-id job01 \
  --task-id task02 \
  --application-package-references app01#1.0 \
  --command-line "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\myapp.exe"
az batch task create \
  --job-id job01 \
  --task-id task03 \
  --application-package-references app01#1.0 \
  --command-line "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\myapp.exe"
az batch task create \
  --job-id job01 \
  --task-id task04 \
  --application-package-references app01#1.0 \
  --command-line "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\myapp.exe"
az batch task create \
  --job-id job01 \
  --task-id task05 \
  --application-package-references app01#1.0 \
  --command-line "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\myapp.exe"
az batch task create \
  --job-id job01 \
  --task-id task06 \
  --application-package-references app01#1.0 \
  --command-line "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\myapp.exe"
# When all tasks are completed, the job will complete
az batch job set \
  --job-id job01 \
  --on-all-tasks-complete terminateJob

You can monitor the state (“active”, “completed”, etc) of your job and task as follows.

# show job's state
az batch job show --job-id job01
{
  "creationTime": "2017-12-12T05:41:31.254257+00:00",
  "executionInfo": {
    "endTime": null,
    "poolId": "pool01",
    "schedulingError": null,
    "startTime": "2017-12-12T05:41:31.291277+00:00",
    "terminateReason": null
  },
  "id": "job01",
  "previousState": null,
  "previousStateTransitionTime": null,
  "state": "active",
  ...
}
# show task's state
az batch task show --job-id job01 --task-id task01
{
  "id": "task01",
  "commandLine": "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\\\myapp.exe",
  "constraints": {
    "maxTaskRetryCount": 0,
    "maxWallClockTime": "10675199 days, 2:48:05.477581",
    "retentionTime": "10675199 days, 2:48:05.477581"
  },
  "creationTime": "2017-12-12T05:42:08.762078+00:00",
  "executionInfo": {
    "containerInfo": null,
    "endTime": null,
    "exitCode": null,
    "failureInfo": null,
    "lastRequeueTime": null,
    "lastRetryTime": null,
    "requeueCount": 0,
    "result": null,
    "retryCount": 0,
    "startTime": null
  },
  "previousState": null,
  "previousStateTransitionTime": null,
  "state": "active",
  "stateTransitionTime": "2017-12-12T05:42:08.762078+00:00",
  ...
}

The task’s standard output is written in the file on the computing node. You can download these files with the following command.
Here the output file (stdout.txt) is saved (downloaded) as C:\tmp\node01-stdout.txt on the local computer (which is running your Azure CLI). You can view the all files on the computing node using “az batch node file list --recursive” command.

az batch node file download \
  --file-path "D:\\batch\\tasks\\workitems\\job01\\job-1\\task01\\stdout.txt" \
  --destination "C:\\tmp\\node01-stdout.txt" \
  --node-id tvm-1219235766_1-20171207t021610z \
  --pool-id pool01

The task is randomly assigned to the computing node, and the maximum number of concurrent task execution in each node is 1 by default.
As a result, each tasks are scheduled and executed as follows.

If you set “2” as maxTasksPerNode in pool creation (see the previous pooldef.json in above), each tasks will be executed as follows.

If the task is having relations between its results and inputs, you can also run each tasks with dependencies. (Use usesTaskDependencies option on job creation and use dependsOn parameter on task creation.)

Walkthrough of Azure Cosmos DB Graph (Gremlin)

I show you the general tasks of Azure Cosmos DB Gremlin (graph) for your first use, and a little bit dive into the practical usage of graph query.

This post will also help the beginner to learn how to use the graph database itself, since Azure Cosmos DB is one of the compliant database to the popular graph framework “Apache TinkerPop“.

Create your database and collection

First you must prepare your database.

With Azure Cosmos DB, you must provision account, database, and collection just like Azure Cosmos DB NoSQL database.
You can create these objects using API (REST or SDK), but here we use UI of Azure Portal.

When you create Azure Cosmos DB account in Azure Portal, you must select “Gremlin (graph)” as the supported API as the following picture.

After you’ve created your account, next you create your database and collection with Data Explorer as follows.
The following “Database id” is the identifier for a database, and “Graph id” is for a collection.

Calling API

Before calling APIs, please copy your account key in your Azure Portal.

Moreover please copy the Gremlin URI in your Azure Portal.

Now you can develop your application with APIs.

As I mentioned before, Azure Cosmos DB is one of the database that is compliant with TinkerPop open source framework. Therefore you can use a lot of existing tools compliant with TinkerPop. (See “Apache TinkerPop” for language drivers and other tools.)
For example, if you use PHP for your programming, you can easily download and install gremlin-php by the following command. (You can also use App Service Editor in Azure App Services. No need to prepare your local physical machine for trial !)

curl http://getcomposer.org/installer | php
php composer.phar require brightzone/gremlin-php "2.*"

The following is the simple PHP program code which is retrieving all vertices (nodes) in graph database. (Later I explain about the gremlin language.)

Note that :

  • host is the host name of the previously copied gremlin uri.
  • username is /dbs/{your database name}/colls/{your collection name}.
  • password is the previously copied account key.
<?php
require_once('vendor/autoload.php');
use BrightzoneGremlinDriverConnection;

$db = new Connection([
  'host' => 'graph01.graphs.azure.com',
  'port' => '443',
  'graph' => 'graph',
  'username' => '/dbs/db01/colls/test01',
  'password' => 'In12qhzXYz...',
  'ssl' => TRUE
]);
$db->open();
$res = $db->send('g.V()');
$db->close();

// output the all vertex in db
var_dump($res);
?>

The retrieved result is so called GraphSON format as follows. In this PHP example, the result will be serialized to the PHP array with the following same format.

{
  "id": "u001",
  "label": "person",
  "type": "vertex",
  "properties": {
    "firstName": [
      {
        "value": "John"
      }
    ],
    "age": [
      {
        "value": 45
      }
    ]
  }
}

You can also use SDK for Azure Cosmos DB graph (.NET, Java, Node.js), which is specific for Azure Cosmos DB.
Especially, if you need some specific operations for Azure Cosmos DB (ex: creating or managing database, collections, etc), it’s better to use this SDK.

For example, the following is the C# example code using Azure Cosmos DB Graph SDK. (Here the gremlin language is also used. Later I explain about the details.)
Note that the endpoint differs from the previous gremlin uri.

using Microsoft.Azure.Documents;
using Microsoft.Azure.Documents.Client;
using Microsoft.Azure.Documents.Linq;
using Microsoft.Azure.Graphs;
using Newtonsoft.Json;

static void Main(string[] args)
{
  using (DocumentClient client = new DocumentClient(
    new Uri("https://graph01.documents.azure.com:443/"),
    "In12qhzXYz..."))
  {
    DocumentCollection graph = client.CreateDocumentCollectionIfNotExistsAsync(
      UriFactory.CreateDatabaseUri("db01"),
      new DocumentCollection { Id = "test01" },
      new RequestOptions { OfferThroughput = 1000 }).Result;
    // drop all vertex
    IDocumentQuery<dynamic> query1 =
      client.CreateGremlinQuery<dynamic>(graph, "g.V().drop()");
    dynamic result1 = query1.ExecuteNextAsync().Result;
    Console.WriteLine($"{JsonConvert.SerializeObject(result1)}");
    // add vertex
    IDocumentQuery<dynamic> query2 =
      client.CreateGremlinQuery<dynamic>(graph, "g.addV('person').property('id', 'u001').property('firstName', 'John')");
    dynamic result2 = query2.ExecuteNextAsync().Result;
    Console.WriteLine($"{JsonConvert.SerializeObject(result2)}");
  }

  Console.WriteLine("Done !");
  Console.ReadLine();
}

Before building your application with Azure Cosmos DB SDK, you must install Microsoft.Azure.Graphs package with NuGet. (Other dependent libraries like Microsoft.Azure.DocumentDB, etc are also installed in your project.)

Interactive Console and Visualize

As I described above, TinkerPop framework is having the various open source utilities contributed by communities.

For example, if you want to run the gremlin language (query, etc) with the interactive console, you can use Gremlin Console.
Please see the official document “Azure Cosmos DB: Create, query, and traverse a graph in the Gremlin console” for details about Gremlin Console with Azure Cosmos DB.

There’re also several libraries or software for visualizing gremlin-compatibile graph in Tinkerpop framework.

If you’re using Visual Studio and Azure Cosmos DB, the following Github sample source (written as ASP.NET web project) is very easy to use for visualizing Azure CosmosDB graph.

[Gitub] Azure-Samples / azure-cosmos-db-dotnet-graphexplorer
https://github.com/Azure-Samples/azure-cosmos-db-dotnet-graphexplorer

Gremlin Language

As you’ve seen in my previous programming example, it’s very important to understand the gremlin language (query, etc) for your practical use.
Let’s dive into the gremlin language (query, etc), which is not deep, but practical level of understanding.

First, we simply create the vertex (node).
The following is creating 2 vertices of “John” and “Mary”.

g.addV('employee').property('id', 'u001').property('firstName', 'John').property('age', 44)
g.addV('employee').property('id', 'u002').property('firstName', 'Mary').property('age', 37)

The following is creating the edge between 2 vertices of John and Mary. (This sample means that John is a manager for Mary.)
As you can see, you can specify (identify) the targeting vertex with the previous “id” property.

g.V('u002').addE('manager').to(g.V('u001'))

In this post, we use the following simple structure (vertices and edges) for our subsequent examples.

g.addV('employee').property('id', 'u001').property('firstName', 'John').property('age', 44)
g.addV('employee').property('id', 'u002').property('firstName', 'Mary').property('age', 37)
g.addV('employee').property('id', 'u003').property('firstName', 'Christie').property('age', 30)  
g.addV('employee').property('id', 'u004').property('firstName', 'Bob').property('age', 35)
g.addV('employee').property('id', 'u005').property('firstName', 'Susan').property('age', 31)
g.addV('employee').property('id', 'u006').property('firstName', 'Emily').property('age', 29)
g.V('u002').addE('manager').to(g.V('u001'))
g.V('u005').addE('manager').to(g.V('u001'))
g.V('u004').addE('manager').to(g.V('u002'))
g.V('u005').addE('friend').to(g.V('u006'))
g.V('u005').addE('friend').to(g.V('u003'))
g.V('u006').addE('friend').to(g.V('u003'))
g.V('u006').addE('manager').to(g.V('u004'))

The following is the example which retrieves vertices with some query conditions. This retrieves the employees whose age is greater than 40. (If you query edges, use g.E() instead of g.V().)

g.V().hasLabel('employee').has('age', gt(40))

As I described above, the retrieved result is so called GraphSON format as follows.

{
  "id": "u001",
  "label": "employee",
  "type": "vertex",
  "properties": {
    "firstName": [
      {
        "id": "9a5c0e2a-1249-4e2c-ada2-c9a7f33e26d5",
        "value": "John"
      }
    ],
    "age": [
      {
        "id": "67d681b1-9a24-4090-bac5-be77337ec903",
        "value": 44
      }
    ]
  }
}

You can also use the logical operation (and(), or()) for the graph query.
For example, the following returns only “Mary”.

g.V().hasLabel('employee').and(has('age', gt(35)), has('age', lt(40)))

Next we handle the traversals. (You can traverse the edge.)
Next is the simple traversal example, which just retrieves Mary’s manager. (The result will be “John”.)

g.V('u002').out('manager').hasLabel('employee')

Note that the following returns the same result. The operation outE() returns the edge element and is getting the incoming vertex by inV(). (Explicitly traversing elements, vertex -> edge -> vertex.)

g.V('u002').outE('manager').inV().hasLabel('employee')

The following retrieves Mary’s manager (i.e, “John”) and retrieves the all employees whose direct report is him (“John”).
The result will be “Mary” and “Susan”.

g.V('u002').out('manager').hasLabel('employee').in('manager').hasLabel('employee')

If you want to omit the repeated elements in path, you can use simplePath() as follows. This returns only “Susan”, because “Mary” is the repeated vertex.

g.V('u002').out('manager').hasLabel('employee').in('manager').hasLabel('employee').simplePath()

Now let’s consider the traversal of the relation “friend”. (See the picture illustrated above.)
As you know, “manager” is the directional relation, but “friend” will be the undirectional (non-directional) relation. That is, if A is a friend of B, B will also be a friend of A.
In such a case, you can use both() (or bothE()) operation as follows. The following retrieves Emily’s friend, and the result is both “Susan” and “Christie”.

g.V('u006').both('friend').hasLabel('employee')

If you want to traverse until some condition matches, you can use repeat().until().
The following retrieves the reporting path (the relation of direct reports) from “John” to “Emily”.

g.V('u001').repeat(in('manager')).until(has('id', 'u006')).path()

The result is “John” – “Mary” – “Bob” – “Emily” as the following GraphSON.

{
  "labels": [
    ...
  ],
  "objects": [
    {
      "id": "u001",
      "label": "employee",
      "type": "vertex",
      "properties": {
        "firstName": [
          {
            "id": "9a5c0e2a-1249-4e2c-ada2-c9a7f33e26d5",
            "value": "John"
          }
        ],
        "age": [
          {
            "id": "67d681b1-9a24-4090-bac5-be77337ec903",
            "value": 44
          }
        ]
      }
    },
    {
      "id": "u002",
      "label": "employee",
      "type": "vertex",
      "properties": {
        "firstName": [
          {
            "id": "8d3b7a38-5b8e-4614-b2c4-a28306d3a534",
            "value": "Mary"
          }
        ],
        "age": [
          {
            "id": "2b0804e5-58cc-4061-a03d-5a296e7405d9",
            "value": 37
          }
        ]
      }
    },
    {
      "id": "u004",
      "label": "employee",
      "type": "vertex",
      "properties": {
        "firstName": [
          {
            "id": "3b804f2e-0428-402c-aad1-795f692f740b",
            "value": "Bob"
          }
        ],
        "age": [
          {
            "id": "040a1234-8646-4412-9488-47a5af75a7d7",
            "value": 35
          }
        ]
      }
    },
    {
      "id": "u006",
      "label": "employee",
      "type": "vertex",
      "properties": {
        "firstName": [
          {
            "id": "dfb2b624-e145-4a78-b357-5e147c1de7f6",
            "value": "Emily"
          }
        ],
        "age": [
          {
            "id": "f756c2e9-a16d-4959-b9a3-633cf08bcfd7",
            "value": 29
          }
        ]
      }
    }
  ]
}

Finally, let’s consider the shortest path from “Emily” to “John”. We assume that you can traverse either “manager” (directional) or “friend” (undirectional).

Now the following returns the possible paths from “Emily” to “John” connected by either “manager” (directional) or “friend” (undirectional).

g.V('u006').repeat(union(both('friend').simplePath(), out('manager').simplePath())).until(has('id', 'u001')).path()

This result is 3 paths :
Emily – Susan – John
Emily – Christie – Susan – John
Emily – Bob – Mary – John

When you want to count the number of each paths (current local elements), use count(local) operation.

g.V('u006').repeat(union(both('friend').simplePath(), out('manager').simplePath())).until(has('id', 'u001')).path().count(local)

This result is :
3
4
4

Then the following returns both count and paths as follows.

g.V('u006').repeat(union(both('friend').simplePath(), out('manager').simplePath())).until(has('id', 'u001')).path().group().by(count(local))
{
  "3": [
    {
      "labels": [...],
      "objects": [
        {
          "id": "u006",
          ...
        },
        {
          "id": "u005",
          ...
        },
        {
          "id": "u001",
          ...
        }
      ]
    }
  ],
  "4": [
    {
      "labels": [...],
      "objects": [
        {
          "id": "u006",
          ...
        },
        {
          "id": "u003",
          ...
        },
        {
          "id": "u005",
          ...
        },
        {
          "id": "u001",
          ...
        }
      ]
    },
    {
      "labels": [...],
      "objects": [
        {
          "id": "u006",
          ...
        },
        {
          "id": "u004",
          ...
        },
        {
          "id": "u002",
          ...
        },
        {
          "id": "u001",
          ...
        }
      ]
    }
  ]
}

 

[Reference]

TinkerPop Documentation (including language reference)
http://tinkerpop.apache.org/docs/current/reference/

 

Build your own Web API protected by Azure AD v2.0 endpoint with custom scopes

* This post is writing about Azure AD v2.0 endpoint. If you’re using v1, please see “Build your own api with Azure AD (written in Japanese)”.

You can now build your own Web API protected by the OAuth flow and you can add your own scopes with Azure AD v2.0 endpoint (also with Azure AD B2C).
Here I show you how to setup, how to build, and how to consider with custom scopes in v2.0 endpoint. (You can also learn several OAuth scenarios and ideas through this post.)

I note that now your Microsoft Account (consumer account) cannot provide the following scenarios with custom (user-defined) scopes. Please use your organization account (Azure AD account).

Note : For Azure AD B2C, please refer the post “Azure AD B2C Access Tokens now in public preview” in team blog.

Register your own Web API

First we register our custom Web API in v2.0 endpoint, and consent this app in the tenant.

Please go to Application Registration Portal, and start to register your own Web API by pressing [Add an app] button. In the application settings, click [Add Platform] and select [Web API].

In the added platform pane, you can see the following generated scope (access_as_user) by default.

This scope is used as follows.
For example, when you create your client app to access this custom Web API by OAuth, this client can access the following uri for the permissions calling Web API with the scope value.

https://login.microsoftonline.com/common/oauth2/v2.0/authorize
  ?response_type=id_token+code
  &response_mode=form_post
  &client_id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  &scope=openid+api%3a%2f%2f8a9c6678-7194-43b0-9409-a3a10c3a9800%2faccess_as_user
  &redirect_uri=https%3A%2F%2Flocalhost%2Ftest
  &nonce=abcdef

Now let’s change this default scope, and define the new read and write scopes as follows here. (We assume that the scopes are api://8a9c6678-7194-43b0-9409-a3a10c3a9800/read and api://8a9c6678-7194-43b0-9409-a3a10c3a9800/write.)

Note : If the admin consent is needed for using the custom scope (the admin can only use your scope), please edit the manifest manually.

Next we must also add “Web” platform (not “Web API” platform), because the user needs to consent this api application before using these custom scopes.

For example, please remember “Office 365”. The organizations or users who don’t purchase (subscribe) Office 365 cannot use the Office 365 API’s permissions. (No Office 365 API permissions are displayed in their Azure AD settings.) After you purchase Office 365 in https://portal.office.com/, you can start to use these API’s permissions.
Your custom api is the same. Before using these custom scopes, the user have to involve this custom application in the tenant or the individual.

When some user accesses the following url in their web browser and login with the user’s credential, the following consent UI will be displayed. Once the user approves this consent, this custom Web API application is registered in user’s individual permissions. (Note that the client_id is the application id of this custom Web API application, and the redirect_uri is the redirect url on “Web” platform in your custom Web API application. Please change these values to meet your application settings.)

https://login.microsoftonline.com/common/oauth2/v2.0/authorize
  ?response_type=id_token
  &response_mode=form_post
  &client_id=8a9c6678-7194-43b0-9409-a3a10c3a9800
  &scope=openid
  &redirect_uri=https%3A%2F%2Flocalhost%2Ftestapi
  &nonce=abcdef

Note : You can revoke the permission with https://account.activedirectory.windowsazure.com/, when you are using the organization account (Azure AD Account). It’s https://account.live.com/consent/Manage, when you’re using the consumer account (Microsoft Account).

Use the custom scope in your client application

After the user has consented the custom Web API application, now the user can use the custom scopes (api://.../read and api://.../write in this example) in the user’s client application. (In this post, we use the OAuth code grant flow with the web client application.)

First let’s register the new client application in Application Registration Portal with the user account who consented your Web API application. In this post, we create as “Web” platform for this client application (i.e, web client application).

The application password (client secret) must also be generated as follows in the application settings.

Now let’s consume the custom scope (of custom Web API) with this generated web client.
Access the following url with your web browser. (As you can see, the requesting scope is the previously registered custom scope api://8a9c6678-7194-43b0-9409-a3a10c3a9800/read.)
Here client_id is the application id of the web client application (not custom Web API application), and redirect_uri is the redirect url of the web client application.

https://login.microsoftonline.com/common/oauth2/v2.0/authorize
  ?response_type=code
  &response_mode=query
  &client_id=b5b3a0e3-d85e-4b4f-98d6-e7483e49bffc
  &scope=api%3A%2F%2F8a9c6678-7194-43b0-9409-a3a10c3a9800%2Fread
  &redirect_uri=https%3a%2f%2flocalhost%2ftestwebclient

Note : In the real production, it’s also better to retrieve the id token (i.e, response_type=id_token+code), since your client will have to validate the returned token and check if the user has logged-in correctly.
This sample will skip this complicated steps for your understandings.

When you access this url, the following login page will be displayed.

After the login succeeds with the user’s credential, the following consent is displayed.
As you can see, this shows that the client will use the permission of “Read test service data” (custom permission), which is the previously registered custom scope permission (api://8a9c6678-7194-43b0-9409-a3a10c3a9800/read).

After you approve this consent, the code will be returned into your redirect url as follows.

https://localhost/testwebclient?code=OAQABAAIAA...

Next, using code value, you can request the access token for the requested resource (custom scope) with the following HTTP request.
This client_id and client_secret are each application id and application password of the user’s web client application.

HTTP Request

POST https://login.microsoftonline.com/common/oauth2/v2.0/token
Content-Type: application/x-www-form-urlencoded

grant_type=authorization_code
&code=OAQABAAIAA...
&client_id=b5b3a0e3-d85e-4b4f-98d6-e7483e49bffc
&client_secret=pmC...
&scope=api%3A%2F%2F8a9c6678-7194-43b0-9409-a3a10c3a9800%2Fread
&redirect_uri=https%3A%2F%2Flocalhost%2Ftestwebclient

HTTP Response

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "token_type": "Bearer",
  "scope": "api://8a9c6678-7194-43b0-9409-a3a10c3a9800/read",
  "expires_in": 3599,
  "ext_expires_in": 0,
  "access_token": "eyJ0eXAiOi..."
}

Note : If you want to get refresh token, you must add “offline_access” to the scopes.

Using the returned access token (access_token property), you can call your custom Web API as follows and the API can verify the passed token. (Later I show you how to verify this token in your custom Web API.)

GET https://localhost/testapi
Authorization: Bearer eyJ0eXAiOi...

Verify access token in your Web API

Now it’s turn in your custom Web API.

How to check whether the access token is valid ? How to get the logged-in user’s claims ?

First you must remember that v2.0 endpoint returns the following token format.

id token access token
organization account (Azure AD) JWT JWT
consumer account (MSA) JWT Compact Tickets

As you can see in the table above, the passed access token is IETF JWT (Json Web Token) format as follows, if you are using Azure AD account (organization account).

  • JWT has 3 string tokens delimited by the dot (.) character.
  • Each delimited tokens are the base64 url encoded (encoded by RFC 4686).
  • Each delimited tokens (3 tokens) are having :
    Certificate information (ex: the type of key, key id, etc), claim information (ex: user name, tenant id, token expiration, etc), and digital signature (byte code).

For example, the following is PHP example of decoding access token. (The sample of C# is here.)
This sample outputs the 2nd delimited token string (i.e, claims information) as result.

<?php
echo "The result is " . token_test("eyJ0eXAiOi...");

// return claims
function token_test($token) {
  $res = 0;

  // 1 create array from token separated by dot (.)
  $token_arr = explode('.', $token);
  $header_enc = $token_arr[0];
  $claim_enc = $token_arr[1];
  $sig_enc = $token_arr[2];

  // 2 base 64 url decoding
  $header = base64_url_decode($header_enc);
  $claim = base64_url_decode($claim_enc);
  $sig = base64_url_decode($sig_enc);

  return $claim;
}

function base64_url_decode($arg) {
  $res = $arg;
  $res = str_replace('-', '+', $res);
  $res = str_replace('_', '/', $res);
  switch (strlen($res) % 4) {
    case 0:
      break;
    case 2:
      $res .= "==";
      break;
    case 3:
      $res .= "=";
      break;
    default:
      break;
  }
  $res = base64_decode($res);
  return $res;
}
?>

The output result (claim information) is the json string as follows.

{
  "aud": "8a9c6678-7194-43b0-9409-a3a10c3a9800",
  "iss": "https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/v2.0",
  "iat": 1498037743,
  "nbf": 1498037743,
  "exp": 1498041643,
  "aio": "ATQAy/8DAA...",
  "azp": "b5b3a0e3-d85e-4b4f-98d6-e7483e49bffc",
  "azpacr": "1",
  "name": "Christie Cline",
  "oid": "fb0d1227-1553-4d71-a04f-da6507ae0d85",
  "preferred_username": "ChristieC@MOD776816.onmicrosoft.com",
  "scp": "read",
  "sub": "Pcz_ssYLnD...",
  "tid": "3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15",
  "ver": "2.0"
}

The aud means the application id for targeting web api (here, custom Web API), nbf (= not before) is the starting time of the token expiration, exp is the expiring time of the token, tid means the tenant id of this logged-in user, and scp is the granted scopes.
With these claim values, your custom Web API can check if the passed token is valid.

Here I show you the PHP sample code for checking these claims.

<?php
echo "The result is " . token_test("eyJ0eXAiOi...");

// return 1, if token is valid
// return 0, if token is invalid
function token_test($token) {
  // 1 create array from token separated by dot (.)
  $token_arr = explode('.', $token);
  $header_enc = $token_arr[0];
  $claim_enc = $token_arr[1];
  $sig_enc = $token_arr[2];

  // 2 base 64 url decoding
  $header = 
    json_decode(base64_url_decode($header_enc), TRUE);
  $claim =
    json_decode(base64_url_decode($claim_enc), TRUE);
  $sig = base64_url_decode($sig_enc);

  // 3 expiration check
  $dtnow = time();
  if($dtnow <= $claim['nbf'] or $dtnow >= $claim['exp'])
    return 0;

  // 4 audience check
  if (strcmp($claim['aud'], '8a9c6678-7194-43b0-9409-a3a10c3a9800') !== 0)
    return 0;

  // 5 scope check
  if (strcmp($claim['scp'], 'read') !== 0)
    return 0;
    
  // other checks if needed (lisenced tenant, etc)
  // Here, we skip these steps ...

  return 1;
}

function base64_url_decode($arg) {
  $res = $arg;
  $res = str_replace('-', '+', $res);
  $res = str_replace('_', '/', $res);
  switch (strlen($res) % 4) {
    case 0:
      break;
    case 2:
      $res .= "==";
      break;
    case 3:
      $res .= "=";
      break;
    default:
      break;
  }
  $res = base64_decode($res);
  return $res;
}
?>

But it’s not complete !

Now let’s consider what if some malicious one has changed this token ? For example, if you are a developer, you can easily change the returned token string with Fiddler or other developer tools and you might be able to login to the critical corporate applications with other user’s credential.

Lastly, the digital signature (the third token in access token string) works against this kind of attacks.

The digital signature is generated using the private key in Microsoft identity provider (Azure AD, etc), and you can verify using the public key which everyone can access. Moreover this digital signature is derived from {1st delimited token string}.{2nd delimited token string} string.
That is, if you change the claims (2nd token string) in access token, the digital signature must be also generated again. And only Microsoft identity provider can create this digital signature. (The malicious user cannot.)

That is, all you have to do is to check whether this digital signature is valid with public key. Let’s see how to do that.

First you can get the public key from https://{issuer url}/.well-known/openid-configuration. (The issuer url is equal to the “iss” value in the claim.) In this case, you can get from the following url.

GET https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/v2.0/.well-known/openid-configuration
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "authorization_endpoint": "https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/oauth2/v2.0/authorize",
  "token_endpoint": "https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/oauth2/v2.0/token",
  "token_endpoint_auth_methods_supported": [
    "client_secret_post",
    "private_key_jwt"
  ],
  "jwks_uri": "https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/discovery/v2.0/keys",
  "response_modes_supported": [
    "query",
    "fragment",
    "form_post"
  ],
  "subject_types_supported": [
    "pairwise"
  ],
  "id_token_signing_alg_values_supported": [
    "RS256"
  ],
  "http_logout_supported": true,
  "frontchannel_logout_supported": true,
  "end_session_endpoint": "https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/oauth2/v2.0/logout",
  "response_types_supported": [
    "code",
    "id_token",
    "code id_token",
    "id_token token"
  ],
  "scopes_supported": [
    "openid",
    "profile",
    "email",
    "offline_access"
  ],
  "issuer": "https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/v2.0",
  "claims_supported": [
    "sub",
    "iss",
    "cloud_instance_name",
    "cloud_graph_host_name",
    "aud",
    "exp",
    "iat",
    "auth_time",
    "acr",
    "nonce",
    "preferred_username",
    "name",
    "tid",
    "ver",
    "at_hash",
    "c_hash",
    "email"
  ],
  "request_uri_parameter_supported": false,
  "tenant_region_scope": "NA",
  "cloud_instance_name": "microsoftonline.com",
  "cloud_graph_host_name": "graph.windows.net"
}

Next you access to the location of “jwks_uri” property (see above), and you can get public key list from that location. Finally you can find appropriate key by matching the “kid” (key id).

Here I show you the complete code by PHP as follows.

<?php
echo "The result is " . token_test("eyJ0eXAiOi...");

// return 1, if token is valid
// return 0, if token is invalid
function token_test($token) {
  // 1 create array from token separated by dot (.)
  $token_arr = explode('.', $token);
  $header_enc = $token_arr[0];
  $claim_enc = $token_arr[1];
  $sig_enc = $token_arr[2];

  // 2 base 64 url decoding
  $header =
    json_decode(base64_url_decode($header_enc), TRUE);
  $claim =
    json_decode(base64_url_decode($claim_enc), TRUE);
  $sig = base64_url_decode($sig_enc);

  // 3 period check
  $dtnow = time();
  if($dtnow <= $claim['nbf'] or $dtnow >= $claim['exp'])
    return 0;

  // 4 audience check
  if (strcmp($claim['aud'], '8a9c6678-7194-43b0-9409-a3a10c3a9800') !== 0)
    return 0;

  // 5 scope check
  if (strcmp($claim['scp'], 'read') !== 0)
    return 0;

  // other checks if needed (lisenced tenant, etc)
  // Here, we skip these steps ...

  //
  // 6 check signature
  //

  // 6-a get key list
  $keylist =
    file_get_contents('https://login.microsoftonline.com/3bc5ea6c-9286-4ca9-8c1a-1b2c4f013f15/discovery/v2.0/keys');
  $keylist_arr = json_decode($keylist, TRUE);
  foreach($keylist_arr['keys'] as $key => $value) {
    
    // 6-b select one key
    if($value['kid'] == $header['kid']) {
      
      // 6-c get public key from key info
      $cert_txt = '-----BEGIN CERTIFICATE-----' . "n" . chunk_split($value['x5c'][0], 64) . '-----END CERTIFICATE-----';
      $cert_obj = openssl_x509_read($cert_txt);
      $pkey_obj = openssl_pkey_get_public($cert_obj);
      $pkey_arr = openssl_pkey_get_details($pkey_obj);
      $pkey_txt = $pkey_arr['key'];
      
      // 6-d validate signature
      $token_valid =
        openssl_verify($header_enc . '.' . $claim_enc, $sig, $pkey_txt, OPENSSL_ALGO_SHA256);
      if($token_valid == 1)
        return 1;
      else
        return 0;      
    }
  }
  
  return 0;
}

function base64_url_decode($arg) {
  $res = $arg;
  $res = str_replace('-', '+', $res);
  $res = str_replace('_', '/', $res);
  switch (strlen($res) % 4) {
    case 0:
      break;
    case 2:
      $res .= "==";
      break;
    case 3:
      $res .= "=";
      break;
    default:
      break;
  }
  $res = base64_decode($res);
  return $res;
}
?>

Calling another services in turn (OAuth On-Behalf-Of Flow)

As you can see above, the access token is for the some specific api (for “aud“) and you cannot reuse the token for another api.
What if your custom Web API needs to call another api (for ex, Microsoft Graph API, other custom APIs, etc) ?

In such a case, your api can convert to another token with OAuth on-behalf-of flow as follows. No need to display the login UI again.
In this example, our custom Web API will connect to Microsoft Graph API and get e-mail messages of the logged-in user.

Note : For other APIs except for “Microsoft Graph”, please see “Azure AD v2 endpoint – How to use custom scopes for admin consent“. (Added 02/07/2018)

Note : For on-behalf-of flow in Azure AD v1 endpoint, please refer my early post .

First, as the official document says (see here), you need to use tenant-aware endpoint when you use on-behalf-of flow with v2.0 endpoint. That is, the administrator consent (admin consent) is needed before requesting the on-behalf-of flow. (In this case, the user consent for custom Web API which is done in the previous section in this post is not needed.)

Before proceeding the admin consent, you must add the delegated permission for your custom Web API in Application Registration Portal. In this example, we add Mail.Read permission as follows. (When you use admin consent, you cannot add scopes on the fly and you must set the permissions beforehand.)

Next the administrator in the user tenant must access the following url using the web browser for administrator consent.
Note that xxxxx.onmicrosoft.com can also be the tenant id (which is the Guid retrieved as “tid” in the previous claims). 8a9c6678-7194-43b0-9409-a3a10c3a9800 is the application id of the custom Web API and https://localhost/testapi is the redirect url of the custom Web API.

https://login.microsoftonline.com/xxxxx.onmicrosoft.com/adminconsent
  ?client_id=8a9c6678-7194-43b0-9409-a3a10c3a9800
  &state=12345
  &redirect_uri=https%3A%2F%2Flocalhost%2Ftestapi

After logged-in with the tenant administrator, the following consent is displayed. When the administrator approves this consent, your custom Web API is registered in the tenant. As a result, all users in this tenant can use this custom Web API and custom scopes.

Note : You can revoke the admin-consented application in your tenant with Azure Portal. (Of course, the administrator privilege is needed for this operation.)

Now you can ready for the OAuth on-behalf-of flow in v2.0 endpoint !

First the user (non-administrator) gets the access token for the custom Web API and call the custom Web API with this access token. This flow is the same as above and I skip the steps here.

Then the custom Web API can request the following HTTP POST for Azure AD v2.0 endpoint using the passed access token. (I note that eyJ0eXAiOi... is the passed access token for this custom Web API, 8a9c6678-7194-43b0-9409-a3a10c3a9800 is the application id of your custom Web API, and itS... is the application password of your custom Web API.)
This POST method is requesting the new access token for https://graph.microsoft.com/mail.read (pre-defined scope).

POST https://login.microsoftonline.com/xxxxx.onmicrosoft.com/oauth2/v2.0/token
Content-Type: application/x-www-form-urlencoded

grant_type=urn%3Aietf%3Aparams%3Aoauth%3Agrant-type%3Ajwt-bearer
&assertion=eyJ0eXAiOi...
&requested_token_use=on_behalf_of
&scope=https%3A%2F%2Fgraph.microsoft.com%2Fmail.read
&client_id=8a9c6678-7194-43b0-9409-a3a10c3a9800
&client_secret=itS...

The following is the HTTP response for this on-behalf-of request. As you can see, your custom Web API can take new access token.

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "token_type": "Bearer",
  "scope": "https://graph.microsoft.com/Mail.Read https://graph.microsoft.com/User.Read",
  "expires_in": 3511,
  "ext_expires_in": 0,
  "access_token": "eyJ0eXAiOi..."
}

The returned access token is having the scope for Mail.Read (https://graph.microsoft.com/mail.read), and it’s not the application only token, but the user token by the logged-in user, although it’s done by the backend (server-to-server) without user interaction. (Please parse and decode this access token as I described above.)

Therefore, when your custom Web API connects to Microsoft Graph endpoint with this access token, the logged-in user’s e-mail messages will be returned to your custom Web API.

GET https://graph.microsoft.com/v1.0/me/messages
  ?$orderby=receivedDateTime%20desc
  &$select=subject,receivedDateTime,from
  &$top=20
Accept: application/json
Authorization: Bearer eyJ0eXAiOi...

 

[Reference] App types for the Azure Active Directory v2.0 endpoint
https://docs.microsoft.com/en-us/azure/active-directory/develop/active-directory-v2-flows

 

Analyze your data in Azure Data Lake with R (R extension)

Azure Data Lake (ADL), which offers the unlimited data storage, is the reasonable choice (or cost effective) for the simple batch-based analysis.
You remember the data is more critical rather than the program ! In the case of analyzing data in your Azure Data Lake Store, you don’t need to move or download your data into the remote host. You can run the python or R code on Azure Data Lake Analytics in the cloud hosted.

Here I show you how to use this R extensions with some brief examples along with the real scenarios.

Note : In this post we consider the simple batch-based scenario. If you need more advanced scenarios with the data in ADL store, please use ADL store with Hadoop (HDInsight) with R Server, Spark, Storm, etc.
See “Benefits of Microsoft R and R Server” in my previous post for more details.

Note : U-SQL development with Python and R is also supported in Visual Studio Code. See “ADL Tools for Visual Studio Code (VSCode) supports Python & R Programming” in team blog. (Added on Nov 2017)

Setting-up

Before starting, you must prepare your Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA) with Azure Portal. (Please see the Azure document.)

Next, on your Azure Data Lake Analytics blade, click [Sample Scripts], and select [Install U-SQL Extensions]. (See the following screenshot.)
It starts the installation of extensions in your Data Lake Analytics (ADLA).

Let’s see what kind of installation was made.

After installation is completed, please click [Success] and [Duplicate Script] button. (The installation is executed as Data Lake job.)

As you know, Data Lake Analytics is the .NET-based platform and you can extend using your own custom .NET classes.

R extension is the same. As you can see (see below) in this script job, the R extension classes (ExtR.dll) are installed in your Data Lake Analytics. (Note that the extensions of python and the extensions of cognitive services are also installed.)
As I show you later, you can use these installed classes in your U-SQL script.

Note : You can see these installed dll on /usqlext/assembly folder in your ADLS (Data Lake Store).

Let’s get started !

Now it’s ready.

You can find a lot of examples in /usqlext/samples/R on ADLS. (These are the famous iris classification examples.) You can soon run these U-SQL files (.usql files) with Azure Portal, Visual Studio, or Visual Studio Code (if using Mac), and see the result and how it works. (Here we use Visual Studio.)

For instance, the following is retrieving the data in iris.csv and analyzing for the prediction target “Species” with linear regression. (Sorry, but this sample is meaningless because it’s just returning the base64 encoded trained model. I show you some complete example later…)

R extension (ExtR.dll) includes the custom reducer (.NET class) named Extension.R.Reducer, then you can use this extension class with U-SQL REDUCE expression as follows.

REFERENCE ASSEMBLY [ExtR]; // Load library

DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/test.txt";
DECLARE @myRScript = @"
inputFromUSQL$Species <- as.factor(inputFromUSQL$Species)
lm.fit <- lm(unclass(Species)~.-Par, data=inputFromUSQL)
library(base64enc)
outputToUSQL <-
  data.frame(
    Model=base64encode(serialize(lm.fit, NULL)),
    stringsAsFactors = FALSE)
";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @IrisData
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, Model string
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @OutputFilePredictions
  USING Outputters.Tsv();

As you can see in this sample code, you can use inputFromUSQL for retrieving the input data in your R script. And you can use outputToUSQL as returned result to U-SQL. That is, your R script can communicate with U-SQL script by using these pre-defined variables.

Instead of using outputToUSQL, you can just write the result to the R output. For instance, you can rewrite the above example as follows. (I changed the source code with bold fonts.)

REFERENCE ASSEMBLY [ExtR]; // Load library

DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/test.txt";
DECLARE @myRScript = @"
inputFromUSQL$Species <- as.factor(inputFromUSQL$Species)
lm.fit <- lm(unclass(Species)~.-Par, data=inputFromUSQL)
library(base64enc)
#outputToUSQL <-
#  data.frame(
#    Model=base64encode(serialize(lm.fit, NULL)),
#    stringsAsFactors = FALSE)
data.frame(
  Model=base64encode(serialize(lm.fit, NULL)),
  stringsAsFactors = FALSE)
";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @IrisData
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, Model string
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @OutputFilePredictions
  USING Outputters.Tsv();

We used inline R script in the above example, but you can also separate the R script from your U-SQL script as follows. (See the line with bold fonts.)

REFERENCE ASSEMBLY [ExtR]; // Load library

DEPLOY RESOURCE @"/usqlext/samples/R/testscript01.R";

DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/my/R/Output/test.txt";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @IrisData
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, Model string
  READONLY Par
  USING new Extension.R.Reducer(
    scriptFile:"testscript01.R",
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @OutputFilePredictions
  USING Outputters.Tsv();

Partitioning

By using REDUCE expression, you can separate your analysis workload by partitions. Each partitions can be executed in parallel, then you can efficiently predict some massive amount of data by using this partitioning capability.

To make things simple, let’s consider the following sample data. Here we use the first column as partition key.

test01.csv

1,1
1,2
1,3
1,4
2,5
2,6
2,7
2,8
3,9
3,10
3,11
3,12

The following is the brief example which is calculating min, max, and mean for each partitions.

REFERENCE ASSEMBLY [ExtR];

DECLARE @SrcFile string = @"/sampledat/test01.csv";
DECLARE @DstFile string = @"/sampledat/output01.txt";
DECLARE @myRScript = @"
outputToUSQL <- data.frame(
  CalcType = c(""min"", ""max"", ""mean""),
  CalcValue = c(
    min(inputFromUSQL$Value),
    max(inputFromUSQL$Value),
    mean(inputFromUSQL$Value)
  )
)
";

@ExtendedData =
  EXTRACT PartitionId int,
      Value int
  FROM @SrcFile
  USING Extractors.Csv();

@RScriptOutput = REDUCE @ExtendedData ON PartitionId
  PRODUCE PartitionId int, CalcType string, CalcValue double
  READONLY PartitionId
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @DstFile
  USING Outputters.Tsv();

The following screenshot is the result of this U-SQL.
Each partition is executed independently in parallel, and all results are collected by REDUCE operation.

Note that you have to specify ON {partition keys (multiple)} or ALL when you’re using REDUCE clause. (You cannot skip ON / ALL.)
So if you don’t need partitioning, you specify the pseudo partition (one same partition for all raw) like the following script.

REFERENCE ASSEMBLY [ExtR];

DECLARE @SrcFile string = @"/sampledat/test01.csv";
DECLARE @DstFile string = @"/sampledat/output01.txt";
DECLARE @myRScript = @"
outputToUSQL <- data.frame(
  CalcType = c(""min"", ""max"", ""mean""),
  CalcValue = c(
    min(inputFromUSQL$Value),
    max(inputFromUSQL$Value),
    mean(inputFromUSQL$Value)
  )
)
";

@ExtendedData =
  EXTRACT SomeId int,
      Value int
  FROM @SrcFile
  USING Extractors.Csv();

@ExtendedData2 =
  SELECT 0 AS Par, // pseudo partition
       *
  FROM @ExtendedData;

@RScriptOutput = REDUCE @ExtendedData2 ON Par
  PRODUCE Par int, CalcType string, CalcValue double
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @DstFile
  USING Outputters.Tsv();

Installing packages

There are default supported packages in R extension, but you can install extra packages if needed. (See here for the default packages of R extension. It’s also including RevoScaleR package.)

First you download the package file (.zip, .tar.gz, etc) using your local R console. Now here we download the famous svm package “e1071”. (We assume the file name is e1071_1.6-8.tar.gz.)

download.packages("e1071", destdir="C:\tmp")

Next you upload this package file to the folder in your ADLS (Data Lake Store).
After that, you can specify this package file in your U-SQL and you can install this package in your R script as follows.

REFERENCE ASSEMBLY [ExtR];

DEPLOY RESOURCE @"/sampledat/e1071_1.6-8.tar.gz";

DECLARE @SrcFile string = @"/sampledat/iris.csv";
DECLARE @DstFile string = @"/sampledat/output03.txt";
DECLARE @myRScript = @"
install.packages('e1071_1.6-8.tar.gz', repos = NULL) # installing package
library(e1071) # loading package
# something to analyze !
# (Later we'll create the code here ...)
data.frame(Res = c(""result1"", ""result2""))
";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @SrcFile
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, Res string
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @DstFile
  USING Outputters.Tsv();

Loading R data

In the real scenario, you might use the pre-trained model for predictions. In such a case, you can create the trained model (R objects) beforehand, and you can load these R objects on your R script in U-SQL.

First you create the trained model using the following script in your local environment. The file “model.rda” will be saved in your local file system.
(Here we’re using script for saving, but you can also use RStudio IDE.)

library(e1071)
inputCSV <- read.csv(
  file = "C:\tmp\iris_train.csv",
  col.names = c(
    "SepalLength",
    "SepalWidth",
    "PetalLength",
    "PetalWidth",
    "Species")
)
mymodel <- svm(
  Species~.,
  data=inputCSV,
  probability = T)
save(mymodel, file = "C:\tmp\model.rda")

Note that we assume our  training data (iris data) is as follows. (It’s the same as U-SQL extension sample files…) :

iris_train.csv

5.1,3.5,1.4,0.2,setosa
7,3.2,4.7,1.4,versicolor
6.3,3.3,6,2.5,virginica
4.9,3,1.4,0.2,setosa
...

Then you upload this generated model (model.rda file) on the folder in your ADLS (Data Lake Store).

Now it’s ready, and let’s go jump into the U-SQL.

See the following R script in U-SQL.
This R script is loading the previous pre-trained model (model.rda). By this, you can use pre-trained R object “mymodel” in your R script.
All you have to do is to predict your input data with this model object.

REFERENCE ASSEMBLY [ExtR];

DEPLOY RESOURCE @"/sampledat/e1071_1.6-8.tar.gz";
DEPLOY RESOURCE @"/sampledat/model.rda";

DECLARE @SrcFile string = @"/sampledat/iris.csv";
DECLARE @DstFile string = @"/sampledat/output03.txt";
DECLARE @myRScript = @"
install.packages('e1071_1.6-8.tar.gz', repos = NULL)
library(e1071)
load(""model.rda"")
pred <- predict(
  mymodel,
  inputFromUSQL,
  probability = T)
prob <- attr(pred, ""probabilities"")
result <- data.frame(prob, stringsAsFactors = FALSE)
result$answer <- inputFromUSQL$Species
outputToUSQL <- result
";

@InputData =
  EXTRACT SepalLength double,
    SepalWidth double,
    PetalLength double,
    PetalWidth double,
    Species string
  FROM @SrcFile
  USING Extractors.Csv();

@ExtendedData =
  SELECT 0 AS Par,
       *
  FROM @InputData;

@RScriptOutput = REDUCE @ExtendedData ON Par
  PRODUCE Par int, setosa double, versicolor double, virginica double, answer string
  READONLY Par
  USING new Extension.R.Reducer(
    command:@myRScript,
    rReturnType:"dataframe",
    stringsAsFactors:false);

OUTPUT @RScriptOutput TO @DstFile
  USING Outputters.Tsv();

[Reference] Tutorial: Get started with extending U-SQL with R
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-r-extensions