Azure Batch – Walkthrough and How it works

Batch is another side of the benefits of massive cloud computing. With this post and next, I show you how to use and how it works about Azure Batch and Batch AI, and first here I focus on Azure Batch fundamentals for your essential understanding.

When you use cloud computing resource for the purpose of critical batch execution, you should consider all the work yourself, such as constructing infrastructure (Virtual Machines, Virtual Networks, etc with ARM template), provisioning libraries and applications, providing queue, scaling, security, monitoring, collecting results, or clean-up, etc etc. Using Azure Batch, it helps you to do these batch provision or execution workloads with cloud massive computing scale.

In this post, we mainly use Azure CLI for your easy tracing and checking. But you must remember that you can also use UI (Azure Portal), PowerShell, REST API or language’s SDKs such as .NET (C#), Java, Python, and R (doAzureParallel package).

Before starting …

Before starting, select “New” – “Batch Service” in Azure Portal and create Batch Account. (Only for creating batch account we use Azure Portal, but you can create batch account also with Azure CLI.)

First you log-in to Azure batch account with Azure CLI as follows. In this post we’re using batch account named “test01” in the resource group “testrg01”.
Here we’re using interactive login UI, but you can also login silently with service principal and password (see my old post “How to use Azure REST API with Certificate programmatically (without interactive UI)“).

# you login Azure with interactive UI
az login
# you login batch account
az batch account login \
 --resource-group testrg01 \
 --name test01

Preparing batch pool (computing nodes)

Before executing batch, you must prepare the application package and batch pool.
The application package is the archived executables (and related other files) with compressed zip format. Now you compress executable applications (.exe, .dll, etc) into zip file, and upload this zipped file (myapp-exe.zip) as the batch application package with the following command.

# Upload and register your archive as application package
az batch application package create \
  --resource-group testrg01 \
  --name test01 \
  --application-id app01 \
  --package-file myapp-exe.zip \
  --version 1.0
# Set this version of package as default version
az batch application set \
  --resource-group testrg01 \
  --name test01 \
  --application-id app01 \
  --default-version 1.0

Next you must provision the batch pool with application packages as follows.
The batch pool is the computing nodes in Azure. Here we’re creating 3 nodes with Windows image (Publisher “MicrosoftWindowsServer”, Offer “WindowsServer”, Sku “2016-Datacenter”, Node agent SKU ID “batch.node.windows amd64”), but of course, you can use Linux computing nodes for the batch pool.

As you can see below, here we are using virtual machine with Standard A1, but you can also use NC-series (which has Tesla GPUs) for machine learning workloads. Moreover you can use Data Science Virtual Machine (DSVM) for Azure Batch, which is the pre-configured machine (including Python, R, CNTK, Tensorflow, MXNet, etc with GPU utilized) for machine learning and deep learning.
Please see “Azure Batch – List of virtual machine images” for the supported machine images.

az batch pool create \
 --id pool01 \
 --target-dedicated 3 \
 --image MicrosoftWindowsServer:WindowsServer:2016-Datacenter:latest \
 --node-agent-sku-id "batch.node.windows amd64" \
 --vm-size Standard_A1 \
 --application-package-references app01#1.0

On batch pool creation, the packaged application is extracted in the directory “D:\batch\tasks\apppackages\{folder name derived from your package name}”, and this path string is referred by the environment variable %AZ_BATCH_APP_PACKAGE_app01#1.0% (“app01” is your app package name and “1.0” is the version) on your batch process.
For the details about environment variables, please refer “Azure Batch compute node environment variables“.

When some set-up is needed in each nodes, you can use the start-task setting.
For example, when you need your package installation instead of extracting zip archive, you can specify the following installation task (see --start-task-command-line option) on pool’s creation. This setting copies https://mystorage.blob.core.windows.net/fol01/myapp.msi as myapp.msi in current working directory and run msiexec with silent installation process.

Note : The error or output is written in the file (stderr.txt or stdout.txt) on %AZ_BATCH_NODE_STARTUP_DIR% directory (D:\batch\tasks\startup\) on each computing nodes.

az batch pool create \
 --id pool08 \
 --target-dedicated 3 \
 --image MicrosoftWindowsServer:WindowsServer:2016-Datacenter:latest \
 --node-agent-sku-id "batch.node.windows amd64" \
 --vm-size Standard_A1 \
 --application-package-references app08#1.0 \
 --start-task-command-line "cmd /c msiexec /i myapp.msi /quiet" \
 --start-task-resource-files "myapp.msi=https://mystorage.blob.core.windows.net/fol01/myapp.msi" \
 --start-task-wait-for-success

When you want to setup distributed training environments with some deep learning libraries (CNTK etc), you should setup MPI with admin elevated installation. But unfortunately there’s no command options in Azure Bath CLI for admin elevated execution.
In such a case, you can specify the detailed parameters with json format as follows. This json is the same format as batch rest api or ARM template. (See the reference for details.)

Note : See “Cognitive Toolkit : Multiple GPUs and Machines” for distributed training with CNTK. For distributed trainig with MXNetR, see my old post “Accelerate MXNet R training by multiple machines and GPUs“.
For the distributed training, some kind of mechanism for inter-node communication must be needed.

command

# Install MPI in start-task with admin elevated privileges
# (see the following "pooldef.json")
az batch pool create --json-file pooldef.json

pooldef.json (parameter’s definition)

{
  "id": "pool01",
  "virtualMachineConfiguration": {
    "imageReference": {
      "publisher": "MicrosoftWindowsServer",
      "offer": "WindowsServer",
      "sku": "2016-Datacenter",
      "version": "latest"
    },
    "nodeAgentSKUId": "batch.node.windows amd64"
  },
  "vmSize": "Standard_A1",
  "targetDedicatedNodes": "3",
  "enableInterNodeCommunication": true,
  "maxTasksPerNode": 1,
  "applicationPackageReferences": [
    {
      "applicationId": "app01",
      "version": "1.0"
    }
  ],
  "startTask": {
    "commandLine": "cmd /c MSMpiSetup.exe -unattend -force",
    "resourceFiles": [
      {
        "blobSource": "https://mystorage.blob.core.windows.net/fol01/MSMpiSetup.exe",
        "filePath": "MSMpiSetup.exe"        
      }
    ],
    "userIdentity": {
      "autoUser": {
        "elevationLevel": "admin"
      }
    },
    "waitForSuccess": true
  }
}

When you want to deploy with pre-configured images, you can use custom VM images or docker images (containerized machine image) with batch pool.

Note : There’re two ways to use docker container for Azure Batch.
The first approach is to use start task for setting up docker on host machine (and run your batch task commands with “docker run” command).
The second approach is to create pool with machine image running docker and containerized image setting with containerConfiguration parameter. When you use the second approach, you can run batch task commands in docker containers, not directly on virtual machine. See “Run container applications on Azure Batch” for details.

When pool’s status gets “steady” and the each computing nodes’ state gets “idle” without any errors, it’s ready for your job execution !

# show pool's status
az batch pool show --pool-id pool01
{
  "id": "pool01",
  "allocationState": "steady",
  "applicationPackageReferences": [
    {
      "applicationId": "app07",
      "version": "1.0"
    }
  ],
  "currentDedicatedNodes": 3,
  "currentLowPriorityNodes": 0,
  "enableAutoScale": false,
  "enableInterNodeCommunication": false,
  "state": "active",
  ...
}
# show node's state
az batch node list --pool-id pool01
[
  {
    "id": "tvm-1219235766_1-20171207t021610z",
    "affinityId": "TVM:tvm-1219235766_1-20171207t021610z",
    "state": "idle",
    "errors": null,
    "endpointConfiguration": {
      "inboundEndpoints": [
        {
          "backendPort": 3389,
          "frontendPort": 50000,
          "name": "SSHRule.0",
          "protocol": "tcp",
          "publicFqdn": "dnsazurebatch-7e6395de-5c8f-444a-9cb6-f7bd809e2d3c-c.westus.cloudapp.azure.com",
          "publicIpAddress": "104.42.129.75"
        }
      ]
    },
    "ipAddress": "10.0.0.4",
    ...
  },
  {
    "id": "tvm-1219235766_2-20171207t021610z",
    "affinityId": "TVM:tvm-1219235766_2-20171207t021610z",
    "state": "idle",
    "errors": null,
    "endpointConfiguration": {
      "inboundEndpoints": [
        {
          "backendPort": 3389,
          "frontendPort": 50002,
          "name": "SSHRule.2",
          "protocol": "tcp",
          "publicFqdn": "dnsazurebatch-7e6395de-5c8f-444a-9cb6-f7bd809e2d3c-c.westus.cloudapp.azure.com",
          "publicIpAddress": "104.42.129.75"
        }
      ]
    },
    "ipAddress": "10.0.0.6",
    ...
  },
  {
    "id": "tvm-1219235766_3-20171207t021610z",
    "affinityId": "TVM:tvm-1219235766_3-20171207t021610z",
    "state": "idle",
    "errors": null,
    "endpointConfiguration": {
      "inboundEndpoints": [
        {
          "backendPort": 3389,
          "frontendPort": 50001,
          "name": "SSHRule.1",
          "protocol": "tcp",
          "publicFqdn": "dnsazurebatch-7e6395de-5c8f-444a-9cb6-f7bd809e2d3c-c.westus.cloudapp.azure.com",
          "publicIpAddress": "104.42.129.75"
        }
      ]
    },
    "ipAddress": "10.0.0.5",
    ...
  }
]

Manage your pool (computing nodes)

Batch pool is based on Azure Virtual Machine Scale Sets (VMSS) and you can manage these nodes like VMSS with batch cli.

For example, you can easily change the number of nodes (scale-out and scale-in) with the following command. Here we’re changing the number of node to 5.

az batch pool resize \
  --pool-id pool01 \
  --target-dedicated-nodes 5

Only nodes are charged as usage prices in Azure Batch. Then if you want to configure for cost-effective consumption, you can also enable automatic scaling (auto-scaling) for pool. Moreover you can also use low-priority VMs for batch pool, which is very low pricing virtual machine. (We used on-demand VMs in above examples.)

By default, tasks run under an auto-user account (named “_azbatch”). This account is built-in user account and is created automatically by the Batch service. When you want to login to each computing nodes, you can add the named user account instead of using auto-user account.
The following is adding the named user account into the node, which node id is “tvm-1219235766_1-20171207t021610z”. (This operation can also be done by Azure Portal UI.)

az batch node user create \
  --pool-id pool07 \
  --node-id tvm-1219235766_1-20171207t021610z \
  --name tsmatsuz \
  --password P@ssw0rd \
  --is-admin

Using named user account, you can connect to each computing node with RDP (Windows) or SSH (Linux).
With the previous “az batch node list” command, you can get the public ip and public port by inbound NAT for RDP (port 3389). (See the previous command results.) Or use “az batch node remote-login-settings show” command for the remote access addresses.

Job and Tasks

The one batch execution is controlled by the logical operation called “job”, and each process (command execution) is managed by “task”. One job has many tasks, and each tasks are launched by parallel or having dependencies.

The typical flow of batch execution is as follows :

  1. Create job (which is associated with the pool)
  2. Add tasks to the previous job
  3. Each tasks are automatically queued and scheduled for execution on computing nodes.

You can also assign job manager task, which determines the completion of all other tasks in the job. In this post, we don’t use the job manager task (jobManagerTask=null) for simplicity and we only use directly executed tasks under the job.

How to pass the results ? It depends on your design.
The task can write the result into the file or task’s standard output is written in the text file. After all tasks complete, you can collect (download) these outputs from VMs and create one mashed-up result.
Another approach is that you pass the connection (connection string, credentials, etc) for accessing storage (blob, db, etc) to tasks, and all tasks write results into this shared storage.
In this post, we use the first approach.

Now let’s see the following example. (As I mentioned above, %AZ_BATCH_APP_PACKAGE_app01#1.0% is the environment variable for batch task.)
After you add a task in a job, the task are automatically scheduled for execution. Even when all tasks are completed, no action is occurred by default. Then you must set “terminateJob” as --on-all-tasks-complete option, and the job is automatically completed after all tasks are completed.

Note :  You can also run your task as admin elevated process same like previous start-task. (Please specify the json file and set the detailed parameters same like previous start-task.)

# Create Job
az batch job create \
  --id job01 \
  --pool-id pool01
# Create Tasks
az batch task create \
  --job-id job01 \
  --task-id task01 \
  --application-package-references app01#1.0 \
  --command-line "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\myapp.exe"
az batch task create \
  --job-id job01 \
  --task-id task02 \
  --application-package-references app01#1.0 \
  --command-line "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\myapp.exe"
az batch task create \
  --job-id job01 \
  --task-id task03 \
  --application-package-references app01#1.0 \
  --command-line "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\myapp.exe"
az batch task create \
  --job-id job01 \
  --task-id task04 \
  --application-package-references app01#1.0 \
  --command-line "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\myapp.exe"
az batch task create \
  --job-id job01 \
  --task-id task05 \
  --application-package-references app01#1.0 \
  --command-line "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\myapp.exe"
az batch task create \
  --job-id job01 \
  --task-id task06 \
  --application-package-references app01#1.0 \
  --command-line "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\myapp.exe"
# When all tasks are completed, the job will complete
az batch job set \
  --job-id job01 \
  --on-all-tasks-complete terminateJob

You can monitor the state (“active”, “completed”, etc) of your job and task as follows.

# show job's state
az batch job show --job-id job01
{
  "creationTime": "2017-12-12T05:41:31.254257+00:00",
  "executionInfo": {
    "endTime": null,
    "poolId": "pool01",
    "schedulingError": null,
    "startTime": "2017-12-12T05:41:31.291277+00:00",
    "terminateReason": null
  },
  "id": "job01",
  "previousState": null,
  "previousStateTransitionTime": null,
  "state": "active",
  ...
}
# show task's state
az batch task show --job-id job01 --task-id task01
{
  "id": "task01",
  "commandLine": "cmd /c %AZ_BATCH_APP_PACKAGE_app01#1.0%\\\\myapp.exe",
  "constraints": {
    "maxTaskRetryCount": 0,
    "maxWallClockTime": "10675199 days, 2:48:05.477581",
    "retentionTime": "10675199 days, 2:48:05.477581"
  },
  "creationTime": "2017-12-12T05:42:08.762078+00:00",
  "executionInfo": {
    "containerInfo": null,
    "endTime": null,
    "exitCode": null,
    "failureInfo": null,
    "lastRequeueTime": null,
    "lastRetryTime": null,
    "requeueCount": 0,
    "result": null,
    "retryCount": 0,
    "startTime": null
  },
  "previousState": null,
  "previousStateTransitionTime": null,
  "state": "active",
  "stateTransitionTime": "2017-12-12T05:42:08.762078+00:00",
  ...
}

The task’s standard output is written in the file on the computing node. You can download these files with the following command.
Here the output file (stdout.txt) is saved (downloaded) as C:\tmp\node01-stdout.txt on the local computer (which is running your Azure CLI). You can view the all files on the computing node using “az batch node file list --recursive” command.

az batch node file download \
  --file-path "D:\\batch\\tasks\\workitems\\job01\\job-1\\task01\\stdout.txt" \
  --destination "C:\\tmp\\node01-stdout.txt" \
  --node-id tvm-1219235766_1-20171207t021610z \
  --pool-id pool01

The task is randomly assigned to the computing node, and the maximum number of concurrent task execution in each node is 1 by default.
As a result, each tasks are scheduled and executed as follows.

If you set “2” as maxTasksPerNode in pool creation (see the previous pooldef.json in above), each tasks will be executed as follows.

If the task is having relations between its results and inputs, you can also run each tasks with dependencies. (Use usesTaskDependencies option on job creation and use dependsOn parameter on task creation.)

Advertisements

Deliver Your Own Excel Custom Functions (JavaScript Add-in) for Users

As you know, you can build your own custom functions (UDF) with VBA since long before. But if you’re ISV folks, you must deliver your functions for your all customers by easy installation. VBA doesn’t matter in such a case.
With new capability in Excel Web Add-in, you can deliver your functions all over the world through marketplace. You can deliver your own advanced Excel functions, such as AI integrated functions, etc.

For example, you can easily expose your AI-driven web service with new Azure Machine Learning (v2) and python, and then the user can consume this service with your given Excel functions. (The user doesn’t need to know about the web services or complicated installation process.)
This concept is shown in Microsoft Ignite 2017 and see “Announcing tools for the AI-driven digital transformation” in team blog. This will be the sample of 1st party implementation for this experience.

The following is the advantages of this new custom functions. (As I write as “Note”, now this capability is in Preview and several limitations exist.)

  • It’s JavaScript. It can run on any platforms like Mac, Online, Mobile, etc.
    (Note : But now you must use only Excel Windows desktop client in current Preview.)
  • Are you a power user or a professional developer ? For the latter folks, once you’ve created your custom functions, you can submit and deliver through marketplace (Office store).
    (Note : But now you cannot submit your custom function add-in in current Preview.)

It’s now in Preview and you need Office Insider version (build 8711 or later) for running. Therefore please sign-up Office Insider before starting.

Let’s dive into 5 minutes’ example !

Logic and Registration

First you must create the html page (.html) and script file (.js), and locate (expose) in the internet.
The following is our example. Here we are using famous chance.js and it can be downloaded (copied) from chance.js site.

mypage.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8" />
  <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
  <meta http-equiv="Expires" content="0" />
  <title></title>
  <script src="https://appsforoffice.edog.officeapps.live.com/lib/beta/hosted/office.js" type="text/javascript"></script>
  <script src="https://example.com/chance.js" type="text/javascript"></script>
  <script src="https://example.com/myfunc.js" type="text/javascript"></script>
</head>
<body>
</body>
</html>

myfunc.js

Office.initialize = function(reason){
  function getrandom(x) {
    if (x == "phone")
      return chance.phone();
    else if (x == "zip")
      return chance.zip();
    else if (x == "address")
      return chance.address();
    else
      return "not supported";
  }
  
  Excel.Script.CustomFunctions = {};
  Excel.Script.CustomFunctions["TEST"] = {};
  Excel.Script.CustomFunctions["TEST"]["RANDOM"] = {
    call: getrandom,
    description: "Create various random data (phone, zip, etc)",
    result: {
      resultType: Excel.CustomFunctionValueType.string,
      resultDimensionality: Excel.CustomFunctionDimensionality.scalar,
    },
    parameters: [
      {
        name: "type of data",
        description: "Which type of data you need (phone, zip, etc)",
        valueType: Excel.CustomFunctionValueType.string,
        valueDimensionality: Excel.CustomFunctionDimensionality.scalar,
      }
    ]
  };
  
  Excel.run(function (context) {    
    context.workbook.customFunctions.addAll();
    return context.sync().then(function(){});
  }).catch(function(error){});
};

Let’s see what’s doing in this code.
First getrandom() is the logic of our custom function. As you can see, here we’re just creating the various random values using chance.js. If your custom function retrieves data from the web (i.e, async IO is invoked), you need an asynchronous function. (See “Github : Create custom functions in Excel (Preview)” for details about the asynchronous function.)

Note : It seems that some libraries (which are dynamically loaded libraries and so on) cannot be used in custom functions now …

Next we must define our custom function by Excel.Script.CustomFunctions (see Script.CustomFunctions reference) before registration. Here we’re using scalar value for input parameters and output result, but if you need some lookup for Excel ranges, you can also use matrix dimensions for parameters.

When the user add this custom function, this definition is registered by context.workbook.customFunctions.addAll(). Once the custom function is registered, the user can use this function in all the workbooks in Excel.
If you want to delete the registered custom functions, please use deleteAll() instead in the same add-in application. (See the following.)

Note : Or you can delete registered custom functions by removing %userprofile%\AppData\Local\Microsoft\Office\16.0\Wef\CustomFunctions folder in your local machine.

Office.initialize = function(reason){
  Excel.run(function (context) {
    // check if exists your custom function
    ...
    context.workbook.customFunctions.deleteAll();
    ...
  }).catch(function(error){});
};

Manifest

It’s ready to register your custom functions in Excel.
Now in current Preview, only sideloading installation (which is the installatoin process for developers) is supported. Therefore you must create shared folder in your local computer, and set this folder as Trusted Add-in Catalogs by the following steps.
In the future, the user will be able to install from marketplace with a few clicks !

  1. Open Excel
  2. Select “File” – “Options” menu.
  3. Select “Trust Center” tab and push “Trust Center Settings” button.
  4. Add your shared folder in Trusted Catalogs Table as the following screenshot.

Please create the following manifest file (UTF-8 encoding) and locate this file (.xml) in this shared folder.

<?xml version="1.0" encoding="utf-8"?>
<OfficeApp xmlns="http://schemas.microsoft.com/office/appforoffice/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:bt="http://schemas.microsoft.com/office/officeappbasictypes/1.0" xmlns:ov="http://schemas.microsoft.com/office/taskpaneappversionoverrides" xsi:type="TaskPaneApp">
  <Id>a4b7679d-7758-48a7-8bf3-7fb37fc2f20b</Id>
  <Version>1.0.0.0</Version>
  <ProviderName>Tsuyoshi Matsuzaki</ProviderName>
  <DefaultLocale>en-US</DefaultLocale>
  <DisplayName DefaultValue="My test functions" />
  <Description DefaultValue="custom functions test" />
  <Hosts>
    <Host Name="Workbook" />
  </Hosts>
  <DefaultSettings>
    <SourceLocation DefaultValue="https://example.com/mypage.html"/>
  </DefaultSettings>
  <Permissions>ReadWriteDocument</Permissions>
  <VersionOverrides xmlns="http://schemas.microsoft.com/office/taskpaneappversionoverrides" xsi:type="VersionOverridesV1_0">
    <Hosts>
      <Host xsi:type="Workbook">
        <AllFormFactors>
          <ExtensionPoint xsi:type="CustomFunctions">
            <Script>
              <SourceLocation resid="functionsjs" />
              <SourceLocation resid="chancejs" />
            </Script>
            <Page>
              <SourceLocation resid="pagehtml"/>
            </Page>
          </ExtensionPoint>
        </AllFormFactors>
      </Host>
    </Hosts>
    <Resources>
      <bt:Urls>
        <bt:Url id="functionsjs" DefaultValue="https://example.com/myfunc.js" />
        <bt:Url id="chancejs" DefaultValue="https://example.com/chance.js" />
        <bt:Url id="pagehtml" DefaultValue="https://example.com/mypage.html" />
      </bt:Urls>
    </Resources>
  </VersionOverrides>
</OfficeApp>

You may wonder why the JavaScript reference is needed in this manifest (see <Script> element above). In fact, this tag is not used in current Preview, but in the future the script will be downloaded and can be used offline and packaged. Once the custom function is installed, it can run offline in Excel in the future.

Run !

Now let’s see how it works.

Open Excel and select “My add-ins” in “Insert” tab. Then you can find your custom add-in as follows.

After you add this add-in in your client, you can use your custom functions with auto complete suggestion, hint, help file, etc.

Of course, you can use several existing Excel capabilities like “auto-fill” together. The users can accelerate their works integrating a lot of Excel capabilities with your advanced functions.

All functions are called (invoked) simultaneously and executed each one at a time. (Because JavaScript is single-threading…) In the future, you can use batch calling which aggregates multiple calls into one function call and you can improve performance.

You can also use the streamed function which can dynamically update the cell, such as IoT sensor’s inputs. (The associated graph is also updated by real-time.)

You can download the following complete example by Microsoft. Please try.

Github: Create custom functions in Excel
https://github.com/OfficeDev/office-js-docs/blob/master/docs/excel/custom-functions-overview.md

 

Now this is just initial Preview. The capabilities and performance will be accelerated in the future. (Currently, the add-ins rely on a hidden browser process, but it will work in the same process in the future.)
Let’s enjoy and look forward to the future update !

 

Power BI Custom Authentication in ISV applications (Custom Data Connector)

Sorry for my long interval for posting because of my private matters. This blog is alive. (My wife is having trouble with her leg and I’m now changing my work time…)

In this post I describe how ISVs can implement their own Power BI Data Connector with custom authentication experience.
When users select “Get Data” in Power BI, your custom connector will be appeared. Your connector can provide custom authentication UI and custom data.
Even when some ISVs create (and submit) custom template content pack, this connector’s flow is needed when using custom OAuth authentication.

In this post I explain straightforward about this flow with developer’s view.

Before staring…

Before starting, please enable the custom connector feature in your Power BI Desktop for development as follows.

  • Launch Power BI Desktop
  • Select “File” – “Options and settings” – “Options” menu
  • In the dialog window, select “Preview features” and enable “Custom data connectors”

Note : Currently, Data Connectors are only supported in Power BI Desktop. (You cannot also publish to Power BI service.)

Next you must install Power Query SDK in your Visual Studio. After installed, you can see the data connector project template in your Visual Studio.

Overall Flow

For developing custom authentication, you must understand how OAuth works with API applications.
You can use your own favorite OAuth provider (Google account, your own custom IdP, etc) for building OAuth supported Data Connector, but here we use Azure Active Directory (Azure AD) for implementation. (Here we use Azure AD v1 endpoint. See the note below.)

Step1 ) Beforehand you must register your api application (A) and your client application(=Power BI Data Connector) (B) in Azure AD. This operation is needed once for only ISV. (The user doesn’t need to register.)

Step2 ) When the user uses your Power BI Data Connector, your data connector (B) shows the following url in web browser or browser component for login.

https://login.microsoftonline.com/common/oauth2/authorize
  ?response_type=code
  &client_id={app id of B}
  &resource={app uri of A}
  &redirect_uri={redirected url for B}

Step3 ) With Power BI Data Connector, the previous “redirected url for B” is the fixed url, “https://preview.powerbi.com/views/oauthredirect.html&#8221;.
After the user has logged-in, code is returned by the query string as follows. Then your data connector (B) can retrieve the returned code value.

https://preview.powerbi.com/views/oauthredirect.html?code={returned code}

Step4 ) Next your data connector (B) posts the following HTTP request against Azure AD (without UI). Then the access token is returned as http response as follows.

POST https://login.microsoftonline.com/common/oauth2/token
Content-Type: application/x-www-form-urlencoded

grant_type=authorization_code
&code={returned code}
&client_id={app id of B}
&redirect_uri=https%3A%2F%2Fpreview.powerbi.com%2Fviews%2Foauthredirect.html
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "token_type": "Bearer",
  "scope": "user_impersonation",
  "expires_in": "3599",
  "ext_expires_in": "0",
  "not_before": "{start time of this token}",
  "expires_on": "{expired time of this token}",
  "resource": "{app uri of A}",
  "access_token": "{returned access token}",
  "refresh_token": "{returned refresh token}",
  "id_token": "{returned id token}"
}

Step5 ) Your data connector (B) calls your api application (A) with the previous access token as follows.

GET https://contoso.com/testapi
Authorization: Bearer {access token}

Step6 ) Your api application (A) can verify the passed token and return the result dataset if it’s valid. (See “Develop your API with Azure AD” for verification.)
The result dataset is shown in your Power BI client.

Note : Currently the native application in Azure AD v2 endpoint cannot have the redirect urls with “http” or “https” protocols. You cannot use “https://preview.powerbi.com/views/oauthredirect.html&#8221; as redirected urls.
Therefore we’re using Azure AD v1 endpoint in this post.

Here I explain how to setup or implement with corresponding above steps.

Register your app in Azure AD (Step 1)

First you must register your api application (A) and your client application(=Power BI Data Connector) (B) with Azure Portal.

When you register your api application (A), you must define the scope (permission or role) in manifest. (See below.)
You can use the appropriate scope values like “read” and “write” along with your API requirements. In this post we use the default settings as follows (scope value is “user_impersonation”), and your data connector (B) must use the scope “{app uri}/user_impersonation” (ex: https://test.onmicrosoft.com/235426d2-485d-46e0-ad0d-059501ab58a4/user_impersonation) for accessing your api application.

When you register your data connector (B), the app type must be “native application”, and you must include “https://preview.powerbi.com/views/oauthredirect.html&#8221; (common for all data connectors) as the redirect urls.

After you’ve done, you must open the required permission setting of data connector application (B) and add the scope (here “user_impersonation” scope) for accessing api application (A).

Note : If it’s production application, you must set these applications (A and B) as multi-tenanted applications and the user must consent your applications in the user’s tenant before using your data connector. In this post we skip these settings.

Implement your API – Create OData Feed service compliant with Power BI (Step 6)

Before implementing your data connector, now we shall implement the api application (B) beforehand.

You can create your api as simple rest api (consumed as Web.Contents() in your data connector) or odata-compliant rest api (consumed as OData.Feed() in your data connector) with your favorite programming languages.
Here we implement as OData v4 feed service with ASP.NET (.NET Framework) as follows. (See “OData.org : How to Use Web API OData to Build an OData V4 Service without Entity Framework” for details.)

First, create your “ASP.NET Web Application” project and please select [Empty] and [Web API] in the creation wizard as follows.

Add Microsoft.AspNet.OData package with NuGet. (Not Microsoft.AspNet.WebApi.OData, because it’s not v4.)

Note : You cannot use .NET Framework 4.6.2 with latest Microsoft.AspNet.OData 6.1.0 because of the dll confliction.

Add and implement your MVC controller. Your controller must inherit ODataController as follows.
Here I created the following simple controller which returns the data (3 records) of SalesRevenue class.

...
using System.Web.OData;
...

public class TestController : ODataController
{
  [EnableQuery]
  public IQueryable<SalesRevenue> Get()
  {
    return new List<SalesRevenue>()
    {
      new SalesRevenue { Id = 1, Country = "US", Value = 200 },
      new SalesRevenue { Id = 2, Country = "Japan", Value = 80 },
      new SalesRevenue { Id = 3, Country = "Germany", Value = 120 }
    }.AsQueryable();
  }
}

public class SalesRevenue
{
  public int Id { get; set; }
  public string Country { get; set; }
  public int Value { get; set; }
}
...

Edit your WebApiConfig.cs as follows.

...
using System.Web.OData.Builder;
using System.Web.OData.Extensions;
using Microsoft.OData.Edm;
...

public static class WebApiConfig
{
  public static void Register(HttpConfiguration config)
  {
    // Web API configuration and services

    // Web API routes
    config.MapHttpAttributeRoutes();

    config.Routes.MapHttpRoute(
      name: "DefaultApi",
      routeTemplate: "api/{controller}/{id}",
      defaults: new { id = RouteParameter.Optional }
    );

    // Add here
    ODataModelBuilder builder = new ODataConventionModelBuilder();
    builder.Namespace = "Demos";
    builder.ContainerName = "DefaultContainer";
    builder.EntitySet<Controllers.SalesRevenue>("Test");
    IEdmModel model = builder.GetEdmModel();
    config.MapODataServiceRoute("ODataRoute", "osalesdat", model);
  }
}
...

Now your api is consumed as https://{your api hosted url}/osalesdat with Power BI Desktop. (Please try with OData feed connector !)

Finally you must add the authentication in your api application.
As I mentioned earlier, your api will receive the following HTTP request from Power BI Data Connector. Your api application must verify the access token (the following Authorization header value), retreive several claims, and return data as you like.
For example, your api application can retrieve tenant-id from token, and can return each tenant’s data with your programming code. (i.e, you can easily implement multi-tenancy.)

GET https://contoso.com/api/test
Accept: */*
Accept-Encoding: gzip, deflate
Authorization: Bearer eyJ0eXAiOi...

For this implementation, please see my earlier post “Build your own Web API protected by Azure AD v2.0 endpoint with custom scopes“, but with ASP.NET you just select “Configure Azure AD Authentication” with the right-click on your project, or you can configure in the project creation wizard.

Note : If you select “new app” instead of “existing app” in the configuration dialog, your api application is automatically registered in Azure AD. In this case, you don’t need to register your api app manually in step 1.

When you want to protect api, please add Authorize attribute as follows.

public class TestController : ODataController
{
  [Authorize]
  [EnableQuery]
  public IQueryable<SalesRevenue> Get()
  {
    return new List<SalesRevenue>()
    {
      new SalesRevenue { Id = 1, Country = "US", Value = 200 },
      new SalesRevenue { Id = 2, Country = "Japan", Value = 80 },
      new SalesRevenue { Id = 3, Country = "Germany", Value = 120 }
    }.AsQueryable();
  }
}

Implement your Data Connector (All)

Now let’s create your data connector (B) in turn.
First you start to create your project with “Data Connector Project” template in Visual Studio.

Power BI Data Connector should be written by Power Query M formula language. (See M function reference for details.) All logic is written in “PQExtension1.pq” file. (We’re assuming our project named “PQExtension1”.)
Here I show you overall sample code (PQExtension1.pq) as follows. (Here I’m skipping the code for exception handling, etc.)
Now let’s see what this code is doing step by step.

section PQExtension1;

[DataSource.Kind="PQExtension1", Publish="PQExtension1.Publish"]
shared PQExtension1.Contents = (optional message as text) =>
  let
    source = OData.Feed(
      "https://contoso.com/osalesdat",
      null,
      [ ODataVersion = 4, MoreColumns = true ])
  in
    source;
 
PQExtension1 = [
  Authentication = [
    OAuth = [
      StartLogin = StartLogin,
      FinishLogin = FinishLogin
    ]
  ],
  Label = "Test API Connector"
];

StartLogin = (resourceUrl, state, display) =>
  let
    authorizeUrl = "https://login.microsoftonline.com/common/oauth2/authorize"
      & "?response_type=code"
      & "&client_id=97f213a1-6c29-4235-a37b-a82dda14365c"
      & "&resource=https%3A%2F%2Ftest.onmicrosoft.com%2F235426d2-485d-46e0-ad0d-059501ab58a4"
      & "&redirect_uri=https%3A%2F%2Fpreview.powerbi.com%2Fviews%2Foauthredirect.html"
  in
    [
      LoginUri = authorizeUrl,
      CallbackUri = "https://preview.powerbi.com/views/oauthredirect.html",
      WindowHeight = 720,
      WindowWidth = 1024,
      Context = null
    ];

FinishLogin = (context, callbackUri, state) =>
  let
    query = Uri.Parts(callbackUri)[Query],
    tokenResponse = Web.Contents("https://login.microsoftonline.com/common/oauth2/token", [
      Content = Text.ToBinary("grant_type=authorization_code"
        & "&code=" & query
        & "&client_id=97f213a1-6c29-4235-a37b-a82dda14365c"
        & "&redirect_uri=https%3A%2F%2Fpreview.powerbi.com%2Fviews%2Foauthredirect.html"),
      Headers = [
        #"Content-type" = "application/x-www-form-urlencoded",
        #"Accept" = "application/json"
      ]
    ]),
    result = Json.Document(tokenResponse)
  in
    result;

...

Note : Use Fiddler for debugging HTTP flow.

Implement your Data Connector - Signing-in (Step 2)

When you collaborate with OAuth authentication flow with your Power BI Data Connector, first you define your connector as follows. The following "Label" text will be displayed in the wizard window in Power BI. (See the screenshot in following "Run !" section.)

PQExtension1 = [
  Authentication = [
    OAuth = [
      ...
    ]
  ],
  Label = "Test API Connector"
];

When you want to navigate to the login UI with OAuth, you must add the following StartLogin.
Here 97f213a1-6c29-4235-a37b-a82dda14365c is application id (client id) for your data connector (B), and https%3A%2F%2Ftest.onmicrosoft.com%2F235426d2-485d-46e0-ad0d-059501ab58a4 is the url-encoded string for app uri of your api app (B). Please change for your appropriate values.

PQExtension1 = [
  Authentication = [
    OAuth = [
      StartLogin = StartLogin,
      ...
    ]
  ],
  Label = "Test API Connector"
];

StartLogin = (resourceUrl, state, display) =>
  let
    authorizeUrl = "https://login.microsoftonline.com/common/oauth2/authorize"
      & "?response_type=code"
      & "&client_id=97f213a1-6c29-4235-a37b-a82dda14365c"
      & "&resource=https%3A%2F%2Ftest.onmicrosoft.com%2F235426d2-485d-46e0-ad0d-059501ab58a4"
      & "&redirect_uri=https%3A%2F%2Fpreview.powerbi.com%2Fviews%2Foauthredirect.html"
  in
    [
      LoginUri = authorizeUrl,
      CallbackUri = "https://preview.powerbi.com/views/oauthredirect.html",
      WindowHeight = 720,
      WindowWidth = 1024,
      Context = null
    ];

Implement your Data Connector - Get auth code (Step 3)

After the user has succeeded login with UI, code is returned to callback uri (https://preview.powerbi.com/views/oauthredirect.html) as part of query strings. When your data connector receives this code value, you must write as follows.
The following FinishLogin is invoked by connector framework after the login is succeeded.

PQExtension1 = [
  Authentication = [
    OAuth = [
      StartLogin = StartLogin,
      FinishLogin = FinishLogin
    ]
  ],
  Label = "Test API Connector"
];
...

FinishLogin = (context, callbackUri, state) =>
  let
    query = Uri.Parts(callbackUri)[Query],
    code = query
    ...
  in
    result;

Implement your Data Connector - Get access token (Step 4)

With code value, your data connector can retrieve access token as follows.
Here M function Web.Contents() posts http-request without UI (silently).

PQExtension1 = [
  Authentication = [
    OAuth = [
      StartLogin = StartLogin,
      FinishLogin = FinishLogin
    ]
  ],
  Label = "Test API Connector"
];
...

FinishLogin = (context, callbackUri, state) =>
  let
    query = Uri.Parts(callbackUri)[Query],
    tokenResponse = Web.Contents("https://login.microsoftonline.com/common/oauth2/token", [
      Content = Text.ToBinary("grant_type=authorization_code"
        & "&code=" & query
        & "&client_id=97f213a1-6c29-4235-a37b-a82dda14365c"
        & "&redirect_uri=https%3A%2F%2Fpreview.powerbi.com%2Fviews%2Foauthredirect.html"),
      Headers = [
        #"Content-type" = "application/x-www-form-urlencoded",
        #"Accept" = "application/json"
      ]
    ]),
    result = Json.Document(tokenResponse)
  in
    result;

As you can see, FinishLogin returns OAuth JWT (json web token) to Power BI. (The access token and refresh token are included in this JWT.)

Implement your Data Connector - Call your api application (Step 5)

Finally you call your api application (previously generated OData feed service) as follows.
As you can see, here you don't need to specify Authorization header in your M code. The framework will automatically set access token in the request.

[DataSource.Kind="PQExtension1", Publish="PQExtension1.Publish"]
shared PQExtension1.Contents = (optional message as text) =>
  let
    source = OData.Feed(
      "https://contoso.com/osalesdat",
      null,
      [ ODataVersion = 4, MoreColumns = true ])
  in
    source;

Run !

Now let's start to build, deploy, and run !

Build your project in Visual Studio and copy the generated .mez file (which is the zip file format) into "%UserProfile%DocumentsMicrosoft Power BI DesktopCustom Connectors" folder.
Launch Power BI Desktop and select "Get Data". As you can see below, here's your data connector in the list.

When you select your custom data connector, the following dialog with "Sign In" button is displayed.

When you push "Sign In", the login-UI (in this case, Azure AD sign-in) is displayed.
Here we used Azure AD for our trivial example, but you can use your own provider (sign-in experience), as I mentioned earlier.

When you successfully logged-in and push "Connect" button, the following navigator is displayed along with your OData feed metadata.
In this case, we defined only "Test" entity in the previous code, but you can also include functions (not only entities), if needed.

When you select entities and proceed, you can use dataset in Power BI as follows.
As you see here, the end-user doesn't concern about your implementation (OAuth, OData, etc). The user just uses the given connector and can securely get all required data with custom authentication.

Note : Once you connect the data source, the connection is cached in Power BI Desktop. Please clear connection cache using "File" - "Options and settings" - "Data source settings" menu.

The following is the sample code which is handling exceptions and others (which is accessing Microsoft Graph), and please refer for your understanding.

[Github] MyGraph Connector Sample
https://github.com/Microsoft/DataConnectors/tree/master/samples/MyGraph

 

Reference : [Github] Getting Started with Data Connectors
https://github.com/Microsoft/DataConnectors

 

Overfitting in Regression and Neural Nets – Visually Understanding (Overview)

For your beginning of machine learning, here I show you some primitive overfitting example, and explain what you should care about and how to avoid. For building your intuitions, I show you several samples with many visual images.
First we see some simple overfitting examples for traditional statistical regression, and in the latter part we discuss about the case of neural network.

First look for Overfitting

Let me explain about overfitting in machine learning with a brief example of dataset as follows. (See the following plotting of sample data.)

sampledat

The following R script is my regression by linear model for above dataset (sampledat).
To fit precisely in the given data, here we use the formula by poly() function as follows.  If the result is , then you think  might be zero. You might think “the larger, the better”.

fit <- lm(
  y ~ poly(x1, 10, raw = TRUE),
  data = sampledat)

Here I show you the result for this regression analysis.

As you can see, we can get the following equation (1) as fitting equation for the given data.

Now we plot this equation with the given dataset. The equation (1) is best fitting with the given data as follows.

Is that really good result ?

In fact, this given dataset (sampledat) is generated by the following R script. As you can see below, this dataset is given by with some noise data. If the new data points are generated by this script, obviously these won’t fit into the equation (1).
That is, the result (equation (1)) is overfitting.

n_sample <- 20
b0 <- 3
b1 <- 1
x1 <- c(1:n_sample)
epsilon <- rnorm(
  n = n_sample,
  mean = 0,
  sd = 0.7)  # noise
y <- b0 + b1 * x1 + epsilon
sampledat <- data.frame(
  x1 = x1,
  y = y
)

Let’s see the equation (1) from your bird’s-eye. (See the following plotting of equation (1).)
The equation (1) is just fitting only for the given 20 data (the above “sampledat”), but not generalized one. The equation (1) doesn’t fit to unknown data.

Here we showed you a trivial overfitting example for your first understanding, but in the real practical case it’s difficult to distinguish whether it’s overfitting or not.
Our next interest is : How to distinguish ? How to avoid ?

Information Criterion

Now let’s say, you add the extra parameter into your regression formula. But the result of likelihood has become just a slightly little improved, or almost the same. If so, you may think that this new parameter might not be needed for this regression formula.

In the statistical approach, there exists the criterion (called “Information Criterion”) to judge your model fitting based on the mathematical background.
The famous one is Akaike Information Criterion (AIC) as follows. The smaller value is better fitting.

where is the number of estimated parameters and  is the maximum likelihood

Note : For both AIC and BIC (Bayesian information criterion) ideas, it’s given by . In AIC,  equals 2 (constant value).
Here we have only 20 survey data (observations) and it’s difficult to understand whether it’s noise or not. With BIC,  includes the number of survey data as parameter.

The following is the plot of values for log likelihood () and AIC for the previous given dataset (sampledat). The red line is the value of likelihood and blue line is AIC. (You can easily get these values with logLik() and AIC() function in R.)

As you can see, the appropriate number of estimated parameters is 3. That is, the formula is good for fitting. (See the following note.)

Note : The hidden estimated parameters (like the variance for Gaussian, the shape for Gamma, etc) must be counted as estimated parameters for AIC. In this case, we are using Gaussian, and we must add the variance for the estimated parameters. (See my previous post “Understanding the basis of GLM Regression” for details.)
For instance, if the formula (equation) is , then the estimated parameters are , and the variance (3 parameters).

For instance, if we use the following dataset, the number of parameters must be 4. Then the equation (formula) must be .

Here we use only single input (), but if you have several input parameters, you must also consider the interactions each other.

Overfitting in neural networks

Let’s proceed to the neural nets for discussion.

First you must remember that too many layers or neurons often cause the overfitting. Especially the layer will affect the complexity so much. (Of course, though the layers must be deep enough to represent your working model.)
Here also it’s not “the larger, the better”.

To simplify our example, let’s say here is brief feed-forward neural nets by sigmoid with two input variables () and one binary output (the output between 0 and 1).
If we have 1 hidden layer, it can represent the model as following illustrated. (The model can have several linear boundaries and these combination.)

If we have 2 hidden layers, it can represent more complex models as following illustrated. (These are the combination of 1 layer’s models.)

Granting that we have some noise data, 2 hidden layers’ network might cause the overfitting as follows.
As you can see here, the large layers will cause the overfitting.

Note : Unlike the statistical approach there’s no concrete criterion to decide how much is the best for layers or neurons, because no common evaluation property based on the mathematical model is there.
You must examine and evaluate the generated model with test data or validation data.

The model complexity is also caused by the large coefficients. Let’s see the next example.

As you know, the sigmoid has the following linear part and binary part. The linear part can smoothly fit, but the binary one doesn’t (binary fit).
As weights are increased, the binary part becomes more stronger than the linear part.

For example, let’s see the following illustrated network.

This network results into the following plotting (wire frame). ( is inputs, and z is output.) As you can see, it’s smoothly transitioning.

Let’s see the following next example.
This network is having exactly same boundary as previous one, but the coefficients (weights and bias) are so large.

When we plot the inputs () and outputs (), it becomes more sharp than before.

As weights are increased and it has enough layers and neurons, the model can easily produce more complex models. As a result it causes overfitting and the lack of generalization.

Large coefficients are easily be generated.
You just learn with too many training iterations (inputs, epoch, etc). Train ! Train ! Train ! The coefficient’s growth is caused by gradient descent.
For instance, the following is the simple feed-forward nets for recognizing hand-writing digit by mxnetR. This script outputs the variance of each layer’s weights.

require(mxnet)
...

# configure network
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

# train
model <- mx.model.FeedForward.create(
  softmax,
  X=traindata.x,
  y=traindata.y,
  ctx=mx.cpu(),
  num.round = 10,
  learning.rate=0.07)

# dump weights and biases
params <- model$arg.params

# check the variance of weights in each layers
var(c(as.array(params$fc1_weight)))
var(c(as.array(params$fc2_weight)))
var(c(as.array(params$fc3_weight)))

When we set num.round = 100 (see the above bold font) in this script, we can get more distributed large coefficients as follows.
In this example we’re setting the appropriate learning rate and the number of layers by the experimental results, but the weights will more rapidly increase with more high values of these parameters.

Note : Learning rate also affects to the weights and accuracy so much. Learning rate must be enough small to constantly decrease the differences in gradient descent, but it must be enough large to make it converge rapidly as possible. (You must decide with your experimental survey.)

epoch = 10

epoch = 100

There exist several regularization techniques to mitigate these overfitting in neural nets as follows.

  • Early Stopping – A method to stop learning when some condition occurs (ex: the condition when the error is higher than the last check, etc)
  • Penalty (L1, L2) – A method to set the penalty term for avoiding weight’s increase (weight decay penalty) in gradient descent evaluation
  • Dropout – A method to randomly drop the neurons in each training phase. By doing this, it avoids the overfitting of co-adaptation when it has so complex structure with many layers and neurons. As a result, it accomplishes the model combination (same like ensemble learning such as random forest etc).

The supported regularization method will differ from each framework.
For the libraries by Microsoft, you can implement all the regularization techniques (early stopping , penalty by L1 and L2, dropout) with CNTK, but rxNeuralNet (in MicrosoftML) and NET# doesn’t support.

# Dropout with CNTK (Python)
...

with default_options(activation=relu, pad=True):
  model = Sequential([
    LayerStack(2, lambda : [
      Convolution((3,3), 64),
      Convolution((3,3), 64),
      MaxPooling((3,3), strides=2)
    ]),
    LayerStack(2, lambda i: [
      Dense([256,128][i]), 
      Dropout(0.5)
    ]),
    Dense(4, activation=None)
  ])
...

 

Understanding the basis of GLM Regression (Logistic, Gaussian, Gamma, etc)

Now I’m working with several ISVs for building their solutions with some machine learning algorithms (using R, Python, Azure Machine Learning, etc).
More advanced and realistic regression (like GLMM, Bayesian and MCMC simulation, etc) for traditional statistical approach will be often needed in the practical modeling, but it’s very important to understand the basic modeling ideas of GLM (generalized linear models) for your first start, because advanced regression techniques are also based on these basic ones.

In this post I summarize the essence of these basic modeling ideas with simple R scripts. Please build your intuitions for your first start.

Note : We don’t discuss the mixed model including quasi-binomial, quasi-poisson or negative binomial in this post. (We discuss about only basic ones.)

Common Idea for Regression (GLM)

All GLM family (Gaussian, Poisson, etc) is based on the following common idea.

  • We choose the appropriate mathematical model (Gamma, Gaussian, etc) depending on what you need.
    For example, when you want to predict the human height (response variable) for some school children with weight and age (explanatory variables), it might be better to choose Gaussian, because the human height will be on normal distribution (Gaussian distribution).
  • All model family is having the input parameters. Next we represent this input parameters with explanatory variables.
    For example, the Gaussian distribution (normal distribution) is having the parameter (the mean) and (the standard deviation). If we suppose that is some constant value for any observation, is the only parameter depending on weight and age (explanatory variables). We suppose  where  is weight and  is age. (The expression will differ from each cases. Later I’ll explain about the link function, and please proceed to read.)
  • Finally we estimate and determine one , , and  for a lot of given data (weight, age, and human height).
    For example, we can get the approximated probability of the human height (because is determined by , , , and ), and finally you can estimate the appropriate , , and which maximizes the probability of actual height (observed response) with mathematical technique.

Now let’s see the each regression along with this common idea.

Logistic Regression (family = binomial)

Binomial regression with logit link function is called “Logistic Regression”. Here I show you the brief outline.

Suppose it’s having the probability for each success occurrence. Then it’s known that times success occurrence among independent experiments (i.e, times failure) is having the following probability. (This probability space is called binomial distribution.)

where

is the function by  and . But when and are given values, this is representing the function with unknown parameter .

where and are given values.

When a lot of sets of are given, we can define the probability with multiplication as follows. Because the multiplication of probability means the logical product.

Let’s consider the following case : Here we have sample small fishes in the fish tank, and we survey the survival rate depending on the amount of food and the amount of giving oxygen as follows.

Alive(y1) Dead(y2) Food(x1) Oxygen(x2)
39 11 300 2000
30 20 150 1200
12 38 100 1000
...

In previous function we’re supposing the parameter is always same. But, in this case, the parameter (which means the probability of survive) depends on food and oxygen: (This parameter   is the linear predictor in the previous illustration.)

Then the model is written as follows. As you can see, this is the function with parameters  for given .

That is, we should find the parameter which maximize .

In logistic regression, the following function is often used as  instead of . This is called the logistic link function (strictly speaking, the inverse of the following function is called the link function), and I describe about the link function later. (Please proceed to read here…)

As a result, the estimation function of the logistic regression is written as follows :

Let’s see the following trivial example with R.
As you can see, we simulate with the value , and . The estimated results are -14.95783, 3.98255 and 1.99622. (See the following output.)

library(e1071) # for sigmoid func

#####
# create sample data with rbinom
#####
# y1 : number of alive
# y2 : number of dead
# x1 : food
# x2 : oxygen
#####
n_sample <- 1000
total <- 50
b0 <- -15
b1 <- 4
b2 <- 2
x1 <- runif(
  n = n_sample,
  min = 0,
  max = 5)
x2 <- runif(
  n = n_sample,
  min = 0,
  max = 5)
probs <- sigmoid(b0 + b1 * x1 + b2 * x2)
y1 <- rbinom(
  n = n_sample,
  size = total,
  prob = probs)
y2 <- total - y1
sampledat <- data.frame(
  x1 = x1,
  x2 = x2,
  y1 = y1,
  y2 = y2
)

#####
# Estimate
#####
fit <- glm(
  cbind(y1, y2) ~ x1 + x2,
  data = sampledat,
  family = binomial(link = "logit"))
summary(fit)

Note : The default link function of binomial is “logit”. Then you do not need to specify link = "logit" explicitly. (Here I’m specifying explicitly for your understanding.)

Let’s see another example.
The following dataset is similar to the previous one (sampledat), but each row is having z column which identifies “Dead” (0) or “Alive” (1) for each small fishes, instead of   and .

sampledatA

Survival(z) Food(x1) Oxygen(x2)
1 300 2000
0 38 100 1000
1 300 2000
0 38 100 1000
0 300 2000
1 300 2000
1 38 100 1000
...

This dataset is written by the following format, and it’s the same as previous one. Then you can analyze using the previous code.

sampledatB

Alive(y1) Dead(y2) Food(x1) Oxygen(x2)
3 1 300 2000
1 2 100 1000
...

Using R, you don’t need to convert the dataset and you can just write as follows using z (0 or 1) column.

## with "Survival" column z (0 or 1)
fit2 <- glm(
  z ~ x1 + x2,
  data = sampledatA,
  family = binomial(link = "logit"))
summary(fit2)

If you’re using Microsoft R runtime, you can use rxLogit() in RevoScaleR or rxFastLinear in MicrosoftML for scaling and high-performance execution as follows.
Here I don’t describe about details, but with rxLogit() you can also specify the middle value in (0.7, 0,25, etc) for the response variable, and moreover you can use Frisch–Waugh–Lovell theorem (see Wikipedia) for categorical data.

fit3 <- rxLogit(
  z ~ x1 + x2,
  data = sampledat2)
summary(fit3)

Poisson Regression (family = poisson)

Suppose some event occurs times in some time interval. Then the probability () of times occurrence of the same event in the same interval is known as follows. (This is called Poisson distribution.)

Let’s consider the following example. : Here we’re planting some seeds in your garden, and we survey how many seeds germinate depending on the amount of water and fertilizer.
In this model, we consider that (the average of germinating seeds) depends on the water () and fertilizer (). That is, the parameter is the linear predictor.

In Poisson Regression, the following function is often used as  instead. The inverse of this function (i.e, ) is called the link function, and I describe the details about the link function later. (Please proceed to read here.)

Suppose a lot of data of , and are given as follows. (i.e, )

Germinating(k) Water(x1) Fertilizer(x2)
11 300 2000
8 150 1200
5 100 1000
...

Then we can define the total probability by multiplication as follows. (See above for the reason of multiplication.)
Here we should find the parameters which maximize the following possibility by the maximum likelihood estimation.

Note : In most cases, (i.e, log-likelihood) is used for the estimation, instead of using .

Let’s see the following trivial example with R.
As you can see, we simulate with the value , and , and the estimated results are 0.215383, 0.498815 and 0.402523.

#####
# create sample data with rpois
#####
# y : germinating
# x1 : water
# x2 : fertilizer
#####
n_sample <- 1000
b0 <- 0.2
b1 <- 0.5
b2 <- 0.4
x1 <- runif(
  n = n_sample,
  min = 0,
  max = 5)
x2 <- runif(
  n = n_sample,
  min = 0,
  max = 5)
lamdas <- exp(b0 + b1 * x1 + b2 * x2)
y <- rpois(
  n = n_sample,
  lambda = lamdas)
sampledat <- data.frame(
  x1 = x1,
  x2 = x2,
  y = y
)

#####
# Estimate
#####
fit <- glm(
  y ~ x1 + x2,
  data = sampledat,
  family = poisson(link = "log"))
summary(fit)

Note : The default link function of poisson is “log”. Then you do not need to specify link = "log" explicitly.

If you’re using Microsoft R runtime, you can use rxGlm in RevoScaleR for scaled execution.

fit2 <- rxGlm(
  y ~ x1 + x2,
  data = sampledat,
  family = poisson(link = "log"))
summary(fit2)

Linear Regression (family = gaussian)

This regression is straight forward. This regression is represented by the following linear expression.

In the previous binomial and poisson regression, maximum likelihood estimation (MLE) is used for the parameter estimation.
But here, we can use the ordinary least squares (OLS) as follows. (The following is assuming that the only is used for explanatory variables.) Here we estimate the difference between the actual value and predicted value. The difference might be positive and negative. Then we calculate the square for each differences.

We should find the parameters which minimize the following E.

Why is it called “Gaussian” ?
Let’s consider the normal distribution (Gaussian distribution) . Suppose the standard deviation (the square root of variance) is some constant value for any observation, then this probability only depends on the parameter (the average).
Let’s consider the maximum likelihood estimation (MLE) for this Gaussian distribution. Suppose , and is having some small rage with . ( is fixed value.) If is very small, the product of likelihood will be defined as follows. (See above for the reason of multiplication.) :

Now we apply the logarithm for this likelihood (log-likelihood) as follows :

We can get the maximum value with this log-likelihood when we minimize . (Because and are constants.)
That is, the maximum likelihood estimation (MLE) for Gaussian distribution will result in the ordinary least squares (OLS) for the linear regression.

For example, it is known that the human height is on the with some fixed and . Suppose the average height  relies on human weight () and we suppose it’s having the linear relation.
In this case, you can get the regression result by OLS approach with .

Let’s see the following trivial example with R.
As you can see, we simulate with the value and , and the estimated results are 44.085248 and 0.139133.

#####
# create sample data
#####
# y : height (inch)
# x1 : weight (pond)
#####
n_sample <- 1000
b0 <- 44
b1 <- 0.14
x1 <- runif(
  n = n_sample,
  min = 45,
  max = 180)
epsilon <- rnorm(
  n = n_sample,
  mean = 0,
  sd = 1.57)  # noise (variance)
y <- b0 + b1 * x1 + epsilon
sampledat <- data.frame(
  x1 = x1,
  y = y
)

#####
# Estimate
#####
fit <- glm(
  y ~ x1,
  data = sampledat,
  family = gaussian(link = "identity"))
summary(fit)

Note : link="identity" means the link function with . That is, the specific link function is not used. (The default link function of Gaussian is “identity”. Then you do not need to specify link = "identity" explicitly.)

If you’re using Microsoft R runtime, you can use rxLinMod in RevoScaleR or rxFastLinear in MicrosoftML for scaled and high-performance execution.

fit2 <- rxLinMod(
  y ~ x1,
  data = sampledat)
summary(fit2)

Gamma Regression (family = Gamma)

Suppose some event occurs times in unit (1) interval. Then the probability density function () for interval () with times occurrence of the same event is known as follows :

where is Gamma function. ( for any positive integer, and it is an extension for real and complex number. See “Wikipedia : Gamma function“.)

Note : You must keep in mind that it’s not probability itself, but it’s probability density function. When you estimate with MLE, you must use integral or get the probability with some fixed small range as I described above (see Gaussian Regression).

This regression is often used for the analytics for waiting time, time between failure, accident rate or something like that.
There exist two parameters (called “shape” and “rate” respectively), but we suppose is the same (constant) for any observation in this regression.

For example, let’s consider the time to the failure (breakdown) for some equipment, and suppose this occurrence depends on the number of initial bugs () and the number of claims by customers (). Suppose 1 failure (breakdown) is caused by 10 times errors. When the number of initial bugs increases, the errors in production would also increase. That is, when (the number of initial bugs) increases, (the count of errors for failure) is always constant, but (the count of errors for unit interval) increases. On the contrary, the average time decreases.

The linear predictor in Gamma regression is the average time , and it’s proportional to as a result. ( is called “scale”.)
Then the following inverse link function is often used for Gamma regression. (I describe the details about the link function later.)

Note : Here I don’t explain about details, but (shape) can also be estimated. (The variance of Gamma distribution is known as .)

Let’s see the following trivial example with R.
As you can see, we simulate with the value , and , and the estimated results are 0.504847, 0.021330 and 0.008929.

#####
# create sample data with rgamma
#####
# y : time to failure
# x1 : number of initial bugs
# x2 : number of customer claims
#####
n_sample <- 1000
shape <- 10
b0 <- 0.5
b1 <- 0.02
b2 <- 0.01
x1 <- runif(
  n = n_sample,
  min = 1,
  max = 5)
x2 <- runif(
  n = n_sample,
  min = 11,
  max = 15)
rate <- (b0 + b1 * x1 + b2 * x2) * shape
y <- rgamma(
  n = n_sample,
  shape = shape,
  rate = rate)
sampledat <- data.frame(
  x1 = x1,
  x2 = x2,
  y = y
)

#####
# Estimate
#####
fit <- glm(
  y ~ x1 + x2,
  data = sampledat,
  family = Gamma(link = "inverse"))
summary(fit)

When you want to get the shape (s), you can use the following gamma.shape() in MASS package. (Then you can reproduce the original model by Gamma distribution.)

library(MASS)
...

# you can get shape as follows.
test <- gamma.shape(fit, verbose = TRUE)

If you’re using Microsoft R runtime, you can use rxGlm in RevoScaleR for scaled execution.

fit2 <- rxGlm(
  y ~ x1 + x2,
  data = sampledat,
  family = Gamma(link = "inverse"))
summary(fit2)

About Link Functions

Now it’s time to expalin about the idea of the link function.
In this post I focus on the functions which we used in the previous example.

Log Link Function

Remember the previous Poisson Regression example.
Suppose the linear predictor (the average of germinating) as follows :

Suppose the seeds have germinated as many as 1.5 times by the enough water and as many as 1.2 times by the enough fertilizer. When you give both enough water and enough fertilizer, the seeds would germinate as many as 1.5 + 1.2 = 2.7 times ?
Of course, it’s not. The estimated value would be 1.5 * 1.2 = 1.8 times.
Thus (1) is not representing the actual model.

Let’s compare with the following model. This is called log link, because the linear predictor () equals to .

When we replace with , , , we can get the following model (2).
As you can see, (1) is growing by the addition, but (2) is by the multiplication. (With the model (1), it could be negative number when or decreses. But it won’t be with model (2).)

The absolute answer doesn’t exist (it is case by case), but you must understand the difference of properties of these two models and use each link function along with your actual modeling.

Logit Link Function

The logit link is used for representing the values which is 0 or 1 (or in the middle between 0 and 1). Unlike log link, this doesn’t exceed 1 for any explanatory variables. This idea is also used in the neural networks or other recognition algorithms in mathematics.

Remember the previous Logistic Regression example. Now we consider the following logical model. :

  • Each input (food), (oxygen) is having the weight respectively.
  • Get the total weights by 
  • Compare with some fixed threshold b. If it’s exceed b, the result is 1. Otherwise the result is 0.

This model is representing the aspect of the actual decision mechanism. But one important weakness is, it is not continuous. (It’s discrete.)
If we suppose , the result is plotted as follows. As you can see, we cannot treat approximation approach, because it’s not continuous.

Therefore we introduce the following function called “sigmoid”.

As you can see below, sigmoid is very similar to the previous one, but it’s continuous.

Now we suppose , and we can get the probability (probability of alive) as follows.

The logit link function is the inverse of this sigmoid. :

Note : As you can see, this  is representing the probability of dead. This ratio of is called odds.

Inverse Link Function

 is called the inverse link function, where p is the predicted parameter.
Remember the previous Gamma Regression example. The predicted parameter is  where s is shape and r is rate, and s is constant. Therefore this is the same meaning as follows :

where

As I explained in Gamma regression, r (rate) means “the occurrence count in unit (1) interval”. Then means “the average interval between occurrence”. This is called “scale”.
If the linear predictor is proportional to the scale (i.e, inversely proportional to the rate), you can specify identity link function (i.e, no specific link function is used) for Gamma Regression as follows instead of inverse link function.

fit <- glm(
  y ~ x1 + x2,
  data = sampledat,
  family = Gamma(link = "identity"))

Offset

When you want to specify some input variable without predicted coefficient like , you can use offset.

For instance, now we introduce new input variable “observation count” () in the previous example of Poisson Regression.
Then the following linear predictor will be used.

In such a case, you can use offset as follows.

fit <- glm(
  y ~ x1 + x2,
  data = sampledat,
  family = poisson(link = "log"),
  offset = log(x3))

You might think you can just divide the result  by instead of using offset. But you can analyze without changing the variance or some other statistical attributes by using offset.

 

Understanding ARIMA time series analysis with R (part 2)

Part1 : Discuss under the stationarity – ARMA
Part2 : Non-stationarity and Seasonality – ARIMA and Seasonal ARIMA  (<- you are here)

Non-stationary time series

In my previous post, I explained several ideas of AR, MA, ARMA under the condition of stationarity for the beginning of your understanding.

But as you know, almost in the real practical time-series (stock price, sales revenue, etc) is non-stationary model or mixed with stationary and non-stationary. In order to understand the non-stationary mode, it’s important to understand the ideas of stationarity. (Because the idea of non-stationary model is based on the stationary one.)

The idea of non-stationary model is : considering the series of differences , and estimating the stationarity of this series.
If the difference series is given as ARMA(p, q) (i.e, stationary model), we can determine the original . This model is called ARIMA (autoregressive integrated moving average) model, which is denoted by ARIMA(p, 1, q).
In general, ARIMA(p, d, q) is given as : the series of d-th order difference is ARMA(p, q).

As I described in my previous post, all MA model is stationary, and therefore I’ll focus on the AR part for considering non-stationary model.

As you can see below, the mathematical behavior of the non-stationary model is vastly different from the stationary one, even if it’s just a little difference of the parameters . (See below. As I described in my previous post, AR is stationary when .)
Unlike the stationary process, non-stationary process diverges in general. Thus we should change several mathematical approach (ideas) under the non-stationary one.

Sample of (stationary)

Sample of (non-stationary)

Sample of (non-stationary)

Identifying a model is also not easy.
Unlike the stationary process, we cannot use the t-distribution (which is often used in the statistical technique) for estimation under the condition of non-stationarity.

First we consider the following AR expression (1). (See my previous post for .)

The following equation (2) is called the characteristic equation of (1) :

This equation (2) follows :

  • If all the absolute values of roots are larger than 1, (1) is stationary.
  • If z = 1 is a root of multiplicity d (i.e, the characteristic equation is having ) and the others’ absolute values are all larger than 1, the series of d-th order difference of (1) is stationary.
    Especially, if it’s only one unit root (d = 1), the series of  of (1) is stationary. It is called unit root process.

Now let’s consider using the simple case of AR(1), which is having the following equation. In this case, it’s stationary if , and it’s unit root process if .

If is stationary process (i.e, ), we can identify the model using the approach which I described in the previous post. If is unit root process (i.e, ), we can identify the model, because is stationary. If neither of both, you can repeat for the series .
Therefore it’s important to know whether it’s unit root process or stationary process, when the real sample data is given.

Under the unit root process, we cannot use t-distribution for estimation, because it is not stationary. Instead of that, we use Dickey-Fuller test in order to distinguish whether the mode is unit root process.

Here (in this post) I don’t describe details about this test technique, but the outline is :

  1. Consider the following continuous stochastic process , where n is the size of given data and is standard deviation of . (Note that .)
    Intuitively, this maps  as y on divided by each span .
  2. When n is getting larger, converges to Standard Brownian Motion (Standard Wiener Process) on  in stochastic distribution.
  3. Estimate and test the original unit root process based on this well-known distribution (Standard Brownian Motion) with sufficiently large n.

An extension of this testing to AR(p) is called Augmented Dickey–Fuller (ADF) test. (Here I don’t explain how to extend this technique.) You can use adf.test() (tseries package) in R for ADF test.

When you manually identify a model, you must follow the previous steps. But when you want the result in R, you can simply analyze without any difficulties. Same like the stationary one (see my previous post), you can simply use auto.arima() for identifying model.

library(forecast)

# read data
y <- read.table(
  "C:\Demo\revenue_sample.txt",
  col.names = c("DATETIME", "REVENUE"),
  header = TRUE,
  quote = """,
  fileEncoding="UTF-8",
  stringsAsFactors = FALSE)

# analyze
test = auto.arima(y$REVENUE)
summary(test)

Here the “drift” (see the output above) is the trend of . Therefore the drift will be near the gradient of . (The following is the plotting of  and is surely having the linear trend.)

Hence the simulated model of will be given as follows :

and

As I described above, you must care the diffrence between stationarity and non-stationarity, even if you’re using R.
For example, when you predict (forecast) the future values under the non-stationarity, the predicted mean (conditional mean) will not converge to the mean itself (it will always move by drift) and the predicted variance will also grow more and more. (See below.)
Please compare the following predicted plotting (non-stationary result) with the one in my previous post (stationary result).

...

# analyze
test = auto.arima(y$REVENUE)
summary(test)

# predict
pred <- forecast(
  test,
  level=c(0.85,0.95),
  h=50)
plot(pred)

Here I don’t explain about the details, but you must also carefully apply some other machine learning algorithms (regression, anomaly detection, …), especially in the non-stationary time-series. (For example, see “Spurious relationship” in Wikipedia.)

Backshift notation

Before discussing subsequent topics, let me introduce convenient notation called “backshift notation”.

With this notation we denote by . Of course, this doesn’t mean multiplying with some real number (). It’s just the notation !
Same like this, means .

Using this notation, we can easily treat the time-series differences by arithmetic operations.
For example, is meaning the difference of sequence as follows.

means the 2nd order difference as follows.

By repeating this operation, you can get p-th order difference as .

With this notation, ARIMA(1,1,0) is denoted by the following equation, where c is constant.

This notation clarifies the structure of model, because means the part of AR(1), and means the part of .
Thus, when you want to denote ARIMA(1,d,0), you can easily get the following representation.

Now we replace the previous expression (1) as follows with backshift notation.

As you can see, LHS (left-hand-side) in the last expression is the same format with expression (2).
If (2) can be represented by where all root of Y(z) are larger than 1 as absolute values, the series of d-th order difference of (1) is stationary. Because this process is denoted as follows. (Y(B) is stationary and is d-th order difference.)

As you can see here, you can easily get the previous consequence with backshift notation.

Seasonal ARIMA modeling

With backshift notation, you can easily understand the background of seasonal ARIMA.
First we see the basic concept using simple AR(1) (without no differencing value d).

Suppose the sales revenue is so much affected by the revenue from the same month a year ago (12 months ago) and this relation is written by the following expression. (For simplicity, we don’t write the constant value c.)

Note : means the seasonal part of in AR. We often use upper character for seasonal part in order to distinguish with the non-seasonal part.

Suppose the revenue is also affected by the revenue from a month ago.
Then the mixed model is represented as follows. This is denoted by .

What does that model mean ? (Why multiply ?)

When we transform this representation, you can get the following.

As you can see, means the t’s revenue which is eliminating the effect of 12 months ago. Same like this, means the (t – 1)’s revenue which is eliminating the effect of 12 months ago.
Therefore this expression means “AR(1) model of one month ago, which is eliminating the effect of 12 months ago”.

This is also written by the following representation, and you can see it’s vice versa.

Here I showed the sample with no differences, but when it’s having the difference d, you just multiply (1 – B) by the backshift notation.
For example, is represented as follows :

  • is AR(1) of non-seasonal part.
  • is AR(1) of seasonal part.
  • is the difference of non-seasonal part.
  • is the difference of seasonal part.

Note : If you have MA(1), you can write on RHS as follows. (In this post we don’t discuss about MA part in seasonal ARIMA.)

When we transform the equation of , you can find that it is the same as ordinary ARIMA with lag 1, 12, and 13 as follows.
Hence if we don’t know the seasonality, we must estimate with all possible coefficients . (Please imagine if we have seasonality with seasonal lag = 365.)

The following is ACF and PACF in  with R. (See my previous post about ACF and PACF. Here I don’t explain what’s meaning ACF and PACF.)
As you can see in the following result, it’s having the spike at lags in both non-seasonal and seasonal. Thus ACF and PACF can also be used in order to identify a seasonal model.

Autocorrelation (ACF) sample

Partial Autocorrelation (PACF) sample

Let’s see the brief example in R.
Now we start to identify the model and forecast with model. As you can see in the following result, here we get the unexpected result without seasonal estimation. (The result is ARIMA(0,1,1) with drift.)

library(forecast)

# read data
y <- read.table(
  "C:\Demo\seasonal_sample.txt",
  col.names = c("DATETIME", "REVENUE"),
  header = TRUE,
  fileEncoding="UTF-8",
  stringsAsFactors = FALSE)

# analyze
test <- auto.arima(y$REVENUE)
summary(test)

# predict and plot
pred <- forecast(
  test,
  level=c(0.85,0.95),
  h=100)
plot(pred)

Suppose we know that this dataset has the seasonality with lags 12 (for instance, 12 months).
Now we fix the previous R script to use this known seasonality as follows. (Here I changed the code in bold fonts.)
As you can see below, we can get the desirable result with seasonal ARIMA analysis.

library(forecast)

# read data
y <- read.table(
  "C:\Demo\seasonal_sample.txt",
  col.names = c("DATETIME", "REVENUE"),
  header = TRUE,
  fileEncoding="UTF-8",
  stringsAsFactors = FALSE)

# analyze with seasonal s=12
test <- auto.arima(ts(data=y$REVENUE, freq=12))
summary(test)

# predict and plot
pred <- forecast(
  test,
  level=c(0.85,0.95),
  h=100)
plot(pred)