Power BI Custom Authentication in ISV applications (Custom Data Connector)

Sorry for my long interval for posting because of my private matters. This blog is alive. (My wife is having trouble with her leg and I’m now changing my work time…)

In this post I describe how ISVs can implement their own Power BI Data Connector with custom authentication experience.
When users select “Get Data” in Power BI, your custom connector will be appeared. Your connector can provide custom authentication UI and custom data.
Even when some ISVs create (and submit) custom template content pack, this connector’s flow is needed when using custom OAuth authentication.

In this post I explain straightforward about this flow with developer’s view.

Before staring…

Before starting, please enable the custom connector feature in your Power BI Desktop for development as follows.

  • Launch Power BI Desktop
  • Select “File” – “Options and settings” – “Options” menu
  • In the dialog window, select “Preview features” and enable “Custom data connectors”

Note : Currently, Data Connectors are only supported in Power BI Desktop. (You cannot also publish to Power BI service.)

Next you must install Power Query SDK in your Visual Studio. After installed, you can see the data connector project template in your Visual Studio.

Overall Flow

For developing custom authentication, you must understand how OAuth works with API applications.
You can use your own favorite OAuth provider (Google account, your own custom IdP, etc) for building OAuth supported Data Connector, but here we use Azure Active Directory (Azure AD) for implementation. (Here we use Azure AD v1 endpoint. See the note below.)

Step1 ) Beforehand you must register your api application (A) and your client application(=Power BI Data Connector) (B) in Azure AD. This operation is needed once for only ISV. (The user doesn’t need to register.)

Step2 ) When the user uses your Power BI Data Connector, your data connector (B) shows the following url in web browser or browser component for login.

https://login.microsoftonline.com/common/oauth2/authorize
  ?response_type=code
  &client_id={app id of B}
  &resource={app uri of A}
  &redirect_uri={redirected url for B}

Step3 ) With Power BI Data Connector, the previous “redirected url for B” is the fixed url, “https://preview.powerbi.com/views/oauthredirect.html”.
After the user has logged-in, code is returned by the query string as follows. Then your data connector (B) can retrieve the returned code value.

https://preview.powerbi.com/views/oauthredirect.html?code={returned code}

Step4 ) Next your data connector (B) posts the following HTTP request against Azure AD (without UI). Then the access token is returned as http response as follows.

POST https://login.microsoftonline.com/common/oauth2/token
Content-Type: application/x-www-form-urlencoded

grant_type=authorization_code
&code={returned code}
&client_id={app id of B}
&redirect_uri=https%3A%2F%2Fpreview.powerbi.com%2Fviews%2Foauthredirect.html
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "token_type": "Bearer",
  "scope": "user_impersonation",
  "expires_in": "3599",
  "ext_expires_in": "0",
  "not_before": "{start time of this token}",
  "expires_on": "{expired time of this token}",
  "resource": "{app uri of A}",
  "access_token": "{returned access token}",
  "refresh_token": "{returned refresh token}",
  "id_token": "{returned id token}"
}

Step5 ) Your data connector (B) calls your api application (A) with the previous access token as follows.

GET https://contoso.com/testapi
Authorization: Bearer {access token}

Step6 ) Your api application (A) can verify the passed token and return the result dataset if it’s valid. (See “Develop your API with Azure AD” for verification.)
The result dataset is shown in your Power BI client.

Note : Currently the native application in Azure AD v2 endpoint cannot have the redirect urls with “http” or “https” protocols. You cannot use “https://preview.powerbi.com/views/oauthredirect.html” as redirected urls.
Therefore we’re using Azure AD v1 endpoint in this post.

Here I explain how to setup or implement with corresponding above steps.

Register your app in Azure AD (Step 1)

First you must register your api application (A) and your client application(=Power BI Data Connector) (B) with Azure Portal.

When you register your api application (A), you must define the scope (permission or role) in manifest. (See below.)
You can use the appropriate scope values like “read” and “write” along with your API requirements. In this post we use the default settings as follows (scope value is “user_impersonation”), and your data connector (B) must use the scope “{app uri}/user_impersonation” (ex: https://test.onmicrosoft.com/235426d2-485d-46e0-ad0d-059501ab58a4/user_impersonation) for accessing your api application.

When you register your data connector (B), the app type must be “native application”, and you must include “https://preview.powerbi.com/views/oauthredirect.html” (common for all data connectors) as the redirect urls.

After you’ve done, you must open the required permission setting of data connector application (B) and add the scope (here “user_impersonation” scope) for accessing api application (A).

Note : If it’s production application, you must set these applications (A and B) as multi-tenanted applications and the user must consent your applications in the user’s tenant before using your data connector. In this post we skip these settings.

Implement your API – Create OData Feed service compliant with Power BI (Step 6)

Before implementing your data connector, now we shall implement the api application (B) beforehand.

You can create your api as simple rest api (consumed as Web.Contents() in your data connector) or odata-compliant rest api (consumed as OData.Feed() in your data connector) with your favorite programming languages.
Here we implement as OData v4 feed service with ASP.NET (.NET Framework) as follows. (See “OData.org : How to Use Web API OData to Build an OData V4 Service without Entity Framework” for details.)

First, create your “ASP.NET Web Application” project and please select [Empty] and [Web API] in the creation wizard as follows.

Add Microsoft.AspNet.OData package with NuGet. (Not Microsoft.AspNet.WebApi.OData, because it’s not v4.)

Note : You cannot use .NET Framework 4.6.2 with latest Microsoft.AspNet.OData 6.1.0 because of the dll confliction.

Add and implement your MVC controller. Your controller must inherit ODataController as follows.
Here I created the following simple controller which returns the data (3 records) of SalesRevenue class.

...
using System.Web.OData;
...

public class TestController : ODataController
{
  [EnableQuery]
  public IQueryable<SalesRevenue> Get()
  {
    return new List<SalesRevenue>()
    {
      new SalesRevenue { Id = 1, Country = "US", Value = 200 },
      new SalesRevenue { Id = 2, Country = "Japan", Value = 80 },
      new SalesRevenue { Id = 3, Country = "Germany", Value = 120 }
    }.AsQueryable();
  }
}

public class SalesRevenue
{
  public int Id { get; set; }
  public string Country { get; set; }
  public int Value { get; set; }
}
...

Edit your WebApiConfig.cs as follows.

...
using System.Web.OData.Builder;
using System.Web.OData.Extensions;
using Microsoft.OData.Edm;
...

public static class WebApiConfig
{
  public static void Register(HttpConfiguration config)
  {
    // Web API configuration and services

    // Web API routes
    config.MapHttpAttributeRoutes();

    config.Routes.MapHttpRoute(
      name: "DefaultApi",
      routeTemplate: "api/{controller}/{id}",
      defaults: new { id = RouteParameter.Optional }
    );

    // Add here
    ODataModelBuilder builder = new ODataConventionModelBuilder();
    builder.Namespace = "Demos";
    builder.ContainerName = "DefaultContainer";
    builder.EntitySet<Controllers.SalesRevenue>("Test");
    IEdmModel model = builder.GetEdmModel();
    config.MapODataServiceRoute("ODataRoute", "osalesdat", model);
  }
}
...

Now your api is consumed as https://{your api hosted url}/osalesdat with Power BI Desktop. (Please try with OData feed connector !)

Finally you must add the authentication in your api application.
As I mentioned earlier, your api will receive the following HTTP request from Power BI Data Connector. Your api application must verify the access token (the following Authorization header value), retreive several claims, and return data as you like.
For example, your api application can retrieve tenant-id from token, and can return each tenant’s data with your programming code. (i.e, you can easily implement multi-tenancy.)

GET https://contoso.com/api/test
Accept: */*
Accept-Encoding: gzip, deflate
Authorization: Bearer eyJ0eXAiOi...

For this implementation, please see my earlier post “Build your own Web API protected by Azure AD v2.0 endpoint with custom scopes“, but with ASP.NET you just select “Configure Azure AD Authentication” with the right-click on your project, or you can configure in the project creation wizard.

Note : If you select “new app” instead of “existing app” in the configuration dialog, your api application is automatically registered in Azure AD. In this case, you don’t need to register your api app manually in step 1.

When you want to protect api, please add Authorize attribute as follows.

public class TestController : ODataController
{
  [Authorize]
  [EnableQuery]
  public IQueryable<SalesRevenue> Get()
  {
    return new List<SalesRevenue>()
    {
      new SalesRevenue { Id = 1, Country = "US", Value = 200 },
      new SalesRevenue { Id = 2, Country = "Japan", Value = 80 },
      new SalesRevenue { Id = 3, Country = "Germany", Value = 120 }
    }.AsQueryable();
  }
}

Implement your Data Connector (All)

Now let’s create your data connector (B) in turn.
First you start to create your project with “Data Connector Project” template in Visual Studio.

Power BI Data Connector should be written by Power Query M formula language. (See M function reference for details.) All logic is written in “PQExtension1.pq” file. (We’re assuming our project named “PQExtension1”.)
Here I show you overall sample code (PQExtension1.pq) as follows. (Here I’m skipping the code for exception handling, etc.)
Now let’s see what this code is doing step by step.

section PQExtension1;

[DataSource.Kind="PQExtension1", Publish="PQExtension1.Publish"]
shared PQExtension1.Contents = (optional message as text) =>
  let
    source = OData.Feed(
      "https://contoso.com/osalesdat",
      null,
      [ ODataVersion = 4, MoreColumns = true ])
  in
    source;
 
PQExtension1 = [
  Authentication = [
    OAuth = [
      StartLogin = StartLogin,
      FinishLogin = FinishLogin
    ]
  ],
  Label = "Test API Connector"
];

StartLogin = (resourceUrl, state, display) =>
  let
    authorizeUrl = "https://login.microsoftonline.com/common/oauth2/authorize"
      & "?response_type=code"
      & "&client_id=97f213a1-6c29-4235-a37b-a82dda14365c"
      & "&resource=https%3A%2F%2Ftest.onmicrosoft.com%2F235426d2-485d-46e0-ad0d-059501ab58a4"
      & "&redirect_uri=https%3A%2F%2Fpreview.powerbi.com%2Fviews%2Foauthredirect.html"
  in
    [
      LoginUri = authorizeUrl,
      CallbackUri = "https://preview.powerbi.com/views/oauthredirect.html",
      WindowHeight = 720,
      WindowWidth = 1024,
      Context = null
    ];

FinishLogin = (context, callbackUri, state) =>
  let
    query = Uri.Parts(callbackUri)[Query],
    tokenResponse = Web.Contents("https://login.microsoftonline.com/common/oauth2/token", [
      Content = Text.ToBinary("grant_type=authorization_code"
        & "&code=" & query
        & "&client_id=97f213a1-6c29-4235-a37b-a82dda14365c"
        & "&redirect_uri=https%3A%2F%2Fpreview.powerbi.com%2Fviews%2Foauthredirect.html"),
      Headers = [
        #"Content-type" = "application/x-www-form-urlencoded",
        #"Accept" = "application/json"
      ]
    ]),
    result = Json.Document(tokenResponse)
  in
    result;

...

Note : Use Fiddler for debugging HTTP flow.

Implement your Data Connector - Signing-in (Step 2)

When you collaborate with OAuth authentication flow with your Power BI Data Connector, first you define your connector as follows. The following "Label" text will be displayed in the wizard window in Power BI. (See the screenshot in following "Run !" section.)

PQExtension1 = [
  Authentication = [
    OAuth = [
      ...
    ]
  ],
  Label = "Test API Connector"
];

When you want to navigate to the login UI with OAuth, you must add the following StartLogin.
Here 97f213a1-6c29-4235-a37b-a82dda14365c is application id (client id) for your data connector (B), and https%3A%2F%2Ftest.onmicrosoft.com%2F235426d2-485d-46e0-ad0d-059501ab58a4 is the url-encoded string for app uri of your api app (B). Please change for your appropriate values.

PQExtension1 = [
  Authentication = [
    OAuth = [
      StartLogin = StartLogin,
      ...
    ]
  ],
  Label = "Test API Connector"
];

StartLogin = (resourceUrl, state, display) =>
  let
    authorizeUrl = "https://login.microsoftonline.com/common/oauth2/authorize"
      & "?response_type=code"
      & "&client_id=97f213a1-6c29-4235-a37b-a82dda14365c"
      & "&resource=https%3A%2F%2Ftest.onmicrosoft.com%2F235426d2-485d-46e0-ad0d-059501ab58a4"
      & "&redirect_uri=https%3A%2F%2Fpreview.powerbi.com%2Fviews%2Foauthredirect.html"
  in
    [
      LoginUri = authorizeUrl,
      CallbackUri = "https://preview.powerbi.com/views/oauthredirect.html",
      WindowHeight = 720,
      WindowWidth = 1024,
      Context = null
    ];

Implement your Data Connector - Get auth code (Step 3)

After the user has succeeded login with UI, code is returned to callback uri (https://preview.powerbi.com/views/oauthredirect.html) as part of query strings. When your data connector receives this code value, you must write as follows.
The following FinishLogin is invoked by connector framework after the login is succeeded.

PQExtension1 = [
  Authentication = [
    OAuth = [
      StartLogin = StartLogin,
      FinishLogin = FinishLogin
    ]
  ],
  Label = "Test API Connector"
];
...

FinishLogin = (context, callbackUri, state) =>
  let
    query = Uri.Parts(callbackUri)[Query],
    code = query
    ...
  in
    result;

Implement your Data Connector - Get access token (Step 4)

With code value, your data connector can retrieve access token as follows.
Here M function Web.Contents() posts http-request without UI (silently).

PQExtension1 = [
  Authentication = [
    OAuth = [
      StartLogin = StartLogin,
      FinishLogin = FinishLogin
    ]
  ],
  Label = "Test API Connector"
];
...

FinishLogin = (context, callbackUri, state) =>
  let
    query = Uri.Parts(callbackUri)[Query],
    tokenResponse = Web.Contents("https://login.microsoftonline.com/common/oauth2/token", [
      Content = Text.ToBinary("grant_type=authorization_code"
        & "&code=" & query
        & "&client_id=97f213a1-6c29-4235-a37b-a82dda14365c"
        & "&redirect_uri=https%3A%2F%2Fpreview.powerbi.com%2Fviews%2Foauthredirect.html"),
      Headers = [
        #"Content-type" = "application/x-www-form-urlencoded",
        #"Accept" = "application/json"
      ]
    ]),
    result = Json.Document(tokenResponse)
  in
    result;

As you can see, FinishLogin returns OAuth JWT (json web token) to Power BI. (The access token and refresh token are included in this JWT.)

Implement your Data Connector - Call your api application (Step 5)

Finally you call your api application (previously generated OData feed service) as follows.
As you can see, here you don't need to specify Authorization header in your M code. The framework will automatically set access token in the request.

[DataSource.Kind="PQExtension1", Publish="PQExtension1.Publish"]
shared PQExtension1.Contents = (optional message as text) =>
  let
    source = OData.Feed(
      "https://contoso.com/osalesdat",
      null,
      [ ODataVersion = 4, MoreColumns = true ])
  in
    source;

Run !

Now let's start to build, deploy, and run !

Build your project in Visual Studio and copy the generated .mez file (which is the zip file format) into "%UserProfile%DocumentsMicrosoft Power BI DesktopCustom Connectors" folder.
Launch Power BI Desktop and select "Get Data". As you can see below, here's your data connector in the list.

When you select your custom data connector, the following dialog with "Sign In" button is displayed.

When you push "Sign In", the login-UI (in this case, Azure AD sign-in) is displayed.
Here we used Azure AD for our trivial example, but you can use your own provider (sign-in experience), as I mentioned earlier.

When you successfully logged-in and push "Connect" button, the following navigator is displayed along with your OData feed metadata.
In this case, we defined only "Test" entity in the previous code, but you can also include functions (not only entities), if needed.

When you select entities and proceed, you can use dataset in Power BI as follows.
As you see here, the end-user doesn't concern about your implementation (OAuth, OData, etc). The user just uses the given connector and can securely get all required data with custom authentication.

Note : Once you connect the data source, the connection is cached in Power BI Desktop. Please clear connection cache using "File" - "Options and settings" - "Data source settings" menu.

The following is the sample code which is handling exceptions and others (which is accessing Microsoft Graph), and please refer for your understanding.

[Github] MyGraph Connector Sample
https://github.com/Microsoft/DataConnectors/tree/master/samples/MyGraph

 

Reference : [Github] Getting Started with Data Connectors
https://github.com/Microsoft/DataConnectors

 

Overfitting in Regression and Neural Nets – Visually Understanding (Overview)

For your beginning of machine learning, here I show you some primitive overfitting example, and explain what you should care about and how to avoid. For building your intuitions, I show you several samples with many visual images.
First we see some simple overfitting examples for traditional statistical regression, and in the latter part we discuss about the case of neural network.

First look for Overfitting

Let me explain about overfitting in machine learning with a brief example of dataset as follows. (See the following plotting of sample data.)

sampledat

The following R script is my regression by linear model for above dataset (sampledat).
To fit precisely in the given data, here we use the formula by poly() function as follows.  If the result is , then you think  might be zero. You might think “the larger, the better”.

fit <- lm(
  y ~ poly(x1, 10, raw = TRUE),
  data = sampledat)

Here I show you the result for this regression analysis.

As you can see, we can get the following equation (1) as fitting equation for the given data.

Now we plot this equation with the given dataset. The equation (1) is best fitting with the given data as follows.

Is that really good result ?

In fact, this given dataset (sampledat) is generated by the following R script. As you can see below, this dataset is given by with some noise data. If the new data points are generated by this script, obviously these won’t fit into the equation (1).
That is, the result (equation (1)) is overfitting.

n_sample <- 20
b0 <- 3
b1 <- 1
x1 <- c(1:n_sample)
epsilon <- rnorm(
  n = n_sample,
  mean = 0,
  sd = 0.7)  # noise
y <- b0 + b1 * x1 + epsilon
sampledat <- data.frame(
  x1 = x1,
  y = y
)

Let’s see the equation (1) from your bird’s-eye. (See the following plotting of equation (1).)
The equation (1) is just fitting only for the given 20 data (the above “sampledat”), but not generalized one. The equation (1) doesn’t fit to unknown data.

Here we showed you a trivial overfitting example for your first understanding, but in the real practical case it’s difficult to distinguish whether it’s overfitting or not.
Our next interest is : How to distinguish ? How to avoid ?

Information Criterion

Now let’s say, you add the extra parameter into your regression formula. But the result of likelihood has become just a slightly little improved, or almost the same. If so, you may think that this new parameter might not be needed for this regression formula.

In the statistical approach, there exists the criterion (called “Information Criterion”) to judge your model fitting based on the mathematical background.
The famous one is Akaike Information Criterion (AIC) as follows. The smaller value is better fitting.

where is the number of estimated parameters and  is the maximum likelihood

Note : For both AIC and BIC (Bayesian information criterion) ideas, it’s given by . In AIC,  equals 2 (constant value).
Here we have only 20 survey data (observations) and it’s difficult to understand whether it’s noise or not. With BIC,  includes the number of survey data as parameter.

The following is the plot of values for log likelihood () and AIC for the previous given dataset (sampledat). The red line is the value of likelihood and blue line is AIC. (You can easily get these values with logLik() and AIC() function in R.)

As you can see, the appropriate number of estimated parameters is 3. That is, the formula is good for fitting. (See the following note.)

Note : The hidden estimated parameters (like the variance for Gaussian, the shape for Gamma, etc) must be counted as estimated parameters for AIC. In this case, we are using Gaussian, and we must add the variance for the estimated parameters. (See my previous post “Understanding the basis of GLM Regression” for details.)
For instance, if the formula (equation) is , then the estimated parameters are , and the variance (3 parameters).

For instance, if we use the following dataset, the number of parameters must be 4. Then the equation (formula) must be .

Here we use only single input (), but if you have several input parameters, you must also consider the interactions each other.

Overfitting in neural networks

Let’s proceed to the neural nets for discussion.

First you must remember that too many layers or neurons often cause the overfitting. Especially the layer will affect the complexity so much. (Of course, though the layers must be deep enough to represent your working model.)
Here also it’s not “the larger, the better”.

To simplify our example, let’s say here is brief feed-forward neural nets by sigmoid with two input variables () and one binary output (the output between 0 and 1).
If we have 1 hidden layer, it can represent the model as following illustrated. (The model can have several linear boundaries and these combination.)

If we have 2 hidden layers, it can represent more complex models as following illustrated. (These are the combination of 1 layer’s models.)

Granting that we have some noise data, 2 hidden layers’ network might cause the overfitting as follows.
As you can see here, the large layers will cause the overfitting.

Note : Unlike the statistical approach there’s no concrete criterion to decide how much is the best for layers or neurons, because no common evaluation property based on the mathematical model is there.
You must examine and evaluate the generated model with test data or validation data.

The model complexity is also caused by the large coefficients. Let’s see the next example.

As you know, the sigmoid has the following linear part and binary part. The linear part can smoothly fit, but the binary one doesn’t (binary fit).
As weights are increased, the binary part becomes more stronger than the linear part.

For example, let’s see the following illustrated network.

This network results into the following plotting (wire frame). ( is inputs, and z is output.) As you can see, it’s smoothly transitioning.

Let’s see the following next example.
This network is having exactly same boundary as previous one, but the coefficients (weights and bias) are so large.

When we plot the inputs () and outputs (), it becomes more sharp than before.

As weights are increased and it has enough layers and neurons, the model can easily produce more complex models. As a result it causes overfitting and the lack of generalization.

Large coefficients are easily be generated.
You just learn with too many training iterations (inputs, epoch, etc). Train ! Train ! Train ! The coefficient’s growth is caused by gradient descent.
For instance, the following is the simple feed-forward nets for recognizing hand-writing digit by mxnetR. This script outputs the variance of each layer’s weights.

require(mxnet)
...

# configure network
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

# train
model <- mx.model.FeedForward.create(
  softmax,
  X=traindata.x,
  y=traindata.y,
  ctx=mx.cpu(),
  num.round = 10,
  learning.rate=0.07)

# dump weights and biases
params <- model$arg.params

# check the variance of weights in each layers
var(c(as.array(params$fc1_weight)))
var(c(as.array(params$fc2_weight)))
var(c(as.array(params$fc3_weight)))

When we set num.round = 100 (see the above bold font) in this script, we can get more distributed large coefficients as follows.
In this example we’re setting the appropriate learning rate and the number of layers by the experimental results, but the weights will more rapidly increase with more high values of these parameters.

Note : Learning rate also affects to the weights and accuracy so much. Learning rate must be enough small to constantly decrease the differences in gradient descent, but it must be enough large to make it converge rapidly as possible. (You must decide with your experimental survey.)

epoch = 10

epoch = 100

There exist several regularization techniques to mitigate these overfitting in neural nets as follows.

  • Early Stopping – A method to stop learning when some condition occurs (ex: the condition when the error is higher than the last check, etc)
  • Penalty (L1, L2) – A method to set the penalty term for avoiding weight’s increase (weight decay penalty) in gradient descent evaluation
  • Dropout – A method to randomly drop the neurons in each training phase. By doing this, it avoids the overfitting of co-adaptation when it has so complex structure with many layers and neurons. As a result, it accomplishes the model combination (same like ensemble learning such as random forest etc).

The supported regularization method will differ from each framework.
For the libraries by Microsoft, you can implement all the regularization techniques (early stopping , penalty by L1 and L2, dropout) with CNTK, but rxNeuralNet (in MicrosoftML) and NET# doesn’t support.

# Dropout with CNTK (Python)
...

with default_options(activation=relu, pad=True):
  model = Sequential([
    LayerStack(2, lambda : [
      Convolution((3,3), 64),
      Convolution((3,3), 64),
      MaxPooling((3,3), strides=2)
    ]),
    LayerStack(2, lambda i: [
      Dense([256,128][i]), 
      Dropout(0.5)
    ]),
    Dense(4, activation=None)
  ])
...

 

Understanding the basis of GLM Regression (Logistic, Gaussian, Gamma, etc)

Now I’m working with several ISVs for building their solutions with some machine learning algorithms (using R, Python, Azure Machine Learning, etc).
More advanced and realistic regression (like GLMM, Bayesian and MCMC simulation, etc) for traditional statistical approach will be often needed in the practical modeling, but it’s very important to understand the basic modeling ideas of GLM (generalized linear models) for your first start, because advanced regression techniques are also based on these basic ones.

In this post I summarize the essence of these basic modeling ideas with simple R scripts. Please build your intuitions for your first start.

Note : We don’t discuss the mixed model including quasi-binomial, quasi-poisson or negative binomial in this post. (We discuss about only basic ones.)

Common Idea for Regression (GLM)

All GLM family (Gaussian, Poisson, etc) is based on the following common idea.

  • We choose the appropriate mathematical model (Gamma, Gaussian, etc) depending on what you need.
    For example, when you want to predict the human height (response variable) for some school children with weight and age (explanatory variables), it might be better to choose Gaussian, because the human height will be on normal distribution (Gaussian distribution).
  • All model family is having the input parameters. Next we represent this input parameters with explanatory variables.
    For example, the Gaussian distribution (normal distribution) is having the parameter (the mean) and (the standard deviation). If we suppose that is some constant value for any observation, is the only parameter depending on weight and age (explanatory variables). We suppose  where  is weight and  is age. (The expression will differ from each cases. Later I’ll explain about the link function, and please proceed to read.)
  • Finally we estimate and determine one , , and  for a lot of given data (weight, age, and human height).
    For example, we can get the approximated probability of the human height (because is determined by , , , and ), and finally you can estimate the appropriate , , and which maximizes the probability of actual height (observed response) with mathematical technique.

Now let’s see the each regression along with this common idea.

Logistic Regression (family = binomial)

Binomial regression with logit link function is called “Logistic Regression”. Here I show you the brief outline.

Suppose it’s having the probability for each success occurrence. Then it’s known that times success occurrence among independent experiments (i.e, times failure) is having the following probability. (This probability space is called binomial distribution.)

where

is the function by  and . But when and are given values, this is representing the function with unknown parameter .

where and are given values.

When a lot of sets of are given, we can define the probability with multiplication as follows. Because the multiplication of probability means the logical product.

Let’s consider the following case : Here we have sample small fishes in the fish tank, and we survey the survival rate depending on the amount of food and the amount of giving oxygen as follows.

Alive(y1) Dead(y2) Food(x1) Oxygen(x2)
39 11 300 2000
30 20 150 1200
12 38 100 1000
...

In previous function we’re supposing the parameter is always same. But, in this case, the parameter (which means the probability of survive) depends on food and oxygen: (This parameter   is the linear predictor in the previous illustration.)

Then the model is written as follows. As you can see, this is the function with parameters  for given .

That is, we should find the parameter which maximize .

In logistic regression, the following function is often used as  instead of . This is called the logistic link function (strictly speaking, the inverse of the following function is called the link function), and I describe about the link function later. (Please proceed to read here…)

As a result, the estimation function of the logistic regression is written as follows :

Let’s see the following trivial example with R.
As you can see, we simulate with the value , and . The estimated results are -14.95783, 3.98255 and 1.99622. (See the following output.)

library(e1071) # for sigmoid func

#####
# create sample data with rbinom
#####
# y1 : number of alive
# y2 : number of dead
# x1 : food
# x2 : oxygen
#####
n_sample <- 1000
total <- 50
b0 <- -15
b1 <- 4
b2 <- 2
x1 <- runif(
  n = n_sample,
  min = 0,
  max = 5)
x2 <- runif(
  n = n_sample,
  min = 0,
  max = 5)
probs <- sigmoid(b0 + b1 * x1 + b2 * x2)
y1 <- rbinom(
  n = n_sample,
  size = total,
  prob = probs)
y2 <- total - y1
sampledat <- data.frame(
  x1 = x1,
  x2 = x2,
  y1 = y1,
  y2 = y2
)

#####
# Estimate
#####
fit <- glm(
  cbind(y1, y2) ~ x1 + x2,
  data = sampledat,
  family = binomial(link = "logit"))
summary(fit)

Note : The default link function of binomial is “logit”. Then you do not need to specify link = "logit" explicitly. (Here I’m specifying explicitly for your understanding.)

Let’s see another example.
The following dataset is similar to the previous one (sampledat), but each row is having z column which identifies “Dead” (0) or “Alive” (1) for each small fishes, instead of   and .

sampledatA

Survival(z) Food(x1) Oxygen(x2)
1 300 2000
0 38 100 1000
1 300 2000
0 38 100 1000
0 300 2000
1 300 2000
1 38 100 1000
...

This dataset is written by the following format, and it’s the same as previous one. Then you can analyze using the previous code.

sampledatB

Alive(y1) Dead(y2) Food(x1) Oxygen(x2)
3 1 300 2000
1 2 100 1000
...

Using R, you don’t need to convert the dataset and you can just write as follows using z (0 or 1) column.

## with "Survival" column z (0 or 1)
fit2 <- glm(
  z ~ x1 + x2,
  data = sampledatA,
  family = binomial(link = "logit"))
summary(fit2)

If you’re using Microsoft R runtime, you can use rxLogit() in RevoScaleR or rxFastLinear in MicrosoftML for scaling and high-performance execution as follows.
Here I don’t describe about details, but with rxLogit() you can also specify the middle value in (0.7, 0,25, etc) for the response variable, and moreover you can use Frisch–Waugh–Lovell theorem (see Wikipedia) for categorical data.

fit3 <- rxLogit(
  z ~ x1 + x2,
  data = sampledat2)
summary(fit3)

Poisson Regression (family = poisson)

Suppose some event occurs times in some time interval. Then the probability () of times occurrence of the same event in the same interval is known as follows. (This is called Poisson distribution.)

Let’s consider the following example. : Here we’re planting some seeds in your garden, and we survey how many seeds germinate depending on the amount of water and fertilizer.
In this model, we consider that (the average of germinating seeds) depends on the water () and fertilizer (). That is, the parameter is the linear predictor.

In Poisson Regression, the following function is often used as  instead. The inverse of this function (i.e, ) is called the link function, and I describe the details about the link function later. (Please proceed to read here.)

Suppose a lot of data of , and are given as follows. (i.e, )

Germinating(k) Water(x1) Fertilizer(x2)
11 300 2000
8 150 1200
5 100 1000
...

Then we can define the total probability by multiplication as follows. (See above for the reason of multiplication.)
Here we should find the parameters which maximize the following possibility by the maximum likelihood estimation.

Note : In most cases, (i.e, log-likelihood) is used for the estimation, instead of using .

Let’s see the following trivial example with R.
As you can see, we simulate with the value , and , and the estimated results are 0.215383, 0.498815 and 0.402523.

#####
# create sample data with rpois
#####
# y : germinating
# x1 : water
# x2 : fertilizer
#####
n_sample <- 1000
b0 <- 0.2
b1 <- 0.5
b2 <- 0.4
x1 <- runif(
  n = n_sample,
  min = 0,
  max = 5)
x2 <- runif(
  n = n_sample,
  min = 0,
  max = 5)
lamdas <- exp(b0 + b1 * x1 + b2 * x2)
y <- rpois(
  n = n_sample,
  lambda = lamdas)
sampledat <- data.frame(
  x1 = x1,
  x2 = x2,
  y = y
)

#####
# Estimate
#####
fit <- glm(
  y ~ x1 + x2,
  data = sampledat,
  family = poisson(link = "log"))
summary(fit)

Note : The default link function of poisson is “log”. Then you do not need to specify link = "log" explicitly.

If you’re using Microsoft R runtime, you can use rxGlm in RevoScaleR for scaled execution.

fit2 <- rxGlm(
  y ~ x1 + x2,
  data = sampledat,
  family = poisson(link = "log"))
summary(fit2)

Linear Regression (family = gaussian)

This regression is straight forward. This regression is represented by the following linear expression.

In the previous binomial and poisson regression, maximum likelihood estimation (MLE) is used for the parameter estimation.
But here, we can use the ordinary least squares (OLS) as follows. (The following is assuming that the only is used for explanatory variables.) Here we estimate the difference between the actual value and predicted value. The difference might be positive and negative. Then we calculate the square for each differences.

We should find the parameters which minimize the following E.

Why is it called “Gaussian” ?
Let’s consider the normal distribution (Gaussian distribution) . Suppose the standard deviation (the square root of variance) is some constant value for any observation, then this probability only depends on the parameter (the average).
Let’s consider the maximum likelihood estimation (MLE) for this Gaussian distribution. Suppose , and is having some small rage with . ( is fixed value.) If is very small, the product of likelihood will be defined as follows. (See above for the reason of multiplication.) :

Now we apply the logarithm for this likelihood (log-likelihood) as follows :

We can get the maximum value with this log-likelihood when we minimize . (Because and are constants.)
That is, the maximum likelihood estimation (MLE) for Gaussian distribution will result in the ordinary least squares (OLS) for the linear regression.

For example, it is known that the human height is on the with some fixed and . Suppose the average height  relies on human weight () and we suppose it’s having the linear relation.
In this case, you can get the regression result by OLS approach with .

Let’s see the following trivial example with R.
As you can see, we simulate with the value and , and the estimated results are 44.085248 and 0.139133.

#####
# create sample data
#####
# y : height (inch)
# x1 : weight (pond)
#####
n_sample <- 1000
b0 <- 44
b1 <- 0.14
x1 <- runif(
  n = n_sample,
  min = 45,
  max = 180)
epsilon <- rnorm(
  n = n_sample,
  mean = 0,
  sd = 1.57)  # noise (variance)
y <- b0 + b1 * x1 + epsilon
sampledat <- data.frame(
  x1 = x1,
  y = y
)

#####
# Estimate
#####
fit <- glm(
  y ~ x1,
  data = sampledat,
  family = gaussian(link = "identity"))
summary(fit)

Note : link="identity" means the link function with . That is, the specific link function is not used. (The default link function of Gaussian is “identity”. Then you do not need to specify link = "identity" explicitly.)

If you’re using Microsoft R runtime, you can use rxLinMod in RevoScaleR or rxFastLinear in MicrosoftML for scaled and high-performance execution.

fit2 <- rxLinMod(
  y ~ x1,
  data = sampledat)
summary(fit2)

Gamma Regression (family = Gamma)

Suppose some event occurs times in unit (1) interval. Then the probability density function () for interval () with times occurrence of the same event is known as follows :

where is Gamma function. ( for any positive integer, and it is an extension for real and complex number. See “Wikipedia : Gamma function“.)

Note : You must keep in mind that it’s not probability itself, but it’s probability density function. When you estimate with MLE, you must use integral or get the probability with some fixed small range as I described above (see Gaussian Regression).

This regression is often used for the analytics for waiting time, time between failure, accident rate or something like that.
There exist two parameters (called “shape” and “rate” respectively), but we suppose is the same (constant) for any observation in this regression.

For example, let’s consider the time to the failure (breakdown) for some equipment, and suppose this occurrence depends on the number of initial bugs () and the number of claims by customers (). Suppose 1 failure (breakdown) is caused by 10 times errors. When the number of initial bugs increases, the errors in production would also increase. That is, when (the number of initial bugs) increases, (the count of errors for failure) is always constant, but (the count of errors for unit interval) increases. On the contrary, the average time decreases.

The linear predictor in Gamma regression is the average time , and it’s proportional to as a result. ( is called “scale”.)
Then the following inverse link function is often used for Gamma regression. (I describe the details about the link function later.)

Note : Here I don’t explain about details, but (shape) can also be estimated. (The variance of Gamma distribution is known as .)

Let’s see the following trivial example with R.
As you can see, we simulate with the value , and , and the estimated results are 0.504847, 0.021330 and 0.008929.

#####
# create sample data with rgamma
#####
# y : time to failure
# x1 : number of initial bugs
# x2 : number of customer claims
#####
n_sample <- 1000
shape <- 10
b0 <- 0.5
b1 <- 0.02
b2 <- 0.01
x1 <- runif(
  n = n_sample,
  min = 1,
  max = 5)
x2 <- runif(
  n = n_sample,
  min = 11,
  max = 15)
rate <- (b0 + b1 * x1 + b2 * x2) * shape
y <- rgamma(
  n = n_sample,
  shape = shape,
  rate = rate)
sampledat <- data.frame(
  x1 = x1,
  x2 = x2,
  y = y
)

#####
# Estimate
#####
fit <- glm(
  y ~ x1 + x2,
  data = sampledat,
  family = Gamma(link = "inverse"))
summary(fit)

When you want to get the shape (s), you can use the following gamma.shape() in MASS package. (Then you can reproduce the original model by Gamma distribution.)

library(MASS)
...

# you can get shape as follows.
test <- gamma.shape(fit, verbose = TRUE)

If you’re using Microsoft R runtime, you can use rxGlm in RevoScaleR for scaled execution.

fit2 <- rxGlm(
  y ~ x1 + x2,
  data = sampledat,
  family = Gamma(link = "inverse"))
summary(fit2)

About Link Functions

Now it’s time to expalin about the idea of the link function.
In this post I focus on the functions which we used in the previous example.

Log Link Function

Remember the previous Poisson Regression example.
Suppose the linear predictor (the average of germinating) as follows :

Suppose the seeds have germinated as many as 1.5 times by the enough water and as many as 1.2 times by the enough fertilizer. When you give both enough water and enough fertilizer, the seeds would germinate as many as 1.5 + 1.2 = 2.7 times ?
Of course, it’s not. The estimated value would be 1.5 * 1.2 = 1.8 times.
Thus (1) is not representing the actual model.

Let’s compare with the following model. This is called log link, because the linear predictor () equals to .

When we replace with , , , we can get the following model (2).
As you can see, (1) is growing by the addition, but (2) is by the multiplication. (With the model (1), it could be negative number when or decreses. But it won’t be with model (2).)

The absolute answer doesn’t exist (it is case by case), but you must understand the difference of properties of these two models and use each link function along with your actual modeling.

Logit Link Function

The logit link is used for representing the values which is 0 or 1 (or in the middle between 0 and 1). Unlike log link, this doesn’t exceed 1 for any explanatory variables. This idea is also used in the neural networks or other recognition algorithms in mathematics.

Remember the previous Logistic Regression example. Now we consider the following logical model. :

  • Each input (food), (oxygen) is having the weight respectively.
  • Get the total weights by 
  • Compare with some fixed threshold b. If it’s exceed b, the result is 1. Otherwise the result is 0.

This model is representing the aspect of the actual decision mechanism. But one important weakness is, it is not continuous. (It’s discrete.)
If we suppose , the result is plotted as follows. As you can see, we cannot treat approximation approach, because it’s not continuous.

Therefore we introduce the following function called “sigmoid”.

As you can see below, sigmoid is very similar to the previous one, but it’s continuous.

Now we suppose , and we can get the probability (probability of alive) as follows.

The logit link function is the inverse of this sigmoid. :

Note : As you can see, this  is representing the probability of dead. This ratio of is called odds.

Inverse Link Function

 is called the inverse link function, where p is the predicted parameter.
Remember the previous Gamma Regression example. The predicted parameter is  where s is shape and r is rate, and s is constant. Therefore this is the same meaning as follows :

where

As I explained in Gamma regression, r (rate) means “the occurrence count in unit (1) interval”. Then means “the average interval between occurrence”. This is called “scale”.
If the linear predictor is proportional to the scale (i.e, inversely proportional to the rate), you can specify identity link function (i.e, no specific link function is used) for Gamma Regression as follows instead of inverse link function.

fit <- glm(
  y ~ x1 + x2,
  data = sampledat,
  family = Gamma(link = "identity"))

Offset

When you want to specify some input variable without predicted coefficient like , you can use offset.

For instance, now we introduce new input variable “observation count” () in the previous example of Poisson Regression.
Then the following linear predictor will be used.

In such a case, you can use offset as follows.

fit <- glm(
  y ~ x1 + x2,
  data = sampledat,
  family = poisson(link = "log"),
  offset = log(x3))

You might think you can just divide the result  by instead of using offset. But you can analyze without changing the variance or some other statistical attributes by using offset.

 

Understanding ARIMA time series analysis with R (part 2)

Part1 : Discuss under the stationarity – ARMA
Part2 : Non-stationarity and Seasonality – ARIMA and Seasonal ARIMA  (<- you are here)

Non-stationary time series

In my previous post, I explained several ideas of AR, MA, ARMA under the condition of stationarity for the beginning of your understanding.

But as you know, almost in the real practical time-series (stock price, sales revenue, etc) is non-stationary model or mixed with stationary and non-stationary. In order to understand the non-stationary mode, it’s important to understand the ideas of stationarity. (Because the idea of non-stationary model is based on the stationary one.)

The idea of non-stationary model is : considering the series of differences , and estimating the stationarity of this series.
If the difference series is given as ARMA(p, q) (i.e, stationary model), we can determine the original . This model is called ARIMA (autoregressive integrated moving average) model, which is denoted by ARIMA(p, 1, q).
In general, ARIMA(p, d, q) is given as : the series of d-th order difference is ARMA(p, q).

As I described in my previous post, all MA model is stationary, and therefore I’ll focus on the AR part for considering non-stationary model.

As you can see below, the mathematical behavior of the non-stationary model is vastly different from the stationary one, even if it’s just a little difference of the parameters . (See below. As I described in my previous post, AR is stationary when .)
Unlike the stationary process, non-stationary process diverges in general. Thus we should change several mathematical approach (ideas) under the non-stationary one.

Sample of (stationary)

Sample of (non-stationary)

Sample of (non-stationary)

Identifying a model is also not easy.
Unlike the stationary process, we cannot use the t-distribution (which is often used in the statistical technique) for estimation under the condition of non-stationarity.

First we consider the following AR expression (1). (See my previous post for .)

The following equation (2) is called the characteristic equation of (1) :

This equation (2) follows :

  • If all the absolute values of roots are larger than 1, (1) is stationary.
  • If z = 1 is a root of multiplicity d (i.e, the characteristic equation is having ) and the others’ absolute values are all larger than 1, the series of d-th order difference of (1) is stationary.
    Especially, if it’s only one unit root (d = 1), the series of  of (1) is stationary. It is called unit root process.

Now let’s consider using the simple case of AR(1), which is having the following equation. In this case, it’s stationary if , and it’s unit root process if .

If is stationary process (i.e, ), we can identify the model using the approach which I described in the previous post. If is unit root process (i.e, ), we can identify the model, because is stationary. If neither of both, you can repeat for the series .
Therefore it’s important to know whether it’s unit root process or stationary process, when the real sample data is given.

Under the unit root process, we cannot use t-distribution for estimation, because it is not stationary. Instead of that, we use Dickey-Fuller test in order to distinguish whether the mode is unit root process.

Here (in this post) I don’t describe details about this test technique, but the outline is :

  1. Consider the following continuous stochastic process , where n is the size of given data and is standard deviation of . (Note that .)
    Intuitively, this maps  as y on divided by each span .
  2. When n is getting larger, converges to Standard Brownian Motion (Standard Wiener Process) on  in stochastic distribution.
  3. Estimate and test the original unit root process based on this well-known distribution (Standard Brownian Motion) with sufficiently large n.

An extension of this testing to AR(p) is called Augmented Dickey–Fuller (ADF) test. (Here I don’t explain how to extend this technique.) You can use adf.test() (tseries package) in R for ADF test.

When you manually identify a model, you must follow the previous steps. But when you want the result in R, you can simply analyze without any difficulties. Same like the stationary one (see my previous post), you can simply use auto.arima() for identifying model.

library(forecast)

# read data
y <- read.table(
  "C:\Demo\revenue_sample.txt",
  col.names = c("DATETIME", "REVENUE"),
  header = TRUE,
  quote = """,
  fileEncoding="UTF-8",
  stringsAsFactors = FALSE)

# analyze
test = auto.arima(y$REVENUE)
summary(test)

Here the “drift” (see the output above) is the trend of . Therefore the drift will be near the gradient of . (The following is the plotting of  and is surely having the linear trend.)

Hence the simulated model of will be given as follows :

and

As I described above, you must care the diffrence between stationarity and non-stationarity, even if you’re using R.
For example, when you predict (forecast) the future values under the non-stationarity, the predicted mean (conditional mean) will not converge to the mean itself (it will always move by drift) and the predicted variance will also grow more and more. (See below.)
Please compare the following predicted plotting (non-stationary result) with the one in my previous post (stationary result).

...

# analyze
test = auto.arima(y$REVENUE)
summary(test)

# predict
pred <- forecast(
  test,
  level=c(0.85,0.95),
  h=50)
plot(pred)

Here I don’t explain about the details, but you must also carefully apply some other machine learning algorithms (regression, anomaly detection, …), especially in the non-stationary time-series. (For example, see “Spurious relationship” in Wikipedia.)

Backshift notation

Before discussing subsequent topics, let me introduce convenient notation called “backshift notation”.

With this notation we denote by . Of course, this doesn’t mean multiplying with some real number (). It’s just the notation !
Same like this, means .

Using this notation, we can easily treat the time-series differences by arithmetic operations.
For example, is meaning the difference of sequence as follows.

means the 2nd order difference as follows.

By repeating this operation, you can get p-th order difference as .

With this notation, ARIMA(1,1,0) is denoted by the following equation, where c is constant.

This notation clarifies the structure of model, because means the part of AR(1), and means the part of .
Thus, when you want to denote ARIMA(1,d,0), you can easily get the following representation.

Now we replace the previous expression (1) as follows with backshift notation.

As you can see, LHS (left-hand-side) in the last expression is the same format with expression (2).
If (2) can be represented by where all root of Y(z) are larger than 1 as absolute values, the series of d-th order difference of (1) is stationary. Because this process is denoted as follows. (Y(B) is stationary and is d-th order difference.)

As you can see here, you can easily get the previous consequence with backshift notation.

Seasonal ARIMA modeling

With backshift notation, you can easily understand the background of seasonal ARIMA.
First we see the basic concept using simple AR(1) (without no differencing value d).

Suppose the sales revenue is so much affected by the revenue from the same month a year ago (12 months ago) and this relation is written by the following expression. (For simplicity, we don’t write the constant value c.)

Note : means the seasonal part of in AR. We often use upper character for seasonal part in order to distinguish with the non-seasonal part.

Suppose the revenue is also affected by the revenue from a month ago.
Then the mixed model is represented as follows. This is denoted by .

What does that model mean ? (Why multiply ?)

When we transform this representation, you can get the following.

As you can see, means the t’s revenue which is eliminating the effect of 12 months ago. Same like this, means the (t – 1)’s revenue which is eliminating the effect of 12 months ago.
Therefore this expression means “AR(1) model of one month ago, which is eliminating the effect of 12 months ago”.

This is also written by the following representation, and you can see it’s vice versa.

Here I showed the sample with no differences, but when it’s having the difference d, you just multiply (1 – B) by the backshift notation.
For example, is represented as follows :

  • is AR(1) of non-seasonal part.
  • is AR(1) of seasonal part.
  • is the difference of non-seasonal part.
  • is the difference of seasonal part.

Note : If you have MA(1), you can write on RHS as follows. (In this post we don’t discuss about MA part in seasonal ARIMA.)

When we transform the equation of , you can find that it is the same as ordinary ARIMA with lag 1, 12, and 13 as follows.
Hence if we don’t know the seasonality, we must estimate with all possible coefficients . (Please imagine if we have seasonality with seasonal lag = 365.)

The following is ACF and PACF in  with R. (See my previous post about ACF and PACF. Here I don’t explain what’s meaning ACF and PACF.)
As you can see in the following result, it’s having the spike at lags in both non-seasonal and seasonal. Thus ACF and PACF can also be used in order to identify a seasonal model.

Autocorrelation (ACF) sample

Partial Autocorrelation (PACF) sample

Let’s see the brief example in R.
Now we start to identify the model and forecast with model. As you can see in the following result, here we get the unexpected result without seasonal estimation. (The result is ARIMA(0,1,1) with drift.)

library(forecast)

# read data
y <- read.table(
  "C:\Demo\seasonal_sample.txt",
  col.names = c("DATETIME", "REVENUE"),
  header = TRUE,
  fileEncoding="UTF-8",
  stringsAsFactors = FALSE)

# analyze
test <- auto.arima(y$REVENUE)
summary(test)

# predict and plot
pred <- forecast(
  test,
  level=c(0.85,0.95),
  h=100)
plot(pred)

Suppose we know that this dataset has the seasonality with lags 12 (for instance, 12 months).
Now we fix the previous R script to use this known seasonality as follows. (Here I changed the code in bold fonts.)
As you can see below, we can get the desirable result with seasonal ARIMA analysis.

library(forecast)

# read data
y <- read.table(
  "C:\Demo\seasonal_sample.txt",
  col.names = c("DATETIME", "REVENUE"),
  header = TRUE,
  fileEncoding="UTF-8",
  stringsAsFactors = FALSE)

# analyze with seasonal s=12
test <- auto.arima(ts(data=y$REVENUE, freq=12))
summary(test)

# predict and plot
pred <- forecast(
  test,
  level=c(0.85,0.95),
  h=100)
plot(pred)

Understanding ARIMA time series analysis with R (part 1)

Part 1 : Discuss under the stationarity – ARMA  (<- you are here)
Part2 : Non-stationarity and Seasonality – ARIMA and Seasonal ARIMA

In the beginning …

The time-series analysis is frequently needed in the practical data analysis, and the skill could help you solve many real problems for all developers.
For example, almost all marketing or social analysis will have some kind of seasonality. Even if you analyze your application’s access logs, it might have some “peak time” (trends of access frequency) or something.

Recognizing time-series analysis is based on the mathematical background. However, throughout in this post, I focus on explaining why things are needed on building your intuitions shortly. (For a lot of descriptions, I will write only consequence without proofs.) I also won’t use the difficult terminologies as possible, but I note that the primary mathematical knowledge might be needed for this reading.

Before starting, we have some notice to say.
First, the time-series analysis is having several approaches like the spectrum analysis (Fourier Transform), exponential smoothing, etc… Here we discuss using ARIMA which is based on the mathematical model which represents many aspects of real time-series. (The famous GARCH is also based on this ARIMA.) Here we use “forecast” package for ARIMA analysis in R.
The second, there’re so many things (a lot of backgrounds) to say, and I cannot write it once. Therefore, first I focus on the fundamentals and basic ideas about ARIMA under the condition of stationary process in this post, and next we will discuss more practical topics including non-stationarity.

Note : In this post, we don’t care the index of time, like “day of week”, “quarter”, etc. (We use like 1, 2, 3, … as time axis.) If you want to handle these time signature, timekit package can help you analyze time-series.

ARMA Modeling

As I mentioned above, here we assume the stationarity of time-series data.
The “stationarity” means : the mean (expected value) is not dependant for time and is constant, and the autocovariance only depends on the time differences, not depends on t. That is, the stationary process has the following restriction. (One important notice is that the real time-series data might be measured once and the values are fixed. But here we should consider the mathematical model to fit our time-series data, and therefore we suppose  is not some fixed values, but consists of the statistical probability space.)

Note : Strictly speaking, the above definition is called “weak-sense stationary process”. This definition is often used in the practical statistical analysis rather than “strongly stationary process”.  When we use the term “stationarity” through the posts, it means “weak-sense stationarity”.
See “Stationary process (Wikipedia)“.

Intuitively, it implies that the stochastic behavior of value’s motion is always same in any time t. For a brief example, it is not stationary process, if values increase more and more as time passes, because the mean is not constant and depends on t. (In this case, the differences might behave some steady model, and we discuss this case in the next post.)

Now we consider the mathematical model to fit our time-series data under the condition of stationarity. Let’s consider the stock price data. (The stock price would be non-stationary, but here we suppose it’s stationary.)

As you know, the stock price is influenced by the yesterday’s price. If the price (close price) of t – 1 is y dollars, the price of day t starts with y dollars.
That is, we can assume that the price will be determined by the following model, where c and is constant, and is randomly distributed noise data called “white noise”. (For this , both the mean and the autocovariance equal to 0. But it’s okay to suppose  as  to make things easy.)

Note : If is negative value, and is having the correlation of the reverse. (When is large, will be negatively large.) If then, the graph will be so jagged.

This model is called AR (Autoregressive), and AR(p) is given as the following definition in general. (Therefore the above stock model is AR(1).)

AR model can represent many aspects of cyclic stationarity.
For example, the following R program is plotting AR model with . As you can see, it draws the beautiful line with cycles.

data_count <- 300
cons <- 70
y1 <- 10
phi1 <- 1.8
phi2 <- -0.9
e <- rnorm(data_count) * 10
y2 <- cons + (phi1 * y1) + e[2]
y <- data.frame(
  Indexes=c(1, 2),
  Values=c(y1, y2))
for (i in 3:data_count) {
  yi <- cons + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + e[i]
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}
plot(y)

Note : You can simply use arima.sim() (in “forecast” package) for ARIMA (including ARMA) simulation. But here we don’t use this function for your understanding.

You remember that not all AR expression is stationary process, and AR model has the stationarity, only if some condition is established.
In AR(1), it has stationarity if and only if . In AR(2), it has stationarity if and only if and .

Now let’s consider some other model.
For stock price, it might also be affected by the difference between open price and close price on the day before. If there was a rapid decline in stock price on the day t, the price might rebound on the day t + 1.
In such a case, we can use the following model called MA (moving average).

The part of (μ is the mean) is the same like AR, but MA depends on the t-1 noise , instead of .

Same as AR, the general MA(q) expression is given as follows.
It is known that MA(q) is always stationary process.

Let’s see how the graph will change with values.
As you can see below, it will smoothly transit when is positively related. It will be jagged when it is negatively related.

Normal Distribution (No MA)

MA(3) with

MA(2) with

data_count <- 300
mu <- 70
# sample1
#theta1 <- 0
#theta2 <- 0
#theta3 <- 0
# sample2
theta1 <- 1
theta2 <- 1
theta3 <- 1
# sample3
#theta1 <- -1
#theta2 <- 1
#theta3 <- 0
e <- rnorm(data_count) * 10
y1 <- mu + e[1]
y2 <- mu + e[2] + (theta1 * e[1])
y3 <- mu + e[3] + (theta1 * e[2]) + (theta2 * e[1])
y <- data.frame(
  Indexes=c(1, 2, 3),
  Values=c(y1, y2, y3))
for (i in 4:data_count) {
  yi <- mu + e[i] + (theta1 * e[i-1]) + (theta2 * e[i-2]) + (theta3 * e[i-3])
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}
plot(y, type="l")

Finally, the combined model is called ARMA model, and it’s given as follows.
As you can see below, the former part is AR(p) and the latter is MA(q).

Estimate – Choose your model and parameters

Now we consider the prediction of time-series data. First we select (identify) the appropriate model and parameters (p, q, , , etc).

When you select the model (AR, MA, ARMA) and dimensions (p, q), you must focus on the charactristics of each models. Each model is having the different characteristics for autocorrelation (ACF) and partial autocorrelation (PACF).

Autocorrelation (Corr) is the autocovariance divided by variance as follows. By dividing with variance, it’s not be affected by the size (quantity) of unit.
As you know, Corr only depends on the lag k (not depend on t), because Cov depends only on k. That is, the autocorrelation means the effect (correlation) by the lag k,

On the other hand, the partial autocorrelation is the value of the effect by the lag k but it eliminates the effect of the lag 1, 2, …, k-1. It is given by calculating the linear equations for the covariance. (It needs the calculation of matrices, but we don’t describe the definition of the partial autocorrelation in this post.)

Each AR(p), MA(q), and ARMA(p, q) model is having the following different characteristics for the autocorrelation and partial autocorrelation.
For example, if the partial autocorrelation of samples equals to 0 when the lag >= 3, there is high probability that the model is AR(2).

  • AR(p) : When the lag is getting large, the autocorrelation decreases exponentially (but, non-0 value). When the lag is larger than p, the partial autocorrelation equals to 0.
  • MA(q) : When the lag is larger than q, the autocorrelation equals to 0. When the lag is getting large, the partial autocorrelation decreases exponentially (but, non-0 value).
  • ARMA(p, q) : When the lag is getting large, both autocorrelation and partial autocorrelation decreases exponentially (but, non-0 value).
autocorrelation (ACF) partial autocorrelation (PACF)
AR(p) decrease exponentially 0 if >= p + 1
MA(q) 0 if >= q + 1 decrease exponentially
ARMA(p, q) decrease exponentially decrease exponentially

Note : You may wonder… Why the partial autocorrelation of MA is not 0, even though Corr is 0 when the lag >= q + 1 ?
The reason is because of the difference between definition of autocorrelation and partial autocorreletion, and it’s not so important.

The following is the autocorrelation and partial autocorrelation plotting sample with R. (The x axis is lags, and the y axis is correlation values)

library(forecast)

data_count <- 10000
cons <- 70
y1 <- 10
phi1 <- 1.8
phi2 <- -0.9
e <- rnorm(data_count) * 10
y2 <- cons + (phi1 * y1) + e[2]
y <- data.frame(
  Indexes=c(1, 2),
  Values=c(y1, y2))
for (i in 3:data_count) {
  yi <- cons + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + e[i]
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}

# Autocorrelation plotting
acf(
  y$Values,
  lag.max = 80,
  type = c("correlation"),
  plot = T)
# Partial Autocorrelation plotting
pacf(
  y$Values,
  lag.max = 80,
  plot = T)

Autocorrelation sample of AR(2)

Partial Autocorrelation sample of AR(2)

data_count <- 10000
mu <- 70
theta1 <- 1
theta2 <- -1
e <- rnorm(data_count) * 10
y1 <- mu + e[1]
y2 <- mu + e[2] + (theta1 * e[1])
y <- data.frame(
  Indexes=c(1, 2),
  Values=c(y1, y2))
for (i in 3:data_count) {
  yi <- mu + e[i] + (theta1 * e[i-1]) + (theta2 * e[i-2])
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}
# Autocorrelation plotting
acf(
  y$Values,
  lag.max = 80,
  type = c("correlation"),
  plot = T)
# Partial Autocorrelation plotting
pacf(
  y$Values,
  lag.max = 80,
  plot = T)

Autocorrelation sample of MA(2)

Partial Autocorrelation sample of MA(2)

Note : As you can see, it’s so hard to distinguish whether it equals to 0 or is decreasing exponentially, because the values are approximated.
In the real estimation, AIC or BIC (information criteria) is often used for validating the choice (the count of parameters) of models.

For the estimation of parameter values, you can use the estimation techniques like moment, least squares, or maximum likelihood. For example, using the given samples , you can determine the parameter values (, etc) which maximize the likelihood of . (We assume  represents the normal distribution.)

After you determine the model and parameters, you can survey the sample (the given real data) with the normal techniques of the statistical sample survey.

With R, you can analyze without any difficulties using auto.arima() in “forecast” package. The following is the brief example, in which we create the data with AR(2) () and analyze with auto.arima().
As you can see below, we can get the good result (). (Here ARIMA(p, 0, q) means ARMA(p, q).)

library(forecast)

# create time-series data with ARMA(2, 0)
data_count <- 1000
cons <- 70
y1 <- 10
phi1 <- 1.8
phi2 <- -0.9
e <- rnorm(data_count) * 10
y2 <- cons + (phi1 * y1) + e[2]
y <- data.frame(
  Indexes=c(1, 2),
  Values=c(y1, y2))
for (i in 3:data_count) {
  yi <- cons + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + e[i]
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}

# analyze
test <- auto.arima(y$Values, seasonal = F)
summary(test)

Let’s consider another example with R.
The following data aligns with sin() (with some noise), and it will be easy to analyze with spectrum technique, but here we shall represent it as similar stationary model with auto.arima().

library(forecast)

data_count <- 200
y <- data.frame(
  Indexes=c(1:data_count),
  Noise=rnorm(data_count))
y$Values <-  150 + (100 * sin(pi * y$Indexes / 10)) + (5 * y$Noise)

plot(x = y$Indexes, y = y$Values)
test <- auto.arima(y$Values, seasonal = F)
summary(test)

The following is the ARMA simulation with these parameters. (As I described above, you can use arima.sim() for simulation, but here we don’t use this function for your understanding.)
As you can see below, this model seems to be fitted closely by the cycle of given data (real data).

library(forecast)

data_count <- 200
phi1 <- 1.1690
phi2 <- 0.3636
phi3 <- -0.6958
c <- 149.9833 * abs(1 - phi1)
e <- rnorm(data_count, sd = sqrt(66.94))
y1 <- c + (phi1 * c) + e[1]
y2 <- c + (phi1 * y1) + e[2]
y3 <- c + (phi1 * y2) + (phi2 * y1) + e[3]
y <- data.frame(
  Indexes=c(1, 2, 3),
  Values=c(y1, y2, y3))
for (i in 4:data_count) {
  yi <- c + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + (phi3 * y$Values[i-3]) + e[i]
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}
plot(y, type="l")

These (above) are the primitive examples, but now we change the original data as follows.
This data is much affected by the white noise compared with sin() cycle.

y <- data.frame(
  Indexes=c(1:2000),
  Noise=rnorm(2000))
y$Values <-  20 + (10 * sin(y$Indexes / 30)) + (5 * y$Noise)

plot(x = y$Indexes, y = y$Values)
test <- auto.arima(y$Values, seasonal = F)
summary(test)

The following is the result which we analyze with auto.arima().

When you simulate the result, you can find the difference (errors) between the original data and the simulated one.
The effect of noise will affect the result of estimation. (You must care the probability of errors in the real practical analysis.)

library(forecast)

data_count <- 2000
phi1 <- 0.9865
theta1 <- -0.8565
theta2 <- -0.0053
theta3 <- 0.0187
theta4 <- -0.0188
theta5 <- 0.0864
c <- 20.4084 * abs(1 - phi1)
e <- rnorm(data_count, sd = sqrt(28.21))
y1 <- c + (phi1 * c) + e[1]
y2 <- c + (phi1 * y1) + e[2] + (theta1 * e[1])
y3 <- c + (phi1 * y2) + e[3] + (theta1 * e[2]) + (theta2 * e[1])
y4 <- c + (phi1 * y3) + e[4] + (theta1 * e[3]) + (theta2 * e[2]) + (theta3 * e[1])
y5 <- c + (phi1 * y4) + e[5] + (theta1 * e[4]) + (theta2 * e[3]) + (theta3 * e[2]) + (theta4 * e[1])
y <- data.frame(
  Indexes=c(1, 2, 3, 4, 5),
  Values=c(y1, y2, y3, y4, y5))
for (i in 6:data_count) {
  yi <- c + (phi1 * y$Values[i-1]) + e[i] + (theta1 * e[i-1]) + (theta2 * e[i-2]) + (theta3 * e[i-3]) + (theta4 * e[i-4]) + (theta5 * e[i-5])
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}
plot(y, type="l")

In these examples I used the data with the trivial sin() cycles for your understanding, but you must remember that our goal is to give the model of maximum possibility with explanatory variables, not to give the similar graph.
Even though it’s same model (the model is correct), the graph could be vastly different in the simulation, because it’s including stochastic uncertain values  and it causes subsequent differences. The important thing is “modeling”, not “drawing”.

When you want to involve some consideration of known periodical cycles, trend, or noise for your analysis, you can mix the approach to get the moving trend (by ma() function) or something, and then you could analyze only targeting trends or remainders. For example, the following original data (see “data” section) is having the pair of cycles with small peak and large peak repeatedly, and is also having the big trend wave with rising and falling. (Here we have decomposed with stl() function as follows.)

 

Predict the future

When the model and parameters are determined, now you can predict the future values with the generated model.

The predicted values are called “conditional expected value” or “conditional mean”, because we predict the future values  with  distribution under the condition of the given values (the values in the past)  . (We assume that represents the normal distribution.)

On the contrary, the prediction approach to show some interval of possibility is called “interval forecast”. For example, showing the interval of 95th percentile.
This possibility zone (the variance called mean squared errors (MSE)) is getting larger, when the time goes to the future. With interval forecast, you can visually know these possibility trends.

In computing process, the probability distribution will not be calculated algebraically, but the sufficiently large number of values with the normal distribution are generated, and it’s approximated. (Because of the computing cost.) This kind of approach is often used in the real computer system. (See Gradient descent, Newton’s method, or Monte Carlo, etc.)

The following is the R example of this prediction (forecast) with primitive AR(2).
The line is the conditional expected value, inner range is the interval forecast with the possibility of 85th percentile, and outer range is the one with the possibility of 95th percentile.

library("forecast")

# create data with AR(2)
data_count <- 1000
cons <- 70
y1 <- 10
phi1 <- 1.8
phi2 <- -0.9
e <- rnorm(data_count) * 10
y2 <- cons + (phi1 * y1) + e[2]
y <- data.frame(
  Indexes=c(1, 2),
  Values=c(y1, y2))
for (i in 3:data_count) {
  yi <- cons + (phi1 * y$Values[i-1]) + (phi2 * y$Values[i-2]) + e[i]
  y <- rbind(
    y,
    data.frame(Indexes=c(i), Values=c(yi)))
}

# estimate the model
test <- auto.arima(y$Values, seasonal = F)
summary(test)


# plot with prediction
pred <- forecast(
  test,
  level=c(0.85,0.95),
  h=100)
plot(pred)

As you can see in the graph, the conditional expected value of AR approaches (converges) to the expected value (the mean) itself of this model. (And MSE converges to the variance.)
Therefore it’s meaningless to forecast further future, because it will just be the same value as the expected value (the mean) of the model.

Next we will discuss more advanced topics about time series analysis including non-stationarity.

Embed Power BI in ISV application (Walkthrough of New Embed Model)

In my old post I described the fundamentals of old Power BI Embedded which enables ISV developers to integrate Power BI report in their applications.
This meets the several needs for ISV including : integrating the existing authentication method in ISV application, not needing Power BI license for application users, data separation for each tenant (same report, but different data), etc.

As you know, now the capacity-based Power BI Premium and new Power BI Embedded (in Microsoft Azure) is released. If you’re not familiar, please see the following team blog’s announcement in the past.

Power BI blog “Microsoft accelerates modern BI adoption with Power BI Premium” (May 3, 2017)

… With Power BI Premium we’re also advancing how Power BI content is embedded in apps created by customers, partners and the broad developer community. As part of the new offering we are converging Power BI Embedded with the Power BI service to deliver one API surface, a consistent set of capabilities and access to the latest features. Moving forward we encourage those interested in embedding Power BI in their apps to start with Power BI Desktop and move to deployment with Power BI Premium. Existing apps built on Power BI Embedded will continue to be supported. …

This new consistent embed experience unlocks Power BI for all ISVs, and all Power BI capabilities such as DirectQuery, streaming, report edit, and more data sources can be supported in ISV application.
In this post I show you this new embed flow for ISV developers and other broad developers.

Note : You can maintain your existing old Power BI Workspace services in Microsoft Azure, but you must remember that you cannot create the new one. (Use the new Power BI embed flow if you create the new one.)

Before starting … (for New Power BI Embedded)

If you use new Power BI Embedded in Microsoft Azure, first you must add new “Power BI Embedded” resource in your Microsoft Azure subscription with organization account. (You cannot use Microsoft Account for creating Power BI Embedded resource.)

After you’ve created, you go to Power BI service (https://app.powerbi.com/) and log in with your organization account. As you can see below, now you can assign your Power BI Embedded capacity for your workspace in Power BI services.

Create pbix file with Power BI Desktop

Now let’s start !
First you define the data source and create reports to be embedded with your Power BI Desktop. (There’s no need to programming for building reports.)
This is used as template of artifacts (data set, reports, etc) for each customers.

In this post, I don’t explain the details about how to use the Power BI Desktop and see the Power BI Desktop tutorial document (official document) for usage.

When you have finished, please save in your local disk as Power BI file (.pbix file).

Note : You can also download the sample of “Retail Analysis PBIX“.

Create workspace and assign capacity

Next you create the Power BI workspace (app workspace). Each workspaces will provide the reports, data source, and other artifacts for each customers.
You will need a user that has a pro license in order to create an app workspace within Power BI, and this task is done by UI (Power BI services) or rest api. (Power BI Pro trial can be used for trial use.)

Note : The app workspace is equal to “group”. When you create app workspace with rest api, please add a group.

Log-in to Power BI services (Web UI), select “Workspaces” menu, and push “Create app workspace”. (Then the following form is displayed. Please input the name of workspace and press “Save”.)

When you create your app workspace in your real production application, please turn on “premium” (see above) and assign your resource capacity. Even if you’re using Power BI Embedded capacity in Microsoft Azure, you can select this Power BI Embedded capacity in this dialog. (You don’t need to use Azure Portal. It’s completely integrated with Power BI service.) By doing this, the report is not consumed by the personal license, but the resource-based license is used with flexible Power BI Premium or Power BI Embedded capacity. (When it’s development purpose or trial, this setting is not needed. This is for production.)
Or you can also assign your capacity (Premium capacity or Embedded capacity) using Power BI admin portal. See “Manage capacities within Power BI Premium and Power BI Embedded” for details about admin portal settings.

After you’ve created the customer’s app workspace, please get the workspace id (so called “group id”). You can copy this id from the url (address) of the workspace in Power BI services (Web UI). Or you can get with rest api as follows.

Note : The following authorization token (access token) in the HTTP header is the master account’s user token i.e, Azure AD token. See my old post for getting the Power BI user token. (Later I explain the idea about the token.)

GET https://api.powerbi.com/v1.0/myorg/groups
Accept: application/json
Authorization: Bearer eyJ0eXAiOi...
HTTP/1.1 200 OK
Content-Type: application/json; odata.metadata=minimal

{
  "@odata.context": "http://df-app-scus-redirect.analysis.windows.net/v1.0/myorg/$metadata#groups",
  "value": [
    {
      "id": "a4781858-f3ef-47c2-80a9-fa14845c833b",
      "isReadOnly": false,
      "name": "myws01"
    },
    ...
    
  ]
}

Import pbix

Next you import your Power BI file (.pbix file) into this generated workspace. This task also can be done by rest api or UI. (Just push “publish” in Power BI Desktop as the following picture.)

After you’ve imported Power BI file, please get the imported dataset id, report id, and report embed url.
The following HTTP request retrieves the imported dataset id. (Please change a4781858-f3ef-47c2-80a9-fa14845c833b to your group id.)

GET https://api.powerbi.com/v1.0/myorg/groups/a4781858-f3ef-47c2-80a9-fa14845c833b/datasets
Accept: application/json
Authorization: Bearer eyJ0eXAiOi...
HTTP/1.1 200 OK
Content-Type: application/json; odata.metadata=minimal

{
  "@odata.context": "http://df-app-scus-redirect.analysis.windows.net/v1.0/myorg/groups/a4781858-f3ef-47c2-80a9-fa14845c833b/$metadata#datasets",
  "value": [
    {
      "id": "44a12ee1-8da7-4383-a2cf-89129ef6e1a7",
      "name": "Retail Analysis Sample",
      "addRowsAPIEnabled": false,
      "configuredBy": "demotaro@test.onmicrosoft.com"
    }
  ]
}

The following retrieves the report id, as well as report embed url (embedUrl) which is used for embedding report.

GET https://api.powerbi.com/v1.0/myorg/groups/a4781858-f3ef-47c2-80a9-fa14845c833b/reports
Accept: application/json
Authorization: Bearer eyJ0eXAiOi...
HTTP/1.1 200 OK
Content-Type: application/json; odata.metadata=minimal

{
  "@odata.context": "http://df-app-scus-redirect.analysis.windows.net/v1.0/myorg/groups/a4781858-f3ef-47c2-80a9-fa14845c833b/$metadata#reports",
  "value": [
    {
      "id": "b21f4f90-e364-4b4c-9281-c5db87cdf3a5",
      "modelId": 0,
      "name": "Retail Analysis Sample",
      "webUrl": "https://app.powerbi.com/groups/a4781858-f3ef-47c2-80a9-fa14845c833b/reports/b21f4f90-e364-4b4c-9281-c5db87cdf3a5",
      "embedUrl": "https://app.powerbi.com/reportEmbed?reportId=b21f4f90-e364-4b4c-9281-c5db87cdf3a5&groupId=a4781858-f3ef-47c2-80a9-fa14845c833b",
      "isOwnedByMe": true,
      "isOriginalPbixReport": false,
      "datasetId": "44a12ee1-8da7-4383-a2cf-89129ef6e1a7"
    }
  ]
}

Data source connectivity and multi-tenancy of data

Although almost all the artifacts in pbix file are imported into your workspace, the credential for the data source is not imported because of security reasons. As a result, if you’re using DirectQuery mode, the embedded report cannot be shown correctly. (On the other hand, if you’re using Import mode, you can view the report because the data is imported in your dataset.)

For ISV applications (SaaS applications, etc), the separation of data is also concerns. The data of the company A will be different from the one of company B.

In such a case, you can set and change the connection string or credentials using rest api.

First you can get the data source id and gateway id by the following HTTP request.  (In this example, we assume that the type of data source is SQL Server.)
Note that the following “id” in HTTP response is the data source id.

GET https://api.powerbi.com/v1.0/myorg/groups/a4781858-f3ef-47c2-80a9-fa14845c833b/datasets/44a12ee1-8da7-4383-a2cf-89129ef6e1a7/Default.GetBoundGatewayDataSources
Accept: application/json
Authorization: Bearer eyJ0eXAiOi...
HTTP/1.1 200 OK
Content-Type: application/json; odata.metadata=minimal

{
  "@odata.context": "http://df-app-scus-redirect.analysis.windows.net/v1.0/myorg/groups/a4781858-f3ef-47c2-80a9-fa14845c833b/$metadata#gatewayDatasources",
  "value": [
    {
      "id": "2a0bca27-a496-450c-80e0-05790ad8875f",
      "gatewayId": "d52ba684-afa8-484d-b5d5-790842b6ab9f",
      "datasourceType": "Sql",
      "connectionDetails": "{"server":"server01.database.windows.net","database":"db01"}"
    }
  ]
}

Using gateway id and data source id, you can set (or change) the credential of this data source as follows.

PATCH https://api.powerbi.com/v1.0/myorg/gateways/d52ba684-afa8-484d-b5d5-790842b6ab9f/datasources/2a0bca27-a496-450c-80e0-05790ad8875f
Accept: application/json
Authorization: Bearer eyJ0eXAiOi...
Content-Type: application/json; charset=utf-8

{
  "credentialType": "Basic",
  "basicCredentials": {
    "username": "demouser",
    "password": "pass@word1"
  }
}
HTTP/1.1 200 OK

The following changes the connection string for the data source via rest api. (The data source id is also changed when you change the connection string.)
That is, you can import Power BI file and set the different connection string for each customer’s tenant.

POST https://api.powerbi.com/v1.0/myorg/groups/a4781858-f3ef-47c2-80a9-fa14845c833b/datasets/44a12ee1-8da7-4383-a2cf-89129ef6e1a7/Default.SetAllConnections
Accept: application/json
Authorization: Bearer eyJ0eXAiOi...
Content-Type: application/json; charset=utf-8

{
  "connectionString": "data source=tsmatsuz-server2.database.windows.net;initial catalog=db02;persist security info=True;encrypt=True;trustservercertificate=False"
}
HTTP/1.1 200 OK

The idea of AuthN / AuthZ for new embed model

Before embedding your report in your application, you must learn about the idea of AuthN / AuthZ in the new embed model.

In provisioning phase (creating workspace, importing pbix, setting credentials, etc), it’s okay to use the Azure AD user’s (master account’s) access token for calling api, because it’s not exposed to the end users. (The only admin does these management tasks.)

How about viewing the embedded report ? (Of course, AuthN / AuthZ must be needed for viewing report.)
The user in ISV application is not necessarily Power BI users or Azure AD users, then it’s not good to use the Azure AD access token directly. If you were to use the master account’s user token for all end user’s reports, the token will be abused for other reports that the user doesn’t have permission.

In this case, you can get Power BI embed token in the backend (in the server-side) with the following HTTP request, and your application can use this token for embedding (user-side processing) securely. This signed token is only for some specific user’s operation (viewing some report, etc) and if the user needs other operations, another token must be issued in the server side. (The token expires in one hour.)

POST https://api.powerbi.com/v1.0/myorg/groups/{group id}/reports/{report id}/GenerateToken
Accept: application/json
Authorization: Bearer eyJ0eXAiOi...
Content-Type: application/json; charset=utf-8

{
  "accessLevel": "View",
  "allowSaveAs": "false"
}
HTTP/1.1 200 OK
Content-Type: application/json; odata.metadata=minimal

{
  "@odata.context": "http://df-app-scus-redirect.analysis.windows.net/v1.0/myorg/groups/{group id}/$metadata#Microsoft.PowerBI.ServiceContracts.Api.V1.GenerateTokenResponse",
  "token": "H4sIAAAAAA...",
  "tokenId": "63c8d0ea-800d-462c-9906-22a4567f276f",
  "expiration": "2017-07-15T08:29:29Z"
}

As you can notice, the Azure AD access token must be provided for the HTTP request above, and this Azure AD access token must be issued in your application’s backend (without interactive login UI). That is, this type of access token must be app-only token. (See my old post for the OAuth flow of app-only access token.)
However unfortunately Power BI doesn’t support app-only token currently. Therefore, for getting embed token, you now must use the user access token with non-interactive sign-in instead of using app-only token. (Please wait till app-only access token is supported in Power BI.)
For example, the following is the OAuth password grant flow and this doesn’t need the interactive sign-in. And you can use this user token for getting embed token in server side.

Note : OAuth password grant flow is not recommended in the usual cases, because it’s not secure and the several advanced security features like 2FA or others are not supported.

POST https://login.microsoftonline.com/common/oauth2/token
Content-Type: application/x-www-form-urlencoded

grant_type=password
&client_id=dede99a5-ed89-4881-90a4-4564dae562f7
&client_secret=P4GmxWa...
&username=tsmatsuz%40test.onmicrosoft.com
&password=pass%40word1
&resource=https%3A%2F%2Fanalysis.windows.net%2Fpowerbi%2Fapi
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "token_type": "Bearer",
  "scope": "Content.Create Dashboard.Read.All Data.Alter_Any Dataset.Read.All Dataset.ReadWrite.All Group.Read Group.Read.All Metadata.View_Any Report.Read.All Report.ReadWrite.All",
  "expires_in": "3599",
  "ext_expires_in": "0",
  "expires_on": "1500032600",
  "not_before": "1500028700",
  "resource": "https://analysis.windows.net/powerbi/api",
  "access_token": "eyJ0eXAiOi...",
  "refresh_token": "AQABAAAAAA..."
}

If the user needs to edit the report, please send the following HTTP request in the server side with “Edit” for “accessLevel“. In the same way, you can set “Create” as “accessLevel” for enabling users to create new reports in your embedded Power BI.

POST https://api.powerbi.com/v1.0/myorg/groups/{group id}/reports/{report id}/GenerateToken
Accept: application/json
Authorization: Bearer eyJ0eXAiOi...
Content-Type: application/json; charset=utf-8

{
  "accessLevel": "Edit",
  "allowSaveAs": "false"
}

Note : The RLS (Row-Level Security) is not supported now. (It is on the roadmap.)

Hosting (embedding) reports in your web page

Now you can embed your Power BI reports in your application with Power BI JavaScript API.

Let’s see the following javascript example.
The access token (txtAccessToken) is the embed token for report viewing (not Azure AD access token), and please set your embed url (txtEmbedUrl) and report id (txtEmbedReportId) which are previously retrieved by the rest api.

<html>
<head>
  <title>Test</title>
  <script src="/Scripts/powerbi.js"></script>
</head>
<body>
  <div id="captionArea">
    <h1>Power BI Embed test</h1>
  </div>
  <div id="embedContainer" style="height:500px">
  </div>
  <script>
    (function () {
      // Please change these values
      var txtAccessToken = 'H4sIAAAAAA...';
      var txtEmbedUrl =
        'https://app.powerbi.com/reportEmbed?reportId=b21f4f90-e364-4b4c-9281-c5db87cdf3a5&groupId=a4781858-f3ef-47c2-80a9-fa14845c833b';
      var txtEmbedReportId = 'b21f4f90-e364-4b4c-9281-c5db87cdf3a5';

      var models = window['powerbi-client'].models;
      var permissions = models.Permissions.All;
      var config = {
        type: 'report',
        tokenType: models.TokenType.Embed,
        accessToken: txtAccessToken,
        embedUrl: txtEmbedUrl,
        id: txtEmbedReportId,
        permissions: permissions,
        settings: {
          filterPaneEnabled: true,
          navContentPaneEnabled: true
        }
      };

      var embedContainer = document.getElementById('embedContainer');
      var report = powerbi.embed(embedContainer, config);
    }());
  </script>
</body>
</html>

This HTML will display the following embedded report (View Mode).

In the backend of javascript api, iframe is inserted in your web page, and some attributes (including embed token) are passed by the inter-frame communications. (See the following postMessage().)
For example, the following sample code displays the same result without Power BI JavaScript API. (Please use JavaScript API for your production. This is the sample code just for your understanding.)

Note : The uid (uniqueId) is the random string.

<html>
<head>
  <meta name="viewport" content="width=device-width" />
  <title>Test without Power BI JavaScript API</title>
</head>
<body>
  <div id="captionArea">
    <h1>Power BI Embed test</h1>
  </div>
  <div id="embedContainer">
    <iframe id="ifrTile" width="100%" height="500px"></iframe>
  </div>
  <script>
    (function () {
      // Please change these values
      var txtAccessToken = 'H4sIAAAAAA...';
      var txtEmbedUrl =
        'https://app.powerbi.com/reportEmbed?reportId=b21f4f90-e364-4b4c-9281-c5db87cdf3a5&groupId=a4781858-f3ef-47c2-80a9-fa14845c833b';
      var txtEmbedReportId = 'b21f4f90-e364-4b4c-9281-c5db87cdf3a5';

      var iframe = document.getElementById('ifrTile');
      iframe.src = txtEmbedUrl;
      iframe.onload = function () {
        var msgJson = {
          "method": "POST",
          "url": "/report/load",
          "headers": {
            "x-sdk-type": "js",
            "x-sdk-version": "2.3.2",
            "uid": "87oes"
          },
          "body": {
            "settings": {
              "filterPaneEnabled": true,
              "navContentPaneEnabled": true
            },
            "type": "report",
            "tokenType": 1,
            "accessToken": txtAccessToken,
            "embedUrl": txtEmbedUrl,
            "id": txtEmbedReportId,
            "permissions": 7,
            "uniqueId": "87oes"
          }
        };
        iframe.contentWindow.postMessage(msgJson, "*");
      };
    }());
  </script>
</body>
</html>

With the new embed experience, you can easily enable application users to edit embedded reports as follows. (You can also enable users to create new reports in your embed experience.)
You must remember that this embed token must be for editing (i.e, "accessLevel": "Edit").

<html>
<head>
  <title>Test</title>
  <script src="/Scripts/powerbi.js"></script>
</head>
<body>
  <div id="captionArea">
    <h1>Power BI Embed test</h1>
  </div>
  <div id="embedContainer" style="height:500px">
  </div>
  <script>
    (function () {
      // Please change these values
      var txtAccessToken = 'H4sIAAAAAA...';
      var txtEmbedUrl =
        'https://app.powerbi.com/reportEmbed?reportId=b21f4f90-e364-4b4c-9281-c5db87cdf3a5&groupId=a4781858-f3ef-47c2-80a9-fa14845c833b';
      var txtEmbedReportId = 'b21f4f90-e364-4b4c-9281-c5db87cdf3a5';

      var models = window['powerbi-client'].models;
      var permissions = models.Permissions.All;
      var config = {
        type: 'report',
        tokenType: models.TokenType.Embed,
        accessToken: txtAccessToken,
        embedUrl: txtEmbedUrl,
        id: txtEmbedReportId,
        permissions: permissions,
        viewMode: models.ViewMode.Edit,
        settings: {
          filterPaneEnabled: true,
          navContentPaneEnabled: true
        }
      };

      var embedContainer = document.getElementById('embedContainer');
      var report = powerbi.embed(embedContainer, config);
    }());
  </script>
</body>
</html>

Using JavaScript API, you can also interact with the embedded report, like filtering, reload, changing page, visual settings, etc.
You can see the following github example for these operations with JavaScript sample code.

[Github] Microsoft Power BI – Report Embed Sample
https://microsoft.github.io/PowerBI-JavaScript/demo/v2-demo/index.html