Uncategorized

Azure Arc-enabled Machine Learning – Run AML anywhere

In practical Machine Learning workloads, various infrastructures and devices (such as, a variety of GPUs, high-speed inter-connections, etc) are required and used. For instance, you will reuse your own hardware in-house in ordinary jobs, but some specific job will need more powerful (high-spec) hardware on cloud. In other cases, you will operationalize the model creation in-house and model hosting (serving) on cloud.
With Azure Arc-enabled Machine Learning, you can bring AML open architecture into existing external resources, such as, on-premise GPU machines or 3rd party platforms (EKS, GKE, etc).
With Arc-enabled Machine Learning, data scientists can use anywhere any devices using the consistent way of Python scripts, commands (Azure CLI), or user-interface.

You can now try Arc-enabled Machine Learning, since it’s in public preview.
In this post, I’ll briefly show you what you can do and how you can use in Arc-enabled Machine Learning.

Run Azure Arc enabled Kubernetes cluster

First of all, you must run Azure Arc enabled Kubernetes on-premise (running K3S, KIND, etc) or on 3rd party cloud.
For running Arc-enabled Machine Learning later, use machines with more than 4 CPUs, since Arc-enabled ML requires enough resources.

In this post, I assume that we run KIND (Kubernetes in Docker) cluster on Ubuntu 18.04 in a single on-premise node. (For test purpose, I have used Standard D3 v2 virtual machine in Azure, which has 4 CPUs and 14 GB memory.)

# Install docker
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get -y update
sudo apt-get -y install docker-ce
# Download Kind
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.11.1/kind-linux-amd64
chmod +x ./kind
# Install and run Kind (Kubernetes) cluster
sudo ./kind create cluster

By the above installation, kubeconfig (~/.kube/config) will be automatically generated in your working directory.
In this connected environment, please register your local cluster as an Arc-enabled Kubernetes resource in Azure by running the following commands.

# Install Helm
curl https://baltocdn.com/helm/signing.asc | sudo apt-key add -
sudo apt-get install apt-transport-https --yes
echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm
# Install Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
# Install connectedk8s extension in CLI
az extension add --name connectedk8s
# Login to your Azure subscription
az login
az account set --subscription {SubscriptionID}
# Register providers
# (Do once in your subscription. It will take a while...)
az provider register --namespace Microsoft.Kubernetes
az provider register --namespace Microsoft.KubernetesConfiguration
az provider register --namespace Microsoft.ExtendedLocation
# Create a resource group
az group create --name MLTest --location EastUS
# Connect local cluster (in kubeconfig) to Azure resource
sudo az connectedk8s connect --name MLTest1 --resource-group MLTest

Now your Arc-enabled Kubernetes is ready.
For beginners, this post will help you to understand how Arc-enabled Kubernetes can be connected remotely.

Install Arc-Enabled ML on Your Cluster

Next, install cluster extension for Arc enabled Machine Learning (Microsoft.AzureML.Kubernetes) on this cluster.

az extension add --name k8s-extension
az k8s-extension create --name amlarc-compute \
  --extension-type Microsoft.AzureML.Kubernetes \
  --configuration-settings enableTraining=True  \
  --cluster-type connectedClusters \
  --cluster-name MLTest1 \
  --resource-group MLTest \
  --scope cluster

Note : For now (July 2021), specify the following version option to run a training job on-premise KIND cluster. (This version will soon be released.)
Otherwise, the submitted run in AML will get stuck with “Queued” status.

az k8s-extension create --name amlarc-compute \
  --extension-type Microsoft.AzureML.Kubernetes \
  --configuration-settings enableTraining=True  \
  --cluster-type connectedClusters \
  --cluster-name MLTest1 \
  --resource-group MLTest \
  --scope cluster \
  --release-train staging --version 1.0.48

After running above command, please run as follows and wait until “installState” becomes “Installed“. (This will take a while.)

az k8s-extension show --name amlarc-compute \
  --cluster-type connectedClusters \
  --cluster-name MLTest1 \
  --resource-group MLTest

When the installation is completed, you can see the following pods correctly running in your local cluster.
The above command (with enableTraining=True setting) will also generate Azure Service Bus and Azure Relay resource in Azure. The communications between AML workspace (cloud) and your cluster (local) will be relayed by these resources using the outbound connection.

# Install kubectl command
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# List pods in azureml namespace
# (Run "kubectl get all" instead for listing all other resources)
sudo kubectl get pod --namespace azureml
NAME                                   READY   STATUS
aml-operator-6bcd789c49-lzmhw          2/2     Running
amlarc-compute-kube-state-metrics...   1/1     Running
cluster-status-reporter-7b7b6c65b...   1/1     Running
fluent-bit-xv7b7                       1/1     Running
frameworkcontroller-0                  1/1     Running
gateway-5667747ccb-5rxvd               1/1     Running
metrics-controller-manager-67688d...   2/2     Running
nfd-master-6f4b4b4554-vpsmg            1/1     Running
nfd-worker-mzz5j                       2/2     Running
prom-operator-7bc87d97dc-vxkxt         2/2     Running
prometheus-prom-prometheus-0           3/3     Running
relayserver-7bb77d9479-4rxhd           1/1     Running
relayserver-7bb77d9479-qw6bv           1/1     Running

If your pods are not correctly running, please check pod status or container logs, and fix errors.

# Show pod details
sudo kubectl describe pods {PodName} \
  --namespace azureml
# Show container logs
sudo kubectl logs {PodName} \
  --container {ContainerName} \
  --namespace azureml

Attach Cluster in ML Workspace

Now let’s create Azure Machine Learning resource (ML workspace) in Azure Portal.

Next we grant permissions for Arc-enabled cluster’s actions to your generated ML workspace (ML resource).
Once you have generated Machine Learning workspace in Azure, you can get system-assigned managed identity (which is automatically generated) for this resource.
In Azure Active Directory tenant in your subscription, you can search and view this identity. (See below.) Now please copy object id of this managed identity.

Create a custom role for Arc-enabled cluster’s actions as follows. In this example (below), I have granted permissions for all cluster’s actions.

# Create a role for Arc-enabled kubernetes actions
az role definition create --role-definition '{
  "Name": "Custom Role for Arc Kubernetes Actions",
  "Description": "Auth for connected clusters",
  "Actions": ["Microsoft.Kubernetes/connectedClusters/*"],
  "DataActions": [],
  "NotDataActions": [],
  "AssignableScopes": ["/subscriptions/{SubscriptionID}"]
}'

When it displays the output result, please copy the name of this custom role. In this example (below), I assume that the following 394b877b-315c-477e-8b74-ff765b6590d0 is the name of this custom role.

{
  "assignableScopes": [
    "/subscriptions/b3ae1c15..."
  ],
  "description": "Auth for connected clusters",
  "id": "/subscriptions/b3ae1c15.../providers/Microsoft.Authorization/roleDefinitions/394b877b-315c-477e-8b74-ff765b6590d0",
  "name": "394b877b-315c-477e-8b74-ff765b6590d0",
  ...
}

Using this role name, assign this role to your ML workspace (Azure AD managed identity) as follows.
Please replace the following {ObjectId} with the one of ML resource’s managed identity.

az role assignment create \
  --role {RoleName} \
  --assignee {ObjectId} \
  --scope /subscriptions/{SubscriptionID}

Open Azure Machine Learning studio UI (https://ml.azure.com/).
Click “Compute” tab in left navigation and attach this Arc-enabled cluster on “Attached computes” tab as follows.

Note : If it fails, please wait a while until Arc-enabled ML is provisioned.

Run ML Workload on Arc-enabled Cluster

In Notebook, let’s run the following simple example with AML Python SDK.
In this example, it just creates iris prediction model, evaluate this model’s accuracy, and save (serialize) the generated model. (This code prints the accuracy score in standard output.)

The following MLTest1 is the name of my attached Kubernetes cluster. (Please change to your attached resource name.)

# Connect to Azure Machine Learning
import azureml.core
from azureml.core import Workspace
ws = Workspace.from_config()
# Create script folder
import os
script_folder = os.path.join(os.getcwd(), 'script')
os.makedirs(script_folder, exist_ok=True)
%%writefile ./script/train01.py
##
## Save this train script as train01.py in script folder
##
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import os
import joblib

# Split dataset for training and evaluation
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Train using SVM
clf = SVC()
clf.fit(X_train, y_train)

# Evaluate
print("***** accuracy score *****")
print(accuracy_score(clf.predict(X_test), y_test))

# Save model
os.makedirs("outputs", exist_ok=True)
joblib.dump(value=clf, filename="outputs/sklearn_iris_model.pkl")
# Get attached Arc-enabled kubernetes compute
amlarc_compute = ws.compute_targets["MLTest1"]
from azureml.core import Experiment
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import ScriptRunConfig

# Register environment to re-use later
env = Environment('test01-env')
conda_dep = CondaDependencies.create()
conda_dep.add_pip_package('scikit-learn')
env.python.conda_dependencies = conda_dep
env.register(workspace = ws)
## # Get existing environment
## env = Environment.get(ws, name="test01-env")

# Create Script Config
src = ScriptRunConfig(
  source_directory=script_folder,
  script='train01.py',
  compute_target=amlarc_compute,
  environment=env)

# Run script on Arc-enabled kubernetes
exp = Experiment(workspace=ws, name="experiment01")
run = exp.submit(config=src)
run.wait_for_completion(show_output=True)

As the following picture shows, the submitted run (train01.py) will run on-premise cluster.

AML workspace in cloud will track all results and logs on the experiment running on-premise. You can view these artifacts (results) in AML studio UI in cloud.

Same as other familiar AML interface, such as, Python scripts or CLI commands, you can consistently view and extract logs, metrics, or models in history.
You can automate all jobs in AML, even when it’s run on-premise.

For instance, the following script is checking the printed result in logs, which are collected in ML workspace.

The following code extracts and uses the generated model (which is also collected in ML workspace) with Python SDK.
This model can also be used for serving in cloud or on-premise.

Collaborating with AML Python SDK, you can also run a variety of AML value-added functions on-premise, such as, hyper-parameter tuning, not depending on specific framework. (See here for details to tune hyper-parameter in AML.)

Work with Data

As usual AML jobs, you can use datastore (or dataset) in both cloud and on-premise. Especially in Arc enabled ML, you can also mount NFS volume on-premise and use data in that volume.

For details about NFS mount settings, see this document in official GitHub repository.

Once you have configured, you can run a job using data on shared local NFS volume.
For instance, when your training data (e.g, data.csv) exists in local NFS volume, you can mount this volume as the path /example-share in pods and specify this data in Python script’s arguments. (See below.)

args = ["--data-folder", "/example-share", "--file-name", "data.csv"]
src = ScriptRunConfig(
  source_directory=script_folder,
  script='train02.py',
  arguments=args,
  compute_target=amlarc_compute,
  environment=env)

Run Pipeline Generated in AML Designer

ML pipeline in AML runs on docker container. By the collaboration of this containerized technologies, you can then also run ML Pipeline on Arc-enabled ML. Especially, you can run ML pipeline built within Azure Machine Learning Designer (i.e, no-code ML experience).

For instance, please select built-in “Automobile Price Prediction” example in AML Designer as follows. (This pipeline will generate and evaluate the model for predicting the price of used-cars.)

In order to run this pipeline on your own Kubernetes cluster (on-premise), just select and run on the attached arc-enabled resource.

Inferencing (Serving)

You can submit not only training job, but also you can serve model on Arc-enabled ML.
You can build MLOps (or operations) with multiple environments in consistent manner with Python SDK or command interface (Azure CLI). For instance, you can easily serve model on-premise, which is generated on cloud GPUs in Azure or 3rd party cloud (EKS or GKE).

Currently this capability (inferencing on Arc-enabled ML) is in private preview, and please sign up for trial.

Categories: Uncategorized

Tagged as: ,

2 replies »

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s