Uncategorized

Speed up Inference by TensorRT (Step-by-Step on Azure)

In my previous post, I described about the challenges for real-time inference and Project Brainwave (FPGA) by Microsoft.
In this post I introduce NVIDIA TensorRT for developers or data scientists on Azure.

For setting up environment on Azure, Azure Data Science Virtual Machine (DSVM) includes TensorRT (see here) and you can soon start without cumbersome setup or configurations.
But here we want to use TensorFlow-TensorRT integration (which needs TensorFlow 1.7 or later) and we should start from Azure VM (NC or NCv3 instance) with plain Ubuntu 16 LTS installed.
Here I don’t describe steps about software installation and setup, but please refer my github repo for the setup procedure on Azure NC (or NCv2, NCv3) seriese. (I used latest TensorRT 4.0.)

Note (Added Sep 2018) : You can now also use NVIDIA GPU Cloud on Microsoft Azure. (See NVIDIA blog’s post “NVIDIA GPU Cloud Adds Support for Microsoft Azure” for announcement.)

After setup is done, now you can examine your real-time inference on cloud platform. Please run the following source code with TensorRT optimizations. (See my repo for required files like images, etc.)
As you can see below, it converts usual TensorFlow graph (resnetV150_frozen.pb) into TensorRT optimized graph. TensorRT graph is also the standard TensorFlow graph and you can use this optimized graph as usual manners.
See “NVIDIA Developer Blog : Deploying Deep Neural Networks with NVIDIA TensorRT” for details about transformations and optimizations by TensorRT. (Here I don’t go so far about what is done inside TensorRT optimizations.)

import time
import tensorflow as tf
from tensorflow.contrib import tensorrt as trt

#
# tuning parameters
# (please change these values along with your computing resources)
#
batch_size = 128
workspace_size_bytes = 1 << 30
precision_mode = 'FP16' # use 'FP32' for K80
trt_gpu_ops = tf.GPUOptions(per_process_gpu_memory_fraction = 0.50)

#
# Read images (images -> input vectors)
#
tf.reset_default_graph()
g1 = tf.Graph()
with g1.as_default():
  # Create graph
  in_images = tf.placeholder(tf.string, name='in_images')
  decoded_input = tf.image.decode_png(in_images, channels=3)
  float_input = tf.cast(decoded_input, dtype=tf.float32)
  # (224, 224, 3) -> (n, 224, 224, 3)
  rgb_input = tf.expand_dims(
    float_input,
    axis=0)
  # For VGG preprocess, reduce means and convert to BGR
  slice_red = tf.slice(
    rgb_input,
    [0, 0, 0, 0],
    [1, 224, 224, 1])
  slice_green = tf.slice(
    rgb_input,
    [0, 0, 0, 1],
    [1, 224, 224, 1])
  slice_blue = tf.slice(
    rgb_input,
    [0, 0, 0, 2],
    [1, 224, 224, 1])
  sub_red = tf.subtract(slice_red, 123.68)
  sub_green = tf.subtract(slice_green, 116.779)
  sub_blue = tf.subtract(slice_blue, 103.939)
  transferred_input = tf.concat(
    [sub_blue, sub_green, sub_red],
    3)
  # Run transformation to vectors
  with tf.Session(config=tf.ConfigProto(gpu_options=trt_gpu_ops)) as s1:
    with open('./tiger224x224.jpg', 'rb') as f:
      data1 = f.read()
      feed_dict = {
        in_images: data1
      }
      imglist1 = s1.run([transferred_input], feed_dict=feed_dict)
      image1 = imglist1[0]
    with open('./lion224x224.jpg', 'rb') as f:
      data2 = f.read()
      feed_dict = {
        in_images: data2
      }
      imglist2 = s1.run([transferred_input], feed_dict=feed_dict)
      image2 = imglist2[0]
    with open('./orangutan224x224.jpg', 'rb') as f:
      data3 = f.read()
      feed_dict = {
        in_images: data3
      }
      imglist3 = s1.run([transferred_input], feed_dict=feed_dict)
      image3 = imglist3[0]
""" Uncomment if you test batch inference """
# import numpy as np
# image1 = np.tile(image1,(batch_size,1,1,1))
# image2 = np.tile(image2,(batch_size,1,1,1))
# image3 = np.tile(image3,(batch_size,1,1,1))
print('Loaded image vectors (tiger, lion, orangutan)')

#
# Load classification graph def (pre-trained)
#
classifier_model_file = './resnetV150_frozen.pb' # downloaded from NVIDIA sample
classifier_graph_def = tf.GraphDef()
with tf.gfile.Open(classifier_model_file, 'rb') as f:
  data = f.read()
  classifier_graph_def.ParseFromString(data)
print('Loaded classifier graph def')

#
# Convert to TensorRT graph def
#
trt_graph_def = trt.create_inference_graph(
  input_graph_def=classifier_graph_def,
  outputs=['resnet_v1_50/predictions/Reshape_1'],
  max_batch_size=batch_size,
  max_workspace_size_bytes=workspace_size_bytes,
  precision_mode=precision_mode)
#trt_graph_def=trt.calib_graph_to_infer_graph(trt_graph_def) # For only 'INT8'
print('Generated TensorRT graph def')

#
# Generate tensor with TensorRT graph def
#
tf.reset_default_graph()
g2 = tf.Graph()
with g2.as_default():
  trt_x, trt_y = tf.import_graph_def(
    trt_graph_def,
    return_elements=['input:0', 'resnet_v1_50/predictions/Reshape_1:0'])
print('Generated tensor for TensorRT optimized graph')

#
# Run classification with TensorRT graph
#
with open('./imagenet_classes.txt', 'rb') as f:
  labeltext = f.read()
  classes_entries = labeltext.splitlines()
with tf.Session(graph=g2, config=tf.ConfigProto(gpu_options=trt_gpu_ops)) as s2:
  #
  # predict image1 (tiger)
  #
  feed_dict = {
    trt_x: image1
  }
  start_time = time.process_time()
  result = s2.run([trt_y], feed_dict=feed_dict)
  stop_time = time.process_time()
  # list -> 1 x n ndarray : feature's format is [[1.16643378e-06 3.12126781e-06 3.39836406e-05 ... ]]
  nd_result = result[0]
  # remove row's dimension
  onedim_result = nd_result[0,]
  # set column index to array of possibilities 
  indexed_result = enumerate(onedim_result)
  # sort with possibilities
  sorted_result = sorted(indexed_result, key=lambda x: x[1], reverse=True)
  # get the names of top 5 possibilities
  for top in sorted_result[:5]:
    print(classes_entries[top[0]], 'confidence:', top[1])
  print('{:.2f} milliseconds'.format((stop_time-start_time)*1000))
  #
  # predict image2 (lion)
  #
  feed_dict = {
    trt_x: image2
  }
  start_time = time.process_time()
  result = s2.run([trt_y], feed_dict=feed_dict)
  stop_time = time.process_time()
  # list -> 1 x n ndarray : feature's format is [[1.16643378e-06 3.12126781e-06 3.39836406e-05 ... ]]
  nd_result = result[0]
  # remove row's dimension
  onedim_result = nd_result[0,]
  # set column index to array of possibilities 
  indexed_result = enumerate(onedim_result)
  # sort with possibilities
  sorted_result = sorted(indexed_result, key=lambda x: x[1], reverse=True)
  # get the names of top 5 possibilities
  for top in sorted_result[:5]:
    print(classes_entries[top[0]], 'confidence:', top[1])
  print('{:.2f} milliseconds'.format((stop_time-start_time)*1000))
  #
  # predict image3 (orangutan)
  #
  feed_dict = {
    trt_x: image3
  }
  start_time = time.process_time()
  result = s2.run([trt_y], feed_dict=feed_dict)
  stop_time = time.process_time()
  # list -> 1 x n ndarray : feature's format is [[1.16643378e-06 3.12126781e-06 3.39836406e-05 ... ]]
  nd_result = result[0]
  # remove row's dimension
  onedim_result = nd_result[0,]
  # set column index to array of possibilities 
  indexed_result = enumerate(onedim_result)
  # sort with possibilities
  sorted_result = sorted(indexed_result, key=lambda x: x[1], reverse=True)
  # get the names of top 5 possibilities
  for top in sorted_result[:5]:
    print(classes_entries[top[0]], 'confidence:', top[1])
  print('{:.2f} milliseconds'.format((stop_time-start_time)*1000))

Following is the result outputs running code on V100 (Volta architecture) instance (NCv3).
As you can see, the first one is very slow for the reason of initial start-up. (Please ignore this first one for your benchmarks.)

b'tiger, Panthera tigris' confidence: 0.92484504
b'tiger cat' confidence: 0.07038423
b'zebra' confidence: 0.0017817853
b'tabby, tabby cat' confidence: 0.0015420463
b'jaguar, panther, Panthera onca, Felis onca' confidence: 0.0006905204
1654.38 milliseconds

b'lion, king of beasts, Panthera leo' confidence: 0.5410014
b'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor' confidence: 0.44524524
b'wombat' confidence: 0.001763356
b'jaguar, panther, Panthera onca, Felis onca' confidence: 0.0015685557
b'tiger, Panthera tigris' confidence: 0.0011836683
3.63 milliseconds

b'orangutan, orang, orangutang, Pongo pygmaeus' confidence: 0.82915646
b'gorilla, Gorilla gorilla' confidence: 0.13315238
b'chimpanzee, chimp, Pan troglodytes' confidence: 0.035735503
b'macaque' confidence: 0.00070971606
b'patas, hussar monkey, Erythrocebus patas' confidence: 0.0002849982
3.91 milliseconds

Let’s change parameters to get appropriate performance and try !
For your reference, I show you the inference performance (milliseconds) per image with my experiment.
As you see below, Volta (V100) iteself is very powerful devices for not only training but inference workloads. TensorRT seems to accelerate especially on batch execution rather than single inference. (Unlike Project Brainwave, it seems to be high performant when I run batch inference rather than single inference.)
Here I don’t describe details but please try with other (K80 or P100) instances.

ResNet-50 classification

single inference batch inference
General Purpose CPU (Azure DS3 v2) 694 ms/image 649 ms/image
V100 (Azure NCv3) 6.73 ms/image 1.34 ms/image
V100 (Azure NCv3) with TensorRT 3.42 ms/image 0.47 ms/image

(Running on : Ubuntu 16.04, CUDA 9.0, cuDNN 7.1.4, Python 3.5.2, TensorFlow 1.8, TensorRT 4.0.1)

As we saw in my previous post, you can take transfer learning approach with pre-built images when you apply project brainwave (FPGA) inference for your required models. With NVIDIA TensorRT, you can soon bring your own models into optimized real-time inference by transforming your own built model into optimized graph def.
GPU might be expensive choice compared with FPGA devices, but this flexibility will meet your needs in a lot of scenarios.

Note : Currently it seems that several types of nodes are not supported in TensorRT and some advanced graphs might fail in transformation. I hope it will be improved in the future …

Note : You can also use Google TPU for high-speed inference, but single inference seems to cause some overhead and result into low-latency (over 10 ms) now. (See “Cloud TPU – Troubleshooting and FAQ” for details.)

Advertisements

Categories: Uncategorized

Tagged as: ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s