In my previous post, I described about the challenges for real-time inference and Project Brainwave (FPGA) by Microsoft.

In this post I introduce NVIDIA TensorRT for developers or data scientists on Azure.

For setting up environment on Azure, Azure Data Science Virtual Machine (DSVM) includes TensorRT (see here) and you can soon start without cumbersome setup or configurations.

But here we want to use TensorFlow-TensorRT integration (which needs TensorFlow 1.7 or later) and we should start from Azure VM (NC or NCv3 instance) with plain Ubuntu 16 LTS installed.

Here I don’t describe steps about software installation and setup, but please refer my github repo for the setup procedure on Azure NC (or NCv2, NCv3) seriese. (I used latest TensorRT 4.0.)

After setup is done, now you can examine your real-time inference on cloud platform. Please run the following source code with TensorRT optimizations. (See my repo for required files like images, etc.)

As you can see below, it converts usual TensorFlow graph (resnetV150_frozen.pb) into TensorRT optimized graph. TensorRT graph is also the standard TensorFlow graph and you can use this optimized graph as usual manners.

See “NVIDIA Developer Blog : Deploying Deep Neural Networks with NVIDIA TensorRT” for details about transformations and optimizations by TensorRT. (Here I don’t go so far about what is done inside TensorRT optimizations.)

import time import tensorflow as tf from tensorflow.contrib import tensorrt as trt # # tuning parameters # (please change these values along with your computing resources) # batch_size = 128 workspace_size_bytes = 1 << 30 precision_mode = 'FP16' # use 'FP32' for K80 trt_gpu_ops = tf.GPUOptions(per_process_gpu_memory_fraction = 0.50) # # Read images (images -> input vectors) # tf.reset_default_graph() g1 = tf.Graph() with g1.as_default(): # Create graph in_images = tf.placeholder(tf.string, name='in_images') decoded_input = tf.image.decode_png(in_images, channels=3) float_input = tf.cast(decoded_input, dtype=tf.float32) # (224, 224, 3) -> (n, 224, 224, 3) rgb_input = tf.expand_dims( float_input, axis=0) # For VGG preprocess, reduce means and convert to BGR slice_red = tf.slice( rgb_input, [0, 0, 0, 0], [1, 224, 224, 1]) slice_green = tf.slice( rgb_input, [0, 0, 0, 1], [1, 224, 224, 1]) slice_blue = tf.slice( rgb_input, [0, 0, 0, 2], [1, 224, 224, 1]) sub_red = tf.subtract(slice_red, 123.68) sub_green = tf.subtract(slice_green, 116.779) sub_blue = tf.subtract(slice_blue, 103.939) transferred_input = tf.concat( [sub_blue, sub_green, sub_red], 3) # Run transformation to vectors with tf.Session(config=tf.ConfigProto(gpu_options=trt_gpu_ops)) as s1: with open('./tiger224x224.jpg', 'rb') as f: data1 = f.read() feed_dict = { in_images: data1 } imglist1 = s1.run([transferred_input], feed_dict=feed_dict) image1 = imglist1[0] with open('./lion224x224.jpg', 'rb') as f: data2 = f.read() feed_dict = { in_images: data2 } imglist2 = s1.run([transferred_input], feed_dict=feed_dict) image2 = imglist2[0] with open('./orangutan224x224.jpg', 'rb') as f: data3 = f.read() feed_dict = { in_images: data3 } imglist3 = s1.run([transferred_input], feed_dict=feed_dict) image3 = imglist3[0] """ Uncomment if you test batch inference """ # import numpy as np # image1 = np.tile(image1,(batch_size,1,1,1)) # image2 = np.tile(image2,(batch_size,1,1,1)) # image3 = np.tile(image3,(batch_size,1,1,1)) print('Loaded image vectors (tiger, lion, orangutan)') # # Load classification graph def (pre-trained) # classifier_model_file = './resnetV150_frozen.pb' # downloaded from NVIDIA sample classifier_graph_def = tf.GraphDef() with tf.gfile.Open(classifier_model_file, 'rb') as f: data = f.read() classifier_graph_def.ParseFromString(data) print('Loaded classifier graph def') # # Convert to TensorRT graph def # trt_graph_def = trt.create_inference_graph( input_graph_def=classifier_graph_def, outputs=['resnet_v1_50/predictions/Reshape_1'], max_batch_size=batch_size, max_workspace_size_bytes=workspace_size_bytes, precision_mode=precision_mode) #trt_graph_def=trt.calib_graph_to_infer_graph(trt_graph_def) # For only 'INT8' print('Generated TensorRT graph def') # # Generate tensor with TensorRT graph def # tf.reset_default_graph() g2 = tf.Graph() with g2.as_default(): trt_x, trt_y = tf.import_graph_def( trt_graph_def, return_elements=['input:0', 'resnet_v1_50/predictions/Reshape_1:0']) print('Generated tensor for TensorRT optimized graph') # # Run classification with TensorRT graph # with open('./imagenet_classes.txt', 'rb') as f: labeltext = f.read() classes_entries = labeltext.splitlines() with tf.Session(graph=g2, config=tf.ConfigProto(gpu_options=trt_gpu_ops)) as s2: # # predict image1 (tiger) # feed_dict = { trt_x: image1 } start_time = time.process_time() result = s2.run([trt_y], feed_dict=feed_dict) stop_time = time.process_time() # list -> 1 x n ndarray : feature's format is [[1.16643378e-06 3.12126781e-06 3.39836406e-05 ... ]] nd_result = result[0] # remove row's dimension onedim_result = nd_result[0,] # set column index to array of possibilities indexed_result = enumerate(onedim_result) # sort with possibilities sorted_result = sorted(indexed_result, key=lambda x: x[1], reverse=True) # get the names of top 5 possibilities for top in sorted_result[:5]: print(classes_entries[top[0]], 'confidence:', top[1]) print('{:.2f} milliseconds'.format((stop_time-start_time)*1000)) # # predict image2 (lion) # feed_dict = { trt_x: image2 } start_time = time.process_time() result = s2.run([trt_y], feed_dict=feed_dict) stop_time = time.process_time() # list -> 1 x n ndarray : feature's format is [[1.16643378e-06 3.12126781e-06 3.39836406e-05 ... ]] nd_result = result[0] # remove row's dimension onedim_result = nd_result[0,] # set column index to array of possibilities indexed_result = enumerate(onedim_result) # sort with possibilities sorted_result = sorted(indexed_result, key=lambda x: x[1], reverse=True) # get the names of top 5 possibilities for top in sorted_result[:5]: print(classes_entries[top[0]], 'confidence:', top[1]) print('{:.2f} milliseconds'.format((stop_time-start_time)*1000)) # # predict image3 (orangutan) # feed_dict = { trt_x: image3 } start_time = time.process_time() result = s2.run([trt_y], feed_dict=feed_dict) stop_time = time.process_time() # list -> 1 x n ndarray : feature's format is [[1.16643378e-06 3.12126781e-06 3.39836406e-05 ... ]] nd_result = result[0] # remove row's dimension onedim_result = nd_result[0,] # set column index to array of possibilities indexed_result = enumerate(onedim_result) # sort with possibilities sorted_result = sorted(indexed_result, key=lambda x: x[1], reverse=True) # get the names of top 5 possibilities for top in sorted_result[:5]: print(classes_entries[top[0]], 'confidence:', top[1]) print('{:.2f} milliseconds'.format((stop_time-start_time)*1000))

Following is the result outputs running code on V100 (Volta architecture) instance (NCv3).

As you can see, the first one is very slow for the reason of initial start-up. (Please ignore this first one for your benchmarks.)

```
b'tiger, Panthera tigris' confidence: 0.92484504
b'tiger cat' confidence: 0.07038423
b'zebra' confidence: 0.0017817853
b'tabby, tabby cat' confidence: 0.0015420463
b'jaguar, panther, Panthera onca, Felis onca' confidence: 0.0006905204
1654.38 milliseconds
```

```
b'lion, king of beasts, Panthera leo' confidence: 0.5410014
b'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor' confidence: 0.44524524
b'wombat' confidence: 0.001763356
b'jaguar, panther, Panthera onca, Felis onca' confidence: 0.0015685557
b'tiger, Panthera tigris' confidence: 0.0011836683
3.63 milliseconds
```

```
b'orangutan, orang, orangutang, Pongo pygmaeus' confidence: 0.82915646
b'gorilla, Gorilla gorilla' confidence: 0.13315238
b'chimpanzee, chimp, Pan troglodytes' confidence: 0.035735503
b'macaque' confidence: 0.00070971606
b'patas, hussar monkey, Erythrocebus patas' confidence: 0.0002849982
3.91 milliseconds
```

Let’s change parameters to get appropriate performance and try !

For your reference, I show you the inference performance (milliseconds) per image with my experiment.

As you see below, Volta (V100) iteself is very powerful devices for not only training but inference workloads. TensorRT seems to accelerate especially on batch execution rather than single inference. (Unlike Project Brainwave, it seems to be high performant when I run batch inference rather than single inference.)

Here I don’t describe details but please try with other (K80 or P100) instances.

ResNet-50 classification

single inference | batch inference | |
---|---|---|

General Purpose CPU (Azure DS3 v2) | 694 ms/image | 649 ms/image |

V100 (Azure NCv3) | 6.73 ms/image | 1.34 ms/image |

V100 (Azure NCv3) with TensorRT | 3.42 ms/image | 0.47 ms/image |

(Running on : Ubuntu 16.04, CUDA 9.0, cuDNN 7.1.4, Python 3.5.2, TensorFlow 1.8, TensorRT 4.0.1)

As we saw in my previous post, you can take transfer learning approach with pre-built images when you apply project brainwave (FPGA) inference for your required models. With NVIDIA TensorRT, you can soon bring your own models into optimized real-time inference by transforming your own built model into optimized graph def.

GPU might be expensive choice compared with FPGA devices, but this flexibility will meet your needs in a lot of scenarios.

Note : Currently it seems that several types of nodes are not supported in TensorRT and some advanced graphs might fail in transformation. I hope it will be improved in the future …

Note : You can also use Google TPU for high-speed inference, but single inference seems to cause some overhead and result into low-latency (over 10 ms) now. (See “Cloud TPU – Troubleshooting and FAQ” for details.)

Categories: Uncategorized