Profiling using Tensorboard-Profiler

Dec 7, 2020 5 min read

This blog post will show how to install tensorflow 2.2 in POWER, how to use profiler and make a comparison between different architectures ( x86, POWER 8 and 9).

Prerequisites

In this part I’ll show how to setup your Virtual Machine (VM) and install tensorflow 2.2 in POWER. My PIP version is 20.3 and my version of python is 3.8.

First we need to install some libraries to install tensorflow 2.2.

Installing dependecies of scipy:

    sudo apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran

Installing h5py:

    sudo apt install python3-h5py

Installing keras using pip:

    pip3 install -U --user keras_applications --no-deps
    pip3 install -U --user keras_preprocessing --no-deps

Now we are able to install tensorflow 2.2. For this, access the site (https://github.com/tensorflow/tensorflow) to download .whl file (this file is used to install tensorflow using pip comand). First go in Community Supported Builds Section, and click in Artifacts release 2.x of Linux ppc64le CPU Stable Release.

Tensorflow installation

Figure 1: Tensorflow 2.2 cpu- only installation.

After clicking we are directed to jenkins, where we click in tensorflow_cpu-2.2.0-cp38-cp38-linux_ppc64le.whl. Note that “cp38” indicates that the tensorflow should be installed in python 3.8. However, if you are using different versions of python you can download the version corresponding to your python version. But in this tutorial I’ll show how to setup using python 3.8.

Tensorflow installationpt2

Figure 2: Download .whl tensorflow 2.2 build.

For download in VM, copy the link (tensorflow_cpu-2.2.0-cp38-cp38-linux_ppc64le.whl) and use the command below:

    wget https://powerci.osuosl.org/job/TensorFlow2_PPC64LE_CPU_Release_Build/lastSuccessfulBuild/artifact/tensorflow_pkg/tensorflow_cpu-2.2.0-cp38-cp38-linux_ppc64le.whl

Now for installation of the tensorflow using pip command:

    pip3 install tensorflow_cpu-2.2.0-cp38-cp38-linux_ppc64le.whl

For more information you can visit https://www.tensorflow.org/install/source.

Now we need to install tensorboard, tensorboard-plugin-profiler and tensorflow-datasets.

    pip3 install --upgrade tensorboard
    pip3 install tensorflow-datasets
    pip3 install -U tensorboard_plugin_profile

Get access to POWER 8 VM in minicloud

Here is a brief tutorial on how to access POWER 8 virtual machine in minicloud, first access https://openpower.ic.unicamp.br/minicloud/ and click in Request Access and answer the google forms to get access. Here is a link that may help you to get access to an instance on minicloud https://github.com/Unicamp-OpenPower/minicloud/wiki. In the next section I’ll show to access tensorboard by terminal.

SSH connection

You’ll need to connect to VM via ssh using the -L 6006:localhost:6006 flag. To be able to use tensorboard in the terminal, your command should be like this:

ssh ubuntu@minicloud.parqtec.unicamp.br -i ~/.ssh/your-key.pem -p <vm-port> -L 6006:localhost:6006

For using tensorboard in the terminal we use this command:

    tensorboard --logdir=<name_of_log_directory>

Now we are able to open the link in your favorite browser.

Compare Tensorboard-Profiler in different architectures

In this section, we will be profiling using Tensorboard-Profiler in different architectures and showing the results. First, we will standardize the test file. For this, download the file available in https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras and modify the line:

model.fit(ds_train,
          epochs=2,
          validation_data=ds_test,
          callbacks = [tboard_callback])

to:

model.fit(ds_train,
          epochs=5,
          validation_data=ds_test,
          callbacks = [tboard_callback])

Now we are ready to execute the script and debug performance bottlenecks using Tensorboard-Profiler.

Sometimes when running Tensorflow we get some errors like: AttributeError: partially initialized module ’tensorflow’ has no attribute ‘version’ (most likely due to a circular import). To fix this error you can use the flag -m.

    python3 -m <your-python-file>

After running the script in different architectures we obtain the following results:

Input pipeline analyzer:

Data preprocessing (ms)

Table 1: Data preprocessing in different architectures

X86	POWER8	POWER9
390	180	164

Reading data from files in advance (including caching, prefetching, interleaving) (in ms):

Table 2: Reading data from files in advance in different architectures

X86	POWER8	POWER9
6.7	~ 0	~0

Tensorflow stats:

Operations which consume more time:

Table 3: Operations which consume more time in x86

Type	Operation	Occurrences	total time (us)
Dataset	Iterator::Model::MapAndBatch	21	112,868
Dataset	Iterator::Model::MapAndBatch::Prefetch::ParallelMap	2.907	106,162
Dataset	Iterator::Model::MapAndBatch::Prefetch::ParallelMap::ParallelMap	2.907	104,103
Decode Png	decode_image/cond_jpeg/else/_1/cond_png/then/_0/DecodePng	2.910	79,162

Table 4: Operations which consume more time in POWER8

Type	Operation	Occurrences	total time (us)
Dataset	Iterator::Model::MapAndBatch	21	131,659
Dataset	Iterator::Model::MapAndBatch::Prefetch::ParallelMap	2.673	23,513
MatMul	gradient_tape/sequential/dense/MatMul	21	16,335
Dataset	Iterator::Model::MapAndBatch::Prefetch	2.673	15,375

Table 5: Operations which consume more time in POWER9

Type	Operation	Occurrences	total time (us)
Dataset	Iterator::Model::MapAndBatch	21	114,562
_FusedMatMul	sequential/dense/Relu	21	30,306
Dataset	Iterator::Model::MapAndBatch::Prefetch::ParallelMap	2.676	20,957
Dataset	Iterator::Model::MapAndBatch::Prefetch	2.675	16,722

Now we can analyze the dada and compare beteween different architectures. First we note that x86 consumes more time for data preprocessing and reading data from files in advance (Tables 1 and 2). In Tensorflow stats we can crack the entire code in operations but I’ll show only the top 4 time-consuming operations in tables 3, 4 and 5. However, you can get all operations in tensorboard-profiler in section Tensorflow Stats. From tables 3, 4 and 5 we obtain that the type of operation differs a little, for example in table 3 we have Decode Png in top 4, whereas in power architectures (Tables 4 and 5) we have matmul. But in all 3 architectures Dataset is highly time-consuming.

An interesting function in tensorboard-profiler is Recommendation for Next Step. This function highlights some otimizations that could improve your program, for exemple, when I execute my program in POWER 8 we have some recommendations like:

Your program is HIGHLY input-bound because 68.8% of the total step time sampled is waiting for input. Therefore, you should first focus on reducing the input time
7.3 % of the total step time sampled is spent on All Others time.

Next tools to use for reducing the input time

input_pipeline_analyzer (especially Section 3 for the breakdown of input operations on the Host)
trace_viewer (look at the activities on the timeline of each Host Thread near the bottom of the trace view)

Julio Kiyoshi

Computer Engineering undergrad student

I’m an undergrad student in Computer Engineering at UNICAMP.