Profiling using Tensorboard-Profiler

This blog post will show how to install tensorflow 2.2 in POWER, how to use profiler and make a comparison between different architectures ( x86, POWER 8 and 9).


In this part I’ll show how to setup your Virtual Machine (VM) and install tensorflow 2.2 in POWER. My PIP version is 20.3 and my version of python is 3.8.

First we need to install some libraries to install tensorflow 2.2.

Installing dependecies of scipy:

    sudo apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran

Installing h5py:

    sudo apt install python3-h5py  

Installing keras using pip:

    pip3 install -U --user keras_applications --no-deps
    pip3 install -U --user keras_preprocessing --no-deps

Now we are able to install tensorflow 2.2. For this, access the site ( to download .whl file (this file is used to install tensorflow using pip comand). First go in Community Supported Builds Section, and click in Artifacts release 2.x of Linux ppc64le CPU Stable Release.

Tensorflow installation

Figure 1: Tensorflow 2.2 cpu- only installation.

After clicking we are directed to jenkins, where we click in tensorflow_cpu-2.2.0-cp38-cp38-linux_ppc64le.whl. Note that “cp38” indicates that the tensorflow should be installed in python 3.8. However, if you are using different versions of python you can download the version corresponding to your python version. But in this tutorial I’ll show how to setup using python 3.8.

Tensorflow installationpt2

Figure 2: Download .whl tensorflow 2.2 build.

For download in VM, copy the link (tensorflow_cpu-2.2.0-cp38-cp38-linux_ppc64le.whl) and use the command below:


Now for installation of the tensorflow using pip command:

    pip3 install tensorflow_cpu-2.2.0-cp38-cp38-linux_ppc64le.whl  

For more information you can visit

Now we need to install tensorboard, tensorboard-plugin-profiler and tensorflow-datasets.

    pip3 install --upgrade tensorboard
    pip3 install tensorflow-datasets
    pip3 install -U tensorboard_plugin_profile

Get access to POWER 8 VM in minicloud

Here is a brief tutorial on how to access POWER 8 virtual machine in minicloud, first access and click in Request Access and answer the google forms to get access. Here is a link that may help you to get access to an instance on minicloud In the next section I’ll show to access tensorboard by terminal.

SSH connection

You’ll need to connect to VM via ssh using the -L 6006:localhost:6006 flag. To be able to use tensorboard in the terminal, your command should be like this:

ssh -i ~/.ssh/your-key.pem -p <vm-port> -L 6006:localhost:6006

For using tensorboard in the terminal we use this command:

    tensorboard --logdir=<name_of_log_directory>

Now we are able to open the link in your favorite browser.

Compare Tensorboard-Profiler in different architectures

In this section, we will be profiling using Tensorboard-Profiler in different architectures and showing the results. First, we will standardize the test file. For this, download the file available in and modify the line:,
          callbacks = [tboard_callback])

          callbacks = [tboard_callback])

Now we are ready to execute the script and debug performance bottlenecks using Tensorboard-Profiler.

Sometimes when running Tensorflow we get some errors like: AttributeError: partially initialized module ‘tensorflow’ has no attribute ‘version’ (most likely due to a circular import). To fix this error you can use the flag -m.

    python3 -m <your-python-file>

After running the script in different architectures we obtain the following results:

Input pipeline analyzer:

  • Data preprocessing (ms)

Table 1: Data preprocessing in different architectures

390 180 164
  • Reading data from files in advance (including caching, prefetching, interleaving) (in ms):

Table 2: Reading data from files in advance in different architectures

6.7 ~ 0 ~0

Tensorflow stats:

  • Operations which consume more time:

Table 3: Operations which consume more time in x86

Type Operation Occurrences total time (us)
Dataset Iterator::Model::MapAndBatch 21 112,868
Dataset Iterator::Model::MapAndBatch::Prefetch::ParallelMap 2.907 106,162
Dataset Iterator::Model::MapAndBatch::Prefetch::ParallelMap::ParallelMap 2.907 104,103
Decode Png decode_image/cond_jpeg/else/_1/cond_png/then/_0/DecodePng 2.910 79,162

Table 4: Operations which consume more time in POWER8

Type Operation Occurrences total time (us)
Dataset Iterator::Model::MapAndBatch 21 131,659
Dataset Iterator::Model::MapAndBatch::Prefetch::ParallelMap 2.673 23,513
MatMul gradient_tape/sequential/dense/MatMul 21 16,335
Dataset Iterator::Model::MapAndBatch::Prefetch 2.673 15,375

Table 5: Operations which consume more time in POWER9

Type Operation Occurrences total time (us)
Dataset Iterator::Model::MapAndBatch 21 114,562
_FusedMatMul sequential/dense/Relu 21 30,306
Dataset Iterator::Model::MapAndBatch::Prefetch::ParallelMap 2.676 20,957
Dataset Iterator::Model::MapAndBatch::Prefetch 2.675 16,722

Now we can analyze the dada and compare beteween different architectures. First we note that x86 consumes more time for data preprocessing and reading data from files in advance (Tables 1 and 2). In Tensorflow stats we can crack the entire code in operations but I’ll show only the top 4 time-consuming operations in tables 3, 4 and 5. However, you can get all operations in tensorboard-profiler in section Tensorflow Stats. From tables 3, 4 and 5 we obtain that the type of operation differs a little, for example in table 3 we have Decode Png in top 4, whereas in power architectures (Tables 4 and 5) we have matmul. But in all 3 architectures Dataset is highly time-consuming.

An interesting function in tensorboard-profiler is Recommendation for Next Step. This function highlights some otimizations that could improve your program, for exemple, when I execute my program in POWER 8 we have some recommendations like:

  • Your program is HIGHLY input-bound because 68.8% of the total step time sampled is waiting for input. Therefore, you should first focus on reducing the input time
  • 7.3 % of the total step time sampled is spent on All Others time.

Next tools to use for reducing the input time

  • input_pipeline_analyzer (especially Section 3 for the breakdown of input operations on the Host)
  • trace_viewer (look at the activities on the timeline of each Host Thread near the bottom of the trace view)
Julio Kiyoshi
Computer Engineering undergrad student

I’m an undergrad student in Computer Engineering at UNICAMP.