CUDA
Profiler

Published by
   NVIDIA Corporation
   2701 San Tomas Expressway
   Santa Clara, CA 95050

Notice

ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, 
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, 
"MATERIALS") ARE BEING PROVIDED "AS IS". NVIDIA MAKES NO WARRANTIES, 
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, 
AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, 
MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.

Information furnished is believed to be accurate and reliable. However, 
NVIDIA Corporation assumes no responsibility for the consequences of use 
of such information or for any infringement of patents or other rights 
of third parties that may result from its use. No license is granted by 
implication or otherwise under any patent or patent rights of NVIDIA 
Corporation. Specifications mentioned in this publication are subject 
to change without notice. This publication supersedes and replaces all 
information previously supplied. NVIDIA Corporation products are not 
authorized for use as critical components in life support devices or 
systems without express written approval of NVIDIA Corporation. 

Trademarks

NVIDIA, CUDA, and the NVIDIA logo are trademarks or registered trademarks 
of NVIDIA Corporation in the United States and other countries. Other 
company and product names may be trademarks of the respective companies 
with which they are associated.

Copyright

(C) 2005-2009 by NVIDIA Corporation. All rights reserved.


The CUDA Profiler
=================

This release of Cuda comes with a simple profiler that allows users
to gather timing information about kernel execution and memory 
transfer operations. The profiler can be used to identify performance
bottlenecks in multi-kernel applications or to quantify the benefit of
optimizing a single kernel.

Profiler Control
----------------

The Cuda profiler is controlled using the following environment 
variables:

CUDA_PROFILE

is set to either 1 or 0 (or unset) to enable or disable profiling.

CUDA_PROFILE_LOG

is set to the desired file path for profiling output. If there is no 
log path specified, the profiler will log data to ./cuda_profile.log.
In case of multiple devices you can add '%d' in the CUDA_PROFILE_LOG
name. This will generate separate profiler output files for each device 
- with '%d' substituted by the device number. 


CUDA_PROFILE_CSV

is set to either 1 or 0 (or unset) to enable or disable a comma 
separated version of the log output. 

CUDA_PROFILE_CONFIG

is used to specify a config file for enabling performance counters
in the GPU. See the next section for configuration details.


Profiler Configuration
----------------------

This version of the Cuda profiler supports configuration options that allow
users to gather statistics about various events occurring in the GPU during execution. 
These events are tracked with hardware counters on signals in the chip. 

The profiler supports the following options:

timestamp        : Time stamps for kernel launches and memory transfers.
                   This can be used for timeline analysis.

gridsize         : Number of blocks in a grid along the X and Y dimensions for 
                   a kernel launch          

threadblocksize  : Number of threads in a block along the X, Y and Z dimensions 
                   for a kernel launch

dynsmemperblock  : Size of dynamically allocated shared memory per block in bytes
                   for a kernel launch

stasmemperblock  : Size of statically allocated shared memory per block in bytes
                   for a kernel launch

regperthread     : Number of registers used per thread for a kernel launch.

memtransferdir   : Memory transfer direction, a direction value of 0 is used for 
                   host->device memory copies and a value of 1 is used for device->host
                   memory copies.                  

memtransfersize  : Memory copy size in bytes

streamid         : Stream Id for a kernel launch


The profiler supports logging of following counters during kernel execution:

gld_incoherent   : Non-coalesced (incoherent) global memory loads

gld_coherent     : Coalesced (coherent) global memory loads

gld_32b          : 32-byte global memory load transactions

gld_64b          : 64-byte global memory load transactions

gld_128b         : 128-byte global memory load transactions

gld_request      : Global memory loads

gst_incoherent   : Non-coalesced (incoherent) global memory stores

gst_coherent     : Coalesced (coherent) global memory stores

gst_32b          : 32-byte global memory store transactions

gst_64b          : 64-byte global memory store transactions

gst_128b         : 128-byte global memory store transactions

gst_request      : Global memory stores

local_load       : Local memory loads

local_store      : Local memory stores

branch           : Branches taken by threads executing a kernel

divergent_branch : Divergent branches taken by threads executing a kernel

instructions     : Instructions executed

warp_serialize   : Number of thread warps that serialize on address conflicts 
                   to either shared or constant memory

cta_launched     : Number of threads blocks executed


There is a limit of 4 profiler counters. 

Options can be commented out using the '#' character at the start of a line. 


Profiler Output
---------------

While CUDA_PROFILE is set, the profiler log records timing information for
every kernel launch and memory operation performed by the driver.  The
default log syntax follows a simple form:

id=[ value ]

For example, here is part of the log from a test of the haar1dwt application 
(without any counters enabled):

# CUDA_PROFILE_LOG_VERSION 1.4
# CUDA_DEVICE_NAME 0 GeForce GTX 280			
timestamp,method,gputime,cputime,occupancy
timestamp=[ 2155.302 ] method=[ _Z10fhaar1dwtdiPf ] gputime=[ 7.808 ] cputime=[ 74.730 ] occupancy=[ 1.000 ] 
timestamp=[ 2421.886 ] method=[ memcopy ] gputime=[ 4.864 ] cputime=[ 238.159 ] 
timestamp=[ 2706.140 ] method=[ _Z10ihaar1dwtdiPf ] gputime=[ 7.296 ] cputime=[ 59.295 ] occupancy=[ 1.000 ] 
timestamp=[ 2876.413 ] method=[ memcopy ] gputime=[ 4.608 ] cputime=[ 224.679 ] 

This log shows data for memory copies and a few different kernel launches.
The 'method' label specifies which GPU function was executed by the
driver. The 'gputime' and 'cputime' labels specify the actual chip
execution time and the driver execution time (including gputime),
respectively. Note that all times are in microseconds. The 'occupancy'
label gives the warp occupancy - percentage of the maximum warp count in
the GPU - for a particular method launch. An occupancy of 1.000 means the
chip is completely full. 

Another example shows the profiler log of matrix multiplication app. There are few counters
enabled in this example. This example includes the new options for memcopy method:

# CUDA_PROFILE_LOG_VERSION 1.4
# CUDA_DEVICE_NAME 0 GeForce GTX 280			
timestamp,method,gputime,cputime,gridSizeX,gridSizeY,blockSizeX,blockSizeY,blockSizeZ,occupancy,instructions,branch,
cta_launched,memTransferSize,memTransferDir
timestamp=[ 6492.515 ] method=[ _Z10dmatrixmulPfiiS_iiS_ ] gputime=[ 25.472 ] cputime=[ 203.797 ] gridSize=[ 2, 1 ] threadCountPerBlock=[ 32, 8, 8 ] occupancy=[ 0.333 ] instructions=[ 2261 ] branch=[ 312 ] cta_launched=[ 2 ] 
timestamp=[ 7031.061 ] method=[ memcopy ] gputime=[ 8.896 ] cputime=[ 230.686 ] memtransfersize=[ 8192 ] memtransferdir=[ 1 ]

This log shows some of the new fields added: 
gridSize shows the number of blocks in x and y direction, 
and threadCountPerBlock shows the number of threads in a block in x, y and z directions. 
The profiler will now show the cputime, transfer size,and direction of pageable memcopys. 


The default log syntax is easy to parse with a script, but for spreadsheet
analysis it might be easier to use the comma separated version. When
CUDA_PROFILE_CSV is set to 1, this same test produces the following
output:

# CUDA_PROFILE_LOG_VERSION 1.4
# CUDA_PROFILE_CSV 1
# CUDA_DEVICE_NAME 0 GeForce GTX 280			
timestamp,method,gputime,cputime,gridSizeX,gridSizeY,blockSizeX,blockSizeY,blockSizeZ,occupancy,cta_launched,branch,
instructions,memTransferSize,memTransferDir
6390.687,_Z10dmatrixmulPfiiS_iiS_,25.184,203.168,2,1,32,8,8,0.333,312,312,2261
6946.483,memcopy,8.928,240.673,,,,,,,,,,8192,1

Interpreting Profiler Counters
------------------------------

The performance counter values do not correspond to individual thread activity.
Instead, these values represent events within a thread warp. For example, a
divergent branch within a thread warp will increment the divergent_branch
counter by one. So the final counter value stores information for all divergent
branches in all warps. 

In addition, the profiler can only target one of the multiprocessors in the
GPU, so the counter values will not correspond to the total number of warps
launched for a particular kernel. For this reason, when using the performance
counter options in the profiler the user should always launch enough threads
blocks to ensure that the target multiprocessor is given a consistent
percentage of the total work. In practice, it is best to launch at least around
100 blocks for consistent results.

For the reasons listed above, users should not expect the counter values to
match the numbers one would get by inspecting kernel code. The values are
best used to identify relative performance differences between unoptimized and
optimized code. For example, if for the initial version of the program the
profiler reports N non-coalesced global loads, it is easy to see if the optimized
code produces less than N non-coalesced loads. In most cases, the goal is to make 
N go to 0, so the counter value is useful for tracking progress toward this goal. 

 
Known Issues
------------
1. Due to improved memory coalescing hardware, the gld_incoherent 
and gst_incoherent signals will always be zero on GTX 280 and GTX 260.

2. Certain memory copy procedures are not included in profiler output.

