--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
NVIDIA CUDA
Linux Release Notes
Version 2.2
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

On some Linux releases, due to a GRUB bug in the handling of upper
memory and a default vmalloc too small on 32-bit systems, it may be
necessary to pass this information to the bootloader:

vmalloc=256MB, uppermem=524288

Example of grub conf:

title Red Hat Desktop (2.6.9-42.ELsmp)
root (hd0,0)
uppermem 524288
kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
pci=nommconf
initrd /initrd-2.6.9-42.ELsmp.img

--------------------------------------------------------------------------------
New Features
--------------------------------------------------------------------------------

  Hardware Support
  o  See http://www.nvidia.com/object/cuda_learn_products.html

  Platform Support
  o  Additional OS support
     - Red Hat Enterprise Linux 5.3
     - SUSE Linux 11.1
     - Fedora 10
     - Ubuntu 8.10
  o  Eliminated OS support
     - SUSE Linux 10.3
     - Fedora 8
     - Ubuntu 7.10

  API Features
  o Pinned Memory Support
     - These new memory management functions (cuMemHostAlloc() and
       cudaHostAlloc()) enable pinned memory to be made "portable" (available
       to all CUDA contexts), "mapped" (mapped into the CUDA address space),
       and/or "write combined" (not cached and faster for the GPU to access).
     - cuMemHostAlloc
     - cuMemHostGetDevicePointer
     - cudaHostAlloc
     - cudaHostGetDevicePointer
  o Function attribute query
     - This function allows applications to query various function properties.
     - cuFuncGetAttribute
  o 2D Texture reads from pitch linear memory
     - You can bind linear memory that you get from cuMemAlloc() or
       cudaMalloc() directly to a 2D texture. In previous releases, you were
       only able to bind cuArrayCreate() or cudaMallocArray() arrays to 2D
       textures.
     - cuTexRefSetAddress2D
     - cudaBindTexture2D
  o Flags for event creation
     - Applications can now create events that use blocking synchronization.
     - cudaEventCreateWithFlags
  o New device management and context creation flags
     - The function cudaSetDeviceFlags() allows the application to specify
       attributes such as mapping host memory and support for blocking
       synchronization.
     - cudaSetDeviceFlags
  o Improved runtime device management
     - The runtime now defaults to attempting context creation on other
       devices in the system before returning any failure messages. The new
       call cudaSetValidDevices() allows the application to specify a list of
       acceptable devices for use.
     - cudaSetValidDevices
  o Driver/runtime version query functions
     - Applications can now directly query version information about the
       underlying driver/runtime.
     - cuDriverGetVersion
     - cudaDriverGetVersion
     - cudaRuntimeGetVersion
  o New device attribute queries
     - CU_DEVICE_ATTRIBUTE_INTEGRATED
     - CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY
     - CU_DEVICE_ATTRIBUTE_COMPUTE_MODE

  Documentation
  o Doxygen-generated and cross-referenced html, pdf, and man pages.
     - Runtime API
     - Driver API

--------------------------------------------------------------------------------
Major Bug Fixes
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
Known Issues
--------------------------------------------------------------------------------

o GPU enumeration order on multi-GPU systems is non-deterministic and
  may change with this or future releases. Users should make sure to
  enumerate all CUDA-capable GPUs in the system and select the most
  appropriate one(s) to use.

o Individual GPU program launches are limited to a run time
  of less than 5 seconds on a GPU with a display attached.
  Exceeding this time limit causes a launch failure reported
  through the CUDA driver or the CUDA runtime. GPUs without
  a display attached are not subject to the 5 second run time
  restriction. For this reason it is recommended that CUDA is
  run on a GPU that is NOT attached to an X display.

o In order to run CUDA applications, the CUDA module must be
  loaded and the entries in /dev created.  This may be achieved
  by initializing X Windows, or by creating a script to load the
  kernel module and create the entries.

  An example script (to be run at boot time):

  #!/bin/bash

  modprobe nvidia

  if [ "$?" -eq 0 ]; then

  # Count the number of NVIDIA controllers found.
  N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
  NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
  mknod -m 666 /dev/nvidia$i c 195 $i;
  done

  mknod -m 666 /dev/nvidiactl c 195 255

  else
  exit 1
  fi

o When compiling with GCC, special care must be taken for structs that
  contain 64-bit integers.  This is because GCC aligns long longs
  to a 4 byte boundary by default, while NVCC aligns long longs
  to an 8 byte boundary by default.  Thus, when using GCC to
  compile a file that has a struct/union, users must give the
  -malign-double
  option to GCC.  When using NVCC, this option is automatically
  passed to GCC.

o "#pragma unroll" sometimes does not unroll loops because of limits in the
  compiler on loop bodies, which may cause a decrease in performance versus
  CUDA 2.0. A user can override this limit on the command line with the
  following nvcc compiler flag:

  nvcc -Xopencc -OPT:unroll_size=200000

  In most cases, this should override the built-in loop unrolling limits.
  Unless a kernel uses #pragma unroll and shows a significant performance drop
  from CUDA 2.0, this flag should not be used.

o It is a known issue that cudaThreadExit() may not be called implicitly on
  host thread exit. Due to this, developers are recommended to explicitly
  call cudaThreadExit() while the issue is being resolved.

o Cross-compilation with the --machine option is not supported.

o The default compilation mode for host code is now C++. To restore the old
  behavior, use the option --host-compilation=c

o For maximum performance when using multiple byte sizes to access the
  same data, coalesce adjacent loads and stores when possible rather
  than using a union or individual byte accesses. Accessing the data via
  a union may result in the compiler reserving extra memory for the object,
  and accessing the data as individual bytes may result in non-coalesced
  accesses. This will be improved in a future compiler release.

o OpenGL interoperability
  - OpenGL cannot access a buffer that is currently
    *mapped*. If the buffer is registered but not mapped, OpenGL can do any
    requested operations on the buffer.
  - Deleting a buffer while it is mapped for CUDA results in undefined behavior.
  - Attempting to map or unmap while a different context is bound than was
    current during the buffer register operation will generally result in a
    program error and should thus be avoided.
  - Interoperability will use a software path on SLI
  - Interoperability will use a software path if monitors are attached to
    multiple GPUs and a single desktop spans more than one GPU
    (i.e. X11 Xinerama).

o Sending sigkill (ctrl-c) to an application that is currently running a
  kernel on the GPU may not result in a clean shutdown of the process as the
  kernel may continue running for a long time afterwards on the GPU. In such
  cases, a system restart may be necessary before running further CUDA or
  graphics applications.

o Some MPI implementations add the current working directory to the $PATH
  silently, which can trigger a segmentation fault in the CUDA driver if you
  do not normally already have "." in your $PATH. The executable must be in
  your path to avoid this error. The best solution is to specify the
  executable to run using an absolute path or a relative path that at minimum
  includes ./ in front of it.

  Examples:  mpirun -np 2 $PWD/a.out
             mpirun -np 2 ./a.out


--------------------------------------------------------------------------------
Open64 Sources
--------------------------------------------------------------------------------

The Open64 source files are controlled under terms of the GPL license.
Current and previously released versions are located via anonymous ftp at
download.nvidia.com in the CUDAOpen64 directory.


--------------------------------------------------------------------------------
Revision History
--------------------------------------------------------------------------------

  03/2009 - Version 2.2 Beta
  11/2008 - Version 2.1 Beta
  06/2008 - Version 2.0
  11/2007 - Version 1.1
  06/2007 - Version 1.0
  06/2007 - Version 0.9
  02/2007 - Version 0.8 - Initial public Beta


--------------------------------------------------------------------------------
More Information
--------------------------------------------------------------------------------

  For more information and help with CUDA, please visit
  http://www.nvidia.com/cuda
