Over 60 trainings all over Europe for universities and industryOn-site trainings on the whole range of GPU computing technologiesEach lecture accompanied with a practical session on remote GPU clusterBest recipes of GPU code optimization, based on our 5-year development experienceWe have multiple training programs and even books! Check out our catalogue here.

How to get infinite loops to work in CUDA

Category: CUDA
Published: Tuesday, 12 June 2018

Surprisingly, the CUDA compiler does not handle infinite loops properly:


For instance, the loop above will be completely eliminated from the resulting assembly, along with its contents. This situation seems to be known at least since 2012.

There are certain situations where infinite loop on GPU is required, such as running a background service with configuration updates periodically sent from host over host-mapped memory.

One solution to enforce infinite loop is:

volatile int infinity = 1;
while (infinity)

How to fix CUDA and avx512vlintrin.h incompatibilty issue

Category: CUDA
Published: Tuesday, 08 May 2018

Recent 5.x and 6.x GCC compilers are causing NVCC to produce the following kind of weird compile errors:

 /usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlintrin.h(10919): error: argument of type "const void *" is incompatible with parameter of type "const long long *" 

In order to track down the issue, pick up the failing NVCC command line and add -M flag (or replace -c flag with -M flag) to print down the structure of used header files:

 $ /opt/cuda/bin/nvcc ... -M -o outfile 

While walking through the output, notice, which system header sits on top of problematic avx512... header. In our case, it was <algorithm>:

$ cat outfile
    /usr/include/c++/5/algorithm \
    /usr/include/c++/5/bits/stl_algo.h \
    /usr/include/c++/5/bits/algorithmfwd.h \
    /usr/include/c++/5/bits/stl_heap.h \
    /usr/include/c++/5/random \
    /usr/include/c++/5/limits \
    /usr/include/c++/5/bits/random.h \
    /usr/include/c++/5/bits/uniform_int_dist.h \
    /usr/include/x86_64-linux-gnu/c++/5/bits/opt_random.h \
    /usr/lib/gcc/x86_64-linux-gnu/5/include/x86intrin.h \
    /usr/lib/gcc/x86_64-linux-gnu/5/include/ia32intrin.h \
    /usr/lib/gcc/x86_64-linux-gnu/5/include/pmmintrin.h \
    /usr/lib/gcc/x86_64-linux-gnu/5/include/tmmintrin.h \
    /usr/lib/gcc/x86_64-linux-gnu/5/include/ammintrin.h \
    /usr/lib/gcc/x86_64-linux-gnu/5/include/smmintrin.h \
    /usr/lib/gcc/x86_64-linux-gnu/5/include/popcntintrin.h \
    /usr/lib/gcc/x86_64-linux-gnu/5/include/wmmintrin.h \
    /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h \
    /usr/lib/gcc/x86_64-linux-gnu/5/include/avxintrin.h \

After removing #include <algorithm>, the compiler error is gone. If <algorithm> could not be removed right away due to complex code dependencies, you can workaround by splitting your source file into CUDA source and <algorithm>-dependent C++ source to be compiled solely by GCC.

New 4-day Training Course on GPU-enabled Neural Networks

Category: CUDA
Published: Thursday, 17 November 2016


Deep Learning on GPUs is currently boosting AI's massive leap into real-life applications: autopilots, intelligent automated assistants, real-time translation, image recognition, data sequencing and clustering. With the unprecedented computing power of NVIDIA GPUs, many automotive, robotics and big data companies are creating products and services based on a new class of intelligent machines.

Applied Parallel Computing LLC offers a specialized 4-day course on GPU-enabled Neural Networks. The course is intended for developers willing to rapidly get NVIDIA-based AI technology into new and existing software solutions. In 4 days we will walk from the necessary theory of Recurrent and Convolutional Neural Networks to practical recipes of using Tensorflow and Caffe frameworks. Finally, course attendees will be guided to build up an AI software stack for a real robot.

Training Course Program


Remote profiling with NVIDIA Visual Profiler on a SLURM-based cluster

Category: CUDA
Published: Wednesday, 16 November 2016

GPU-equipped clusters are often managed by SLURM job control system. Essentially, developer logs into the frontend node by SSH, builds the application and then queries SLURM for compute node(s) allocation. Once compute nodes are granted, the application is executed on them. In order to debug or profile an application, developer is allowed to SSH from the frontend node to individual compute nodes granted for execution. Now, how this pipeline can play together with NVIDIA Visual Profiler?

Read more: Remote profiling with NVIDIA Visual Profiler on a SLURM-based cluster

CUDA-like runtime interface for Xeon Phi

Category: CUDA
Published: Tuesday, 12 April 2016

The performance power of GPUs could be exposed to applications using two principal kinds of programming interfaces: with manual parallel programming (CUDA or OpenCL), or with directive-based extensions relying on compiler's capabilities of semi-automatic parallelization (OpenACC and OpenMP4). Unlike for GPUs, Intel has never offered an explicit CUDA-like interface for their Xeon Phi accelerators to general public, leaving OpenMP offloading directives as the only programming option.

Based on liboffloadmic, we have prototyped "micrt" - a programming interface to execute memory transfers and kernels, similarly to CUDA runtime. Find the code example and building instructions here.



Category: CUDA
Published: Wednesday, 14 October 2015

Multiple presentations about OpenMP 4.0 support on NVIDIA GPUs date back to 2012. There is however still very limited OpenMP 4.0 production-ready tools availability for NVIDIA devices: Intel's compilers are Xeon Phi only, PGI and Cray offer only OpenACC, GCC support is only in plans. Fortunately, a usable compiler could be built on top of LLVM and Clang. In this blog post we provide complete instructions to build OpenMP 4.0 compiler with NVIDIA support on Linux. Note these instructions could be especially useful for building OpenMP 4.0 compiler on platforms where OpenACC is not available, e.g. Jetson TK1.

Read more: OpenMP 4.0 on NVIDIA CUDA GPUs

Use CUDA 7.0 NVRTC with Thrust

Category: CUDA
Published: Wednesday, 29 April 2015

Rintime Compilation (NVRTC) introduced in CUDA 7.0 allows to dynamically compile CUDA kernels during program execution (see example). This functionality allows to perform additional GPU code optimization/specialization, using runtime context, e.g. to substitute constant loop bounds and unroll loops, or to eliminate divergent branches known not to be visited. However, NVRTC is not fully equivalent to offline nvcc: it only compiles CUDA device code into PTX assembly. Thus, NVRTC is not directly usable with GPU-enabled frameworks that combine both host and device code, e.g. Thrust. In this demo we show how to use NVRTC to replace a certain device function in Thrust code.

Read more: Use CUDA 7.0 NVRTC with Thrust

Get extra 8% perf in bilinear interpolation on GPU using __restrict__ keyword

Category: CUDA
Published: Thursday, 26 March 2015

Starting from GK110 (Tesla Kepler), "const __restrict__" annotation on kernel argument has an extra GPU-specific meaning: accesses to that argument should go through the texture cache. As an example, we use GPU bilinear interpolation, which is a compute-bound problem. Replacing of "RGBApixel* pixels" by "const RGBApixel* __restrict__ pixels" in __global__ and __device__ functions instructs the compiler to emit LDG instructions for pixels loading instead of generic LD:

$ cuobjdump -sass no_restrict | grep LD
        /*01d8*/                   LD.E R13, [R8+0x4];
        /*0210*/                   LD.E R14, [R8];
        /*0248*/                   LD.E R8, [R4+0x4];
        /*0290*/                   LD.E R11, [R4];
$ cuobjdump -sass restrict | grep LD
        /*01a0*/                   LDG.E R7, [R16];
        /*01b0*/                   LDG.E R9, [R8];
        /*01e0*/                   LDG.E R8, [R16];
        /*01f0*/                   LDG.E R5, [R14];

If __restrict__ pointer does not produce LDG, then the keyword is not applicable or is used incorrectly.

Code version with "const __restrict__" demonstrates extra speedup of 8%:

$ ./no_restrict hst_lagoon_detail.bmp
Image: BMP 4778 x 4856
GPU kernel time = 0.013719 sec
$ ./restrict hst_lagoon_detail.bmp
Image: BMP 4778 x 4856
GPU kernel time = 0.012686 sec

Download our demo code here.

Thrust/CUDA tip: reuse temporary buffer across multiple transforms

Category: CUDA
Published: Thursday, 09 October 2014

Thrust is a very handy STL-like template library for rapid data processing on GPUs.

In real applications it is often needed to perform the same data processing (in Thrust's terminology - transform) multiple times on different datasets. Transforms containing reduction (reduction, sorting, transform-reduce, etc.) require temporary arrays allocation. By default these allocations are performed for each individual transform, adding cudaMalloc/cudaFree operations, that could be quite expensive:

Read more: Thrust/CUDA tip: reuse temporary buffer across multiple transforms

On-the-fly modification of LLVM IR code of CUDA sources

Category: CUDA
Published: Tuesday, 23 September 2014

Largely thanks to LLVM, in recent years we've seen a significant increase of interest to domain-specific compilation tools research & development. With the release of PTX backends by NVIDIA (opensource NVPTX and proprietary libNVVM), construction of custom LLVM-driven compilers for generating GPU binaries also becomes possible. However, two questions are still remaining:

  1. How to customize the CUDA source compilation?
  2. What is the NVIDIA's best set of GPU-specific LLVM optimizations and how to continue modifying IR after applying them?

In order to answer these two questions, we have created a special dynamic library. Being attached to NVIDIA CUDA compiler, this library exposes unoptimized and optimized LLVM IR code to the user and allows its on-the-fly modification. As result, domain-specific compiler developer receives flexibility e.g. to re-target CUDA-generated LLVM IR to different architectures, or to make additional modifications to IR after executing NVIDIA's optimizations.

Source code, sample modification and description are available on our GitHub page. Tested with CUDA 6.0.