Wednesday, August 28, 2013

Low performance kernel

Low performance kernel

I have a CUDA kernel where there are many operations and few branches. It
looks like
__global__
void kernel(Real *randomValues, Real mu, Real sigma)
{
int row = blockDim.y * blockIdx.y + threadIdx.y;
int col = blockDim.x * blockIdx.x + threadIdx.x;
if ( row >= cnTimeSteps || col >= cnPaths ) return;
Real alphaLevel = randomValues[row*cnPaths+col];
Real q = 0.0;
Real x = 0.0;
if ( alphaLevel < p_low)
{
q = sqrt( -2*log( alphaLevel ) );
x = (((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) /
((((d1*q+d2)*q+d3)*q+d4)*q+1);
}
else if ( alphaLevel < p_high )
{
q = alphaLevel-0.5;
Real r = q*q;
x= (((((a1*r+a2)*r+a3)*r+a4)*r+a5)*r+a6)*q /
(((((b1*r+b2)*r+b3)*r+b4)*r+b5)*r+1);
}
else
{
q = sqrt( -2*log( 1.0-alphaLevel ) );
x = -(((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) /
((((d1*q+d2)*q+d3)*q+d4)*q+1);
}
randomValues[row*cnPaths+col] = sigma * x + mu;
}
where all the a's, b's, c's and d's are constant values (in the device
constant memory)
static __device__ __constant__ float a1 = 1.73687;
static __device__ __constant__ float a2 = 1.12321100;
and so on.
After profiling the kernel I found that the theoretical occupancy is 100%
but I am getting no more than 60%.
I went through this and this GTC talks to try to optimize my kernel.
On one side I have that the IPC reports an average of 1.32 issued
instructions and 0.62 executed. The instruction serialization is about 50%
but the SM activity is almost 100%. On the other hand, there are around 38
active warps but 8 are eligible to execute the next instruction but on
warp issue efficiency I get that around 70% of the cycles there is no
eligible warp. The stall reasons are reported as "Other" which I think has
to do with the computation of the log and sqrt.
How can the SM activity be 99.82% if most of the cycles there is no
eligible warp?
How can I reduce stall?
As threads in a warp may not go into the same branch, requests to constant
memory are probably seralized, is this true? Should I put those constants
in global memory (maybe use shared memory also)?
Is the first time I use Nsight Visual Studio so I'm trying to figure out
the meaning of all the performance analysis. BTW my card is a Quadro
K4000.

No comments:

Post a Comment