Welcome to managedCuda
managedCuda combines Cuda's GPU computing power with the comfort of managed .net code. While offering access to the entire feature set of Cuda's driver API, managedCuda has type safe wrapper classes for every handle defined by the API. ManagedCuda also includes wrappers for all Cuda based libraries, as there were CUFFT, CURAND, CUSPARSE, CUBLAS, CUSOLVE, NPP and NVRTC.
What managedCuda is not
managedCuda is not a code converter, which means that no C# code will be translated to Cuda. Every cuda kernel that you want to use has to be written in CUDA-C and must be compiled to PTX or CUBIN format using the NVCC toolchain.
What managedCuda is
managedCuda is the right library if you want to accelerate your .net application with Cuda without any restrictions. As every kernel is written in plain CUDA-C, all Cuda specific features are maintained. Even future improvements to Cuda by NVIDIA can be integrated without any changes to your application host code.
Where to get
Previously, managedCuda was hosted on codeplex. Elder releases (pre cuda 7.5) are available there.
Also available as NuGet packages: search for managedCuda using NuGet package manager.
Sample code
VectorAdd.cu as given by the Cuda SDK samples:
//Kernel code:
extern "C" {
// Device code
__global__ void VecAdd(const float* A, const float* B, float* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
}
Corresponding C# code to call the kernel:
int N = 50000;
int deviceID = 0;
CudaContext ctx = new CudaContext(deviceID);
CudaKernel kernel = ctx.LoadKernel("vectorAdd.ptx", "VecAdd");
kernel.GridDimensions = (N + 255) / 256;
kernel.BlockDimensions = 256;
// Allocate input vectors h_A and h_B in host memory
float[] h_A = new float[N];
float[] h_B = new float[N];
// TODO: Initialize input vectors h_A, h_B
// Allocate vectors in device memory and copy vectors from host memory to device memory
CudaDeviceVariable<float> d_A = h_A;
CudaDeviceVariable<float> d_B = h_B;
CudaDeviceVariable<float> d_C = new CudaDeviceVariable<float>(N);
// Invoke kernel
kernel.Run(d_A.DevicePointer, d_B.DevicePointer, d_C.DevicePointer, N);
// Copy result from device memory to host memory
// h_C contains the result in host memory
float[] h_C = d_C;
Sample showing the simple and elegant integration of NPP
//Load an image
Bitmap bmp = new Bitmap("niceImage.png");
//Alloc device memory using NPP images
NPPImage_8uC3 bmp_d = new NPPImage_8uC3(bmp.Width, bmp.Height);
NPPImage_8uC3 bmpDest_d = new NPPImage_8uC3(bmp.Width, bmp.Height);
//Copy image to GPU
bmp_d.CopyToDevice(bmp);
//Run a NPP function
bmp_d.FilterGaussBorder(bmpDest_d, MaskSize.Size_5_X_5, NppiBorderType.Replicate);
//Copy result back to host
bmpDest_d.CopyToHost(bmp);
//Use the result
bmp.Save("niceImageFiltered.png");