How to load CUDA code from .PTX file using cuPy in Python?

Since most researchers aim to analyze data swiftly, Python typically emerges as their first choice. However, navigating GPU acceleration in Python can pose some challenges. In this article, I'll demonstrate how to leverage cuPy in Python to load CUDA code from a .PTX file. This approach empowers you to craft custom CUDA kernels and seamlessly execute them on the GPU using cuPy.

My System:
  • OS: Windows 11
  • GPU: NVIDIA RTX 3050 Ti
  • Compute Capability: 8.6

Approaches to load CUDA code using cuPy:

In this article, I'll outline two methods for loading CUDA code using the cuPy library. The first method involves directly loading your kernel code, while the second method entails converting your CUDA code into a .PTX file (a pseudo-assembly format) before loading it.

Approach 1: Directly Load CUDA Code

The first approach is to directly load your CUDA code into cuPy using the RawModule class. Here's how you can do it:


import cupy as cp

# Define the CUDA kernel code as a string
cuda_code = """
extern "C" __global__ void vector_add(const int* a, const int* b, int* result, int size) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid < size) {
        result[tid] = a[tid] + b[tid];
    }
}

extern "C" __global__ void vector_subtract(const int* a, const int* b, int* result, int size) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid < size) {
        result[tid] = a[tid] - b[tid];
    }
}
"""

# Compile the CUDA code into a module
module = cp.RawModule(code=cuda_code)

# Get the compiled kernel functions
vector_add_kernel = module.get_function('vector_add')
vector_subtract_kernel = module.get_function('vector_subtract')

# Define the input vectors
vector_a = cp.array([1, 2, 3, 4, 5])
vector_b = cp.array([6, 7, 8, 9, 10])

# Create output vectors to store the results
vector_sum = cp.empty_like(vector_a)
vector_diff = cp.empty_like(vector_a)

# Specify grid and block dimensions for kernel execution
block_dim = (128,)  # Adjust as needed based on your GPU capabilities
grid_dim = ((vector_a.size - 1) // block_dim[0] + 1,)

# Execute the kernel for vector addition
vector_add_kernel(grid_dim, block_dim, (vector_a, vector_b, vector_sum, vector_a.size))

# Execute the kernel for vector subtraction
vector_subtract_kernel(grid_dim, block_dim, (vector_a, vector_b, vector_diff, vector_a.size))

# Print the results
print("Result of vector addition:", vector_sum)
print("Result of vector subtraction:", vector_diff)

                            

Approach 2: Load CUDA Code from .PTX File

The second approach is to convert your CUDA code into a .PTX file (a pseudo assembly) and then load it using cuPy. A .PTX file, short for Parallel Thread Execution, is a human-readable intermediate representation of CUDA code. It serves as a step in the compilation process before CUDA code is translated into machine code for execution on the GPU.

PTX files contain instructions that are platform-independent, allowing them to be optimized for specific GPU architectures during runtime. But to load the .PTX file, you need to compile your CUDA code using the nvcc compiler. To generate a .PTX file, do the followig steps:

  • First write your CUDA code in .cu file. For example, in the sample above, we want to have two CUDA kernels, one for addtion and another one for substraction. So, create a kernel.cu file and write your CUDA code in it
    
    // kerncel.cu file                                            
    extern "C"
    {
        __global__ void vector_add(const int* a, const int* b, int* result, int size) 
        {
            int tid = threadIdx.x + blockIdx.x * blockDim.x;
            if (tid < size) 
            {
                result[tid] = a[tid] + b[tid];
            }
        }
    
        __global__ void vector_subtract(const int* a, const int* b, int* result, int size) 
        {
            int tid = threadIdx.x + blockIdx.x * blockDim.x;
            if (tid < size) {
                result[tid] = a[tid] - b[tid];
            }
        }
    
    }                                        
    
  • Compile the kernel.cu file using nvcc compiler
    
    nvcc -ptx kernel.cu -o kernel.ptx
                                        
  • Now load the generated .PTX file in your Python code. The only change is in the RawModule() function which is responsible to load the CUDA.
    
    import cupy as cp
    
    # Create a raw module
    ptx_code_path = "D:/your_path/kernel.ptx"
    raw_module = cp.RawModule(path=ptx_code_path)
    
    # Get the kernel functions from the raw module
    vector_add_kernel = raw_module.get_function('vector_add')
    vector_subtract_kernel = raw_module.get_function('vector_subtract')
    
    def vector_add(a, b):
        size = a.size
        result = cp.empty(size, dtype=cp.int32)
        vector_add_kernel(grid=(size,), block=(256,), args=(a, b, result, size))
        return result
    
    def vector_subtract(a, b):
        size = a.size
        result = cp.empty(size, dtype=cp.int32)
        vector_subtract_kernel(grid=(size,), block=(256,), args=(a, b, result, size))
        return result
    
    # Example usage
    a = cp.array([1, 2, 3, 4])
    b = cp.array([5, 6, 7, 8])
    
    # Perform vector addition and subtraction
    result_add = vector_add(a, b)
    result_subtract = vector_subtract(a, b)
    
    print("Vector Addition Result:", result_add)
    print("Vector Subtraction Result:", result_subtract)                                        
                                        

Disclaimer: The opinions expressed in this blog are my own. If you have any discrepancies or suggestions for improvements, please feel free to contact me. Your feedback is highly valued.

Siddharth Mittal

Siddharth is a Signal & Information Processing graduate with an undergraduate degree in Electrical & Electronics Engineering. He enjoys programming and has a passion for travel, photography, and writing.