# OpenCL-Wrapper **Repository Path**: jiangkaiwen/OpenCL-Wrapper ## Basic Information - **Project Name**: OpenCL-Wrapper - **Description**: I created this lightweight OpenCL-Wrapper to greatly simplify OpenCL software development with C++ while keeping functionality and performance. - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 1 - **Created**: 2023-08-13 - **Last Updated**: 2023-11-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # OpenCL-Wrapper OpenCL is the most powerful programming language ever created. Yet the OpenCL C++ bindings are cumbersome and the code overhead prevents many people from getting started. I created this lightweight OpenCL-Wrapper to greatly simplify OpenCL software development with C++ while keeping functionality and performance. Works in Windows, Linux and Android with C++17. Use-case example: [FluidX3D](https://github.com/ProjectPhysX/FluidX3D) builds entirely on top of this OpenCL-Wrapper. ## Key simplifications: 1. select a `Device` with 1 line - automatically select fastest device / device with most memory / device with specified ID from a list of all devices - easily get device information (performance in TFLOPs/s, amount of memory and cache, FP64/FP16 capabilities, etc.) - automatic OpenCL C code compilation when creating the Device object - automatically enable FP64/FP16 capabilities in OpenCL C code - automatically print log to console if there are compile errors - easy option to generate PTX assembly and save that in a `.ptx` file 2. create a `Memory` object with 1 line - one object for both host and device memory - easy host <-> device memory transfer (also for 1D/2D/3D grid domains) - easy handling of multi-dimensional vectors - can also be used to only allocate memory on host or only allocate memory on device - automatically tracks total global memory usage of device when allocating/deleting memory 3. create a `Kernel` with 1 line - Memory objects and constants are linked to OpenCL C kernel parameters during Kernel creation - a list of Memory objects and constants can be added to Kernel parameters in one line (`add_parameters(...)`) - Kernel parameters can be edited (`set_parameters(...)`) - easy Kernel execution: `kernel.run();` - Kernel function calls can be daisy chained, for example: `kernel.set_parameters(3u, time).run();` 4. OpenCL C code is embedded into C++ - syntax highlighting in the code editor is retained - notes / peculiarities of this workaround: - the `#define R(...) string(" "#__VA_ARGS__" ")` stringification macro converts its arguments to string literals; `'\n'` is converted to `' '` in the process - these string literals cannot be arbitrarily long, so interrupt them periodically with `)+R(` - to use unbalanced round brackets `'('`/`')'`, exit the `R(...)` macro and insert a string literal manually: `)+"void function("+R(` and `)+") {"+R(` - to use preprocessor switch macros, exit the `R(...)` macro and insert a string literal manually: `)+"#define TEST"+R(` and `)+"#endif"+R( // TEST` - preprocessor replacement macros (for example `#define VARIABLE 42`) don't work; hand these to the `Device` constructor directly instead ## No need to: - have code overhead for selecting a platform/device, passing the OpenCL C code, etc. - keep track of length and data type for buffers - have duplicate code for host and device buffers - keep track of total global memory usage - keep track of global/local range for kernels - bother with Queue, Context, Source, Program - load a `.cl` file at runtime ## Example (OpenCL vector addition) ### main.cpp ```c #include "opencl.hpp" int main() { Device device(select_device_with_most_flops()); // compile OpenCL C code for the fastest available device const uint N = 1024u; // size of vectors Memory A(device, N); // allocate memory on both host and device Memory B(device, N); Memory C(device, N); Kernel add_kernel(device, N, "add_kernel", A, B, C); // kernel that runs on the device for(uint n=0u; n #include "utilities.hpp" #define WORKGROUP_SIZE 64 int main() { // 1. select device vector cl_devices; // get all devices of all platforms { vector cl_platforms; // get all platforms (drivers) cl::Platform::get(&cl_platforms); for(uint i=0u; i<(uint)cl_platforms.size(); i++) { vector cl_devices_available; cl_platforms[i].getDevices(CL_DEVICE_TYPE_ALL, &cl_devices_available); for(uint j=0u; j<(uint)cl_devices_available.size(); j++) { cl_devices.push_back(cl_devices_available[j]); } } } cl::Device cl_device; // select fastest available device { float best_value = 0.0f; uint best_i = 0u; // index of fastest device for(uint i=0u; i<(uint)cl_devices.size(); i++) { // find device with highest (estimated) floating point performance const string name = trim(cl_devices[i].getInfo()); // device name const string vendor = trim(cl_devices[i].getInfo()); // device vendor const uint compute_units = (uint)cl_devices[i].getInfo(); // compute units (CUs) can contain multiple cores depending on the microarchitecture const uint clock_frequency = (uint)cl_devices[i].getInfo(); // in MHz const bool is_gpu = cl_devices[i].getInfo()==CL_DEVICE_TYPE_GPU; const uint ipc = is_gpu?2u:32u; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs const bool nvidia_192_cores_per_cu = contains_any(to_lower(name), {"gt 6", "gt 7", "gtx 6", "gtx 7", "quadro k", "tesla k"}) || (clock_frequency<1000u&&contains(to_lower(name), "titan")); // identify Kepler GPUs const bool nvidia_64_cores_per_cu = contains_any(to_lower(name), {"p100", "v100", "a100", "a30", " 16", " 20", "titan v", "titan rtx", "quadro t", "tesla t", "quadro rtx"}) && !contains(to_lower(name), "rtx a"); // identify P100, Volta, Turing, A100, A30 const bool amd_128_cores_per_dualcu = contains(to_lower(name), "gfx10"); // identify RDNA/RDNA2 GPUs where dual CUs are reported const bool amd_256_cores_per_dualcu = contains(to_lower(name), "gfx11"); // identify RDNA3 GPUs where dual CUs are reported const bool intel_16_cores_per_cu = contains(to_lower(name), "gpu max"); // identify PVC GPUs const float nvidia = (float)(contains(to_lower(vendor), "nvidia"))*(nvidia_64_cores_per_cu?64.0f:nvidia_192_cores_per_cu?192.0f:128.0f); // Nvidia GPUs have 192 cores/CU (Kepler), 128 cores/CU (Maxwell, Pascal, Ampere, Hopper, Ada) or 64 cores/CU (P100, Volta, Turing, A100, A30) const float amd = (float)(contains_any(to_lower(vendor), {"amd", "advanced"}))*(is_gpu?(amd_256_cores_per_dualcu?256.0f:amd_128_cores_per_dualcu?128.0f:64.0f):0.5f); // AMD GPUs have 64 cores/CU (GCN, CDNA), 128 cores/dualCU (RDNA, RDNA2) or 256 cores/dualCU (RDNA3), AMD CPUs (with SMT) have 1/2 core/CU const float intel = (float)(contains(to_lower(vendor), "intel"))*(is_gpu?(intel_16_cores_per_cu?16.0f:8.0f):0.5f); // Intel GPUs have 16 cores/CU (PVC) or 8 cores/CU (integrated/Arc), Intel CPUs (with HT) have 1/2 core/CU const float apple = (float)(contains(to_lower(vendor), "apple"))*(128.0f); // Apple ARM GPUs usually have 128 cores/CU const float arm = (float)(contains(to_lower(vendor), "arm"))*(is_gpu?8.0f:1.0f); // ARM GPUs usually have 8 cores/CU, ARM CPUs have 1 core/CU cores = to_uint((float)compute_units*(nvidia+amd+intel+apple+arm)); // for CPUs, compute_units is the number of threads (twice the number of cores with hyperthreading) tflops = 1E-6f*(float)cores*(float)ipc*(float)clock_frequency; // estimated device floating point performance in TeraFLOPs/s if(tflops>best_value) { best_value = tflops; best_i = i; } } const string name = trim(cl_devices[best_i].getInfo()); // device name cl_device = cl_devices[best_i]; print_info(name); // print device name } // 2. embed OpenCL C code (raw string literal breaks syntax highlighting) string opencl_c_code = R"( kernel void add_kernel(global float* A, global float* B, global float* C) { // equivalent to "for(uint n=0u; n(cl_device)); // print build log if(error) print_error("OpenCL C code compilation failed."); else print_info("OpenCL C code successfully compiled."); } // 4. allocate memory on host and device const uint N = 1024u; float* host_A; float* host_B; float* host_C; cl::Buffer device_A; cl::Buffer device_B; cl::Buffer device_C; { host_A = new float[N]; host_B = new float[N]; host_C = new float[N]; for(uint i=0u; i