# Study_CUDA_Programming

**Repository Path**: bear4zcx/Study_CUDA_Programming

## Basic Information

- **Project Name**: Study_CUDA_Programming
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: AGPL-3.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-05-29
- **Last Updated**: 2025-05-29

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Study_CUDA_Programming (based on C++ ver11)
- All materials in this repository are based on lectures and code from Inflean's CUDA course. 
- The link of the course is as follows: 
  - https://www.inflearn.com/roadmaps/654#community
- All lecture materials and codes follow the instructor's license and are only for educational purposes for the course attended. 
- Commercial use is prohibited !!


### Must prepare as follows:
- Nvidia GPU
- OS: Ubuntu 20.04 (for me), Windows 10 over., Mac
- Install CUDA (for me, v12.1)
- Pure python Env (Not conda Env)
  - If you want to set python3 for main python module, please set.
```shell
sudo apt update
sudo apt install python3.8
sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.8 10
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 10 # If don't use python3, no need it
sudo update-alternatives --config python3 # If don't use python3, no need it
```

### Install glfw3 packages
```shell
sudo apt-get install libglfw3-dev libglfw3
```

### Install cmake 3.30
```shell
sudo apt purge cmake
sudo apt install wget build-essential

wget https://github.com/Kitware/CMake/releases/download/v3.30.0/cmake-3.30.0.tar.gz
tar -xvzf cmake-3.30.0.tar.gz
cd cmake-3.30.0
./bootstrap --prefix=/usr/local
make
sudo make install
cmake --version
```
- If you don't find cmake version, please edit as follows:
```shell
vi ~./bashrc
```
PATH=/usr/local/bin:$PATH:$HOME/bin
```shell
source ~./bashrc
```
**Download [CUDA Samples](https://github.com/NVIDIA/cuda-samples) (for me, v12.1)**
```shell
wget https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.1.tar.gz
tar -zxvf v12.1.tar.gz
```

```shell
wget https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.1.zip
unzip v12.1.zip
make
sudo make install
```

### CUDA for Ubuntu

- $ ubuntu-drivers devices
- $ sudo apt install nvidia-driver-xx
  - reboot !
- $ nvidia-smi (only for checking your NVIDIA driver)
  - visit CUDA-zone to get the CUDA toolkit
- $ sudo apt get install build-essential (to get GCC compilers)
- $ nvcc -V (now you should get the NVIDIA CUDA Compiler messages)

### CUDA Tutorial
- in each section, build the project as shown below and run the generated file.
```shell
mkdir build
cd build
cmake ..
make
./generated_execution_file
```

### This tutorial is structured as follows:

#### 1. `part1_cuda_kernel`: [Start CUDA programming](./part1_cuda_kernel/README.md) | [Certificate](https://www.inflearn.com/certificate/962884-329543-12987047)
- print hello cuda (on Ubuntu) 
- memory copy
- add vector by using cpu or CUDA
- error check

#### 2. `part2_vector_addition`: [Study CUDA kernel launch](./part2_vector_addition/README.md) | [Certificate](https://www.inflearn.com/certificate/962884-329544-12987046)
- elapsed time
- CUDA kernel launch
- 1d vector addition
- Giga vector addition
- AXPY and FMA
  - single precision
  - linear interpolation
- thread and GPU

#### 3. `part3_memory_structure`: [Memory Structure](./part3_memory_structure/README.md) | [Certificate](https://www.inflearn.com/certificate/962884-329600-12987048)
- 메모리 계층 구조
- CUDA 전용의 2D 메모리 할당 함수, pitched point 사용법 
- 3D 행렬 사용 및 pitched point 사용법
- CUDA 메모리 계층 구조
- 인접 원소끼리 차이 구하기: shared memory 활용

#### 4. `part4_matrix_multiply`: [Matrix Multiply](./part4_matrix_multiply/README.md) | [Certificate](https://www.inflearn.com/certificate/962884-329601-12987045)
- matrix copy
- Matrix Transpose 전치 행렬
- Matrix Multiplication
- GEMM: general matrix-to-matrix multiplication
- 메모리에 따른 CUDA 변수 스피드 측정
- 정밀도와 속도개선

#### 5. `part5_atommic_operation`: [Atomic Operation](./part5_atomic_operation/README.md) | [Certificate](https://www.inflearn.com/certificate/962884-329721-12987044)
- Control Flow
  - if 문 과 for loop 문 어떻게 최적화 할것인지?
  - shared 메모리를 사용하는 경우라면, `half-by-half`를 사용하는 `even-odd` 보다 조금더 빠르다.!!
- race conditions 문제의 해결방법으로 Atomic Operation 사용 
- atomic operation 사용하여 histogram 구하기
- Reduction Problem 솔루션 
- GEMV operation

#### 6. `part6_search_sort`: [Search & Sort](./part6_search_sort/README.md) | [Certificate](https://www.inflearn.com/certificate/962884-329723-12987043)
- Linear Search 선형 탐색
- Search All 모든 위치 모두 찾기
  - CUDA에서 stride 사용하는 것이 제일 빠르다. 
- Binary Search 이진 탐색
  - CUDA 사용해서, `binary search`는 효과적이지 못하다.
  - `그냥 CPU 사용하세요!. 특히 STL 짱짱 빠름.`
- `CUDA 에서 Sort 하는 방법`.. 본격적으로 얘기해 보자!!
  - 블럭 단위 parallel sorting
    - CUDA even-odd sort: 엄청 빨라 짐
  - global 메모리 활용 parallel sort 할때는,
    - CUDA (even-odd) 에서 도차도 상당히 느리다.
- Bitonic Sort 바이토닉 소트
  - 병렬 처리를 위한, 소팅 방법이라고 보면 됨
- Counting Merge Sort 카운팅 방식 머지 소트 (병합 정렬)
  - 병렬 처리에 가장 적합한 Large Scale Parallel Counting Merge Sort 방법


## Additional Comments
- All description in the materials have been modified by myself, Hyunkoo Kim.
- (c) 2024. hyunkookim.me@gmail.com. All rights reserved.