# bert.cpp

**Repository Path**: amikey/bert.cpp

## Basic Information

- **Project Name**: bert.cpp
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-11-29
- **Last Updated**: 2023-11-29

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# bert.cpp

[ggml](https://github.com/ggerganov/ggml) inference of BERT neural net architecture with pooling and normalization from [SentenceTransformers (sbert.net)](https://sbert.net/).
High quality sentence embeddings in pure C++ (with C API).

## Description
The main goal of `bert.cpp` is to run the BERT model using 4-bit integer quantization on CPU

* Plain C/C++ implementation without dependencies
* Inherit support for various architectures from ggml (x86 with AVX2, ARM, etc.)
* Choose your model size from 32/16/4 bits per model weigth
* all-MiniLM-L6-v2 with 4bit quantization is only 14MB. Inference RAM usage depends on the length of the input
* Sample cpp server over tcp socket and a python test client
* Benchmarks to validate correctness and speed of inference

## Limitations & TODO
* Tokenizer doesn't correctly handle asian writing (CJK, maybe others)
* bert.cpp doesn't respect tokenizer, pooling or normalization settings from the model card:
    * All inputs are lowercased and trimmed
    * All outputs are mean pooled and normalized
* Batching support is WIP. Lack of real batching means that this library is slower than it could be in usecases where you have multiple sentences

## Usage

### Checkout the ggml submodule
```sh
git submodule update --init --recursive
```
### Download models
```sh
pip3 install -r requirements.txt
# python3 models/download-ggml.py list_models
python3 models/download-ggml.py download all-MiniLM-L6-v2 q4_0
```
### Build
To build the dynamic library for usage from e.g. Python:
```sh
mkdir build
cd build
cmake .. -DBUILD_SHARED_LIBS=ON -DCMAKE_BUILD_TYPE=Release
make
cd ..
```

To build the native binaries, like the example server, with static libraries, run:
```sh
mkdir build
cd build
cmake .. -DBUILD_SHARED_LIBS=OFF -DCMAKE_BUILD_TYPE=Release
make
cd ..
```
### Run the python dynamic library example
```sh
python3 examples/sample_dylib.py models/all-MiniLM-L6-v2/ggml-model-f16.bin

# bert_load_from_file: loading model from '../models/all-MiniLM-L6-v2/ggml-model-f16.bin' - please wait ...
# bert_load_from_file: n_vocab = 30522
# bert_load_from_file: n_max_tokens   = 512
# bert_load_from_file: n_embd  = 384
# bert_load_from_file: n_intermediate  = 1536
# bert_load_from_file: n_head  = 12
# bert_load_from_file: n_layer = 6
# bert_load_from_file: f16     = 1
# bert_load_from_file: ggml ctx size =  43.12 MB
# bert_load_from_file: ............ done
# bert_load_from_file: model size =    43.10 MB / num tensors = 101
# bert_load_from_file: mem_per_token 450 KB
# Loading texts from sample_client_texts.txt...
# Loaded 1738 lines.
# Starting with a test query "Should I get health insurance?"
# Closest texts:
# 1. Can I sign up for Medicare Part B if I am working and have health insurance through an employer?
#  (similarity score: 0.4790)
# 2. Will my Medicare premiums be higher because of my higher income?
#  (similarity score: 0.4633)
# 3. Should I sign up for Medicare Part B if I have Veterans' Benefits?
#  (similarity score: 0.4208)
# Enter a text to find similar texts (enter 'q' to quit): poaching
# Closest texts:
# 1. The exotic animal trade is enormous , and it continues to spiral out of control .
#  (similarity score: 0.2825)
# 2. " PeopleSoft management entrenchment tactics continue to destroy the value of the company for its shareholders , " said Deborah Lilienthal , an Oracle spokeswoman .
#  (similarity score: 0.2709)
# 3. " I 've stopped looters , run political parties out of abandoned buildings , caught people with large amounts of cash and weapons , " Williams said .
#  (similarity score: 0.2672)
```

### Start sample server
```sh
./build/bin/server -m models/all-MiniLM-L6-v2/ggml-model-q4_0.bin --port 8085

# bert_model_load: loading model from 'models/all-MiniLM-L6-v2/ggml-model-q4_0.bin' - please wait ...
# bert_model_load: n_vocab = 30522
# bert_model_load: n_ctx   = 512
# bert_model_load: n_embd  = 384
# bert_model_load: n_intermediate  = 1536
# bert_model_load: n_head  = 12
# bert_model_load: n_layer = 6
# bert_model_load: f16     = 2
# bert_model_load: ggml ctx size =  13.57 MB
# bert_model_load: ............ done
# bert_model_load: model size =    13.55 MB / num tensors = 101
# Server running on port 8085 with 4 threads
# Waiting for a client
```
### Run sample client
```sh
python3 examples/sample_client.py 8085
# Loading texts from sample_client_texts.txt...
# Loaded 1738 lines.
# Starting with a test query "Should I get health insurance?"
# Closest texts:
# 1. Will my Medicare premiums be higher because of my higher income?
#  (similarity score: 0.4844)
# 2. Can I sign up for Medicare Part B if I am working and have health insurance through an employer?
#  (similarity score: 0.4575)
# 3. Should I sign up for Medicare Part B if I have Veterans' Benefits?
#  (similarity score: 0.4052)
# Enter a text to find similar texts (enter 'q' to quit): expensive
# Closest texts:
# 1. It is priced at $ 5,995 for an unlimited number of users tapping into the single processor , or $ 195 per user with a minimum of five users .
#  (similarity score: 0.4597)
# 2. The new system costs between $ 1.1 million and $ 22 million , depending on configuration .
#  (similarity score: 0.4547)
# 3. Each hull will cost about $ 1.4 billion , with each fully outfitted submarine costing about $ 2.2 billion , Young said .
#  (similarity score: 0.4078)
```

### Converting models to ggml format
Converting models is similar to llama.cpp. Use models/convert-to-ggml.py to make hf models into either f32 or f16 ggml models. Then use ./build/bin/quantize to turn those into Q4_0, 4bit per weight models.

There is also models/run_conversions.sh which creates all 4 versions (f32, f16, Q4_0, Q4_1) at once.
```sh
cd models
# Clone a model from hf
git clone https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1
# Run conversions to 4 ggml formats (f32, f16, Q4_0, Q4_1)
sh run_conversions.sh multi-qa-MiniLM-L6-cos-v1
```

## Benchmarks
Running MTEB (Massive Text Embedding Benchmark) with bert.cpp vs. [sbert](https://sbert.net/)(cpu mode) gives comparable results between the two, with quantization having minimal effect on accuracy and eval time being similar or better than sbert with batch_size=1 (bert.cpp doesn't support batching).

See [benchmarks](benchmarks) more info.
### all-MiniLM-L6-v2
| Data Type | STSBenchmark | eval time | EmotionClassification | eval time | 
|-----------|-----------|------------|-----------|------------|
| f32 | 0.8201 | 6.83 | 0.4082 | 11.34 | 
| f16 | 0.8201 | 6.17 | 0.4085 | 10.28 | 
| q4_0 | 0.8175 | 5.45 | 0.3911 | 10.63 | 
| q4_1 | 0.8223 | 6.79 | 0.4027 | 11.41 | 
| sbert | 0.8203 | 2.74 | 0.4085 | 5.56 | 
| sbert-batchless | 0.8203 | 13.10 | 0.4085 | 15.52 | 

### all-MiniLM-L12-v2
| Data Type | STSBenchmark | eval time | EmotionClassification | eval time | 
|-----------|-----------|------------|-----------|------------|
| f32 | 0.8306 | 13.36 | 0.4117 | 21.23 | 
| f16 | 0.8306 | 11.51 | 0.4119 | 20.08 | 
| q4_0 | 0.8310 | 11.27 | 0.4183 | 20.81 | 
| q4_1 | 0.8325 | 12.37 | 0.4093 | 19.38 | 
| sbert | 0.8309 | 5.11 | 0.4117 | 8.93 | 
| sbert-batchless | 0.8309 | 22.81 | 0.4117 | 28.04 | 

### bert-base-uncased
bert-base-uncased is not a very good sentence embeddings model, but it's here to show that bert.cpp correctly runs models that are not from SentenceTransformers. Technically any hf model with architecture `BertModel` or `BertForMaskedLM` should work.

| Data Type | STSBenchmark | eval time | EmotionClassification | eval time | 
|-----------|-----------|------------|-----------|------------|
| f32 | 0.4738 | 52.38 | 0.3361 | 88.56 | 
| f16 | 0.4739 | 33.24 | 0.3361 | 55.86 | 
| q4_0 | 0.4940 | 33.93 | 0.3375 | 57.82 | 
| q4_1 | 0.4612 | 36.86 | 0.3318 | 59.63 | 
| sbert | 0.4729 | 16.97 | 0.3527 | 28.77 | 
| sbert-batchless | 0.4729 | 69.97 | 0.3526 | 79.02 |