# textmatch

**Repository Path**: redauzhang/textmatch

## Basic Information

- **Project Name**: textmatch
- **Description**: textmatch
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-03-31
- **Last Updated**: 2023-04-02

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# sentence-similarity
对四种句子/文本相似度计算方法进行实验与比较；<br>
四种方法为:cosine,cosine+idf,bm25,jaccard；<br>
本实验仍然利用之前抓取的医疗语料库；<br>

## 1 环境

python3<br>
gensim<br>
jieba<br>
scipy<br>
numpy<br>

**2023/3/26 Update: 准备使用thulac替换jieba**
- https://github.com/thunlp/THULAC-Python

## 2 算法原理

![image](./docs/images/cosine.png)<br>
![image](./docs/images/idf.png)<br>
![image](./docs/images/bm25.png)<br>
![image](./docs/images/jaccard.png)<br>

## 3 运行步骤
- 文件检索过程，可以参考 [./textmatch/train_model/train_model.py](./textmatch/train_model/train_model.py)


## 后记
- 参考 [textmatch](https://github.com/MachineLP/TextMatch)
- 分词参考 [THULAC-Python](https://github.com/thunlp/THULAC-Python)
- [chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4/tree/main)


### 错误记录
- [protobuf 故障，包重名](https://stackoverflow.com/questions/50839667/protofile-proto-a-file-with-this-name-is-already-in-the-pool)