# textmatch **Repository Path**: redauzhang/textmatch ## Basic Information - **Project Name**: textmatch - **Description**: textmatch - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-03-31 - **Last Updated**: 2023-04-02 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # sentence-similarity 对四种句子/文本相似度计算方法进行实验与比较;
四种方法为:cosine,cosine+idf,bm25,jaccard;
本实验仍然利用之前抓取的医疗语料库;
## 1 环境 python3
gensim
jieba
scipy
numpy
**2023/3/26 Update: 准备使用thulac替换jieba** - https://github.com/thunlp/THULAC-Python ## 2 算法原理 ![image](./docs/images/cosine.png)
![image](./docs/images/idf.png)
![image](./docs/images/bm25.png)
![image](./docs/images/jaccard.png)
## 3 运行步骤 - 文件检索过程,可以参考 [./textmatch/train_model/train_model.py](./textmatch/train_model/train_model.py) ## 后记 - 参考 [textmatch](https://github.com/MachineLP/TextMatch) - 分词参考 [THULAC-Python](https://github.com/thunlp/THULAC-Python) - [chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4/tree/main) ### 错误记录 - [protobuf 故障,包重名](https://stackoverflow.com/questions/50839667/protofile-proto-a-file-with-this-name-is-already-in-the-pool)