diff --git a/DPDLDA/README.md b/DPDLDA/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9e5964f7d7fee26064974c1aada8fd45f65babd5 --- /dev/null +++ b/DPDLDA/README.md @@ -0,0 +1,98 @@ +# 数据集介绍 + +本项目使用了中文医学语言理解测评([Chinese Biomedical Language Understanding Evaluation,CBLUE](https://github.com/CBLUEbenchmark/CBLUE))1.0 版本数据集,这是国内首个面向中文医疗文本处理的多任务榜单,涵盖了医学文本信息抽取(实体识别、关系抽取)、医学术语归一化、医学文本分类、医学句子关系判定和医学问答共5大类任务8个子任务。其数据来源分布广泛,包括医学教材、电子病历、临床试验公示以及互联网用户真实查询等。该榜单一经推出便受到了学界和业界的广泛关注,已逐渐发展成为检验AI系统中文医疗信息处理能力的“金标准”。 + +* CMeEE:中文医学命名实体识别 +* CMeIE:中文医学文本实体关系抽取 +* CHIP-CDN:临床术语标准化任务 +* CHIP-CTC:临床试验筛选标准短文本分类 +* CHIP-STS:平安医疗科技疾病问答迁移学习 +* KUAKE-QIC:医疗搜索检索词意图分类 +* KUAKE-QTR:医疗搜索查询词-页面标题相关性 +* KUAKE-QQR:医疗搜索查询词-查询词相关性 + +更多信息可参考CBLUE的[github](https://github.com/CBLUEbenchmark/CBLUE/blob/main/README_ZH.md)。 + +## 模型介绍 + +模型的整体结构与 ELECTRA 相似,包括生成器和判别器两部分。 而 Fine-tune 过程只用到了判别器模块,由 12 层 Transformer 网络组成。 + +## 快速开始 + +### 代码结构说明 + +以下是本项目主要代码结构及说明: + +```text +├── train_classification.py # 文本分类任务训练评估 +├── model.py # 模型的结构定义 +├── utils.py # 数据的处理流程 +├── export_model.py # 动态图模型导出静态图参数 +└── README.md +``` + +模型的具体使用在deploy/predictor文件夹下 + +### 依赖安装 + +```shell +pip install xlrd==1.2.0 +``` + +### 模型训练 + +我们按照任务类别划分,同时提供了8个任务的不同参数设置。可以运行下边的命令,在训练集上进行训练,并在**验证集**上进行验证。 + +**训练参数设置(Training setup)及结果** + +| Task | epochs | batch_size | learning_rate | max_seq_length | metric | results | results (fp16) | +| --------- | :----: | :--------: | :-----------: | :------------: | :------: | :-----: | :------------: | +| CHIP-STS | 4 | 16 | 3e-5 | 96 | Macro-F1 | 0.88749 | 0.88555 | +| CHIP-CTC | 4 | 32 | 6e-5 | 160 | Macro-F1 | 0.84136 | 0.83514 | +| CHIP-CDN | 16 | 256 | 3e-5 | 32 | F1 | 0.76979 | 0.76489 | +| KUAKE-QQR | 2 | 32 | 6e-5 | 64 | Accuracy | 0.83865 | 0.84053 | +| KUAKE-QTR | 4 | 32 | 6e-5 | 64 | Accuracy | 0.69722 | 0.69722 | +| KUAKE-QIC | 4 | 32 | 6e-5 | 128 | Accuracy | 0.81483 | 0.82046 | +| CMeEE | 2 | 32 | 6e-5 | 128 | Micro-F1 | 0.66120 | 0.66026 | +| CMeIE | 100 | 12 | 6e-5 | 300 | Micro-F1 | 0.61385 | 0.60076 | + +可支持配置的参数: + +* `save_dir`:可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下。 +* `max_seq_length`:可选,ELECTRA模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `learning_rate`:可选,Fine-tune的最大学习率;默认为6e-5。 +* `weight_decay`:可选,控制正则项力度的参数,用于防止过拟合,默认为0.01。 +* `epochs`: 训练轮次,默认为3。 +* `max_steps`: 最大训练步数。若训练`epochs`轮包含的训练步数大于该值,则达到`max_steps`后就提前结束。 +* `valid_steps`: evaluate的间隔steps数,默认100。 +* `save_steps`: 保存checkpoints的间隔steps数,默认100。 +* `logging_steps`: 日志打印的间隔steps数,默认10。 +* `warmup_proption`:可选,学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.1。 +* `init_from_ckpt`:可选,模型参数路径,恢复模型训练;默认为None。 +* `seed`:可选,随机种子,默认为1000. +* `device`: 选用什么设备进行训练,可选cpu、gpu或npu。如使用gpu训练则参数gpus指定GPU卡号。 +* `use_amp`: 是否使用混合精度训练,默认为False。 + + +#### 医疗文本分类任务 + +```shell +$ unset CUDA_VISIBLE_DEVICES +$ python -m paddle.distributed.launch --gpus '0,1,2,3' train_classification.py --dataset CHIP-CDN-2C --batch_size 256 --max_seq_length 32 --learning_rate 3e-5 --epochs 16 +``` + +其他可支持配置的参数: + +* `dataset`:可选,CHIP-CDN-2C CHIP-CTC CHIP-STS KUAKE-QIC KUAKE-QTR KUAKE-QQR,默认为KUAKE-QIC数据集。 + +### 静态图模型导出 + +使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,用于部署推理等,具体代码见export_model.py。静态图参数保存在`output_path`指定路径中。 + +运行方式: +```shell +python export_model.py --train_dataset CHIP-CDN-2C --params_path=./checkpoint/model_900/ --output_path=./export +``` + +**NOTICE**: train_dataset分类任务选择填上训练数据集名称,params_path选择最好参数的模型的路径。 diff --git a/DPDLDA/deploy/predictor/README.md b/DPDLDA/deploy/predictor/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a24e5a71711f23a7fc4d6edb30c5f6824e6f0dc5 --- /dev/null +++ b/DPDLDA/deploy/predictor/README.md @@ -0,0 +1,100 @@ +# 基于ONNXRuntime推理部署指南 + +本示例以CBLUE数据集微调得到的模型为例,提供了文本分类任务的部署代码,自定义数据集可参考实现。 +在推理部署前需将微调后的动态图模型转换导出为静态图,详细步骤见静态图模型导出。 + +以下是本部分主要代码结构及说明: + +```text +├── infer_classification.py # 模型推理的参数设置 +├── predictor.py # 模型推理的处理流程 +└── README.md +``` + +## 环境安装 + +ONNX模型转换和推理部署依赖于Paddle2ONNX和ONNXRuntime。其中Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式。 + +#### GPU端 +请先确保机器已正确安装NVIDIA相关驱动和基础软件,确保CUDA >= 11.2,CuDNN >= 8.2,并使用以下命令安装所需依赖: +``` +python -m pip install -r requirements_gpu.tx +``` +\* 如需使用半精度(FP16)部署,请确保GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0。 + +#### CPU端 +请使用如下命令安装所需依赖: +``` +python -m pip install -r requirements_cpu.txt +``` +## GPU部署推理样例 + +请使用如下命令进行GPU上的部署,可用`use_fp16`开启**半精度部署推理加速**,可用`device_id`**指定GPU卡号**。 + +- 文本分类任务 + +``` +python infer_classification.py --device gpu --device_id 0 --dataset KUAKE-QIC --model_path_prefix ../../export/inference +``` + +可支持配置的参数: + +* `model_path_prefix`:必须,待推理模型路径前缀。 +* `model_name_or_path`:选择预训练模型;默认为"ernie-health-chinese"。 +* `dataset`:CBLUE中的训练数据集。 + * `文本分类任务`:包括KUAKE-QIC, KUAKE-QQR, KUAKE-QTR, CHIP-CTC, CHIP-STS, CHIP-CDN-2C;默认为KUAKE-QIC。 +* `max_seq_length`:模型使用的最大序列长度,最大不能超过512;`关系抽取任务`默认为300,其余默认为128。 +* `use_fp16`:选择是否开启FP16进行加速,仅在`devive=gpu`时生效;默认关闭。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为200。 +* `device`: 选用什么设备进行训练,可选cpu、gpu;默认为gpu。 +* `device_id`: 选择GPU卡号;默认为0。 +* `data_file`:本地待预测数据文件;默认为None。 + +#### 本地数据集加载 +如需使用本地数据集,请指定本地待预测数据文件 `data_file`,每行一条样例,单文本输入每句一行,双文本输入以`\t`分隔符隔开。例如 + +**ctc-data.txt** +``` +在过去的6个月曾服用偏头痛预防性药物或长期服用镇痛药物者,以及有酒精依赖或药物滥用习惯者; +患有严重的冠心病、脑卒中,以及传染性疾病、精神疾病者; +活动性乙肝(包括大三阳或小三阳)或血清学指标(HBsAg或/和HBeAg或/和HBcAb)阳性者,丙肝、肺结核、巨细胞病毒、严重真菌感染或HIV感染; +... +``` + +## CPU部署推理样例 + +请使用如下命令进行CPU上的部署,可用`num_threads`**调整预测线程数量**。 + +- 文本分类任务 + +``` +python infer_classification.py --device cpu --dataset KUAKE-QIC --model_path_prefix ../../export/inference +``` + +可支持配置的参数: + +* `model_path_prefix`:必须,待推理模型路径前缀。 +* `model_name_or_path`:选择预训练模型;默认为"ernie-health-chinese"。 +* `dataset`:CBLUE中的训练数据集。 + * `文本分类任务`:包括KUAKE-QIC, KUAKE-QQR, KUAKE-QTR, CHIP-CTC, CHIP-STS, CHIP-CDN-2C;默认为KUAKE-QIC。 +* `max_seq_length`:模型使用的最大序列长度,最大不能超过512;`关系抽取任务`默认为300,其余默认为128。 +* `batch_size`:批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为200。 +* `device`: 选用什么设备进行训练,可选cpu、gpu;默认为gpu。 +* `num_threads`:cpu线程数,在`device=gpu`时影响较小;默认为cpu的物理核心数量。 +* `data_file`:本地待预测数据文件,格式见[GPU部署推理样例](#本地数据集加载)中的介绍;默认为None。 + +## 性能与精度测试 + +本节提供了在CBLUE数据集上预测的性能和精度数据,以供参考。 +在CPU上测试,得到的数据如下。 + +| 数据集 | 最大文本长度 | 精度评估指标 | FP32 指标值 | FP32 latency(ms) | +| ---------- | ------------ | ------------ | ---------- | ---------------- | +| KUAKE-QIC | 128 | Accuracy | 0.8046 | 37.72 | +| KUAKE-QTR | 64 | Accuracy | 0.6886 | 18.40 | +| KUAKE-QQR | 64 | Accuracy | 0.7755 | 10.34 | +| CHIP-CTC | 160 | Macro F1 | 0.8445 | 47.43 | +| CHIP-STS | 96 | Macro F1 | 0.8892 | 27.67 | +| CHIP-CDN-2C | 256 | Micro F1 | 0.8921 | 26.86 | +| CMeEE | 128 | Micro F1 | 0.6469 | 37.59 | +| CMeIE | 300 | Micro F1 | 0.5902 | 213.04 | diff --git a/DPDLDA/deploy/predictor/infer_classification.py b/DPDLDA/deploy/predictor/infer_classification.py new file mode 100644 index 0000000000000000000000000000000000000000..b2ac865b261cf3d23693fb45ed0db1d62bc0b9aa --- /dev/null +++ b/DPDLDA/deploy/predictor/infer_classification.py @@ -0,0 +1,153 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import psutil +from predictor import CLSPredictor + +from paddlenlp.utils.log import logger + + +def parse_args(): + parser = argparse.ArgumentParser() + ''' + parser.add_argument( + "--model_path_prefix", default="D:\PyCharmFile\PaddleNLP-develop\model_zoo\ernie-health\cblue\checkpoint\model_1", type=str, required=True, help="The path prefix of inference model to be used." + ) + ''' + parser.add_argument( + "--model_path_prefix", + default="D:/PyCharmFile/PaddleNLP-develop/model_zoo/ernie-health/cblue/checkpoint/model_train/", type=str, + help="The path prefix of inference model to be used." + ) + parser.add_argument( + "--model_name_or_path", default="ernie-health-chinese", type=str, help="The directory or name of model." + ) + parser.add_argument("--dataset", default="CHIP-STS", type=str, help="Dataset for text classfication.") + parser.add_argument("--data_file", default=None, type=str, help="The data to predict with one sample per line.") + parser.add_argument( + "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization." + ) + parser.add_argument( + "--use_fp16", + action="store_true", + help="Whether to use fp16 inference, only takes effect when deploying on gpu.", + ) + parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for predicting.") + parser.add_argument( + "--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu." + ) + parser.add_argument( + "--device", choices=["cpu", "gpu"], default="cpu", help="Select which device to train model, defaults to gpu." + ) + parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.") + args = parser.parse_args() + return args + + +LABEL_LIST = { + "kuake-qic": ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"], + "kuake-qtr": ["完全不匹配", "很少匹配,有一些参考价值", "部分匹配", "完全匹配"], + "kuake-qqr": ["B为A的语义父集,B指代范围大于A; 或者A与B语义毫无关联。", "B为A的语义子集,B指代范围小于A。", "表示A与B等价,表述完全一致。"], + "chip-ctc": [ + "成瘾行为", + "居住情况", + "年龄", + "酒精使用", + "过敏耐受", + "睡眠", + "献血", + "能力", + "依存性", + "知情同意", + "数据可及性", + "设备", + "诊断", + "饮食", + "残疾群体", + "疾病", + "教育情况", + "病例来源", + "参与其它试验", + "伦理审查", + "种族", + "锻炼", + "性别", + "健康群体", + "实验室检查", + "预期寿命", + "读写能力", + "含有多类别的语句", + "肿瘤进展", + "疾病分期", + "护理", + "口腔相关", + "器官组织状态", + "药物", + "怀孕相关", + "受体状态", + "研究者决定", + "风险评估", + "性取向", + "体征(医生检测)", + " 吸烟状况", + "特殊病人特征", + "症状(患者感受)", + "治疗或手术", + ], + "chip-sts": ["语义不同", "语义相同"], + "chip-cdn-2c": ["否", "是"], +} + +TEXT = { + "kuake-qic": ["心肌缺血如何治疗与调养呢?", "什么叫痔核脱出?什么叫外痔?"], + "kuake-qtr": [["儿童远视眼怎么恢复视力", "近视眼该如何保养才能恢复一些视力"], ["维生素的药有哪些", "抗生素类的药物都有哪些?"],["儿童远视眼怎么恢复视力", "抗生素类的药物都有哪些?"]], + "kuake-qqr": [["茴香是发物吗", "茴香怎么吃?"], ["气的胃疼是怎么回事", "气到胃痛是什么原因"]], + "chip-ctc": ["(1)前牙结构发育不良:釉质发育不全、氟斑牙、四环素牙等;", "怀疑或确有酒精或药物滥用史;"], + "chip-sts": [["糖尿病能吃减肥药吗?能治愈吗?", "糖尿病为什么不能吃减肥药"], ["H型高血压的定义", "WHO对高血压的最新分类定义标准数值"],["糖尿病能吃减肥药吗?能治愈吗?", "WHO对高血压的最新分类定义标准数值"]], + "chip-cdn-2c": [["1型糖尿病性植物神经病变", " 1型糖尿病肾病IV期"], ["髂腰肌囊性占位", "髂肌囊肿"],["1型糖尿病性植物神经病变", "髂肌囊肿"]], +} + +METRIC = { + "kuake-qic": "acc", + "kuake-qtr": "acc", + "kuake-qqr": "acc", + "chip-ctc": "macro", + "chip-sts": "macro", + "chip-cdn-2c": "macro", +} + + +def main(): + args = parse_args() + + for arg_name, arg_value in vars(args).items(): + logger.info("{:20}: {}".format(arg_name, arg_value)) + + args.dataset = args.dataset.lower() + label_list = LABEL_LIST[args.dataset] + if args.data_file is not None: + with open(args.data_file, "r") as fp: + input_data = [x.strip().split("\t") for x in fp.readlines()] + input_data = [x[0] if len(x) == 1 else x for x in input_data] + else: + input_data = TEXT[args.dataset] + + predictor = CLSPredictor(args, label_list) + predictor.predict(input_data) + + +if __name__ == "__main__": + main() diff --git a/DPDLDA/deploy/predictor/predictor.py b/DPDLDA/deploy/predictor/predictor.py new file mode 100644 index 0000000000000000000000000000000000000000..75c8f778731b028b86d0cdc8af495eaa4b2af8be --- /dev/null +++ b/DPDLDA/deploy/predictor/predictor.py @@ -0,0 +1,366 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time + +import numpy as np +import onnxruntime as ort +import paddle2onnx +import six + +from paddlenlp.transformers import ( + AutoTokenizer, + normalize_chars, + tokenize_special_chars, +) +from paddlenlp.utils.log import logger + + +class InferBackend(object): + def __init__(self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, num_threads=10): + + if not isinstance(device, six.string_types): + logger.error( + ">>> [InferBackend] The type of device must be string, but the type you set is: ", type(device) + ) + exit(0) + if device not in ["cpu", "gpu"]: + logger.error(">>> [InferBackend] The device must be cpu or gpu, but your device is set to:", type(device)) + exit(0) + + logger.info(">>> [InferBackend] Creating Engine ...") + + onnx_model = paddle2onnx.command.c_paddle_to_onnx( + model_file=model_path_prefix + "model.pdmodel", + params_file=model_path_prefix + "model.pdiparams", + opset_version=13, + enable_onnx_checker=True, + ) + + infer_model_dir = model_path_prefix.rsplit("/", 1)[0] + float_onnx_file = os.path.join(infer_model_dir, "model.onnx") + ''' + #infer_model_dir = model_path_prefix.rsplit("/", 1)[0] + float_onnx_file = os.path.join(model_path_prefix, ".onnx") + ''' + with open(float_onnx_file, "wb") as f: + f.write(onnx_model) + + if device == "gpu": + logger.info(">>> [InferBackend] Use GPU to inference ...") + providers = ["CUDAExecutionProvider"] + if use_fp16: + logger.info(">>> [InferBackend] Use FP16 to inference ...") + import onnx + from onnxconverter_common import float16 + + fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx") + onnx_model = onnx.load_model(float_onnx_file) + trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True) + onnx.save_model(trans_model, fp16_model_file) + onnx_model = fp16_model_file + else: + logger.info(">>> [InferBackend] Use CPU to inference ...") + providers = ["CPUExecutionProvider"] + if use_fp16: + logger.warning( + ">>> [InferBackend] Ignore use_fp16 as it only " + "takes effect when deploying on gpu..." + ) + + sess_options = ort.SessionOptions() + sess_options.intra_op_num_threads = num_threads + self.predictor = ort.InferenceSession( + onnx_model, sess_options=sess_options, providers=providers, provider_options=[{"device_id": device_id}] + ) + + self.input_handles = [ + self.predictor.get_inputs()[0].name, + self.predictor.get_inputs()[1].name, + ] + + if device == "gpu": + try: + assert "CUDAExecutionProvider" in self.predictor.get_providers() + except AssertionError: + raise AssertionError( + """The environment for GPU inference is not set properly. \nA possible cause is that you had installed both onnxruntime and onnxruntime-gpu. \nPlease run the following commands to reinstall: \n1) pip uninstall -y onnxruntime onnxruntime-gpu \n2) pip install onnxruntime-gpu""" + ) + logger.info(">>> [InferBackend] Engine Created ...") + + def infer(self, input_dict: dict): + input_dict = {k: v for k, v in input_dict.items() if k in self.input_handles} + result = self.predictor.run(None, input_dict) + return result + + +class EHealthPredictor(object): + def __init__(self, args, label_list): + self.label_list = label_list + self._tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=True) + self._max_seq_length = args.max_seq_length + self._batch_size = args.batch_size + self.inference_backend = InferBackend( + args.model_path_prefix, args.device, args.device_id, args.use_fp16, args.num_threads + ) + + def predict(self, input_data: list): + encoded_inputs = self.preprocess(input_data) + infer_result = self.infer_batch(encoded_inputs) + result = self.postprocess(infer_result) + self.printer(result, input_data) + return result + + def _infer(self, input_dict): + infer_data = self.inference_backend.infer(input_dict) + return infer_data + + def infer_batch(self, encoded_inputs): + num_sample = len(encoded_inputs["input_ids"]) + infer_data = None + num_infer_data = None + for idx in range(0, num_sample, self._batch_size): + l, r = idx, idx + self._batch_size + keys = encoded_inputs.keys() + input_dict = {k: encoded_inputs[k][l:r] for k in keys} + results = self._infer(input_dict) + if infer_data is None: + infer_data = [[x] for x in results] + num_infer_data = len(results) + else: + for i in range(num_infer_data): + infer_data[i].append(results[i]) + for i in range(num_infer_data): + infer_data[i] = np.concatenate(infer_data[i], axis=0) + return infer_data + + def performance(self, encoded_inputs): + nums = len(encoded_inputs["input_ids"]) + start_time = time.time() + infer_result = self.infer_batch(preprocess_result) # noqa + total_time = time.time() - start_time + logger.info("sample nums: %d, time: %.2f, latency: %.2f ms" % (nums, total_time, 1000 * total_time / nums)) + + def get_text_and_label(self, dataset): + raise NotImplementedError + + def preprocess(self, input_data: list): + raise NotImplementedError + + def postprocess(self, infer_data): + raise NotImplementedError + + def printer(self, result, input_data): + raise NotImplementedError + + +class CLSPredictor(EHealthPredictor): + def preprocess(self, input_data: list): + norm_text = lambda x: tokenize_special_chars(normalize_chars(x)) + # To deal with a pair of input text. + if isinstance(input_data[0], list): + text = [norm_text(sample[0]) for sample in input_data] + text_pair = [norm_text(sample[1]) for sample in input_data] + else: + text = [norm_text(x) for x in input_data] + text_pair = None + + data = self._tokenizer( + text=text, text_pair=text_pair, max_length=self._max_seq_length, padding=True, truncation=True + ) + + encoded_inputs = { + "input_ids": np.array(data["input_ids"], dtype="int64"), + "token_type_ids": np.array(data["token_type_ids"], dtype="int64"), + } + return encoded_inputs + + def postprocess(self, infer_data): + infer_data = infer_data[0] + max_value = np.max(infer_data, axis=1, keepdims=True) + exp_data = np.exp(infer_data - max_value) + probs = exp_data / np.sum(exp_data, axis=1, keepdims=True) + label = probs.argmax(axis=-1) + confidence = probs.max(axis=-1) + return {"label": label, "confidence": confidence} + + def printer(self, result, input_data): + label, confidence = result["label"], result["confidence"] + for i in range(len(label)): + logger.info("input data: {}".format(input_data[i])) + logger.info("labels: {}, confidence: {}".format(self.label_list[label[i]], confidence[i])) + logger.info("-----------------------------") + + +class NERPredictor(EHealthPredictor): + """The predictor for CMeEE dataset.""" + + en_to_cn = { + "bod": "身体", + "mic": "微生物类", + "dis": "疾病", + "sym": "临床表现", + "pro": "医疗程序", + "equ": "医疗设备", + "dru": "药物", + "dep": "科室", + "ite": "医学检验项目", + } + + def _extract_chunk(self, tokens): + chunks = set() + start_idx, cur_idx = 0, 0 + while cur_idx < len(tokens): + if tokens[cur_idx][0] == "B": + start_idx = cur_idx + cur_idx += 1 + while cur_idx < len(tokens) and tokens[cur_idx][0] == "I": + if tokens[cur_idx][2:] == tokens[start_idx][2:]: + cur_idx += 1 + else: + break + if cur_idx < len(tokens) and tokens[cur_idx][0] == "E": + if tokens[cur_idx][2:] == tokens[start_idx][2:]: + chunks.add((tokens[cur_idx][2:], start_idx - 1, cur_idx)) + cur_idx += 1 + elif tokens[cur_idx][0] == "S": + chunks.add((tokens[cur_idx][2:], cur_idx - 1, cur_idx)) + cur_idx += 1 + else: + cur_idx += 1 + return list(chunks) + + def preprocess(self, infer_data): + infer_data = [[x.lower() for x in text] for text in infer_data] + data = self._tokenizer( + infer_data, max_length=self._max_seq_length, padding=True, is_split_into_words=True, truncation=True + ) + + encoded_inputs = { + "input_ids": np.array(data["input_ids"], dtype="int64"), + "token_type_ids": np.array(data["token_type_ids"], dtype="int64"), + } + return encoded_inputs + + def postprocess(self, infer_data): + tokens_oth = np.argmax(infer_data[0], axis=-1) + tokens_sym = np.argmax(infer_data[1], axis=-1) + entity = [] + for oth_ids, sym_ids in zip(tokens_oth, tokens_sym): + token_oth = [self.label_list[0][x] for x in oth_ids] + token_sym = [self.label_list[1][x] for x in sym_ids] + chunks = self._extract_chunk(token_oth) + self._extract_chunk(token_sym) + sub_entity = [] + for etype, sid, eid in chunks: + sub_entity.append({"type": self.en_to_cn[etype], "start_id": sid, "end_id": eid}) + entity.append(sub_entity) + return {"entity": entity} + + def printer(self, result, input_data): + result = result["entity"] + for i, preds in enumerate(result): + logger.info("input data: {}".format(input_data[i])) + logger.info("detected entities:") + for item in preds: + logger.info( + "* entity: {}, type: {}, position: ({}, {})".format( + input_data[i][item["start_id"] : item["end_id"]], + item["type"], + item["start_id"], + item["end_id"], + ) + ) + logger.info("-----------------------------") + + +class SPOPredictor(EHealthPredictor): + """The predictor for the CMeIE dataset.""" + + def predict(self, input_data: list): + encoded_inputs = self.preprocess(input_data) + lengths = encoded_inputs["attention_mask"].sum(axis=-1) + infer_result = self.infer_batch(encoded_inputs) + result = self.postprocess(infer_result, lengths) + self.printer(result, input_data) + return result + + def preprocess(self, infer_data): + infer_data = [[x.lower() for x in text] for text in infer_data] + data = self._tokenizer( + infer_data, + max_length=self._max_seq_length, + padding=True, + is_split_into_words=True, + truncation=True, + return_attention_mask=True, + ) + encoded_inputs = { + "input_ids": np.array(data["input_ids"], dtype="int64"), + "token_type_ids": np.array(data["token_type_ids"], dtype="int64"), + "attention_mask": np.array(data["attention_mask"], dtype="float32"), + } + return encoded_inputs + + def postprocess(self, infer_data, lengths): + ent_logits = np.array(infer_data[0]) + spo_logits = np.array(infer_data[1]) + ent_pred_list = [] + ent_idxs_list = [] + for idx, ent_pred in enumerate(ent_logits): + seq_len = lengths[idx] - 2 + start = np.where(ent_pred[:, 0] > 0.5)[0] + end = np.where(ent_pred[:, 1] > 0.5)[0] + ent_pred = [] + ent_idxs = {} + for x in start: + y = end[end >= x] + if (x == 0) or (x > seq_len): + continue + if len(y) > 0: + y = y[0] + if y > seq_len: + continue + ent_idxs[x] = (x - 1, y - 1) + ent_pred.append((x - 1, y - 1)) + ent_pred_list.append(ent_pred) + ent_idxs_list.append(ent_idxs) + + spo_preds = spo_logits > 0 + spo_pred_list = [[] for _ in range(len(spo_preds))] + idxs, preds, subs, objs = np.nonzero(spo_preds) + for idx, p_id, s_id, o_id in zip(idxs, preds, subs, objs): + obj = ent_idxs_list[idx].get(o_id, None) + if obj is None: + continue + sub = ent_idxs_list[idx].get(s_id, None) + if sub is None: + continue + spo_pred_list[idx].append((tuple(sub), p_id, tuple(obj))) + + return {"entity": ent_pred_list, "spo": spo_pred_list} + + def printer(self, result, input_data): + ent_pred_list, spo_pred_list = result["entity"], result["spo"] + for i, (ent, rel) in enumerate(zip(ent_pred_list, spo_pred_list)): + logger.info("input data: {}".format(input_data[i])) + logger.info("detected entities and relations:") + for sid, eid in ent: + logger.info("* entity: {}, position: ({}, {})".format(input_data[i][sid : eid + 1], sid, eid)) + for s, p, o in rel: + logger.info( + "+ spo: ({}, {}, {})".format( + input_data[i][s[0] : s[1] + 1], self.label_list[p], input_data[i][o[0] : o[1] + 1] + ) + ) + logger.info("-----------------------------") diff --git a/DPDLDA/deploy/predictor/requirements_cpu.txt b/DPDLDA/deploy/predictor/requirements_cpu.txt new file mode 100644 index 0000000000000000000000000000000000000000..645682ec79c6c8694ee9ea288af3dc3c416a4dfb --- /dev/null +++ b/DPDLDA/deploy/predictor/requirements_cpu.txt @@ -0,0 +1,2 @@ +onnxruntime==1.10.0 +psutil diff --git a/DPDLDA/deploy/predictor/requirements_gpu.txt b/DPDLDA/deploy/predictor/requirements_gpu.txt new file mode 100644 index 0000000000000000000000000000000000000000..2ca8b172eb7993140d6f5e2c3692a200195dd1ee --- /dev/null +++ b/DPDLDA/deploy/predictor/requirements_gpu.txt @@ -0,0 +1,4 @@ +onnxruntime-gpu==1.11.1 +onnx==1.12.0 +onnxconverter-common==1.9.0 +psutil diff --git a/DPDLDA/export_model.py b/DPDLDA/export_model.py new file mode 100644 index 0000000000000000000000000000000000000000..2ab246ef2955bcb3271114f4b1941725b2fca34d --- /dev/null +++ b/DPDLDA/export_model.py @@ -0,0 +1,97 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import paddle +from model import ElectraForBinaryTokenClassification, ElectraForSPO + +from paddlenlp.transformers import ElectraForSequenceClassification + +NUM_CLASSES = { + "CHIP-CDN-2C": 2, + "CHIP-STS": 2, + "CHIP-CTC": 44, + "KUAKE-QQR": 3, + "KUAKE-QTR": 4, + "KUAKE-QIC": 11, + "CMeEE": [33, 5], + "CMeIE": 44, +} + + +def parse_args(): + parser = argparse.ArgumentParser() + #parser.add_argument("--train_dataset",default="CHIP-CDN-2C", required=True, type=str, help="The name of dataset used for training.") + parser.add_argument("--train_dataset", default="CHIP-STS", type=str,help="The name of dataset used for training.") + ''' + parser.add_argument( + "--params_path", + type=str, + required=True, + default="./checkpoint/model_1/", + help="The path to model parameters to be loaded.", + ) + ''' + parser.add_argument( + "--params_path", + type=str, + default="./checkpoint/model_1500/", + help="The path to model parameters to be loaded.", + ) + parser.add_argument( + "--output_path", type=str, default="./export2", help="The path of model parameter in static graph to be saved." + ) + args = parser.parse_args() + return args + + +def main(): + args = parse_args() + + # Load the model parameters. + if args.train_dataset not in NUM_CLASSES: + raise ValueError(f"Please modify the code to fit {args.dataset}") + + if args.train_dataset == "CMeEE": + model = ElectraForBinaryTokenClassification.from_pretrained( + args.params_path, + num_classes_oth=NUM_CLASSES[args.train_dataset][0], + num_classes_sym=NUM_CLASSES[args.train_dataset][1], + ) + elif args.train_dataset == "CMeIE": + model = ElectraForSPO.from_pretrained(args.params_path, num_labels=NUM_CLASSES[args.train_dataset]) + else: + model = ElectraForSequenceClassification.from_pretrained( + args.params_path, num_labels=NUM_CLASSES[args.train_dataset] + ) + + model.eval() + + # Convert to static graph with specific input description: + # input_ids, token_type_ids + input_spec = [ + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + paddle.static.InputSpec(shape=[None, None], dtype="int64"), + ] + model = paddle.jit.to_static(model, input_spec=input_spec) + + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) + + +if __name__ == "__main__": + main() diff --git a/DPDLDA/model.py b/DPDLDA/model.py new file mode 100644 index 0000000000000000000000000000000000000000..f93da173e2bbc1becb43cd7f803d200daa49c5b8 --- /dev/null +++ b/DPDLDA/model.py @@ -0,0 +1,123 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn + +#from paddlenlp.transformers import ElectraConfig, ElectraModel, ElectraPretrainedModel +from transformers import ElectraConfig +from paddlenlp.transformers import ElectraModel, ElectraPretrainedModel + +class ElectraForBinaryTokenClassification(ElectraPretrainedModel): + """ + Electra Model with two linear layers on top of the hidden-states output layers, + designed for token classification tasks with nesting. + + Args: + electra (:class:`ElectraModel`): + An instance of ElectraModel. + num_classes (list): + The number of classes. + dropout (float, optionl): + The dropout probability for output of Electra. + If None, use the same value as `hidden_dropout_prob' of 'ElectraModel` + instance `electra`. Defaults to None. + """ + + def __init__(self, config: ElectraConfig, num_classes_oth, num_classes_sym): + super(ElectraForBinaryTokenClassification, self).__init__(config) + self.num_classes_oth = num_classes_oth + self.num_classes_sym = num_classes_sym + self.electra = ElectraModel(config) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + self.classifier_oth = nn.Linear(config.hidden_size, self.num_classes_oth) + self.classifier_sym = nn.Linear(config.hidden_size, self.num_classes_sym) + + def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None): + sequence_output = self.electra(input_ids, token_type_ids, position_ids, attention_mask) + sequence_output = self.dropout(sequence_output) + + logits_sym = self.classifier_sym(sequence_output) + logits_oth = self.classifier_oth(sequence_output) + + return logits_oth, logits_sym + + +class MultiHeadAttentionForSPO(nn.Layer): + """ + Multi-head attention layer for SPO task. + """ + + def __init__(self, embed_dim, num_heads, scale_value=768): + super(MultiHeadAttentionForSPO, self).__init__() + self.embed_dim = embed_dim + self.num_heads = num_heads + self.scale_value = scale_value**-0.5 + self.q_proj = nn.Linear(embed_dim, embed_dim * num_heads) + self.k_proj = nn.Linear(embed_dim, embed_dim * num_heads) + + def forward(self, query, key): + q = self.q_proj(query) + k = self.k_proj(key) + q = paddle.reshape(q, shape=[0, 0, self.num_heads, self.embed_dim]) + k = paddle.reshape(k, shape=[0, 0, self.num_heads, self.embed_dim]) + q = paddle.transpose(q, perm=[0, 2, 1, 3]) + k = paddle.transpose(k, perm=[0, 2, 1, 3]) + scores = paddle.matmul(q, k, transpose_y=True) + scores = paddle.scale(scores, scale=self.scale_value) + return scores + + +class ElectraForSPO(ElectraPretrainedModel): + """ + Electra Model with a linear layer on top of the hidden-states output + layers for entity recognition, and a multi-head attention layer for + relation classification. + + Args: + electra (:class:`ElectraModel`): + An instance of ElectraModel. + num_classes (int): + The number of classes. + dropout (float, optionl): + The dropout probability for output of Electra. + If None, use the same value as `hidden_dropout_prob' of 'ElectraModel` + instance `electra`. Defaults to None. + """ + + def __init__(self, config: ElectraConfig): + super(ElectraForSPO, self).__init__(config) + self.num_classes = config.num_labels + self.electra = ElectraModel(config) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + self.classifier = nn.Linear(config.hidden_size, 2) + self.span_attention = MultiHeadAttentionForSPO(config.hidden_size, config.num_labels) + + def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None): + outputs = self.electra( + input_ids, token_type_ids, position_ids, attention_mask, output_hidden_states=True, return_dict=True + ) + sequence_outputs = outputs.last_hidden_state + all_hidden_states = outputs.hidden_states + sequence_outputs = self.dropout(sequence_outputs) + ent_logits = self.classifier(sequence_outputs) + + subject_output = all_hidden_states[-2] + cls_output = paddle.unsqueeze(sequence_outputs[:, 0, :], axis=1) + subject_output = subject_output + cls_output + + output_size = self.num_classes + self.electra.config["hidden_size"] # noqa:F841 + rel_logits = self.span_attention(sequence_outputs, subject_output) + + return ent_logits, rel_logits diff --git a/DPDLDA/train_classification.py b/DPDLDA/train_classification.py new file mode 100644 index 0000000000000000000000000000000000000000..12c259ddbc1401975728928177f5a595e29718b6 --- /dev/null +++ b/DPDLDA/train_classification.py @@ -0,0 +1,265 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import distutils.util +import os +import random +import time +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +from paddle.metric import Accuracy +from utils import LinearDecayWithWarmup, convert_example, create_dataloader + +from paddlenlp.data import Pad, Stack, Tuple +from paddlenlp.datasets import load_dataset +from paddlenlp.metrics import AccuracyAndF1, MultiLabelsMetric +from paddlenlp.transformers import ElectraForSequenceClassification, ElectraTokenizer + +METRIC_CLASSES = { + "KUAKE-QIC": Accuracy, + "KUAKE-QQR": Accuracy, + "KUAKE-QTR": Accuracy, + "CHIP-CTC": MultiLabelsMetric, + "CHIP-STS": MultiLabelsMetric, + "CHIP-CDN-2C": AccuracyAndF1, +} + +parser = argparse.ArgumentParser() +parser.add_argument( + "--dataset", + choices=["KUAKE-QIC", "KUAKE-QQR", "KUAKE-QTR", "CHIP-STS", "CHIP-CTC", "CHIP-CDN-2C"], + default="CHIP-STS", + type=str, + help="Dataset for sequence classfication tasks.", +) +parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization.") +parser.add_argument( + "--device", + choices=["cpu", "gpu", "xpu", "npu"], + default="cpu", + help="Select which device to train model, default to gpu.", +) +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs.") +parser.add_argument( + "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override epochs." +) +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument( + "--learning_rate", default=6e-5, type=float, help="Learning rate for fine-tuning sequence classification task." +) +parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay of optimizer if we apply some.") +parser.add_argument( + "--warmup_proportion", + default=0.1, + type=float, + help="Linear warmup proportion of learning rate over the training process.", +) +parser.add_argument( + "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization." +) +parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.") +parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.") +parser.add_argument( + "--save_dir", + default="./checkpoint", + type=str, + help="The output directory where the model checkpoints will be written.", +) +parser.add_argument("--save_steps", default=20, type=int, help="The interval steps to save checkpoints.") +parser.add_argument("--valid_steps", default=20, type=int, help="The interval steps to evaluate model performance.") +parser.add_argument("--use_amp", default=False, type=distutils.util.strtobool, help="Enable mixed precision training.") +parser.add_argument("--scale_loss", default=128, type=float, help="The value of scale_loss for fp16.") + +args = parser.parse_args() + + +def set_seed(seed): + """set random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, criterion, metric, data_loader): + """ + Given a dataset, it evals model and compute the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + dataloader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + criterion(obj:`paddle.nn.Layer`): It can compute the loss. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + losses = [] + for batch in data_loader: + input_ids, token_type_ids, position_ids, labels = batch + logits = model(input_ids, token_type_ids, position_ids) + loss = criterion(logits, labels) + losses.append(loss.numpy()) + correct = metric.compute(logits, labels) + metric.update(correct) + if isinstance(metric, Accuracy): + metric_name = "accuracy" + result = metric.accumulate() + elif isinstance(metric, MultiLabelsMetric): + metric_name = "macro f1" + _, _, result = metric.accumulate("macro") + else: + metric_name = "micro f1" + _, _, _, result, _ = metric.accumulate() + + print("eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric_name, result)) + model.train() + metric.reset() + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds, dev_ds = load_dataset("cblue", args.dataset, splits=["train", "dev"]) + + model = ElectraForSequenceClassification.from_pretrained( + "ernie-health-chinese", num_labels=len(train_ds.label_list) + ) + tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese") + + trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) + batchify_fn = lambda samples, fn=Tuple( # noqa: E731 + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + Pad(axis=0, pad_val=args.max_seq_length - 1, dtype="int64"), # position + Stack(dtype="int64"), + ): [data for data in fn(samples)] + train_data_loader = create_dataloader( + train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + dev_data_loader = create_dataloader( + dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func + ) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + state_keys = {x: x.replace("discriminator.", "") for x in state_dict.keys() if "discriminator." in x} + if len(state_keys) > 0: + state_dict = {state_keys[k]: state_dict[k] for k in state_keys.keys()} + model.set_dict(state_dict) + if paddle.distributed.get_world_size() > 1: + model = paddle.DataParallel(model) + + num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs + args.epochs = (num_training_steps - 1) // len(train_data_loader) + 1 + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] + + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params, + ) + + criterion = paddle.nn.loss.CrossEntropyLoss() + if METRIC_CLASSES[args.dataset] is Accuracy: + metric = METRIC_CLASSES[args.dataset]() + metric_name = "accuracy" + elif METRIC_CLASSES[args.dataset] is MultiLabelsMetric: + metric = METRIC_CLASSES[args.dataset](num_labels=len(train_ds.label_list)) + metric_name = "macro f1" + else: + metric = METRIC_CLASSES[args.dataset]() + metric_name = "micro f1" + if args.use_amp: + scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss) + global_step = 0 + tic_train = time.time() + total_train_time = 0 + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + input_ids, token_type_ids, position_ids, labels = batch + with paddle.amp.auto_cast( + args.use_amp, + custom_white_list=["layer_norm", "softmax", "gelu", "tanh"], + ): + logits = model(input_ids, token_type_ids, position_ids) + loss = criterion(logits, labels) + probs = F.softmax(logits, axis=1) + correct = metric.compute(probs, labels) + metric.update(correct) + + if isinstance(metric, Accuracy): + result = metric.accumulate() + elif isinstance(metric, MultiLabelsMetric): + _, _, result = metric.accumulate("macro") + else: + _, _, _, result, _ = metric.accumulate() + + if args.use_amp: + scaler.scale(loss).backward() + scaler.minimize(optimizer, loss) + else: + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + global_step += 1 + if global_step % args.logging_steps == 0 and rank == 0: + time_diff = time.time() - tic_train + total_train_time += time_diff + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, %s: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, metric_name, result, args.logging_steps / time_diff) + ) + + if global_step % args.valid_steps == 0 and rank == 0: + print("evaluate前") + evaluate(model, criterion, metric, dev_data_loader) + print("evaluate后") + + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + if paddle.distributed.get_world_size() > 1: + model._layers.save_pretrained(save_dir) + else: + model.save_pretrained(save_dir) + tokenizer.save_pretrained(save_dir) + + if global_step >= num_training_steps: + return + tic_train = time.time() + + if rank == 0 and total_train_time > 0: + print("Speed: %.2f steps/s" % (global_step / total_train_time)) + + +if __name__ == "__main__": + do_train() diff --git a/DPDLDA/utils.py b/DPDLDA/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..4c0bda0c63ebbd5eb07a27ef1b1aef45a055671a --- /dev/null +++ b/DPDLDA/utils.py @@ -0,0 +1,503 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math + +import numpy as np +import paddle +from paddle.optimizer.lr import LambdaDecay + +from paddlenlp.transformers import normalize_chars, tokenize_special_chars + + +def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == "train" else False + if mode == "train": + batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) + + +class LinearDecayWithWarmup(LambdaDecay): + def __init__(self, learning_rate, total_steps, warmup, last_epoch=-1, verbose=False): + """ + Creates a learning rate scheduler, which increases learning rate linearly + from 0 to given `learning_rate`, after this warmup period learning rate + would be decreased linearly from the base learning rate to 0. + + Args: + learning_rate (float): + The base learning rate. It is a python float number. + total_steps (int): + The number of training steps. + warmup (int or float): + If int, it means the number of steps for warmup. If float, it means + the proportion of warmup in total training steps. + last_epoch (int, optional): + The index of last epoch. It can be set to restart training. If + None, it means initial learning rate. + Defaults to -1. + verbose (bool, optional): + If True, prints a message to stdout for each update. + Defaults to False. + """ + + warmup_steps = warmup if isinstance(warmup, int) else int(math.floor(warmup * total_steps)) + + def lr_lambda(current_step): + if current_step < warmup_steps: + return float(current_step) / float(max(1, warmup_steps)) + return max(0.0, 1.0 - current_step / total_steps) + + super(LinearDecayWithWarmup, self).__init__(learning_rate, lr_lambda, last_epoch, verbose) + + +def convert_example(example, tokenizer, max_seq_length=512, is_test=False): + """ + Builds model inputs from a sequence or a pair of sequences for sequence + classification tasks by concatenating and adding special tokens. And + creates a mask from the two sequences for sequence-pair classification + tasks. + + The convention in Electra/EHealth is: + + - single sequence: + input_ids: ``[CLS] X [SEP]`` + token_type_ids: `` 0 0 0`` + position_ids: `` 0 1 2`` + + - a senquence pair: + input_ids: ``[CLS] X [SEP] Y [SEP]`` + token_type_ids: `` 0 0 0 1 1`` + position_ids: `` 0 1 2 3 4`` + + Args: + example (obj:`dict`): + A dictionary of input data, containing text and label if it has. + tokenizer (obj:`PretrainedTokenizer`): + A tokenizer inherits from :class:`paddlenlp.transformers.PretrainedTokenizer`. + Users can refer to the superclass for more information. + max_seq_length (obj:`int`): + The maximum total input sequence length after tokenization. + Sequences longer will be truncated, and the shorter will be padded. + is_test (obj:`bool`, default to `False`): + Whether the example contains label or not. + + Returns: + input_ids (obj:`list[int]`): + The list of token ids. + token_type_ids (obj:`list[int]`): + List of sequence pair mask. + position_ids (obj:`list[int]`): + List of position ids. + label(obj:`numpy.array`, data type of int64, optional): + The input label if not is_test. + """ + text_a = example["text_a"] + text_b = example.get("text_b", None) + + text_a = tokenize_special_chars(normalize_chars(text_a)) + if text_b is not None: + text_b = tokenize_special_chars(normalize_chars(text_b)) + + encoded_inputs = tokenizer(text=text_a, text_pair=text_b, max_seq_len=max_seq_length, return_position_ids=True) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + position_ids = encoded_inputs["position_ids"] + + if is_test: + return input_ids, token_type_ids, position_ids + label = np.array([example["label"]], dtype="int64") + return input_ids, token_type_ids, position_ids, label + + +def convert_example_ner(example, tokenizer, max_seq_length=512, pad_label_id=-100, is_test=False): + """ + Builds model inputs from a sequence and creates labels for named- + entity recognition task CMeEE. + + For example, a sample should be: + + - input_ids: ``[CLS] x1 x2 [SEP] [PAD]`` + - token_type_ids: `` 0 0 0 0 0`` + - position_ids: `` 0 1 2 3 0`` + - attention_mask: `` 1 1 1 1 0`` + - label_oth: `` 32 3 32 32 32`` (optional, label ids of others) + - label_sym: `` 4 4 4 4 4`` (optional, label ids of symptom) + + Args: + example (obj:`dict`): + A dictionary of input data, containing text and label if it has. + tokenizer (obj:`PretrainedTokenizer`): + A tokenizer inherits from :class:`paddlenlp.transformers.PretrainedTokenizer`. + Users can refer to the superclass for more information. + max_seq_length (obj:`int`): + The maximum total input sequence length after tokenization. + Sequences longer will be truncated, and the shorter will be padded. + is_test (obj:`bool`, default to `False`): + Whether the example contains label or not. + + Returns: + encoded_output (obj: `dict[str, list|np.array]`): + The sample dictionary including `input_ids`, `token_type_ids`, + `position_ids`, `attention_mask`, `label_oth` (optional), + `label_sym` (optional) + """ + + encoded_inputs = {} + text = example["text"] + if len(text) > max_seq_length - 2: + text = text[: max_seq_length - 2] + text = ["[CLS]"] + [x.lower() for x in text] + ["[SEP]"] + input_len = len(text) + encoded_inputs["input_ids"] = tokenizer.convert_tokens_to_ids(text) + encoded_inputs["token_type_ids"] = np.zeros(input_len) + encoded_inputs["position_ids"] = list(range(input_len)) + encoded_inputs["attention_mask"] = np.ones(input_len) + + if not is_test: + labels = example["labels"] + if input_len - 2 < len(labels[0]): + labels[0] = labels[0][: input_len - 2] + if input_len - 2 < len(labels[1]): + labels[1] = labels[1][: input_len - 2] + encoded_inputs["label_oth"] = [pad_label_id[0]] + labels[0] + [pad_label_id[0]] + encoded_inputs["label_sym"] = [pad_label_id[1]] + labels[1] + [pad_label_id[1]] + + return encoded_inputs + + +def convert_example_spo(example, tokenizer, num_classes, max_seq_length=512, is_test=False): + """ + Builds model inputs from a sequence and creates labels for SPO prediction + task CMeIE. + + For example, a sample should be: + + - input_ids: ``[CLS] x1 x2 [SEP] [PAD]`` + - token_type_ids: `` 0 0 0 0 0`` + - position_ids: `` 0 1 2 3 0`` + - attention_mask: `` 1 1 1 1 0`` + - ent_label: ``[[0 1 0 0 0], # start ids are set as 1 + [0 0 1 0 0]] # end ids are set as 1 + - spo_label: a tensor of shape [num_classes, max_batch_len, max_batch_len]. + Set [predicate_id, subject_start_id, object_start_id] as 1 + when (subject, predicate, object) exists. + + Args: + example (obj:`dict`): + A dictionary of input data, containing text and label if it has. + tokenizer (obj:`PretrainedTokenizer`): + A tokenizer inherits from :class:`paddlenlp.transformers.PretrainedTokenizer`. + Users can refer to the superclass for more information. + num_classes (obj:`int`): + The number of predicates. + max_seq_length (obj:`int`): + The maximum total input sequence length after tokenization. + Sequences longer will be truncated, and the shorter will be padded. + is_test (obj:`bool`, default to `False`): + Whether the example contains label or not. + + Returns: + encoded_output (obj: `dict[str, list|np.array]`): + The sample dictionary including `input_ids`, `token_type_ids`, + `position_ids`, `attention_mask`, `ent_label` (optional), + `spo_label` (optional) + """ + encoded_inputs = {} + text = example["text"] + if len(text) > max_seq_length - 2: + text = text[: max_seq_length - 2] + text = ["[CLS]"] + [x.lower() for x in text] + ["[SEP]"] + input_len = len(text) + encoded_inputs["input_ids"] = tokenizer.convert_tokens_to_ids(text) + encoded_inputs["token_type_ids"] = np.zeros(input_len) + encoded_inputs["position_ids"] = list(range(input_len)) + encoded_inputs["attention_mask"] = np.ones(input_len) + if not is_test: + encoded_inputs["ent_label"] = example["ent_label"] + encoded_inputs["spo_label"] = example["spo_label"] + return encoded_inputs + + +class NERChunkEvaluator(paddle.metric.Metric): + """ + NERChunkEvaluator computes the precision, recall and F1-score for chunk detection. + It is often used in sequence tagging tasks, such as Named Entity Recognition (NER). + + Args: + label_list (list): + The label list. + + Note: + Difference from `paddlenlp.metric.ChunkEvaluator`: + + - `paddlenlp.metric.ChunkEvaluator` + All sequences with non-'O' labels are taken as chunks when computing num_infer. + - `NERChunkEvaluator` + Only complete sequences are taken as chunks, namely `B- I- E-` or `S-`. + """ + + def __init__(self, label_list): + super(NERChunkEvaluator, self).__init__() + self.id2label = [dict(enumerate(x)) for x in label_list] + self.num_classes = [len(x) for x in label_list] + self.num_infer = 0 + self.num_label = 0 + self.num_correct = 0 + + def compute(self, lengths, predictions, labels): + """ + Computes the prediction, recall and F1-score for chunk detection. + + Args: + lengths (Tensor): + The valid length of every sequence, a tensor with shape `[batch_size]`. + predictions (Tensor): + The predictions index, a tensor with shape `[batch_size, sequence_length]`. + labels (Tensor): + The labels index, a tensor with shape `[batch_size, sequence_length]`. + + Returns: + tuple: Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`). + + With the fields: + + - `num_infer_chunks` (Tensor): The number of the inference chunks. + - `num_label_chunks` (Tensor): The number of the label chunks. + - `num_correct_chunks` (Tensor): The number of the correct chunks. + """ + assert len(predictions) == len(labels) + assert len(predictions) == len(self.id2label) + preds = [x.numpy() for x in predictions] + labels = [x.numpy() for x in labels] + + preds_chunk = set() + label_chunk = set() + for idx, (pred, label) in enumerate(zip(preds, labels)): + for i, case in enumerate(pred): + case = [self.id2label[idx][x] for x in case[: lengths[i]]] + preds_chunk |= self.extract_chunk(case, i) + for i, case in enumerate(label): + case = [self.id2label[idx][x] for x in case[: lengths[i]]] + label_chunk |= self.extract_chunk(case, i) + + num_infer = len(preds_chunk) + num_label = len(label_chunk) + num_correct = len(preds_chunk & label_chunk) + return num_infer, num_label, num_correct + + def update(self, correct): + num_infer, num_label, num_correct = correct + self.num_infer += num_infer + self.num_label += num_label + self.num_correct += num_correct + + def accumulate(self): + precision = self.num_correct / (self.num_infer + 1e-6) + recall = self.num_correct / (self.num_label + 1e-6) + f1 = 2 * precision * recall / (precision + recall + 1e-6) + return precision, recall, f1 + + def reset(self): + self.num_infer = 0 + self.num_label = 0 + self.num_correct = 0 + + def name(self): + return "precision", "recall", "f1" + + def extract_chunk(self, sequence, cid=0): + chunks = set() + + start_idx, cur_idx = 0, 0 + while cur_idx < len(sequence): + if sequence[cur_idx][0] == "B": + start_idx = cur_idx + cur_idx += 1 + while cur_idx < len(sequence) and sequence[cur_idx][0] == "I": + if sequence[cur_idx][2:] == sequence[start_idx][2:]: + cur_idx += 1 + else: + break + if cur_idx < len(sequence) and sequence[cur_idx][0] == "E": + if sequence[cur_idx][2:] == sequence[start_idx][2:]: + chunks.add((cid, sequence[cur_idx][2:], start_idx, cur_idx)) + cur_idx += 1 + elif sequence[cur_idx][0] == "S": + chunks.add((cid, sequence[cur_idx][2:], cur_idx, cur_idx)) + cur_idx += 1 + else: + cur_idx += 1 + + return chunks + + +class SPOChunkEvaluator(paddle.metric.Metric): + """ + SPOChunkEvaluator computes the precision, recall and F1-score for multiple + chunk detections, including Named Entity Recognition (NER) and SPO Prediction. + + Args: + num_classes (int): + The number of predicates. + """ + + def __init__(self, num_classes=None): + super(SPOChunkEvaluator, self).__init__() + self.num_classes = num_classes + self.num_infer_ent = 0 + self.num_infer_spo = 1e-10 + self.num_label_ent = 0 + self.num_label_spo = 1e-10 + self.num_correct_ent = 0 + self.num_correct_spo = 0 + + def compute(self, lengths, ent_preds, spo_preds, ent_labels, spo_labels): + """ + Computes the prediction, recall and F1-score for NER and SPO prediction. + + Args: + lengths (Tensor): + The valid length of every sequence, a tensor with shape `[batch_size]`. + ent_preds (Tensor): + The predictions of entities. + A tensor with shape `[batch_size, sequence_length, 2]`. + `ent_preds[:, :, 0]` denotes the start indexes of entities. + `ent_preds[:, :, 1]` denotes the end indexes of entities. + spo_preds (Tensor): + The predictions of predicates between all possible entities. + A tensor with shape `[batch_size, num_classes, sequence_length, sequence_length]`. + ent_labels (list[list|tuple]): + The entity labels' indexes. A list of pair `[start_index, end_index]`. + spo_labels (list[list|tuple]): + The SPO labels' indexes. A list of triple `[[subject_start_index, subject_end_index], + predicate_id, [object_start_index, object_end_index]]`. + + Returns: + tuple: + Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`). + The `ent` denotes results of NER and the `spo` denotes results of SPO prediction. + + With the fields: + + - `num_infer_chunks` (dict): The number of the inference chunks. + - `num_label_chunks` (dict): The number of the label chunks. + - `num_correct_chunks` (dict): The number of the correct chunks. + """ + ent_preds = ent_preds.numpy() + spo_preds = spo_preds.numpy() + + ent_pred_list = [] + ent_idxs_list = [] + for idx, ent_pred in enumerate(ent_preds): + seq_len = lengths[idx] - 2 + start = np.where(ent_pred[:, 0] > 0.5)[0] + end = np.where(ent_pred[:, 1] > 0.5)[0] + ent_pred = [] + ent_idxs = {} + for x in start: + y = end[end >= x] + if (x == 0) or (x > seq_len): + continue + if len(y) > 0: + y = y[0] + if y > seq_len: + continue + ent_idxs[x] = (x - 1, y - 1) + ent_pred.append((x - 1, y - 1)) + ent_pred_list.append(ent_pred) + ent_idxs_list.append(ent_idxs) + + spo_preds = spo_preds > 0 + spo_pred_list = [[] for _ in range(len(spo_preds))] + idxs, preds, subs, objs = np.nonzero(spo_preds) + for idx, p_id, s_id, o_id in zip(idxs, preds, subs, objs): + obj = ent_idxs_list[idx].get(o_id, None) + if obj is None: + continue + sub = ent_idxs_list[idx].get(s_id, None) + if sub is None: + continue + spo_pred_list[idx].append((sub, p_id, obj)) + + correct = {"ent": 0, "spo": 0} + infer = {"ent": 0, "spo": 0} + label = {"ent": 0, "spo": 0} + for ent_pred, ent_true in zip(ent_pred_list, ent_labels): + ent_true = [tuple(x) for x in ent_true] + infer["ent"] += len(set(ent_pred)) + label["ent"] += len(set(ent_true)) + correct["ent"] += len(set(ent_pred) & set(ent_true)) + + for spo_pred, spo_true in zip(spo_pred_list, spo_labels): + spo_true = [(tuple(s), p, tuple(o)) for s, p, o in spo_true] + infer["spo"] += len(set(spo_pred)) + label["spo"] += len(set(spo_true)) + correct["spo"] += len(set(spo_pred) & set(spo_true)) + + return infer, label, correct + + def update(self, corrects): + assert len(corrects) == 3 + for item in corrects: + assert isinstance(item, dict) + for value in item.values(): + if not self._is_number_or_matrix(value): + raise ValueError("The numbers must be a number(int) or a numpy ndarray.") + num_infer, num_label, num_correct = corrects + self.num_infer_ent += num_infer["ent"] + self.num_infer_spo += num_infer["spo"] + self.num_label_ent += num_label["ent"] + self.num_label_spo += num_label["spo"] + self.num_correct_ent += num_correct["ent"] + self.num_correct_spo += num_correct["spo"] + + def accumulate(self): + spo_precision = self.num_correct_spo / self.num_infer_spo + spo_recall = self.num_correct_spo / self.num_label_spo + spo_f1 = 2 * self.num_correct_spo / (self.num_infer_spo + self.num_label_spo) + ent_precision = self.num_correct_ent / self.num_infer_ent if self.num_infer_ent > 0 else 0.0 + ent_recall = self.num_correct_ent / self.num_label_ent if self.num_label_ent > 0 else 0.0 + ent_f1 = ( + 2 * ent_precision * ent_recall / (ent_precision + ent_recall) if (ent_precision + ent_recall) != 0 else 0.0 + ) + return {"entity": (ent_precision, ent_recall, ent_f1), "spo": (spo_precision, spo_recall, spo_f1)} + + def _is_number_or_matrix(self, var): + def _is_number_(var): + return ( + isinstance(var, int) + or isinstance(var, np.int64) + or isinstance(var, float) + or (isinstance(var, np.ndarray) and var.shape == (1,)) + ) + + return _is_number_(var) or isinstance(var, np.ndarray) + + def reset(self): + self.num_infer_ent = 0 + self.num_infer_spo = 1e-10 + self.num_label_ent = 0 + self.num_label_spo = 1e-10 + self.num_correct_ent = 0 + self.num_correct_spo = 0 + + def name(self): + return {"entity": ("precision", "recall", "f1"), "spo": ("precision", "recall", "f1")}