diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/LICENSE.txt b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/LICENSE.txt new file mode 100644 index 0000000000000000000000000000000000000000..eca7123693d7333d8d391b74a469fee175244ac7 --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/LICENSE.txt @@ -0,0 +1,190 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + Copyright 2017 Guillaume Genthial + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. \ No newline at end of file diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/README.md b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/README.md new file mode 100644 index 0000000000000000000000000000000000000000..059f8c2888764cb05ccc1c9098f1a28d29c394f1 --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/README.md @@ -0,0 +1,184 @@ +# SEQUENCE_TAGGING离线推理 + +## 环境要求 + +| 环境 | 版本 | +| --- | --- | +| CANN | >=5.0.3 | +| 处理器| Ascend310/Ascend910 | +| 其他| 见 'requirements.txt' | + +## 数据准备 +1. 模型训练使用CoNLL2003数据集,数据集请用户自行获取。 + +2. 数据集训练前需要做预处理操作,请参考默认配置中的数据集预处理小结。 + +3. 数据集处理后,放入SEQUENCE_TAGGING_ID2097_for_ACL/data目录下。 + +## 数据集预处理(CoNLL2003语料库): + +- 去除数据集中与命名实体识别无关的属性(第2、3列)和DOCSTART行 + 处理前: + + ```text + -DOCSTART- -X- -X- O + + EU NNP B-NP B-ORG + rejects VBZ B-VP O + German JJ B-NP B-MISC + call NN I-NP O + to TO B-VP O + boycott VB I-VP O + British JJ B-NP B-MISC + lamb NN I-NP O + . . O O + ``` + 处理后: + ```text + EU B-ORG + rejects O + German B-MISC + call O + to O + boycott O + British B-MISC + lamb O + . O + ``` + +- 数据集文件路径 + + 训练集:./data/coNLL/eng/eng.train.iob + + 测试集:./data/coNLL/eng/eng.testb.iob + + 验证集:./data/coNLL/eng/eng.testa.iob + +- 词向量库预处理: + + - glove.6B下载 + + - 词向量库文件路径 + + ./data/glove.6B + +- 注意,推理所用的单词表必须与训练所用的一致,训练所需的单词表构建见SEQUENCE_TAGGING_ID2097_for_Tensorflow/build_data.py。 + + 数据集链接:[OBS](obs://cann-id2097/dataset/) +## 脚本和示例代码 + +```text +├── build_data.py //创建单词表 +├── README.md //代码说明文档 +├── eval_ckpt.py //评估ckpt +├── eval_pb.py //评估pb +├── eval_om.py //评估om +├── convert_bin.py //数据转换 +├── freeze_graph.py //模型固化 +├── requirements.txt //环境依赖 +├── LICENSE.txt //证书 +├── scripts +│ ├──pb_to_om.sh //pb转om +│ ├──run_msame.sh //msame离线推理 +├── model +│ ├──__init__.py +│ ├──base_model.py //基础模型 +│ ├──ner_model.py //网络结构 +│ ├──config.py //参数设置 +│ ├──data_utils.py //数据集处理 +├── om_model //存放om模型 +├── pb_model //存放pb模型 +├── bin_data //存放bin文件 +``` + +## 模型文件 + +包括ckpt、pb、om模型文件 + +下载链接:[OBS](obs://cann-id2097/npu/Inference/) + +## STEP1: ckpt文件转pb模型 + +```bash +# CKPT_PATH为ckpt文件的路径 +python3 freeze_graph.py --dir_ckpt CKPT_PATH +# 示例 +python3 freeze_graph.py --dir_ckpt ./ckpt/model.weights/ +``` + +## STEP2: pb模型转om模型 + +检查环境中ATC工具环境变量,设置完成后,修改PB和OM文件路径PB_PATH和OM_PATH,运行pb_to_om.sh + +```bash +export PATH=/usr/local/python3.7.5/bin:$PATH +export PYTHONPATH=/usr/local/Ascend/ascend-toolkit/latest/atc/python/site-packages/te:$PYTHONPATH +export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/atc/lib64:${LD_LIBRARY_PATH} + +PB_PATH=/root/infer +OM_PATH=/root/infer + +/usr/local/Ascend/ascend-toolkit/latest/atc/bin/atc --model=$PB_PATH/SEQUENCE_TAGGING.pb --framework=3 \ + --output=$OM_PATH/SEQUENCE_TAGGING --soc_version=Ascend310 \ + --input_shape="word_ids:1,128;sequence_lengths:1;char_ids:1,128,64" \ + --out_nodes="dense/BiasAdd:0;transitions:0" + +``` + +## STEP3: 数据集转为bin文件 + +运行convert_bin.py + +```bash +# DATA_PATH为数据集路径 +python3 convert_bin.py --data_path DATA_PATH +# 示例 +python3 convert_bin.py --data_path ./data +``` + +## STEP4: 使用msame工具推理 + +安装好[msame]([tools: Ascend tools - Gitee.com](https://gitee.com/ascend/tools/tree/master/msame)),运行run_msame.sh + +```bash +OM_PATH=/root/infer +BIN_PATH=/root/infer/bin_data + +# /root/msame/out/改成自己的msame安装路径 +/root/msame/out/./msame --model $OM_PATH/SEQUENCE_TAGGING.om --input $BIN_PATH/word_ids,$BIN_PATH/sequence_lengths,$BIN_PATH/char_ids --output $OM_PATH/ +``` + +注意,msame生成的推理文件夹是根据时间命名的,类似于20220323_170719这样的格式,需要自己检查路径,在后续精度验证的步骤中修改。SEQUENCE_TAGGING模型的输出有两个,需要将这两个输出分别存放到两个文件夹dir_om_output/output_0 和 dir_om_output/output_1(dir_om_output需要改为自己创建的文件夹) + +## 验证om模型精度 + +运行eval_om.py。 + +```bash +# DATA_PATH为数据集路径,OM_OUTPUT为om模型推理的输出(bin格式) +python3 eval_om.py --data_path DATA_PATH --dir_om_output OM_OUTPUT +# 示例 +python3 eval_om.py --data_path ./data --dir_om_output ./bin_data +``` + +## 验证pb模型精度 + +运行eval_pb.py。 + +```bash +# DATA_PATH为数据集路径,CKPT_PATH为ckpt文件路径 +python3 eval_pb.py --data_path DATA_PATH +# 示例 +python3 eval_pb.py --data_path ./data +``` + +## 验证ckpt精度 + +运行eval_ckpt.py。 + +```bash +# DATA_PATH为数据集路径,CKPT_PATH为ckpt文件路径 +python3 eval_ckpt.py --data_path DATA_PATH --dir_ckpt CKPT_PATH +# 示例 +python3 eval_ckpt.py --data_path ./data --dir_ckpt ./ckpt/model.weights +``` \ No newline at end of file diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/build_data.py b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/build_data.py new file mode 100644 index 0000000000000000000000000000000000000000..b1e819ea7da072c5e3d753e6f2da0eb58bd27f15 --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/build_data.py @@ -0,0 +1,83 @@ +# Copyright 2017 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from model.config import Config +from model.data_utils import CoNLLDataset, get_vocabs, UNK, NUM, \ + get_glove_vocab, write_vocab, load_vocab, get_char_vocab, \ + export_trimmed_glove_vectors, get_processing_word + + +def main(): + """Procedure to build data + + You MUST RUN this procedure. It iterates over the whole dataset (train, + dev and test) and extract the vocabularies in terms of words, tags, and + characters. Having built the vocabularies it writes them in a file. The + writing of vocabulary in a file assigns an id (the line #) to each word. + It then extract the relevant GloVe vectors and stores them in a np array + such that the i-th entry corresponds to the i-th word in the vocabulary. + + + Args: + config: (instance of Config) has attributes like hyper-params... + + """ + # get config and processing of words + config = Config(load=False) + processing_word = get_processing_word(lowercase=True) + + # Generators + dev = CoNLLDataset(config.filename_dev, processing_word) + test = CoNLLDataset(config.filename_test, processing_word) + train = CoNLLDataset(config.filename_train, processing_word) + + # Build Word and Tag vocab + vocab_words, vocab_tags = get_vocabs([train, dev, test]) + vocab_glove = get_glove_vocab(config.filename_glove) + + vocab = vocab_words & vocab_glove + vocab.add(UNK) + vocab.add(NUM) + + # Save vocab + write_vocab(vocab, config.filename_words) + write_vocab(vocab_tags, config.filename_tags) + + # Trim GloVe Vectors + vocab = load_vocab(config.filename_words) + export_trimmed_glove_vectors(vocab, config.filename_glove, + config.filename_trimmed, config.dim_word) + + # Build and save char vocab + train = CoNLLDataset(config.filename_train) + vocab_chars = get_char_vocab(train) + write_vocab(vocab_chars, config.filename_chars) + + +if __name__ == "__main__": + main() diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/convert_bin.py b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/convert_bin.py new file mode 100644 index 0000000000000000000000000000000000000000..65418359cd7d0ef55f5623db095ecb63a214bc21 --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/convert_bin.py @@ -0,0 +1,83 @@ +# Copyright 2017 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import numpy as np + +from model.config import Config +from model.data_utils import CoNLLDataset, minibatches, pad_sequences + + +def data_to_bin(config): + """Read data and convert it to bin""" + + # load test set + test = CoNLLDataset(config.filename_test, config.processing_word, + config.processing_tag, config.max_iter) + for idx, (words, labels) in enumerate(minibatches(test, config.batch_size)): + char_ids, word_ids = zip(*words) + word_ids, sequence_lengths = pad_sequences(word_ids, 0, + max_sequence_length=config.max_sequence_length, + max_word_length=config.max_word_length) + char_ids, word_lengths = pad_sequences(char_ids, pad_tok=0, nlevels=2, + max_sequence_length=config.max_sequence_length, + max_word_length=config.max_word_length) + word_ids = np.array(word_ids) + sequence_lengths = np.array(sequence_lengths) + char_ids = np.array(char_ids) + labels = np.array(labels) + + dir_bins = "./bin_data/" + dir_word_ids = os.path.join(dir_bins, "word_ids/") + dir_sequence_lengths = os.path.join(dir_bins, "sequence_lengths/") + dir_char_ids = os.path.join(dir_bins, "char_ids/") + dir_labels = os.path.join(dir_bins, "labels/") + + # create directories + if not os.path.exists(dir_bins): + os.mkdir(dir_bins) + if not os.path.exists(dir_word_ids): + os.mkdir(dir_word_ids) + if not os.path.exists(dir_sequence_lengths): + os.mkdir(dir_sequence_lengths) + if not os.path.exists(dir_char_ids): + os.mkdir(dir_char_ids) + if not os.path.exists(dir_labels): + os.mkdir(dir_labels) + + # store data as bin + word_ids.tofile("{0}/{1:04d}.bin".format(dir_word_ids, idx)) + sequence_lengths.tofile("{0}/{1:04d}.bin".format(dir_sequence_lengths, idx)) + char_ids.tofile("{0}/{1:04d}.bin".format(dir_char_ids, idx)) + labels.tofile("{0}/{1:04d}.bin".format(dir_labels, idx)) + + +if __name__ == '__main__': + config = Config() + config.batch_size = 1 + data_to_bin(config) diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/eval_ckpt.py b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/eval_ckpt.py new file mode 100644 index 0000000000000000000000000000000000000000..d25f4aa24bb28b6107fa7865eb9920e4ac9b1aab --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/eval_ckpt.py @@ -0,0 +1,59 @@ +# Copyright 2017 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from model.data_utils import CoNLLDataset +from model.ner_model import NERModel +from model.config import Config + + +def eval_ckpt(config): + """Evaluate the performance of ckpt on test set. + + Args: + config: configuration of eval. + + Returns: + metrics: (dict) metrics["acc"] = 98.4, ... + """ + # build model + model = NERModel(config) + model.build() + model.restore_session(config.dir_ckpt) + # create dataset + test = CoNLLDataset(config.filename_test, config.processing_word, + config.processing_tag, config.max_iter) + + # evaluate and interact + model.evaluate(test) + + +if __name__ == "__main__": + # create instance of config + config = Config() + config.batch_size = 1 + eval_ckpt(config) diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/eval_om.py b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/eval_om.py new file mode 100644 index 0000000000000000000000000000000000000000..5e7b7128df115f40f7ffe6ba94a860562bbab2eb --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/eval_om.py @@ -0,0 +1,82 @@ +# Copyright 2017 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import tensorflow as tf + +if type(tf.contrib) != type(tf): tf.contrib._warning = None +tf.get_logger().setLevel('ERROR') + +from model.data_utils import get_chunks, load_bin +from model.config import Config + + +def eval_om(config): + """Evaluate the performance of om on test set. + + Args: + config: configuration of eval. + + Returns: + metrics: (dict) metrics["acc"] = 98.4, ... + """ + + dir_bin_label = './bin_data/labels' + + crf_params = load_bin(config.dir_om_output + '/output_1') + logits = load_bin(config.dir_om_output + '/output_0') + labels = load_bin(dir_bin_label, data_type=np.int) + + correct_preds, total_correct, total_preds = 0., 0., 0. + for label, logit, crf_param in zip(labels, logits, crf_params): + length = len(label) + logit = logit.reshape((config.max_sequence_length, 9))[:length] + crf_param = crf_param.reshape((9, 9)) + lab_pred, viterbi_score = tf.contrib.crf.viterbi_decode(logit, crf_param) + + lab_chunks = set(get_chunks(label, config.vocab_tags)) + lab_pred_chunks = set(get_chunks(lab_pred, config.vocab_tags)) + + correct_preds += len(lab_chunks & lab_pred_chunks) + total_preds += len(lab_pred_chunks) + total_correct += len(lab_chunks) + + p = correct_preds / total_preds if correct_preds > 0 else 0 + r = correct_preds / total_correct if correct_preds > 0 else 0 + f1 = 2 * p * r / (p + r) if correct_preds > 0 else 0 + + return {"precision": 100 * p, "recall": 100 * r, "f1": 100 * f1} + + +if __name__ == "__main__": + # create instance of config + config = Config() + config.batch_size = 1 + metrics = eval_om(config) + msg = " - ".join(["{} {:04.2f}".format(k, v) for k, v in metrics.items()]) + print(msg) diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/eval_pb.py b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/eval_pb.py new file mode 100644 index 0000000000000000000000000000000000000000..4923e7ac60380a004ebe7ab04053781d971d3161 --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/eval_pb.py @@ -0,0 +1,102 @@ +# Copyright 2017 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import tensorflow as tf + +if type(tf.contrib) != type(tf): tf.contrib._warning = None +tf.get_logger().setLevel('ERROR') + +from tensorflow_core.python.platform import gfile +from model.data_utils import get_chunks, load_bin +from model.config import Config + + +def eval_pb(config): + """Evaluate the performance of pb on test set. + + Args: + config: configuration of eval. + + Returns: + metrics: (dict) metrics["acc"] = 98.4, ... + """ + dir_bin_input = './bin_data' + dir_bin_label = './bin_data/labels' + + word_ids = load_bin(dir_bin_input + '/word_ids', data_type=np.int) + char_ids = load_bin(dir_bin_input + '/char_ids', data_type=np.int) + sequence_lengths = load_bin(dir_bin_input + '/sequence_lengths', data_type=np.int) + labels = load_bin(dir_bin_label, data_type=np.int) + + # import graph + sess = tf.Session() + with gfile.FastGFile('./pb_model/SEQUENCE_TAGGING.pb', 'rb') as f: + graph_def = tf.GraphDef() + graph_def.ParseFromString(f.read()) + sess.graph.as_default() + tf.import_graph_def(graph_def, name='') + + correct_preds, total_correct, total_preds = 0., 0., 0. + for word_id, char_id, sequence_length, label in zip(word_ids, char_ids, sequence_lengths, labels): + word_id = word_id.reshape((config.batch_size, config.max_sequence_length)) + char_id = char_id.reshape((config.batch_size, config.max_sequence_length, config.max_word_length)) + + input_word_ids = sess.graph.get_tensor_by_name('word_ids:0') + input_sequence_length = sess.graph.get_tensor_by_name('sequence_lengths:0') + input_char_ids = sess.graph.get_tensor_by_name('char_ids:0') + + fd = {input_word_ids: word_id, input_sequence_length: sequence_length, input_char_ids: char_id} + logits = sess.graph.get_tensor_by_name('dense/BiasAdd:0') + trans_params = sess.graph.get_tensor_by_name('transitions:0') + logit, trans_param = sess.run([logits, trans_params], feed_dict=fd) + logit = logit.squeeze()[:len(label)] + + lab_pred, viterbi_score = tf.contrib.crf.viterbi_decode(logit, trans_param) + + lab_chunks = set(get_chunks(label, config.vocab_tags)) + lab_pred_chunks = set(get_chunks(lab_pred, config.vocab_tags)) + + correct_preds += len(lab_chunks & lab_pred_chunks) + total_preds += len(lab_pred_chunks) + total_correct += len(lab_chunks) + + p = correct_preds / total_preds if correct_preds > 0 else 0 + r = correct_preds / total_correct if correct_preds > 0 else 0 + f1 = 2 * p * r / (p + r) if correct_preds > 0 else 0 + + return {"precision": 100 * p, "recall": 100 * r, "f1": 100 * f1} + + +if __name__ == "__main__": + # create instance of config + config = Config() + config.batch_size = 1 + metrics = eval_pb(config) + msg = " - ".join(["{} {:04.2f}".format(k, v) for k, v in metrics.items()]) + print(msg) diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/freeze_graph.py b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/freeze_graph.py new file mode 100644 index 0000000000000000000000000000000000000000..a2d166a65af1e0d479a7219066faf35641ec9c09 --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/freeze_graph.py @@ -0,0 +1,103 @@ +# Copyright 2017 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import tensorflow as tf +from tensorflow_core.python.platform import gfile + +from model.data_utils import pad_sequences +from model.ner_model import NERModel +from model.config import Config + + +def infer_pb(): + """Inference test of pb.""" + + sess = tf.Session() + with gfile.FastGFile('./pb_model/SEQUENCE_TAGGING.pb', 'rb') as f: + graph_def = tf.GraphDef() + graph_def.ParseFromString(f.read()) + sess.graph.as_default() + tf.import_graph_def(graph_def, name='') + + # 输入 + words_raw = "I love Paris".strip().split(" ") + words = [config.processing_word(w) for w in words_raw] + if type(words[0]) == tuple: + words = zip(*words) + char_ids, word_ids = zip(*[words]) + word_ids, sequence_lengths = pad_sequences(word_ids, 0, + max_sequence_length=config.max_sequence_length, + max_word_length=config.max_word_length) + char_ids, word_lengths = pad_sequences(char_ids, pad_tok=0, nlevels=2, + max_sequence_length=config.max_sequence_length, + max_word_length=config.max_word_length) + input_word_ids = sess.graph.get_tensor_by_name('word_ids:0') + input_sequence_length = sess.graph.get_tensor_by_name('sequence_lengths:0') + input_char_ids = sess.graph.get_tensor_by_name('char_ids:0') + + fd = {input_word_ids: word_ids, input_sequence_length: sequence_lengths, + input_char_ids: char_ids} + logits = sess.graph.get_tensor_by_name('dense/BiasAdd:0') + trans_params = sess.graph.get_tensor_by_name('transitions:0') + logits, trans_params = sess.run([logits, trans_params], feed_dict=fd) + print(logits) + print(trans_params) + + +def ckpt_to_pb(config): + """Convert ckpt to pb. + + Args: + config: configuration of pb. + """ + model = NERModel(config) + model.build() + model.restore_session(config.dir_ckpt) + + sess = model.sess + output_nodes = 'dense/BiasAdd,transitions' + + graph = tf.get_default_graph() + input_graph_def = graph.as_graph_def() + output_graph_def = tf.graph_util.convert_variables_to_constants( + sess=sess, + input_graph_def=input_graph_def, + output_node_names=output_nodes.split(",")) + + with tf.gfile.GFile("./pb_model/SEQUENCE_TAGGING.pb", "wb") as f: + f.write(output_graph_def.SerializeToString()) + print("%d ops in the final graph." % len(output_graph_def.node)) + + print("done") + + +if __name__ == '__main__': + # create instance of config + config = Config() + config.batch_size = 1 + ckpt_to_pb(config) diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/__init__.py b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/base_model.py b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/base_model.py new file mode 100644 index 0000000000000000000000000000000000000000..b22d99a2ecae81697305ae00d034b8bb4480c835 --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/base_model.py @@ -0,0 +1,192 @@ +# Copyright 2017 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import tensorflow as tf + +if type(tf.contrib) != type(tf): tf.contrib._warning = None +tf.get_logger().setLevel('ERROR') + +from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig + +# from npu_bridge.npu_init import * + + +class BaseModel(object): + """Generic class for general methods that are not specific to NER""" + + def __init__(self, config): + """Defines self.config + + Args: + config: (Config instance) class with hyper parameters, + vocab and embeddings + + """ + self.config = config + self.sess = None + self.saver = None + + def reinitialize_weights(self, scope_name): + """Reinitializes the weights of a given layer""" + variables = tf.contrib.framework.get_variables(scope_name) + init = tf.variables_initializer(variables) + self.sess.run(init) + + def add_train_op(self, lr_method, lr, loss, clip=-1): + """Defines self.train_op that performs an update on a batch + + Args: + lr_method: (string) sgd method, for example "adam" + lr: (tf.placeholder) tf.float32, learning rate + loss: (tensor) tf.float32 loss to minimize + clip: (python float) clipping of gradient. If < 0, no clipping + + """ + _lr_m = lr_method.lower() # lower to make sure + + with tf.compat.v1.variable_scope("train_step"): + if _lr_m == 'adam': + optimizer = tf.compat.v1.train.AdamOptimizer(lr) + elif _lr_m == 'adagrad': + optimizer = tf.compat.v1.train.AdagradOptimizer(lr) + elif _lr_m == 'sgd': + optimizer = tf.compat.v1.train.GradientDescentOptimizer(lr) + elif _lr_m == 'momentum': + optimizer = tf.compat.v1.train.MomentumOptimizer(lr, 0.9) + elif _lr_m == 'rmsprop': + optimizer = tf.compat.v1.train.RMSPropOptimizer(lr) + else: + raise NotImplementedError("Unknown method {}".format(_lr_m)) + + if clip > 0: # gradient clipping if clip is positive + grads, vs = zip(*optimizer.compute_gradients(loss)) + grads, gnorm = tf.clip_by_global_norm(grads, clip) + grads_and_vars = [(grad, var) for grad, var in zip(grads, vs) if grad is not None] + self.train_op = optimizer.apply_gradients(grads_and_vars) + else: + self.train_op = optimizer.minimize(loss) + + def initialize_session(self): + """Defines self.sess and initialize the variables""" + print("Initializing tf session") + + config = tf.compat.v1.ConfigProto() + custom_op = config.graph_options.rewrite_options.custom_optimizers.add() + custom_op.name = "NpuOptimizer" + + custom_op.parameter_map["dynamic_input"].b = True + custom_op.parameter_map["dynamic_graph_execute_mode"].s = tf.compat.as_bytes("lazy_recompile") + + config.graph_options.rewrite_options.remapping = RewriterConfig.OFF + config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF + self.sess = tf.compat.v1.Session(config=config) + self.sess.run(tf.compat.v1.global_variables_initializer()) + self.saver = tf.compat.v1.train.Saver() + + def restore_session(self, dir_model): + """Reload weights into session + + Args: + sess: tf.compat.v1.Session() + dir_model: dir with weights + + """ + print("Reloading the latest trained model...") + self.saver.restore(self.sess, dir_model) + + def save_session(self): + """Saves session = weights""" + if not os.path.exists(self.config.dir_model): + os.makedirs(self.config.dir_model) + self.saver.save(self.sess, self.config.dir_model) + + def close_session(self): + """Closes the session""" + self.sess.close() + + def add_summary(self): + """Defines variables for Tensorboard + + Args: + dir_output: (string) where the results are written + + """ + self.merged = tf.compat.v1.summary.merge_all() + self.file_writer = tf.compat.v1.summary.FileWriter(self.config.output_path, + self.sess.graph) + + def train(self, train, test): + """Performs training with early stopping and lr exponential decay + + Args: + train: dataset that yields tuple of (sentences, tags) + test: dataset + + """ + best_score = 0 + # for early stopping + nepoch_no_imprv = 0 + # tensorboard + self.add_summary() + init_lr = self.config.lr + start_epoch = 0 if self.config.resume == 0 else self.config.resume + # resume training + if start_epoch > 0: + self.restore_session(self.config.dir_model) + for epoch in range(start_epoch, self.config.nepochs): + print("Epoch {:} out of {:}".format(epoch + 1, self.config.nepochs)) + self.config.lr = init_lr / (1 + self.config.lr_decay * epoch) # decay learning rate + + score = self.run_epoch(train, test, epoch) + # early stopping and saving best parameters + if score >= best_score: + nepoch_no_imprv = 0 + self.save_session() + best_score = score + print("- new best score!") + else: + nepoch_no_imprv += 1 + if nepoch_no_imprv >= self.config.nepoch_no_imprv: + print("Early stopping {} epochs without " \ + "improvement".format(nepoch_no_imprv)) + break + print("Final best score {}".format(best_score)) + + def evaluate(self, test): + """Evaluate model on test set + + Args: + test: instance of class Dataset + + """ + print("Testing model over test set") + metrics = self.run_evaluate(test) + msg = " - ".join(["{} {:04.2f}".format(k, v) + for k, v in metrics.items()]) + print(msg) diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/config.py b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/config.py new file mode 100644 index 0000000000000000000000000000000000000000..2f0fbc6bfdb866abda5e7c0f001c031d6dff3620 --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/config.py @@ -0,0 +1,162 @@ +# Copyright 2017 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse + +from .data_utils import get_trimmed_glove_vectors, load_vocab, \ + get_processing_word + + +class Config(): + def __init__(self, load=True): + """Initialize hyperparameters and load vocabs + + Args: + load_embeddings: (bool) if True, load embeddings into + np array, else None + + """ + if load: + self.load() + + def load(self): + """Loads vocabulary, processing functions and embeddings + + Supposes that build_data.py has been run successfully and that + the corresponding files have been created (vocab and trimmed GloVe + vectors) + + """ + # 1. vocabulary + self.vocab_words = load_vocab(self.filename_words) + self.vocab_tags = load_vocab(self.filename_tags) + self.vocab_chars = load_vocab(self.filename_chars) + + self.nwords = len(self.vocab_words) + self.nchars = len(self.vocab_chars) + self.ntags = len(self.vocab_tags) + + # 2. get processing functions that map str -> id + self.processing_word = get_processing_word(self.vocab_words, + self.vocab_chars, lowercase=True, chars=self.use_chars) + self.processing_tag = get_processing_word(self.vocab_tags, + lowercase=False, allow_unk=False) + + # 3. get pre-trained embeddings + self.embeddings = (get_trimmed_glove_vectors(self.filename_trimmed) + if self.use_pretrained else None) + + parser = argparse.ArgumentParser() + parser.add_argument("--data_path", type=str, + default="./data", help="Dataset path.") + parser.add_argument("--resume", type=int, default=0, help="Resume training.") + parser.add_argument("--output_path", type=str, default="./results/", help="Output path.") + parser.add_argument("--dir_ckpt", type=str, default="./results/model.weights/", + help="Checkpoint path for evaluation.") + parser.add_argument("--dir_om_output", type=str, default="./bin_data/", + help="The output path of om for evaluation.") + parser.add_argument("--dim_word", type=int, default=100, help="The dimension of word embeddings.") + parser.add_argument("--dim_char", type=int, default=30, help="The dimension of char embeddings.") + parser.add_argument("--max_sequence_length", type=int, default=128, help="Max length of sequence.") + parser.add_argument("--max_word_length", type=int, default=64, help="Max length of word.") + parser.add_argument("--train_embeddings", type=bool, default=False, help="Whether to train embeddings.") + parser.add_argument("--epochs", type=int, default=100, help="Number of epochs.") + parser.add_argument("--dropout", type=float, default=0.5, help="Dropout.") + parser.add_argument("--batchsize", type=int, default=10, help="Batch size.") + parser.add_argument("--optimizer", type=str, default="adam", + choices=('adam', 'sgd', 'momentum', 'adagrad', 'rmsprop'), help="Optimizer.") + parser.add_argument("--lr", type=float, default=0.002, help="Learning rate.") + parser.add_argument("--lr_decay", type=float, default=0.05, help="The decay rate of learning rate.") + parser.add_argument("--grad_clip", type=float, default=10, help="Gradient clip.") + parser.add_argument("--early_stop", type=int, default=10, help="Early stop.") + parser.add_argument("--mix_precision", type=bool, default=False, help="Whether to enable mix precision.") + parser.add_argument("--hidden_size_lstm", type=int, default=200, help="Hidden size of lstm.") + parser.add_argument("--use_crf", type=bool, default=True, help="Whether to use crf.") + parser.add_argument("--use_chars", type=bool, default=True, help="Whether to use char embeddings.") + parser.add_argument("--conv_kernel_size", type=int, default=3, help="Kernel size of cnn block.") + parser.add_argument("--conv_filter_num", type=int, default=30, help="Filter number of cnn block.") + args = parser.parse_args() + + data_path = args.data_path + output_path = args.output_path + dir_om_output = args.dir_om_output + + # general config + dir_ckpt = args.dir_ckpt + if dir_ckpt[-1] != '/': + dir_ckpt += '/' + dir_model = os.path.join(output_path, "model.weights/") + + # embeddings + dim_word = args.dim_word + dim_char = args.dim_char + + max_sequence_length = args.max_sequence_length + max_word_length = args.max_word_length + + # glove files + filename_glove = os.path.join(data_path, "glove.6B/glove.6B.{}d.txt".format(dim_word)) + # trimmed embeddings (created from glove_filename with build_data.py) + filename_trimmed = os.path.join(data_path, "glove.6B.{}d.trimmed.npz".format(dim_word)) + use_pretrained = True + + # dataset + filename_dev = os.path.join(data_path, "coNLL/eng/eng.testa.iob") + filename_test = os.path.join(data_path, "coNLL/eng/eng.testb.iob") + filename_train = os.path.join(data_path, "coNLL/eng/eng.train.iob") + + max_iter = None # if not None, max number of examples in Dataset + + # vocab (created from dataset with build_data.py) + filename_words = os.path.join(data_path, "words.txt") + filename_tags = os.path.join(data_path, "tags.txt") + filename_chars = os.path.join(data_path, "chars.txt") + + # resume + resume = args.resume + + # training + train_embeddings = args.train_embeddings + nepochs = args.epochs + dropout = args.dropout + batch_size = args.batchsize + lr_method = args.optimizer + lr = args.lr + lr_decay = args.lr_decay + clip = args.grad_clip + nepoch_no_imprv = args.early_stop + + # model hyperparameters + hidden_size_lstm = args.hidden_size_lstm # lstm on word embeddings + + # NOTE: if both chars and crf, only 1.6x slower on GPU + use_crf = args.use_crf # if crf, training is 1.7x slower on CPU + use_chars = args.use_chars # if char embedding, training is 3.5x slower on CPU + conv_kernel_size = args.conv_kernel_size + conv_filter_num = args.conv_filter_num diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/data_utils.py b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/data_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..c783291822552484dc2ac7d0387e9682a955c11c --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/data_utils.py @@ -0,0 +1,479 @@ +# Copyright 2017 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os + +import numpy as np + +# shared global variables to be imported from model also +UNK = "$UNK$" +NUM = "$NUM$" +NONE = "O" + + +# special error message +class MyIOError(Exception): + def __init__(self, filename): + # custom error message + message = """ +ERROR: Unable to locate file {}. + +FIX: Have you tried running python build_data.py first? +This will build vocab file from your train, test and dev sets and +trimm your word vectors. +""".format(filename) + super(MyIOError, self).__init__(message) + + +class CoNLLDataset(object): + """Class that iterates over CoNLL Dataset + + __iter__ method yields a tuple (words, tags) + words: list of raw words + tags: list of raw tags + + If processing_word and processing_tag are not None, + optional preprocessing is appplied + + Example: + ```python + data = CoNLLDataset(filename) + for sentence, tags in data: + pass + ``` + + """ + + def __init__(self, filename, processing_word=None, processing_tag=None, + max_iter=None): + """ + Args: + filename: path to the file + processing_words: (optional) function that takes a word as input + processing_tags: (optional) function that takes a tag as input + max_iter: (optional) max number of sentences to yield + + """ + self.filename = filename + self.processing_word = processing_word + self.processing_tag = processing_tag + self.max_iter = max_iter + self.length = None + + def __iter__(self): + niter = 0 + with open(self.filename) as f: + words, tags = [], [] + for line in f: + line = line.strip() + if len(line) == 0 or line.startswith("-DOCSTART-"): + if len(words) != 0: + niter += 1 + if self.max_iter is not None and niter > self.max_iter: + break + yield words, tags + words, tags = [], [] + else: + ls = line.split(' ') + word, tag = ls[0], ls[1] + if self.processing_word is not None: + word = self.processing_word(word) + if self.processing_tag is not None: + tag = self.processing_tag(tag) + words += [word] + tags += [tag] + + def __len__(self): + """Iterates once over the corpus to set and store length""" + if self.length is None: + self.length = 0 + for _ in self: + self.length += 1 + + return self.length + + +def get_vocabs(datasets): + """Build vocabulary from an iterable of datasets objects + + Args: + datasets: a list of dataset objects + + Returns: + a set of all the words in the dataset + + """ + print("Building vocab...") + vocab_words = set() + vocab_tags = set() + for dataset in datasets: + for words, tags in dataset: + vocab_words.update(words) + vocab_tags.update(tags) + print("- done. {} tokens".format(len(vocab_words))) + return vocab_words, vocab_tags + + +def get_char_vocab(dataset): + """Build char vocabulary from an iterable of datasets objects + + Args: + dataset: a iterator yielding tuples (sentence, tags) + + Returns: + a set of all the characters in the dataset + + """ + vocab_char = set() + for words, _ in dataset: + for word in words: + vocab_char.update(word) + + return vocab_char + + +def get_glove_vocab(filename): + """Load vocab from file + + Args: + filename: path to the glove vectors + + Returns: + vocab: set() of strings + """ + print("Building vocab...") + vocab = set() + with open(filename, encoding="utf-8") as f: + for line in f: + word = line.strip().split(' ')[0] + vocab.add(word) + print("- done. {} tokens".format(len(vocab))) + return vocab + + +def write_vocab(vocab, filename): + """Writes a vocab to a file + + Writes one word per line. + + Args: + vocab: iterable that yields word + filename: path to vocab file + + Returns: + write a word per line + + """ + print("Writing vocab...") + with open(filename, "w") as f: + for i, word in enumerate(vocab): + if i != len(vocab) - 1: + f.write("{}\n".format(word)) + else: + f.write(word) + print("- done. {} tokens".format(len(vocab))) + + +def load_vocab(filename): + """Loads vocab from a file + + Args: + filename: (string) the format of the file must be one word per line. + + Returns: + d: dict[word] = index + + """ + try: + d = dict() + with open(filename) as f: + for idx, word in enumerate(f): + word = word.strip() + d[word] = idx + + except IOError: + raise MyIOError(filename) + return d + + +def export_trimmed_glove_vectors(vocab, glove_filename, trimmed_filename, dim): + """Saves glove vectors in numpy array + + Args: + vocab: dictionary vocab[word] = index + glove_filename: a path to a glove file + trimmed_filename: a path where to store a matrix in npy + dim: (int) dimension of embeddings + + """ + embeddings = np.zeros([len(vocab), dim]) + with open(glove_filename, encoding="utf-8") as f: + for line in f: + line = line.strip().split(' ') + word = line[0] + embedding = [float(x) for x in line[1:]] + if word in vocab: + word_idx = vocab[word] + embeddings[word_idx] = np.asarray(embedding) + + np.savez_compressed(trimmed_filename, embeddings=embeddings) + + +def get_trimmed_glove_vectors(filename): + """ + Args: + filename: path to the npz file + + Returns: + matrix of embeddings (np array) + + """ + try: + with np.load(filename) as data: + return data["embeddings"] + + except IOError: + raise MyIOError(filename) + + +def get_processing_word(vocab_words=None, vocab_chars=None, + lowercase=False, chars=False, allow_unk=True): + """Return lambda function that transform a word (string) into list, + or tuple of (list, id) of int corresponding to the ids of the word and + its corresponding characters. + + Args: + vocab_words: vocab of words + vocab_chars: vocab of chars + lowercase: whether to use lowercase + chars: whether to use chars + allow_unk: whether to allow unk + vocab: dict[word] = idx + + Returns: + f("cat") = ([12, 4, 32], 12345) + = (list of char ids, word id) + + """ + + def f(word): + # 0. get chars of words + if vocab_chars is not None and chars == True: + char_ids = [] + for char in word: + # ignore chars out of vocabulary + if char in vocab_chars: + char_ids += [vocab_chars[char]] + + # 1. preprocess word + if lowercase: + word = word.lower() + if word.isdigit(): + word = NUM + + # 2. get id of word + if vocab_words is not None: + if word in vocab_words: + word = vocab_words[word] + else: + if allow_unk: + word = vocab_words[UNK] + else: + raise Exception("Unknow key is not allowed. Check that " \ + "your vocab (tags?) is correct") + + # 3. return tuple char ids, word id + if vocab_chars is not None and chars == True: + return char_ids, word + else: + return word + + return f + + +def _pad_sequences(sequences, pad_tok, max_length): + """ + Args: + sequences: a generator of list or tuple + pad_tok: the char to pad with + + Returns: + a list of list where each sublist has same length + """ + sequence_padded, sequence_length = [], [] + + for seq in sequences: + seq = list(seq) + seq_ = seq[:max_length] + [pad_tok] * max(max_length - len(seq), 0) + sequence_padded += [seq_] + sequence_length += [min(len(seq), max_length)] + + return sequence_padded, sequence_length + + +def pad_sequences(sequences, pad_tok, nlevels=1, max_sequence_length=0, max_word_length=0): + """ + Args: + sequences: a generator of list or tuple + pad_tok: the char to pad with + nlevels: "depth" of padding, for the case where we have characters ids + + Returns: + a list of list where each sublist has same length + + """ + if nlevels == 1: + max_length = max(map(lambda x: len(x), sequences)) if max_sequence_length == 0 else max_sequence_length + sequence_padded, sequence_length = _pad_sequences(sequences, + pad_tok, max_length) + + elif nlevels == 2: + max_length_word = max([max(map(lambda x: len(x), seq)) + for seq in sequences]) if max_word_length == 0 else max_word_length + sequence_padded, sequence_length = [], [] + for seq in sequences: + # all words are same length now + sp, sl = _pad_sequences(seq, pad_tok, max_length_word) + sequence_padded += [sp] + sequence_length += [sl] + + max_length_sentence = max(map(lambda x: len(x), sequences)) if max_sequence_length == 0 else max_sequence_length + sequence_padded, _ = _pad_sequences(sequence_padded, + [pad_tok] * max_length_word, max_length_sentence) + sequence_length, _ = _pad_sequences(sequence_length, 0, + max_length_sentence) + + return sequence_padded, sequence_length + + +def minibatches(data, minibatch_size): + """ + Args: + data: generator of (sentence, tags) tuples + minibatch_size: (int) + + Yields: + list of tuples + + """ + x_batch, y_batch = [], [] + for (x, y) in data: + if len(x_batch) == minibatch_size: + yield x_batch, y_batch + x_batch, y_batch = [], [] + + if type(x[0]) == tuple: + x = zip(*x) + x_batch += [x] + y_batch += [y] + + # if len(x_batch) != 0: + # yield x_batch, y_batch + + +def get_chunk_type(tok, idx_to_tag): + """ + Args: + tok: id of token, ex 4 + idx_to_tag: dictionary {4: "B-PER", ...} + + Returns: + tuple: "B", "PER" + + """ + tag_name = idx_to_tag[tok] + tag_class = tag_name.split('-')[0] + tag_type = tag_name.split('-')[-1] + return tag_class, tag_type + + +def get_chunks(seq, tags): + """Given a sequence of tags, group entities and their position + + Args: + seq: [4, 4, 0, 0, ...] sequence of labels + tags: dict["O"] = 4 + + Returns: + list of (chunk_type, chunk_start, chunk_end) + + Example: + seq = [4, 5, 0, 3] + tags = {"B-PER": 4, "I-PER": 5, "B-LOC": 3} + result = [("PER", 0, 2), ("LOC", 3, 4)] + + """ + default = tags[NONE] + idx_to_tag = {idx: tag for tag, idx in tags.items()} + chunks = [] + chunk_type, chunk_start = None, None + for i, tok in enumerate(seq): + # End of a chunk 1 + if tok == default and chunk_type is not None: + # Add a chunk. + chunk = (chunk_type, chunk_start, i) + chunks.append(chunk) + chunk_type, chunk_start = None, None + + # End of a chunk + start of a chunk! + elif tok != default: + tok_chunk_class, tok_chunk_type = get_chunk_type(tok, idx_to_tag) + if chunk_type is None: + chunk_type, chunk_start = tok_chunk_type, i + elif tok_chunk_type != chunk_type or tok_chunk_class == "B": + chunk = (chunk_type, chunk_start, i) + chunks.append(chunk) + chunk_type, chunk_start = tok_chunk_type, i + else: + pass + + # end condition + if chunk_type is not None: + chunk = (chunk_type, chunk_start, len(seq)) + chunks.append(chunk) + + return chunks + + +def load_bin(dir_bin, data_type=np.float32): + """Load all bin data from dir_bin and build a data list. + + Args: + dir_bin: directory of bin data. + data_type: type of data loaded from bin. + + Returns: + list{data} + """ + outputs = [] + files = os.listdir(dir_bin) + files.sort() + for file in files: + if file.endswith('.bin'): + data = np.fromfile(dir_bin + '/' + file, dtype=data_type) + outputs.append(data) + return outputs diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/ner_model.py b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/ner_model.py new file mode 100644 index 0000000000000000000000000000000000000000..fcd7a69bf149f6bcd079b4560e753f9490ee58e1 --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/model/ner_model.py @@ -0,0 +1,368 @@ +# Copyright 2017 The TensorFlow Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import time +import tensorflow as tf + +if type(tf.contrib) != type(tf): tf.contrib._warning = None +tf.get_logger().setLevel('ERROR') + +from .data_utils import minibatches, pad_sequences, get_chunks +from .base_model import BaseModel + +# from npu_bridge.npu_init import * +# from npu_bridge.estimator import npu_ops + + +class NERModel(BaseModel): + """Specialized class of Model for NER""" + + def __init__(self, config): + super(NERModel, self).__init__(config) + self.idx_to_tag = {idx: tag for tag, idx in + self.config.vocab_tags.items()} + + def add_placeholders(self): + """Define placeholders = entries to computational graph""" + # shape = (batch size, max length of sentence in batch) + self.word_ids = tf.compat.v1.placeholder(tf.int32, + shape=[self.config.batch_size, self.config.max_sequence_length], + name="word_ids") + + # shape = (batch size) + self.sequence_lengths = tf.compat.v1.placeholder(tf.int32, shape=[self.config.batch_size], + name="sequence_lengths") + + # shape = (batch size, max length of sentence, max length of word) + self.char_ids = tf.compat.v1.placeholder(tf.int32, shape=[self.config.batch_size, + self.config.max_sequence_length, + self.config.max_word_length], + name="char_ids") + + # shape = (batch_size, max_length of sentence) + self.word_lengths = tf.compat.v1.placeholder(tf.int32, + shape=[self.config.batch_size, self.config.max_sequence_length], + name="word_lengths") + + # shape = (batch size, max length of sentence in batch) + self.labels = tf.compat.v1.placeholder(tf.int32, + shape=[self.config.batch_size, self.config.max_sequence_length], + name="labels") + + # hyper parameters + self.dropout = tf.compat.v1.placeholder_with_default(1.0, shape=[], + name="dropout") + self.lr = tf.compat.v1.placeholder(dtype=tf.float32, shape=[], + name="lr") + + def get_feed_dict(self, words, labels=None, lr=None, dropout=None): + """Given some data, pad it and build a feed dictionary + + Args: + words: list of sentences. A sentence is a list of ids of a list of + words. A word is a list of ids + labels: list of ids + lr: (float) learning rate + dropout: (float) keep prob + + Returns: + dict {placeholder: value} + + """ + # perform padding of the given data + if self.config.use_chars: + char_ids, word_ids = zip(*words) + word_ids, sequence_lengths = pad_sequences(word_ids, 0, + max_sequence_length=self.config.max_sequence_length, + max_word_length=self.config.max_word_length) + char_ids, word_lengths = pad_sequences(char_ids, pad_tok=0, nlevels=2, + max_sequence_length=self.config.max_sequence_length, + max_word_length=self.config.max_word_length) + else: + word_ids, sequence_lengths = pad_sequences(words, 0, + max_sequence_length=self.config.max_sequence_length, + max_word_length=self.config.max_word_length) + + # build feed dictionary + feed = { + self.word_ids: word_ids, + self.sequence_lengths: sequence_lengths + } + + if self.config.use_chars: + feed[self.char_ids] = char_ids + feed[self.word_lengths] = word_lengths + + if labels is not None: + labels, _ = pad_sequences(labels, 0, + max_sequence_length=self.config.max_sequence_length, + max_word_length=self.config.max_word_length) + feed[self.labels] = labels + + if lr is not None: + feed[self.lr] = lr + + if dropout is not None: + feed[self.dropout] = dropout + + return feed, sequence_lengths + + def add_embeddings_op(self): + """Defines self.embeddings + + If self.config.embeddings is not None and is a np array initialized + with pre-trained word vectors, the word embeddings is just a look-up + and we don't train the vectors. Otherwise, a random matrix with + the correct shape is initialized. + """ + with tf.compat.v1.variable_scope("words"): + if self.config.embeddings is None: + print("WARNING: randomly initializing word vectors") + _word_embeddings = tf.compat.v1.get_variable( + name="_word_embeddings", + dtype=tf.float32, + shape=[self.config.nwords, self.config.dim_word]) + else: + _word_embeddings = tf.Variable( + self.config.embeddings, + name="_word_embeddings", + dtype=tf.float32, + trainable=self.config.train_embeddings) + + word_embeddings = tf.nn.embedding_lookup(_word_embeddings, + self.word_ids, name="word_embeddings") + + with tf.compat.v1.variable_scope("chars"): + if self.config.use_chars: + # get char embeddings matrix + _char_embeddings = tf.compat.v1.get_variable( + name="_char_embeddings", + dtype=tf.float32, + shape=[self.config.nchars, self.config.dim_char]) + char_embeddings = tf.nn.embedding_lookup(_char_embeddings, + self.char_ids, name="char_embeddings") + + char_embeddings = tf.keras.layers.TimeDistributed( + tf.keras.layers.Conv1D(kernel_size=self.config.conv_kernel_size, + filters=self.config.conv_filter_num, + padding='same', activation='tanh', strides=1))(char_embeddings) + char_embeddings = tf.keras.layers.TimeDistributed(tf.keras.layers.MaxPooling1D + (self.config.max_word_length))(char_embeddings) + char_embeddings = tf.keras.layers.TimeDistributed(tf.keras.layers.Flatten())(char_embeddings) + + word_embeddings = tf.concat([word_embeddings, char_embeddings], axis=-1) + + self.word_embeddings = tf.nn.dropout(word_embeddings, rate=1-self.dropout) + + def add_logits_op(self): + """Defines self.logits + + For each word in each sentence of the batch, it corresponds to a vector + of scores, of dimension equal to the number of tags. + """ + with tf.compat.v1.variable_scope("bi-lstm"): + cell_fw = tf.contrib.rnn.LSTMCell(self.config.hidden_size_lstm) + cell_bw = tf.contrib.rnn.LSTMCell(self.config.hidden_size_lstm) + (output_fw, output_bw), _ = tf.nn.bidirectional_dynamic_rnn( + cell_fw, cell_bw, self.word_embeddings, + sequence_length=self.sequence_lengths, dtype=tf.float32) + output = tf.concat([output_fw, output_bw], axis=-1) + + self.word_embeddings = tf.nn.dropout(output, rate=1-self.dropout) + + self.logits = tf.layers.dense(output, self.config.ntags) + + def add_pred_op(self): + """Defines self.labels_pred + + This op is defined only in the case where we don't use a CRF since in + that case we can make the prediction "in the graph" (thanks to tf + functions in other words). With the CRF, as the inference is coded + in python and not in pure tensroflow, we have to make the prediciton + outside the graph. + """ + if not self.config.use_crf: + self.labels_pred = tf.cast(tf.argmax(self.logits, axis=-1), tf.int32) + + def add_loss_op(self): + """Defines the loss""" + if self.config.use_crf: + log_likelihood, trans_params = tf.contrib.crf.crf_log_likelihood( + self.logits, self.labels, self.sequence_lengths) + self.trans_params = trans_params # need to evaluate it for decoding + self.loss = tf.reduce_mean(-log_likelihood) + else: + losses = tf.nn.sparse_softmax_cross_entropy_with_logits( + logits=self.logits, labels=self.labels) + mask = tf.sequence_mask(self.sequence_lengths) + losses = tf.boolean_mask(losses, mask) + self.loss = tf.reduce_mean(losses) + + # for tensorboard + tf.compat.v1.summary.scalar("loss", self.loss) + + def build(self): + # NER specific functions + self.add_placeholders() + self.add_embeddings_op() + self.add_logits_op() + self.add_pred_op() + self.add_loss_op() + + # Generic functions that add training op and initialize session + self.add_train_op(self.config.lr_method, self.lr, self.loss, + self.config.clip) + self.initialize_session() # now self.sess is defined and vars are init + + def predict_batch(self, words): + """ + Args: + words: list of sentences + + Returns: + labels_pred: list of labels for each sentence + sequence_length + + """ + fd, sequence_lengths = self.get_feed_dict(words) + if self.config.use_crf: + # get tag scores and transition params of CRF + viterbi_sequences = [] + logits, trans_params = self.sess.run( + [self.logits, self.trans_params], feed_dict=fd) + + # iterate over the sentences because no batching in vitervi_decode + for logit, sequence_length in zip(logits, sequence_lengths): + logit = logit[:sequence_length] # keep only the valid steps + viterbi_seq, viterbi_score = tf.contrib.crf.viterbi_decode( + logit, trans_params) + viterbi_sequences += [viterbi_seq] + return viterbi_sequences, sequence_lengths + else: + labels_pred = self.sess.run(self.labels_pred, feed_dict=fd) + return labels_pred, sequence_lengths + + def run_epoch(self, train, test, epoch): + """Performs one complete pass over the train set and evaluate on dev + + Args: + train: dataset that yields tuple of sentences, tags + test: dataset + epoch: (int) index of the current epoch + + Returns: + f1: (python float), score to select model on, higher is better + + """ + epoch_start = time.time() + + # progbar stuff for logging + batch_size = self.config.batch_size + nbatches = len(train) // batch_size + + # iterate over dataset + start_time = time.time() + step = 0 + for i, (words, labels) in enumerate(minibatches(train, batch_size)): + fd, _ = self.get_feed_dict(words, labels, self.config.lr, self.config.dropout) + + _, train_loss, summary = self.sess.run( + [self.train_op, self.loss, self.merged], feed_dict=fd) + step += 1 + + # tensorboard + if i % 100 == 0: + self.file_writer.add_summary(summary, epoch * nbatches + i) + cost_time = time.time() - start_time + print("epoch : {}----step : {}----loss : {}----sec/step : {}" + .format(epoch + 1, i, train_loss, cost_time / step)) + start_time = time.time() + step = 0 + + epoch_end = time.time() + print("epoch time: %.8s s" % (epoch_end - epoch_start)) + + metrics = self.run_evaluate(test) + msg = " - ".join(["{} {:04.2f}".format(k, v) + for k, v in metrics.items()]) + print(msg) + + return metrics["f1"] + + def run_evaluate(self, test): + """Evaluates performance on test set + + Args: + test: dataset that yields tuple of (sentences, tags) + + Returns: + metrics: (dict) metrics["acc"] = 98.4, ... + + """ + accs = [] + correct_preds, total_correct, total_preds = 0., 0., 0. + for words, labels in minibatches(test, self.config.batch_size): + labels_pred, sequence_lengths = self.predict_batch(words) + + for lab, lab_pred, length in zip(labels, labels_pred, + sequence_lengths): + lab = lab[:length] + lab_pred = lab_pred[:length] + accs += [a == b for (a, b) in zip(lab, lab_pred)] + + lab_chunks = set(get_chunks(lab, self.config.vocab_tags)) + lab_pred_chunks = set(get_chunks(lab_pred, + self.config.vocab_tags)) + + correct_preds += len(lab_chunks & lab_pred_chunks) + total_preds += len(lab_pred_chunks) + total_correct += len(lab_chunks) + + p = correct_preds / total_preds if correct_preds > 0 else 0 + r = correct_preds / total_correct if correct_preds > 0 else 0 + f1 = 2 * p * r / (p + r) if correct_preds > 0 else 0 + + return {"precision": 100 * p, "recall": 100 * r, "f1": 100 * f1} + + def predict(self, words_raw): + """Returns list of tags + + Args: + words_raw: list of words (string), just one sentence (no batch) + + Returns: + preds: list of tags (string), one for each word in the sentence + + """ + words = [self.config.processing_word(w) for w in words_raw] + if type(words[0]) == tuple: + words = zip(*words) + pred_ids, _ = self.predict_batch([words]) + preds = [self.idx_to_tag[idx] for idx in list(pred_ids[0])] + + return preds diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/requirements.txt b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..77a51f9b31ea8ebdf695d936edb7a4c3c00364dc --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/requirements.txt @@ -0,0 +1,2 @@ +tensorflow==1.15.0 +numpy diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/scripts/pb_to_om.sh b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/scripts/pb_to_om.sh new file mode 100644 index 0000000000000000000000000000000000000000..074578f7aca25b7ff2dfd72068c787e4559d4cc2 --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/scripts/pb_to_om.sh @@ -0,0 +1,11 @@ +export PATH=/usr/local/python3.7.5/bin:$PATH +export PYTHONPATH=/usr/local/Ascend/ascend-toolkit/latest/atc/python/site-packages/te:$PYTHONPATH +export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/atc/lib64:${LD_LIBRARY_PATH} + +PB_PATH=/root/infer +OM_PATH=/root/infer + +/usr/local/Ascend/ascend-toolkit/latest/atc/bin/atc --model=$PB_PATH/SEQUENCE_TAGGING.pb --framework=3 \ + --output=$OM_PATH/SEQUENCE_TAGGING --soc_version=Ascend310 \ + --input_shape="word_ids:1,128;sequence_lengths:1;char_ids:1,128,64" \ + --out_nodes="dense/BiasAdd:0;transitions:0" \ No newline at end of file diff --git a/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/scripts/run_msame.sh b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/scripts/run_msame.sh new file mode 100644 index 0000000000000000000000000000000000000000..6642a39a489a05470038d26a81cc8aaacb84bd53 --- /dev/null +++ b/ACL_TensorFlow/contrib/nlp/SEQUENCE_TAGGING_ID2097_for_ACL/scripts/run_msame.sh @@ -0,0 +1,4 @@ +OM_PATH=/root/infer +BIN_PATH=/root/infer/bin_data + +/root/msame/out/./msame --model $OM_PATH/SEQUENCE_TAGGING.om --input $BIN_PATH/word_ids,$BIN_PATH/sequence_lengths,$BIN_PATH/char_ids --output $OM_PATH/ \ No newline at end of file