# UDParse **Repository Path**: mirrors_Orange-OpenSource/UDParse ## Basic Information - **Project Name**: UDParse - **Description**: No description available - **Primary Language**: Unknown - **License**: MPL-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-07-22 - **Last Updated**: 2026-02-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # UDParse UDParse is a fork of UDPipe-Future which in turn is prototype for UDPipe 2.0. The prototype consists of tagging and parsing and is purely in Python. It participated in CoNLL 2018 UD Shared Task and was one of three winners. The original code is available at https://github.com/CoNLL-UD-2018/UDPipe-Future. UDparse has (as UdpipeFuture) the [Mozilla Public License Version 2.0](LICENSE) UDParse has integrated input sentence vectorisation (with BERT, XLM-Robert-large etc) in order to improve the quality of the tagged and parsed output. Compared to the CoNLL 2018 Shared task, UDParse [improves on nearly all treebanks considerably](doc/results.md): ![Comparison XLM-R](doc/conll18_LAS_XLM-R.png) # Installation We used anaconda in order to install an virtual environment ``` conda create -n udparse python=3.8 conda activate udparse conda install cudatoolkit==11.3.1 conda install cudnn==8.2.1 pip --no-cache-dir install tensorflow-gpu==2.5.0 pip --no-cache-dir install tensorflow_addons==0.13.0 pip --no-cache-dir install transformers==4.6.1 pip --no-cache-dir install sentencepiece==0.1.95 pip --no-cache-dir install cython pip --no-cache-dir install git+https://github.com/andersjo/dependency_decoding pip --no-cache-dir install pyyaml pip --no-cache-dir install matplotlib psutil pip --no-cache-dir install flask==2.0.1 pip --no-cache-dir install flask-cors==3.0.10 pip --no-cache-dir install flask-restful-swagger-3==0.1 pip --no-cache-dir install flask-restful==0.3.9 pip --no-cache-dir install regex==2021.11.10 pip --no-cache-dir install conllu==4.4.1 pip --no-cache-dir install svgwrite==1.4.2 export LD_LIBRARY_PATH=$HOME/anaconda3/envs/udparse/lib/ ``` some transformer models do not exist for Tensorflow. In order to use them you have to install pytorch as well ``` pip --no-cache-dir install pytorch==1.8.1 ``` or on more recent GPUs ``` pip3 install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html ``` In order to be able to process raw text, you need to install a [UDPipe 1.2 clone](https://github.com/ioan2/udpipe) tokenizer from our fork of UDpipe at https://github.com/ioan2/udpipe (replace 3.12 with 3.10 on Ubuntu 22.04) ``` sudo apt install libpython3.12-dev git clone https://github.com/ioan2/udpipe pushd udpipe/src make popd pushd bindings/python make PYTHON_INCLUDE=/usr/include/python3.12 popd ``` * in `udpipe/bindings/python/ufal/udpipe.py` replace `from .ufal_udpipe import *` by `from ufal_udpipe import *` * copy UDpipe's `bindings/python/ufal/*` and `bindings/python/ufal_udpipe.so` to to UDParse's `UDParse` or use `PYTHONPATH=...` to specify the location of the UDPipe-bindings in order python can find it # Train a model (on [Universal Dependencies](https://universaldependencies.org) data) First prepare a configuration file `data.yml` ``` configs: fr-bert: calculate_embeddings: bert dev: /universal-dependencies/ud-treebanks-v2.8/UD_French-GSD/fr_gsd-ud-dev.conllu embeddings: /universal-dependencies/models/2.8/udpf/tmp/data/fr-bert gpu: 0 out: /universal-dependencies/models/2.8/udpf/fr-bert test: /universal-dependencies/ud-treebanks-v2.8/UD_French-GSD/fr_gsd-ud-test.conllu tokmodel: /universal-dependencies/models/2.8/tok/fr_gsd.tok.model train: - /universal-dependencies/ud-treebanks-v2.8/UD_French-GSD/fr_gsd-ud-train.conllu ``` * `gpu:` indicates the gpu device starting with 0. If the number given does not correspond to any device or if it is negative training will be done un CPU. * `calculate_embeddings` indicates the transformer model to use. The following are recognized: * `bert`: bert-base-multilingual-cased * `distilbert`: distilbert-base-multilingual-cased * `itBERT`: dbmdz/bert-base-italian-xxl-cased * `arBERT`: asafaya/bert-base-arabic * `fiBERT`: TurkuNLP/bert-base-finnish-cased-v1 * `slavicBERT`: DeepPavlov/bert-base-bg-cs-pl-ru-cased (needs pytorch) * `plBERT`: dkleczek/bert-base-polish-uncased-v1 (needs pytorch) * `svBERT`: KB/bert-base-swedish-cased * `nlBERT`: wietsedv/bert-base-dutch-cased * `flaubert`: flaubert/flaubert_base_cased (needs pytorch) * `camembert`: camembert-base * `roberta`: roberta-large * `xml-roberta`: jplu/tf-xlm-roberta-large(multilingual) The tokenizer model (`tokmodel`) must be created using Udpipe 1.2: ``` udpipe --train \ --tagger=none \ --parser=none \ --heldout /universal-dependencies/ud-treebanks-v2.8/UD_French-GSD/fr_gsd-ud-dev.conllu \ /universal-dependencies/models/2.8/tok/fr_gsd.tok.model \ /universal-dependencies/ud-treebanks-v2.8/UD_French-GSD/fr_gsd-ud-train.conllu ``` Run the train and test: ``` ./run.py --action train --yml data.yml fr-bert ./run.py --action test --yml data.yml fr-bert ``` or in one step: ``` ./run.py --action tt --yml data.yml fr-bert ``` The training process puts some debugging stuff into the given `out` directory. To copy only the needed stuff, run: ``` ./bin/clonedata.py data.yml new_directory ``` # Use a model ## Predict a file Tokenise, tag and parse a raw text. If your input file is in CoNLL-U format, omit the `--istext` option: ``` ./run.py --action predict \ --yml data.yml \ --infile inputtext.txt \ --istext \ --outfile output.conllu \ --istext ``` use the option `--presegmented` if the input file contains one sentence per line. Without this option the input text will be segmented into sentences before tokenization. For tokenized (CoNLL-U) files run the following: ``` ./run.py --action predict \ --yml data.yml \ --infile inputtext.conllu --outfile output.conllu ``` For a quick check, whether the model works you can use: ``` ./run.py --action predict \ --yml data.yml \ --example "A sentence to be tokenized, tagged and parsed" ``` which prints the output (in CoNLL-U format) to stdout ## Server launching You can launch a server in the following way: ``` ./run.py --action server \ --yml data.yml --port 8844 \ --forcegpu \ --gpu 1 ``` The Swagger API specification can be obtained at http://localhost:**PORT**/api/v1/doc ## Requesting the server the server can be used with ``` curl -X POST "http://localhost:8844/api/v1/parse" \ -H "accept: text/tab-separated-values" \ -H "Content-Type: multipart/form-data" \ -F "text=my sentence will be parsed" \ -F "presegmented=false" ``` Or with python's `requests` package ```python import requests r = requests.post("http://localhost:8844/api/v1/parse", data = {"text": "my sentence will be parsed", "presegmented": False, "conllu": "", # put tokenized input (CoNLL-U) here instead of in "text". If "conllu" is not empty, "text" will be ignored "parse": True}, headers = {"Accept": "text/tab-separated-values" } # you can set this to get pure CoNLL-U or omot to get json ) print(r.text) ``` ## Swagger API documentation use `http://localhost:8844/api/v1/doc` ## Use programmatically (in python) load a parser instance ```python import UDParse.UDParse udp = UDParse.UDParse(lg=None, action=UDParse.Action.SERVER, yml=, gpu=0, forcegpu=True, usepytorch=True, forceoutdir=False) while not udp.submitresult.done(): print("Still loading model ...", model, file=sys.stderr) time.sleep(2) ``` use it ``` output = udp.api_process(in_text="my input sentence", is_text=True) output = udp.api_process(in_text=, is_text=False) ``` ## Use it as an external dependency library For the CPU version ```bash pip install . --find-links=https://download.pytorch.org/whl/torch_stable.html ``` For the GPU version ```bash pip install .[gpu] --find-links=https://download.pytorch.org/whl/cu113/torch_stable.html ```