# BabelNet-Sememe-Prediction

**Repository Path**: thunlp/BabelNet-Sememe-Prediction

## Basic Information

- **Project Name**: BabelNet-Sememe-Prediction
- **Description**: Code and data of the AAAI-20 paper "Towards Building a Multilingual Sememe Knowledge Base: Predicting Sememes for BabelNet Synsets"
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-05-29
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# BabelNet-Sememe-Prediction
Code and data of the AAAI-20 paper "**Towards Building a Multilingual Sememe Knowledge Base: Predicting Sememes for BabelNet Synsets**" [[pdf]](https://arxiv.org/pdf/1912.01795.pdf)

## Requirements

- Tensorflow-gpu >= 1.13.0
- Python 3.x

## Data

This repo contains two types of data. 

#### Annotated BabelSememe Dataset

- *BabelSememe* Dataset `./BabelSememe/synset_sememes.txt`

#### Experimental Dataset

- Dataset of all POS tags (Noun, Verb, Adj, Adv)
  
  `./data-all/entitiy2id.txt`: All entities and corresponding IDs, one per line.

  `./data-all/relation2id.txt`: All relations and corresponding ids, one per line.

  `./data-all/train2id.txt`: Training set. All lines are in the format ***(e1, e2, rel)*** which indicates there is a relation ***rel*** between ***e1*** and ***e2***. The ids of entities and relations are from `entitiy2id.txt` and `relation2id.txt`.

  `./data-all/valid2id.txt`: Validation set. The lines are all in the format ***(e1, e2, rel)*** which indicates there is a relation ***rel*** between ***e1*** and ***e2***. The ids of entities and relations are from `entitiy2id.txt` and `relation2id.txt`.

  `./data-all/test2id.txt`: Test set. The lines are all in the format ***(e1, e2, rel)*** which indicates there is a relation ***rel*** between ***e1*** and ***e2***. The ids of entities and relations are from `entitiy2id.txt` and `relation2id.txt`.

- Dataset of Nouns
  
  The format of the noun dataset is the same as the all dataset.

  `./data-noun/entitiy2id.txt`

  `./data-noun/relation2id.txt`

  `./data-noun/train2id.txt`

  `./data-noun/valid2id.txt`

  `./data-noun/test2id.txt`

- Synset embeddings from [NASARI](http://lcl.uniroma1.it/nasari/)

  `./SPBS-SR/synset_vec.txt`

## Models

#### SPBS-SR

##### Usage

Commands for training and testing models:

```bash
python ./SPBS-SR/EvalSememePre_SPWE.py 1
```

#### SPBS-RR

##### Usage

Commands for training and testing models:

```bash
bash ./SPBS-RR/src/train.sh
```

Note: Test results are recorded in the training log.

#### Ensemble

##### Usage

After training the above two models, copy the output files `./SPBS-RR/sememePre_TransE.txt` and `./SPBS-SR/sememePre_SPWE.txt` to the Ensemble directory, and then run the Ensemble model with the following command:

```bash
python ./Ensemble/Ensemble.py
```
## Cite

If you use any code or data, please cite this paper

```
@article{qi2019towards,
  title={Towards Building a Multilingual Sememe Knowledge Base: Predicting Sememes for BabelNet Synsets},
  author={Qi, Fanchao and Chang, Liang and Sun, Maosong and Ouyang, Sicong and Liu, Zhiyuan},
  journal={arXiv preprint arXiv:1912.01795},
  year={2019}
}
```