# deep-siamese-text-similarity

**Repository Path**: deeplearningrepos/deep-siamese-text-similarity

## Basic Information

- **Project Name**: deep-siamese-text-similarity
- **Description**: Tensorflow based implementation of deep siamese LSTM network to capture phrase/sentence similarity using character/word embeddings
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-03-30
- **Last Updated**: 2021-08-31

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

**This project is a prototype for experimental purposes only and production grade code is not released here.**

# Deep LSTM siamese network for text similarity

It is a tensorflow based implementation of deep siamese LSTM network to capture phrase/sentence similarity using character embeddings.

This code provides architecture for learning two kinds of tasks:

- Phrase similarity using char level embeddings [1]
![siamese lstm phrase similarity](https://cloud.githubusercontent.com/assets/9861437/20479454/405a1aea-b004-11e6-8a27-7bb05cf0a002.png)

- Sentence similarity using word level embeddings [2]
![siamese lstm sentence similarity](https://cloud.githubusercontent.com/assets/9861437/20479493/6ea8ad12-b004-11e6-89e4-53d4d354d32e.png)

For both the tasks mentioned above it uses a multilayer siamese LSTM network and euclidian distance based contrastive loss to learn input pair similairty.

# Capabilities
Given adequate training pairs, this model can learn Semantic as well as structural similarity. For eg:

**Phrases :**
- International Business Machines = I.B.M
- Synergy Telecom = SynTel
- Beam inc = Beam Incorporate
- Sir J J Smith = Johnson Smith
- Alex, Julia = J Alex
- James B. D. Joshi	= James Joshi
- James Beaty, Jr. = Beaty

For phrases, the model learns **character based embeddings** to identify structural/syntactic similarities.

**Sentences :**
- He is smart = He is a wise man.
- Someone is travelling countryside = He is travelling to a village.
- She is cooking a dessert = Pudding is being cooked.
- Microsoft to acquire Linkedin ≠ Linkedin to acquire microsoft

(More examples Ref: semEval dataset)

For Sentences, the model uses **pre-trained word embeddings** to identify semantic similarities.

Categories of pairs, it can learn as similar:
- Annotations
- Abbreviations
- Extra words
- Similar semantics
- Typos
- Compositions
- Summaries

# Training Data
- **Phrases:** 
	- A sample set of learning person name paraphrases have been attached to this repository. To generate full person name disambiguation data follow the steps mentioned at:

	> https://github.com/dhwajraj/dataset-person-name-disambiguation

    "person_match.train" : https://drive.google.com/open?id=1HnMv7ulfh8yuq9yIrt_IComGEpDrNyo-
- **Sentences:** 
	- A sample set of learning sentence semantic similarity can be downloaded from:

	"train_snli.txt" : https://drive.google.com/open?id=1itu7IreU_SyUSdmTWydniGxW-JEGTjrv

	This data is generated using SNLI project : 
	> https://nlp.stanford.edu/projects/snli/

	 - word embeddings: any set of pre-trained word embeddings can be utilized in this project. For our testing we had used fastText 	simple english embeddings from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

	alternate download location for "wiki.simple.vec" is : https://drive.google.com/open?id=1u79f3d2PkmePzyKgubkbxOjeaZCJgCrt

# Environment
- numpy 1.11.0
- tensorflow 1.2.1
- gensim 1.0.1
- nltk 3.2.2

# How to run
### Training
```
$ python train.py [options/defaults]

options:
  -h, --help            show this help message and exit
  --is_char_based IS_CHAR_BASED
  			is character based syntactic similarity to be used for phrases.
			if false then word embedding based semantic similarity is used.
			(default: True)
  --word2vec_model WORD2VEC_MODEL
    			this flag will be used only if IS_CHAR_BASED is False
  			word2vec pre-trained embeddings file (default: wiki.simple.vec)
  --word2vec_format WORD2VEC_FORMAT
  			this flag will be used only if IS_CHAR_BASED is False
  			word2vec pre-trained embeddings file format (bin/text/textgz)(default: text)
  --embedding_dim EMBEDDING_DIM
                        Dimensionality of character embedding (default: 100)
  --dropout_keep_prob DROPOUT_KEEP_PROB
                        Dropout keep probability (default: 0.5)
  --l2_reg_lambda L2_REG_LAMBDA
                        L2 regularizaion lambda (default: 0.0)
  --max_document_words MAX_DOCUMENT_WORDS
                        Max length (left to right max words to consider) in
                        every doc, else pad 0 (default: 100)
  --training_files TRAINING_FILES
                        Comma-separated list of training files (each file is
                        tab separated format) (default: None)
  --hidden_units HIDDEN_UNITS
                        Number of hidden units(default:50)
  --batch_size BATCH_SIZE
                        Batch Size (default: 128)
  --num_epochs NUM_EPOCHS
                        Number of training epochs (default: 200)
  --evaluate_every EVALUATE_EVERY
                        Evaluate model on dev set after this many steps
                        (default: 2000)
  --checkpoint_every CHECKPOINT_EVERY
                        Save model after this many steps (default: 2000)
  --allow_soft_placement [ALLOW_SOFT_PLACEMENT]
                        Allow device soft device placement
  --noallow_soft_placement
  --log_device_placement [LOG_DEVICE_PLACEMENT]
                        Log placement of ops on devices
  --nolog_device_placement

```
### Evaluation
```
$ python eval.py --model graph#.pb
```

# Performance
**Phrases:**
- Training time: (8 core cpu) = 1 complete epoch : 6min 48secs (training requires atleast 30 epochs)
	- Contrastive Loss : 0.0248
- Evaluation performance : similarity measure for 100,000 pairs (8core cpu) = 1min 40secs
	- Accuracy 91%
	
**Sentences:**
- Training time: (8 core cpu) = 1 complete epoch : 8min 10secs (training requires atleast 50 epochs)
	- Contrastive Loss : 0.0477
- Evaluation performance : similarity measure for 100,000 pairs (8core cpu) = 2min 10secs
	- Accuracy 81%

# References
1. [Learning Text Similarity with Siamese Recurrent Networks](http://www.aclweb.org/anthology/W16-16#page=162)
2. [Siamese Recurrent Architectures for Learning Sentence Similarity](http://www.mit.edu/~jonasm/info/MuellerThyagarajan_AAAI16.pdf)