# ditto **Repository Path**: XZHongAN/ditto ## Basic Information - **Project Name**: ditto - **Description**: Entity Matching SOTA - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-07-26 - **Last Updated**: 2022-09-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: 实体匹配, SHOT ## README # Ditto: Deep Entity Matching with Pre-Trained Language Models *Update: a new light-weight version based on new versions of Transformers* Ditto is an entity matching (EM) solution based on pre-trained language models such as BERT. Given a pair of data entries, EM checks if the two entries refer to the same real-world entities (products, businesses, publications, persons, etc.). Ditto leverages the powerful language understanding capability of pre-trained language models (LMs) via fine-tuning. Ditto serializes each data entry into a text sequence and casts EM as a sequence-pair classification problem solvable by LM fine-tuning. We also employ a set of novel optimizations including summarization, injecting domain-specific knowledge, and data augmentation to further boost the performance of the matching models. For more technical details, see the [Deep Entity Matching with Pre-Trained Language Models](https://arxiv.org/abs/2004.00584) paper. ## Requirements * Python 3.7.7 * PyTorch 1.9 * HuggingFace Transformers 4.9.2 * Spacy with the ``en_core_web_lg`` models * NVIDIA Apex (fp16 training) Install required packages ``` conda install -c conda-forge nvidia-apex pip install -r requirements.txt python -m spacy download en_core_web_lg ``` ## The EM pipeline A typical EM pipeline consists of two phases: blocking and matching. ![The EM pipeline of Ditto.](ditto.jpg) The blocking phase typically consists of simple heuristics that reduce the number of candidate pairs to perform the pairwise comparisons. Ditto optimizes the matching phase which performs the actual pairwise comparisons. The input to Ditto consists of a set of labeled candidate data entry pairs. Each data entry is pre-serialized into the following format: ``` COL title VAL microsoft visio standard 2007 version upgrade COL manufacturer VAL microsoft COL price VAL 129.95 ``` where ``COL`` and ``VAL`` are special tokens to indicate the starts of attribute names and attribute values. A complete example pair is of the format ``` \t \t