# DistFactAssessLM **Repository Path**: mirrors_Orange-OpenSource/DistFactAssessLM ## Basic Information - **Project Name**: DistFactAssessLM - **Description**: No description available - **Primary Language**: Unknown - **License**: GPL-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-02-16 - **Last Updated**: 2026-02-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Factual Knowledge Assessment of Language Models Using Distractors ![image](https://github.com/user-attachments/assets/785cf87f-5608-4a2c-94d0-1dd18f162be1) [Paper](https://aclanthology.org/2025.coling-main.537/) Language models encode extensive factual knowledge within their parameters. The accurate assessment of this knowledge is crucial for understanding and improving these models. In the literature, factual knowledge assessment often relies on cloze sentences (e.g., "The capital of France ____"), which can lead to erroneous conclusions due to the complexity of natural language (out-of-subject continuations, the existence of many correct answers and the several ways of expressing them). We introduced a new interpretable knowledge assessment method that mitigates these issues by leveraging distractors which are incorrect but plausible alternatives to the correct answer. Our method is evaluated against existing approaches, demonstrating solid alignment with human judgment and stronger robustness to verbalization artifacts. This repository aims to permit the reproduction of the results of our [paper](https://aclanthology.org/2025.coling-main.537/). ### Setup environment This section introduces all the steps required to setup the environment needed to launch the experiments described in our paper. #### Software Make sure you have these softwares installed - OS : Ubuntu 22.04 -- *can maybe work on Windows (not tested)* - conda (version used : 24.9.2) - MongoDB (version used : 7.0.3) -- *Don't forget to start it* #### Environment variables - STORAGE_FOLDER: Set this variable to represent the path to the folder where to store intermediate files. - MONGO_URL: If MongoDB does not run locally, specify its URL in this variable. Else, do nothing. Plan **150GB** of disk storage in this folder and **150GB** for MongoDB. #### Install Virual Environment Create and activate the conda environment *wfd_build* ``` bash setup_env/wfd_build.sh conda activate wfd_build ``` #### Collect data for experiments Launch all following commands from the root of the project 1. List the available Wikidata dumps that are available for download: ``` PYTHONPATH=src python list_available_wikidata_dumps.py ``` 1. Choose one date (format: YYYYMMDD) and pass it as an argument in this script: ``` PYTHONPATH=src python download_wikidata.py --date DATE ``` This will first download the Wikidata dump, push it to MongoDB, and then preprocess it. *This will take a while (>10h)*. **Reproducibility note:** If you want to reproduce the results of our paper choose the date to be **20210104**. #### Initialize distractor retrievers Run this command ``` PYTHONPATH=src python init_distractor_retrievers.py --date DATE ``` where DATE is the same date used in the previous step. Congratulations! You are ready to proceed to experiments. ## Experiments ### Sampling facts 1. Sample random facts (noted $S$ in the paper) using the following command: ``` PYTHONPATH=src python sample_facts.py --type random --date DATE ``` 1. Sample facts that have atleast one temporal distractor (noted $S'$ in the paper) using the following command: ``` PYTHONPATH=src python sample_facts.py --type tempdist --date DATE ``` **Reproducibility note:** The exact same facts sample used in our experiments are provided in the folder `reproducibility`. To use them, simply copy the two files inside this folder to $STORAGE_FOLDER. ### Compare distractor retrieval strategies To compare the different distractor retrieval strategies proposed in the paper, launch the following commands: ``` PYTHONPATH=src python run_experiment.py --experiment compare_retrieval_strategies --date DATE PYTHONPATH=src python run_experiment.py --experiment compare_retrieval_strategies_dist_temp --date DATE ``` The result analysis can be generated by executing the Jupyter notebook `scripts/general_eval_know_measure/compare_retrieval_strategies_analysis.ipynb`. To choose what results to analyze, set the variable TEMPORAL_DISTRACTOR_RESULT_ANALYSIS at the beginning of the notebook. If it is set to `True`, the results on the set $S'$ are analyzed, else, $S$ is analyzed (see the paper for more information). ### Compare our knowledge measure with other baselines The other studied baselines are: [KaRR](https://arxiv.org/pdf/2305.10519) (our implementation), Probability, BERT-score, ROUGE-L, and [LLM-as-a-judge](https://arxiv.org/pdf/2308.10168). First, launch the following command: ``` PYTHONPATH=src python run_experiment.py --experiment compare_kms --date DATE ``` Then analyze the results in the notebook situated in `scripts/optimize_know_evals/optimize.ipynb` which contains the alignment of each knowledge measure with human judgment, and its robustness to verbalization artifacts. **Note:** The annotation dataset of verbalization errors is in `scripts/taxonomy/taxonomy_hichem.csv` and its statistical analysis in `scripts/taxonomy/analyze.ipynb`. ## How to measure factual knowledge within langauge models with our method? Here is an example of how to measure the knowledge of the fact **"The capital of France is Paris"** by **GPT2**: ```python import numpy as np from kb.core import Entity, Relation, Triple, TripleComp from kb.wikidata import TempWikidata, WikidataPrepStage from know.core import DistKnowMeasure, StrictMetric from know.distractor_find import ApproxIdealDistractorFinder from lm.core import LanguageModel, LogProbability from verb.core import VerbalizeConfig from verb.wikifactdiff import WikiFactDiffVerbalizer wd = TempWikidata('20210104', WikidataPrepStage.PREPROCESSED) lm = LanguageModel.from_pretrained_name("gpt2", 'cuda') strict_agg = StrictMetric([20]) dist_finder = ApproxIdealDistractorFinder(wd, lm) assert dist_finder.built(), "Distractor finder %s must be built first before executing this script!" % dist_finder dist_finder.load() cred_func = LogProbability() know_strict = DistKnowMeasure(strict_agg, dist_finder, cred_func, compute_cred_on_object=True) # In this example we are testing the knowledge of GPT2 on the fact (France, capital, Paris) subject, relation, object = Entity('Q142'), Relation('P36'), Entity('Q90') fact = Triple(subject, relation, object) # Inject label information in the triple wd.inject_info(fact) verbalizer = WikiFactDiffVerbalizer() config = VerbalizeConfig( max_num_verbs=1, verb_tense=None, ends_with=TripleComp.OBJECT ) temp = verbalizer.verbalize(fact, config, skip_failed_conjugation=True) measure = know_strict.measure_temp(lm, temp) print('The score of GPT2 on the fact %s is: %.2f (value between 0 and 1)' % (fact, measure.result[0])) ``` **Note:** Modify this script to use our method on other facts and/or other language models. ## Having an issue with our code? If you have a problem running our code, please let us know by opening an issue ;) ## How to cite our work? ``` @inproceedings{ammar-khodja-etal-2025-factual, title = "Factual Knowledge Assessment of Language Models Using Distractors", author = "Ammar Khodja, Hichem and Ait gueni ssaid, Abderrahmane and Bechet, Frederic and Brabant, Quentin and Nasr, Alexis and Lecorv{\'e}, Gw{\'e}nol{\'e}", editor = "Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven", booktitle = "Proceedings of the 31st International Conference on Computational Linguistics", month = jan, year = "2025", address = "Abu Dhabi, UAE", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.coling-main.537/", pages = "8043--8056", abstract = "Language models encode extensive factual knowledge within their parameters. The accurate assessment of this knowledge is crucial for understanding and improving these models. In the literature, factual knowledge assessment often relies on cloze sentences, which can lead to erroneous conclusions due to the complexity of natural language (out-of-subject continuations, the existence of many correct answers and the several ways of expressing them). In this paper, we introduce a new interpretable knowledge assessment method that mitigates these issues by leveraging distractors{---}incorrect but plausible alternatives to the correct answer. We propose several strategies for retrieving distractors and determine the most effective one through experimentation. Our method is evaluated against existing approaches, demonstrating solid alignment with human judgment and stronger robustness to verbalization artifacts. The code and data to reproduce our experiments are available on GitHub." } ``` ## License Look for the LICENSE.txt file at the root of this project