Sui Generis

# SuiGeneris

**Repository Path**: mirrors_microsoft/SuiGeneris

## Basic Information

- **Project Name**: SuiGeneris
- **Description**: Source code for the paper: Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-08-16
- **Last Updated**: 2026-05-30

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

<h1 align="center">
<br>
Sui Generis
</h1>

<p align="center">
📃 <a href="https://arxiv.org/abs/2501.00273" target="_blank">[Paper]</a> 
</p>

Repo for "[Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs](https://arxiv.org/abs/2501.00273)".

Authors: Weijia Xu, Nebojsa Jojic, Sudha Rao, Chris Brockett, Bill Dolan

<p align="center">
  <img src="figures/AverageLine_WritingPrompts_GPT4.png">
</p>

## Contents
- [Introduction](#Introduction)
- [Prerequisite](#Prerequisite)
- [Usage](#Usage)
- [Limitations](#Limitations)
- [Best Practices](#Best-Practices)
- [License](#License)
- [Citation](#Citation)


## Introduction
With rapid advances in large language models (LLMs), there has been an increasing application of LLMs in creative content ideation and generation. A critical question emerges: can current LLMs provide ideas that are diverse enough to truly bolster collective creativity? We examine two state-of-the-art LLMs, GPT-4 and LLaMA-3, on story generation and discover that LLM-generated stories often consist of plot elements that are echoed across a number of generations. To quantify this phenomenon, we introduce the Sui Generis score, an automatic metric that measures the uniqueness of a plot element among alternative storylines generated using the same prompt under an LLM. Evaluating on 100 short stories, we find that LLM-generated stories often contain combinations of idiosyncratic plot elements echoed frequently across generations and across different LLMs, while plots from the original human-written stories are rarely recreated or even echoed in pieces. Moreover, our human evaluation shows that the ranking of Sui Generis scores among story segments correlates moderately with human judgment of surprise level, even though score computation is completely automatic without relying on human judgment.

<p align="center">
  <img src="figures/echo_example.jpg">
</p>

## Prerequisite
* Install the python package by running ``pip install .``
* Access to an LLM (e.g. GPT-4, LLaMA-3, etc)

## Usage
* Import the scoring function by ``from sgscore.metric import sui_generis_score``
* Call the ``sui_generis_score`` function with the following parameters:
    * text_segments (list of str): list of input text segments.
    * k_continuations (int): number of alternative continuations to be generated given a text prefix.
    * invoke_llm (func): function call that gets a response string from an LLM for alternative continuation generation (given a prompt and optional arguments including temperature, top_p, max_tokens).
    * invoke_llm_entailment (func): function call that gets a response string from an LLM for entailment judgment (given a prompt and optional arguments including temperature, top_p, max_tokens).
    * text_type (optional str): type or genre of the text in text_segments (e.g. story).
    * entailment_prompt (optional str): prompt for the entailment judgment using the LLM.
    * theta (optional float): exponential weight decay coefficient.
    * key_segment_ids (optional list of int): only compute the Sui Generis scores on the key segments given by the ids
* The ``sui_generis_score`` function returns:
    * sg_matrices (list of list of float): Echo rates of the i-th segment given text prefix from 0 to j-1.
    * sg_scores (list of float): Sui Generis score for each segment.

## Limitations
Sui Generis was developed for research and experimental purposes. Further testing and validation are needed before considering its application in commercial or real-world scenarios. 

Sui Generis was designed and tested using the English language. Performance in other languages may vary and should be assessed by someone who is both an expert in the expected outputs and a native speaker of that language. 

Outputs generated by AI may include factual errors, fabrication, or speculation. Users are responsible for assessing the accuracy of generated content. All decisions leveraging outputs of the system should be made with human oversight and not be based solely on system outputs. 

The results of the Sui Generis technique inherit any biases, errors, or omissions produced by the model you choose for generating the metric. There has not been a systematic effort to ensure that systems using Sui Generis are protected from security vulnerabilities such as indirect prompt injection attacks. Any systems using it should take proactive measures to harden their systems as appropriate. 

## Best Practices
When selecting an entailment judgment model as part of the score computation pipeline, better performance can be achieved by using a model that is at least as capable as GPT-4 on natural language understanding tasks. 

We strongly encourage users to use LLMs that support robust Responsible AI mitigations, such as Azure Open AI (AOAI) services. Such services continually update their safety and RAI mitigations with the latest industry standards for responsible use. For more on AOAI’s best practices when employing foundations models for scripts and applications: 
- [Blog post on responsible AI features in AOAI that were presented at Ignite 2023](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-new-ai-safety-amp-responsible-ai-features-in-azure/ba-p/3983686) 
- [Overview of Responsible AI practices for Azure OpenAI models](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/overview)
- [Azure OpenAI Transparency Note](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/transparency-note) 
- [OpenAI’s Usage policies](https://openai.com/policies/usage-policies) 
- [Azure OpenAI’s Code of Conduct](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/code-of-conduct) 

## License
MIT License 

Nothing disclosed here, including the Out of Scope Uses section, should be interpreted as or deemed a restriction or modification to the license the code is released under. 

## Citation
If you find this repo useful for your research, please consider citing the paper
```
@misc{xu2024echoes,
      title={Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs}, 
      author={Weijia Xu and Nebojsa Jojic and Sudha Rao and Chris Brockett and Bill Dolan},
      year={2024},
      eprint={2501.00273},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.00273}, 
}
```