fLSA

# fLSA

**Repository Path**: mirrors_microsoft/fLSA

## Basic Information

- **Project Name**: fLSA
- **Description**: Learning Semantic Structures in Document Collections Using Foundation Models
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-06-21
- **Last Updated**: 2026-02-07

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

<h1 align="center">
<br>
fLSA
</h1>

<p align="center">
📃 <a href="https://arxiv.org/abs/2410.05481" target="_blank">[Paper]</a> 
</p>

Repo for "[fLSA: Learning Semantic Structures in Document Collections Using Foundation Models](https://arxiv.org/abs/2410.05481)".

Authors: Weijia Xu, Nebojsa Jojic, Nicolas Le Roux

<p align="center">
  <img src="images/EM_algorithm.png">
</p>

## Contents
- [Introduction](#Introduction)
- [Setup](#Prerequisite)
- [Usage](#Usage)
- [Citation](#Citation)


## Introduction
Humans can learn to solve new tasks by inducing high-level strategies from example solutions to similar problems and then adapting these strategies to solve unseen problems. Can we use large language models to induce such high-level structure from example documents or solutions? We introduce fLSA, a foundation-model-based Latent Semantic Analysis method that iteratively clusters and tags document segments based on document-level contexts. These tags can be used to model the latent structure of given documents and for hierarchical sampling of new texts. Our experiments on story writing, math, and multi-step reasoning datasets demonstrate that fLSA tags are more informative in reconstructing the original texts than existing tagging methods. Moreover, when used for hierarchical sampling, fLSA tags help expand the output space in the right directions that lead to correct solutions more often than direct sampling and hierarchical sampling with existing tagging methods.

<p align="center">
  <img src="images/tag_dynamics_MATH.png">
</p>

## Prerequisite
* Access to an LLM (e.g. GPT-4, ChatGPT, Qwen2.5, Claude3.7, etc)

## Usage
* ``code/run_fLSA.py`` is the python script for running the fLSA algorithm on a set of documents or problem solutions. Before running it, you need to edit the script to:
    * change hyperparameters (e.g. N_tags)
    * specify the paths to your input data file(s) and output log files
    * define ``loadData`` function to load the input data
    * define ``sampleResponse`` function to sample a response text from your choice of LLM

## Limitations
fLSA was developed for research and experimental purposes. Further testing and validation are needed before considering its application in commercial or real-world scenarios.
fLSA was designed and tested using the English language. Performance in other languages may vary and should be assessed by someone who is both an expert in the expected outputs and a native speaker of that language.
fLSA inherits any biases, errors, or omissions produced by the LLM used for inference. Developers are advised to choose an appropriate base LLM carefully, depending on the intended use case.
There has not been a systematic effort to ensure that systems using fLSA are protected from security vulnerabilities such as indirect prompt injection attacks. Any systems using it should take proactive measures to harden their systems as appropriate.

## Best Practices
Better performance can be achieved by running fLSA with an LLM with strong capabilities for text understanding.
We strongly encourage users to use LLMs/MLLMs that support robust Responsible AI mitigations, such as Azure Open AI (AOAI) services. Such services continually update their safety and RAI mitigations with the latest industry standards for responsible use. For more on AOAI’s best practices when employing foundations models for scripts and applications:
- [Blog post on responsible AI features in AOAI that were presented at Ignite 2023](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-new-ai-safety-amp-responsible-ai-features-in-azure/ba-p/3983686)
- [Overview of Responsible AI practices for Azure OpenAI models] (https://learn.microsoft.com/en-us/legal/cognitive-services/openai/overview)
- [Azure OpenAI Transparency Note](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/transparency-note)
- [OpenAI’s Usage policies](https://openai.com/policies/usage-policies)
- [Azure OpenAI’s Code of Conduct](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/code-of-conduct)
Users are responsible for sourcing their datasets legally and ethically. This could include securing appropriate copy rights, ensuring consent for use of audio/images, and/or the anonymization of data prior to use in research.   
Users are reminded to be mindful of data privacy concerns and are encouraged to review the privacy policies associated with any models and data storage solutions interfacing with fLSA. 
It is the user’s responsibility to ensure that the use of fLSA complies with relevant data protection regulations and organizational guidelines.


## Citation
If you find this repo useful for your research, please consider citing the paper
```
@misc{xu2024fplsa,
      title={fPLSA: Learning Semantic Structures in Document Collections Using Foundation Models}, 
      author={Weijia Xu and Nebojsa Jojic and Nicolas Le Roux},
      year={2024},
      eprint={2410.05481},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.05481}, 
}
```