# synthetic-questions-generation **Repository Path**: data_factory/synthetic-questions-generation ## Basic Information - **Project Name**: synthetic-questions-generation - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-20 - **Last Updated**: 2025-08-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Synthetic Questions Generation Generate diverse, engaging questions from text using multiple LLM providers (OpenAI-compatible, Anthropic, Gemini, OpenRouter, Groq, Together, Cerebras, Qwen/DeepInfra, Kimi, Z.ai, Ollama, Chutes, Hugging Face). To diversify outputs, the generator randomly selects a writing style for each item (e.g., formal and academic; casual and conversational; funny and humorous; thought‑provoking and philosophical; practical and application‑focused; analytical and critical; creative and imaginative; simple and straightforward; detailed and comprehensive; or concise and direct). ## Quick start ```bash # 1) Create a venv (optional) python3 -m venv .venv && source .venv/bin/activate # 2) Install dependencies pip install -r requirements.txt # 3) Set an API key for your chosen provider (example: OpenRouter) export OPENROUTER_API_KEY=your_api_key_here # 4) Run python3 src/main.py mkurman/hindawi-journals-2007-2023 \ --provider openrouter \ --model qwen/qwen3-235b-a22b-2507 \ --output-dir ./data/questions_openrouter \ --start-index 0 \ --end-index 10 \ --num-questions 5 \ --text-column text \ --verbose ``` See `example.sh` for a ready-to-run snippet. ## New Features ### YAML Configuration Support You can now use YAML configuration files for easier management: ```bash # Using YAML configuration python3 src/main.py --config configs/example.yaml # Override specific settings python3 src/main.py --config configs/example.yaml --provider anthropic --model claude-3-sonnet ``` ### Custom System Prompts Customize the system prompts used for question and answer generation: ```bash # Using custom prompts python3 src/main.py --config configs/example.yaml --custom-prompts ./my_prompts ``` ### Multiple-Choice Questions Generate multiple-choice questions with options A, B, C, D, E: ```bash # Generate multiple-choice questions python3 src/main.py --config configs/example.yaml --with-options ``` See [CONFIGURATION.md](CONFIGURATION.md) for detailed documentation on these features. ## Requirements - Python 3.9+ (uses modern typing like `list[str]`) - Install Python packages: `pip install -r requirements.txt` Contents of `requirements.txt`: - aiohttp - datasets - tqdm - PyYAML ## Usage The tool accepts either a Hugging Face dataset name (e.g., `mkurman/hindawi-journals-2007-2023`) or a path to a local `.jsonl`/`.json` file. It reads a text field (default `text`), asks an LLM to generate N questions, and writes each question as a JSONL record. Basic CLI: ```bash python3 src/main.py \ --provider \ --model \ --output-dir ``` Key options: - --text-column TEXT Column containing text to prompt from (default: text) - --num-questions INT Questions per text (default: 3) - --max-tokens INT Max tokens per response (default: 4096) - --provider-url URL Base URL for 'other' provider (required when using --provider other) - --num-workers INT Concurrency (default: 1) - --shuffle Shuffle dataset items - --max-items INT Limit number of items - --start-index INT Start index (0-based) - --end-index INT End index (exclusive, 0-based) - --dataset-split SPLIT HF split for remote datasets (default: train) - --sleep-between-requests S Rate-limit between API calls - --sleep-between-items S Rate-limit between items - --style STYLE Optional. Single style or comma-separated list; one is chosen randomly per item - --no-style Generate questions without any style instructions (neutral tone) - --styles-file FILE Load styles from a file (one per line, # for comments) - --with-answer Generate answers for each question using the model - --answer-provider PROVIDER API provider to use for answering questions (if not set, uses the same provider as --provider) - --answer-model MODEL Model to use for answering questions (if not set, uses the same model as --model) - --answer-single-request Answer all questions in a single request instead of one question per request - --verbose Verbose logging - --debug Debug logging Supported providers for `--provider`: featherless, openai, anthropic, qwen, qwen-deepinfra, kimi, z.ai, openrouter, cerebras, together, groq, gemini, ollama, chutes, huggingface, other The `other` provider allows you to use any OpenAI-compatible API endpoint by specifying `--provider-url`. ### Styles You can control question styles in several ways: 1. **Default behavior** (no style flags): Randomly selects from 35+ built-in styles per item (see `default_styles.txt`) 2. **Custom styles** (`--style`): Single style or comma-separated list; one chosen randomly per item 3. **No styling** (`--no-style`): Generates neutral, straightforward questions without style instructions 4. **Styles from file** (`--styles-file`): Load styles from a text file (one per line, `#` for comments) **Note**: Only one style option can be used at a time. Built-in default styles include academic, creative, informal, analytical, practical, philosophical, and more. The complete list is in `default_styles.txt` and includes styles like: - formal and academic, professional and business-focused - creative and imaginative, artistic and expressive, humorous and entertaining - casual and conversational, friendly and approachable, informal and relaxed - analytical and critical thinking, investigative and probing - practical and application-focused, hands-on and actionable - thought-provoking and philosophical, reflective and contemplative - simple and straightforward, clear and concise - detailed and comprehensive, thorough and exhaustive Examples: ```bash # Single custom style python3 src/main.py \ --provider openrouter \ --model qwen/qwen3-235b-a22b-2507 \ --output-dir ./data/out \ --num-questions 5 \ --style "formal and academic" # Multiple custom styles (random per item) python3 src/main.py \ --provider openrouter \ --model qwen/qwen3-235b-a22b-2507 \ --output-dir ./data/out \ --num-questions 5 \ --style "casual and conversational,funny and humorous,concise and direct" # No styling (neutral questions) python3 src/main.py \ --provider openrouter \ --model qwen/qwen3-235b-a22b-2507 \ --output-dir ./data/out \ --num-questions 5 \ --no-style # Load styles from file python3 src/main.py \ --provider openrouter \ --model qwen/qwen3-235b-a22b-2507 \ --output-dir ./data/out \ --num-questions 5 \ --styles-file ./styles_sample.txt ``` See `styles_sample.txt` for an example styles file format and `default_styles.txt` for the complete list of built-in styles. ### Answer Generation The system can optionally generate answers for each question using the `--with-answer` flag. This creates question-answer pairs where each question is answered based on the original source text. Key features: - **Answer generation**: Use `--with-answer` to enable answer generation for each question - **Custom answer provider**: Use `--answer-provider` to specify a different API provider for answering (defaults to the same provider used for questions) - **Custom answer model**: Use `--answer-model` to specify a different model for answering (defaults to the same model used for questions) - **Batch vs individual**: Use `--answer-single-request` to generate all answers in one request, or process one question at a time (default) - **Error handling**: If answer generation fails, the output field is set to "error" with an appropriate error message Examples: ```bash # Generate questions with answers using the same model python3 src/main.py \ --provider openrouter \ --model qwen/qwen3-235b-a22b-2507 \ --output-dir ./data/qa_output \ --num-questions 3 \ --with-answer # Use a different model for answers python3 src/main.py \ --provider openrouter \ --model qwen/qwen3-235b-a22b-2507 \ --answer-model qwen/qwen3-4b-instruct \ --output-dir ./data/qa_output \ --num-questions 3 \ --with-answer # Use a different provider and model for answers python3 src/main.py \ --provider openrouter \ --model openai/gpt-oss-120b \ --answer-provider anthropic \ --answer-model moonshotai/kimi-k2 \ --output-dir ./data/qa_output \ --num-questions 3 \ --with-answer # Generate all answers in a single request (more efficient but less granular error handling) python3 src/main.py \ --provider openrouter \ --model qwen/qwen3-235b-a22b-2507 \ --output-dir ./data/qa_output \ --num-questions 3 \ --with-answer \ --answer-single-request # Use custom provider for questions and standard provider for answers export OTHER_API_KEY=your_custom_api_key export ANTHROPIC_API_KEY=your_anthropic_key python3 src/main.py \ --provider other \ --provider-url https://your-custom-api.com/v1 \ --model your-custom-model \ --answer-provider anthropic \ --answer-model claude-3-haiku-20240307 \ --output-dir ./data/qa_output \ --num-questions 3 \ --with-answer ``` When `--with-answer` is used, the output format includes an `output` field containing the generated answer, or "error" if answer generation failed. ## Authentication (API keys) Provide API keys via environment variables. General rule: `_API_KEY` using uppercase and replacing `.` or `-` with `_`. Special cases are handled automatically. - openai → `OPENAI_API_KEY` - anthropic → `ANTHROPIC_API_KEY` - openrouter → `OPENROUTER_API_KEY` - groq → `GROQ_API_KEY` - together → `TOGETHER_API_KEY` - cerebras → `CEREBRAS_API_KEY` - qwen → `QWEN_API_KEY` - qwen-deepinfra → `QWEN_DEEPINFRA_API_KEY` - kimi (Moonshot) → `KIMI_API_KEY` - z.ai → `Z_AI_API_KEY` - featherless → `FEATHERLESS_API_KEY` - chutes → `CHUTES_API_KEY` - hugging face → `HUGGINGFACE_API_KEY` - gemini → `GEMINI_API_KEY` (note: Gemini uses a query param; still export as shown) - ollama → no API key required (assumes local Ollama at http://localhost:11434) - other → `OTHER_API_KEY` (for custom OpenAI-compatible endpoints) Example: ```bash export OPENROUTER_API_KEY=your_api_key_here ``` ### Using Custom Providers ("other") The `other` provider allows you to use any OpenAI-compatible API endpoint. This is useful for: - Custom or self-hosted models - New providers not yet directly supported - Local inference servers that implement OpenAI-compatible APIs Requirements: - Set `--provider other` - Provide `--provider-url` with the base URL of your API endpoint - Set `OTHER_API_KEY` environment variable with your API key Example: ```bash export OTHER_API_KEY=your_custom_api_key python3 src/main.py dataset.jsonl \ --provider other \ --provider-url https://your-custom-api.com/v1 \ --model your-custom-model \ --output-dir ./output \ --num-questions 3 ``` The system will use OpenAI-compatible request format with your custom endpoint. ## Datasets You can pass either: - Hugging Face dataset name: `org/dataset` (uses `datasets.load_dataset(..., split=...)`) - Local JSONL/JSON file: path ending with `.jsonl` or `.json` - Local Parquet file: path ending with `.parquet` Default text column is `text`. Change with `--text-column` if your data uses another key. Local JSONL example (one JSON per line): ```json {"text": "Large Language Models excel at generating diverse questions from text."} {"text": "Neural networks can learn complex patterns from large datasets."} ``` Local Parquet example: ```bash python3 src/main.py /path/to/data.parquet \ --provider openrouter \ --model qwen/qwen3-235b-a22b-2507 \ --output-dir ./data/questions_parquet \ --text-column text \ --num-questions 3 ``` ## Output Writes to `/questions_{YYYY-MM-DD_HH-MM-SS}_{dataset}_{provider}_{model}[optional_range].jsonl` Each line is a JSON record. For successful question generations: ```json { "input": "What practical applications benefit most from question generation using LLMs?", "source_text": "...original text...", "question_index": 1, "total_questions": 5, "metadata": { "original_item_index": 0, "text_column": "text" }, "generation_settings": { "provider": "openrouter", "model": "qwen/qwen3-235b-a22b-2507", "style": "formal and academic", "num_questions_requested": 5, "num_questions_generated": 5, "max_tokens": 4096 }, "timestamp": "2025-08-17T12:34:56.789012" } ``` When using `--with-answer`, each record also includes an `output` field with the generated answer: ```json { "input": "What practical applications benefit most from question generation using LLMs?", "output": "Question generation using LLMs has several practical applications including educational content creation, chatbot training data, assessment generation for online courses, and synthetic dataset augmentation for machine learning models...", "source_text": "...original text...", "question_index": 1, "total_questions": 5, "metadata": { "original_item_index": 0, "text_column": "text" }, "generation_settings": { "provider": "openrouter", "model": "qwen/qwen3-235b-a22b-2507", "style": "formal and academic", "answer_provider": "anthropic", "answer_model": "claude-3-haiku-20240307", "answer_single_request": false, "num_questions_requested": 5, "num_questions_generated": 5, "max_tokens": 4096 }, "timestamp": "2025-08-17T12:34:56.789012" } ``` When using `--with-options`, each record includes an `options` field with multiple-choice options: ```json { "input": "What is the primary purpose of machine learning?", "options": { "A": "To replace human intelligence completely", "B": "To enable computers to learn and make decisions from data", "C": "To create robots that look like humans", "D": "To store large amounts of data efficiently", "E": "To generate synthetic questions from text" }, "source_text": "...original text...", "question_index": 1, "total_questions": 3, "metadata": { "original_item_index": 0, "text_column": "text" }, "generation_settings": { "provider": "openrouter", "model": "qwen/qwen3-235b-a22b-2507", "style": "formal and academic", "with_options": true, "num_questions_requested": 3, "num_questions_generated": 3, "max_tokens": 4096 }, "timestamp": "2025-08-17T12:34:56.789012" } ``` When using both `--with-options` and `--with-answer`, the answer includes the correct letter and explanation in separate fields: ```json { "input": "What is the primary purpose of machine learning?", "options": { "A": "To replace human intelligence completely", "B": "To enable computers to learn and make decisions from data", "C": "To create robots that look like humans", "D": "To store large amounts of data efficiently", "E": "To generate synthetic questions from text" }, "output": "Answer: B | Explanation: This is the correct answer because it enables computers to learn from data and make intelligent decisions, which is the fundamental purpose of machine learning.", "correct_answer": "B", "explanation": "This is the correct answer because it enables computers to learn from data and make intelligent decisions, which is the fundamental purpose of machine learning.", "source_text": "...original text...", "generation_settings": { "with_options": true, "with_answer": true, ... } } ``` The system automatically extracts: - **`correct_answer`**: The letter (A, B, C, D, or E) for programmatic use - **`explanation`**: The detailed explanation text - **`output`**: The full formatted answer (for backward compatibility) If answer generation fails for a question, the `output` field is set to "error" and an `answer_error` field provides details. If generation fails for an item, an error record is emitted with `error` instead of `questions` fields. ## Examples OpenRouter (Qwen): ```bash export OPENROUTER_API_KEY=your_api_key_here python3 src/main.py mkurman/hindawi-journals-2007-2023 \ --provider openrouter \ --model qwen/qwen3-235b-a22b-2507 \ --output-dir ./data/questions_openrouter \ --start-index 0 \ --end-index 10 \ --num-questions 5 \ --text-column text \ --verbose ``` OpenRouter with Answer Generation: ```bash export OPENROUTER_API_KEY=your_api_key_here python3 src/main.py mkurman/hindawi-journals-2007-2023 \ --provider openrouter \ --model qwen/qwen3-235b-a22b-2507 \ --output-dir ./data/qa_openrouter \ --start-index 0 \ --end-index 10 \ --num-questions 3 \ --with-answer \ --answer-single-request \ --verbose ``` Multi-Provider Q&A Generation (Questions from OpenRouter, Answers from Anthropic): ```bash export OPENROUTER_API_KEY=your_openrouter_key export ANTHROPIC_API_KEY=your_anthropic_key python3 src/main.py mkurman/hindawi-journals-2007-2023 \ --provider openrouter \ --model qwen/qwen3-235b-a22b-2507 \ --answer-provider anthropic \ --answer-model claude-3-haiku-20240307 \ --output-dir ./data/qa_multi_provider \ --start-index 0 \ --end-index 5 \ --num-questions 2 \ --with-answer \ --verbose ``` Ollama (local): ```bash # Ensure Ollama is running and the model is pulled locally python3 src/main.py ./data/articles.jsonl \ --provider ollama \ --model hf.co/lmstudio-community/Qwen3-4B-Instruct-2507-GGUF:Q4_K_M \ --output-dir ./data/questions_ollama \ --num-questions 3 ``` ## Tips - Increase `--num-workers` for concurrency, and use `--sleep-between-requests` for rate limits. - Use `--shuffle` to randomize items, and `--start-index/--end-index` to slice large datasets. - Ensure `*_API_KEY` is set (where * is the provider name) ## License Apache 2.0. See [LICENSE](LICENSE).