# ml-mcp-repo-level-coding **Repository Path**: mirrors_apple/ml-mcp-repo-level-coding ## Basic Information - **Project Name**: ml-mcp-repo-level-coding - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-30 - **Last Updated**: 2026-03-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ml-mcp-repo-level-coding This repository contains the public code for the paper *“Agentic Tooling with Model Context Protocol Outperforms RAG and Long-Context Windows for Repository-Level Coding”*. To support reproducibility, we provide a public codebase that mirrors our data preparation and tool exposure pipeline, which includes Abstract Syntax Tree (AST) parsing, structured JSON references, and MCP tool definitions. The ASK-compliant MCP tools in our public codebase empower a language model to answer questions about an open source repository readily available online rather than an internally developed repository. ## Repository Description The [scikit-learn repository](https://github.com/scikit-learn/scikit-learn) is required as input. For convenience, the `scikit-learn/` repo is included as a git submodule. To obtain the repository contents, please run `git submodule add https://github.com/scikit-learn/scikit-learn.git scikit-learn` after cloning this repository. Note that the repository is referred to as `sklearn` throughout our code. Below is a breakdown of the other files found in our repository: ### Reference Database Creation - **`reference_json_scraping_config.yaml`**: Configurations used to parse the sklearn repository. Must adhere to defined structure. Configuration options include 1. verbose (whether to print updates while running) 2. skip_private_functions (whether to skip/include private functions in the generated db) 3. include_classes (whether to include class information in the generated db) 4. sections in docstrings to skip, etc. - **`generate_reference_json.py`**: Processes the raw sklearn repository and outputs the reference database used for MCP tooling. Run via: ```bash python generate_reference_json.py --config path_to_config_file ``` - **`sklearn_function_reference.json`**: Output of `generate_reference_json.py` using the default configurations in `reference_json_scraping_config.yaml`. This JSON file serves as the reference database that an LLM can query through tool calls. ### Querying a Model with MCP Tools - **`ask_AI_config.yaml`**: Configurations used to ask an AI model a query with the MCP tools. - **`ask_AI.py`**: Script to ask an OpenAI model a query while providing MCP tool descriptions. The default tools are defined in `mcp_tools_sklearn.py` while the parsed tool options are listed in `ask_AI_config.yaml`. The tools enable the LLM to quickly search through and retrieve documentation regarding the sklearn repository (without having to parse source code). **To use this script, an OpenAI API key must be obtained and listed in the configuration file.** Customizable options include the user query, name of config file, and an optional argument to save the AI model's response to a file. Run via: ```bash python ask_AI.py --config_name ask_AI_config.yaml --query "User's sklearn query to ask model" ``` To change optional arguments, run/change options in the following template: ```bash python ask_AI.py --config_name config_file_name --query "User's sklearn query to ask model" --output_file name_of_output_file ``` - **`mcp_tools_sklearn.py`**: Implements two ASK-compliant tools similar to those exposed to the LLMs in our study. - **`sklearn_meta_prompt.py`**: Example of a system prompt used when running the experiments described in the paper. This prompt is prepended to each user query when running `ask_AI.py`.