# HNSW-Flash **Repository Path**: windforce1981/HNSW-Flash ## Basic Information - **Project Name**: HNSW-Flash - **Description**: hnsw flash - **Primary Language**: C++ - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-07 - **Last Updated**: 2025-11-16 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # HNSW-Flash This repository contains the code for the paper **Accelerating Graph Indexing for ANNS on Modern CPUs**. This code builds upon [hnswlib](https://github.com/nmslib/hnswlib) and [Eigen](https://gitlab.com/libeigen/eigen). ## Prerequisite ### Datasets **File format:** The data used in this project is formatted in `.fvecs` and `.ivecs`. Each file contains multiple lines of vectors; each line begins with an integer $d$, representing the dimension of the vector, followed by $d$ numbers as vector data. The number are of type `float` in `.fvecs` files and `integer` in `.ivecs` files. For more information, visit [this site](http://corpus-texmex.irisa.fr/). Below are the datasets mentioned in the paper: | Datasets | Data Volume | Size (GB) | Dim. | Query Volume | | ----------------------------------------------------------------------------------------------------------- | ------------- | --------- | ----- | ------------ | | cohere10k (for test) [(link)](data/cohere10k) | 10,000 | 0.029 | 768 | 100 | | ARGILLA [(link)](https://huggingface.co/datasets/argilla-warehouse/personahub-fineweb-edu-4-embeddings) | 21,071,228 | 81 | 1,024 | 100,000 | | ANTON [(link)](https://huggingface.co/datasets/anton-l/wiki-embed-mxbai-embed-large-v1) | 19,399,177 | 75 | 1,024 | 100,000 | | LAION [(link)](https://sisap-challenges.github.io/2024/datasets/) | 100,000,000 | 293 | 768 | 100,000 | | IMAGENET [(link)]() | 13,158,856 | 38 | 768 | 100,000 | | COHERE [(link)](https://huggingface.co/datasets/Cohere/wikipedia-22-12-es-embeddings) | 10,124,929 | 30 | 768 | 100,000 | | DATACOMP [(link)](https://huggingface.co/datasets/nielsr/datacomp-small-with-embeddings-and-cluster-labels) | 12,800,000 | 37 | 768 | 100,000 | | BIGCODE [(link)](https://huggingface.co/datasets/bigcode/stack-exchange-embeddings-20230914) | 10,404,628 | 30 | 768 | 100,000 | | SSNPP [(link)](https://big-ann-benchmarks.com/neurips21.html) | 1,000,000,000 | 960 | 256 | 100,000 | The following is an example of the file organization for the `cohere10k` dataset: HNSW_Flash ├─data │ ├─cohere10k │ │ ├─cohere10k_base.fvecs │ │ ├─cohere10k_query.fvecs │ │ └─cohere10k_groundtruth.ivecs │ ├─... ### Strategies Below are the strategies implemented in this repository. The strategy names are used to select different approaches when running the code. | **Strategy** | **Description** | |--------------|-----------------| | **flash** | A high-performance implementation of HNSW, optimized for fast search in large-scale datasets. Utilizes techniques specific to **Flash** for efficient distance calculations. | | **hnsw** | A standard implementation of **Hierarchical Navigable Small World (HNSW)** graph, based on the version in [nmslib/hnswlib]((https://github.com/nmslib/hnswlib)). This is the core algorithm for approximate nearest neighbor search. | | **nsg** | An implementation of **Navigating Spreading-out Graph (NSG)**. | | **nsg-flash** | Base on NSG algorithm, utilizes the same techniques as **Flash** for efficient distance calculations. | | **pca-sdc** | Extends HNSW by applying **Principal Component Analysis (PCA)** for dimensionality reduction. Uses **SDC** (Simultaneous Distance Calculation) for efficient query processing, improving both speed and accuracy. | | **pq-adc** | Combines HNSW with **Product Quantization (PQ)**. Uses **ADC** (Asymmetric Distance Calculation) to perform queries efficiently by comparing encoded vectors with raw data. Suitable for scenarios where memory usage is a concern. | | **pq-sdc** | Similar to `pq-adc`, but uses **SDC** for query processing, balancing speed and accuracy. Ideal for large datasets where encoding and fast computation are critical. | | **sq-adc** | Integrates HNSW with **Scalar Quantization (SQ)**. Uses **ADC** for distance calculations, enabling fast search with compressed data. This method is useful for low-memory environments. | | **sq-sdc** | A variation of `sq-adc`, utilizing **SDC** for querying. Offers a trade-off between accuracy and efficiency, particularly useful for datasets with high dimensionality. | | **taumg** | An implementation of **τ-Monotonic Graph (τ-MG)**. | | **taumg-flash** | Base on τ-MG algorithm, utilizes the same techniques as **Flash** for efficient distance calculations. | ### Environment - cmake >= 3.15 - g++ >= 11.4 with C++17 support - CPU with SSE/AVX/AVX2/AVX512 support ## Reproduction The algorithms automatically check if the index has been generated and saved in the file system. if not, they will generate and save the index before running the query; otherwise, they will load the existing index file and run the query. The `cohere10k` dataset has been pre-configured to run with all strategies: make build make run flash cohere10k For other datasets, you can run the code using the following format: cd ./bin && ./main [dataset_name] [strategy_name] For more details on running the code, please refer to `./main.cc` and `./Makefile`. All parameters for the strategies can be modified in `./include/core.h`. Please recompile the program after making any changes.