# datafusion-ray **Repository Path**: mirrors_apache/datafusion-ray ## Basic Information - **Project Name**: datafusion-ray - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-09-21 - **Last Updated**: 2024-12-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # DataFusion for Ray [![Apache licensed][license-badge]][license-url] [![Python Tests][actions-badge]][actions-url] [![Discord chat][discord-badge]][discord-url] [license-badge]: https://img.shields.io/badge/license-Apache%20v2-blue.svg [license-url]: https://github.com/apache/datafusion-ray/blob/main/LICENSE.txt [actions-badge]: https://github.com/apache/datafusion-ray/actions/workflows/main.yml/badge.svg [actions-url]: https://github.com/apache/datafusion-ray/actions?query=branch%3Amain [discord-badge]: https://img.shields.io/badge/Chat-Discord-purple [discord-url]: https://discord.com/invite/Qw5gKqHxUM ## Overview DataFusion for Ray is a distributed execution framework that enables DataFusion DataFrame and SQL queries to run on a Ray cluster. This integration allows users to leverage Ray's dynamic scheduling capabilities while executing queries in a distributed fashion. ## Execution Modes DataFusion for Ray supports two execution modes: ### Streaming Execution This mode mimics the default execution strategy of DataFusion. Each operator in the query plan starts executing as soon as its inputs are available, leading to a more pipelined execution model. ### Batch Execution _Note: Batch Execution is not implemented yet. Tracking issue: _ In this mode, execution follows a staged model similar to Apache Spark. Each query stage runs to completion, producing intermediate shuffle files that are persisted and used as input for the next stage. ## Getting Started See the [contributor guide] for instructions on building DataFusion for Ray. Once installed, you can run queries using DataFusion's familiar API while leveraging the distributed execution capabilities of Ray. ```python # from example in ./examples/http_csv.py import ray from datafusion_ray import DFRayContext, df_ray_runtime_env ray.init(runtime_env=df_ray_runtime_env) ctx = DFRayContext() ctx.register_csv( "aggregate_test_100", "https://github.com/apache/arrow-testing/raw/master/data/csv/aggregate_test_100.csv", ) df = ctx.sql("SELECT c1,c2,c3 FROM aggregate_test_100 LIMIT 5") df.show() ``` ## Contributing Contributions are welcome! Please open an issue or submit a pull request if you would like to contribute. See the [contributor guide] for more information. ## License DataFusion for Ray is licensed under Apache 2.0. [contributor guide]: docs/contributing.md