# pg_parquet **Repository Path**: dcrhtml/pg_parquet ## Basic Information - **Project Name**: pg_parquet - **Description**: 公开 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: aykut/aws-config-gucs - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-06-26 - **Last Updated**: 2025-06-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # pg_parquet > Copy from/to Parquet files in PostgreSQL! [![CI lints and tests](https://github.com/CrunchyData/pg_parquet/actions/workflows/ci.yml/badge.svg)](https://github.com/CrunchyData/pg_parquet/actions/workflows/ci.yml) [![codecov](https://codecov.io/gh/CrunchyData/pg_parquet/graph/badge.svg?token=6BPS0DSKJ2)](https://codecov.io/gh/CrunchyData/pg_parquet) `pg_parquet` is a PostgreSQL extension that allows you to read and write [Parquet files](https://parquet.apache.org), which are located in `S3` or `file system`, from PostgreSQL via `COPY TO/FROM` commands. It depends on [Apache Arrow](https://arrow.apache.org/rust/arrow/) project to read and write Parquet files and [pgrx](https://github.com/pgcentralfoundation/pgrx) project to extend PostgreSQL's `COPY` command. ```sql -- Copy a query result into Parquet in S3 COPY (SELECT * FROM table) TO 's3://mybucket/data.parquet' WITH (format 'parquet'); -- Load data from Parquet in S3 COPY table FROM 's3://mybucket/data.parquet' WITH (format 'parquet'); ``` ## Quick Reference - [Installation From Source](#installation-from-source) - [Usage](#usage) - [Copy FROM/TO Parquet files TO/FROM Postgres tables](#copy-tofrom-parquet-files-fromto-postgres-tables) - [Inspect Parquet schema](#inspect-parquet-schema) - [Inspect Parquet metadata](#inspect-parquet-metadata) - [Object Store Support](#object-store-support) - [Copy Options](#copy-options) - [Misc Configuration](#misc-configuration) - [Supported Types](#supported-types) - [Nested Types](#nested-types) - [Postgres Support Matrix](#postgres-support-matrix) ## Installation From Source After installing `Postgres`, you need to set up `rustup`, `cargo-pgrx` to build the extension. ```bash # install rustup > curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # install cargo-pgrx > cargo install cargo-pgrx # configure pgrx > cargo pgrx init --pg17 $(which pg_config) # append the extension to shared_preload_libraries in ~/.pgrx/data-17/postgresql.conf > echo "shared_preload_libraries = 'pg_parquet'" >> ~/.pgrx/data-17/postgresql.conf # run cargo-pgrx to build and install the extension > cargo pgrx run # create the extension in the database psql> "CREATE EXTENSION pg_parquet;" ``` ## Usage There are mainly 3 things that you can do with `pg_parquet`: 1. You can export Postgres tables/queries to Parquet files, 2. You can ingest data from Parquet files to Postgres tables, 3. You can inspect the schema and metadata of Parquet files. ### COPY to/from Parquet files from/to Postgres tables You can use PostgreSQL's `COPY` command to read and write Parquet files. Below is an example of how to write a PostgreSQL table, with complex types, into a Parquet file and then to read the Parquet file content back into the same table. ```sql -- create composite types CREATE TYPE product_item AS (id INT, name TEXT, price float4); CREATE TYPE product AS (id INT, name TEXT, items product_item[]); -- create a table with complex types CREATE TABLE product_example ( id int, product product, products product[], created_at TIMESTAMP, updated_at TIMESTAMPTZ ); -- insert some rows into the table insert into product_example values ( 1, ROW(1, 'product 1', ARRAY[ROW(1, 'item 1', 1.0), ROW(2, 'item 2', 2.0), NULL]::product_item[])::product, ARRAY[ROW(1, NULL, NULL)::product, NULL], now(), '2022-05-01 12:00:00-04' ); -- copy the table to a parquet file COPY product_example TO '/tmp/product_example.parquet' (format 'parquet', compression 'gzip'); -- show table SELECT * FROM product_example; -- copy the parquet file to the table COPY product_example FROM '/tmp/product_example.parquet'; -- show table SELECT * FROM product_example; ``` ### Inspect Parquet schema You can call `SELECT * FROM parquet.schema()` to discover the schema of the Parquet file at given uri. ```sql SELECT * FROM parquet.schema('/tmp/product_example.parquet') LIMIT 10; uri | name | type_name | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type ------------------------------+--------------+------------+-------------+-----------------+--------------+----------------+-------+-----------+----------+-------------- /tmp/product_example.parquet | arrow_schema | | | | 5 | | | | | /tmp/product_example.parquet | id | INT32 | | OPTIONAL | | | | | 0 | /tmp/product_example.parquet | product | | | OPTIONAL | 3 | | | | 1 | /tmp/product_example.parquet | id | INT32 | | OPTIONAL | | | | | 2 | /tmp/product_example.parquet | name | BYTE_ARRAY | | OPTIONAL | | UTF8 | | | 3 | STRING /tmp/product_example.parquet | items | | | OPTIONAL | 1 | LIST | | | 4 | LIST /tmp/product_example.parquet | list | | | REPEATED | 1 | | | | | /tmp/product_example.parquet | element | | | OPTIONAL | 3 | | | | 5 | /tmp/product_example.parquet | id | INT32 | | OPTIONAL | | | | | 6 | /tmp/product_example.parquet | name | BYTE_ARRAY | | OPTIONAL | | UTF8 | | | 7 | STRING (10 rows) ``` ### Inspect Parquet metadata You can call `SELECT * FROM parquet.metadata()` to discover the detailed metadata of the Parquet file, such as column statistics, at given uri. ```sql SELECT uri, row_group_id, row_group_num_rows, row_group_num_columns, row_group_bytes, column_id, file_offset, num_values, path_in_schema, type_name FROM parquet.metadata('/tmp/product_example.parquet') LIMIT 1; uri | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type_name ------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+----------- /tmp/product_example.parquet | 0 | 1 | 13 | 842 | 0 | 0 | 1 | id | INT32 (1 row) ``` ```sql SELECT stats_null_count, stats_distinct_count, stats_min, stats_max, compression, encodings, index_page_offset, dictionary_page_offset, data_page_offset, total_compressed_size, total_uncompressed_size FROM parquet.metadata('/tmp/product_example.parquet') LIMIT 1; stats_null_count | stats_distinct_count | stats_min | stats_max | compression | encodings | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size ------------------+----------------------+-----------+-----------+--------------------+--------------------------+-------------------+------------------------+------------------+-----------------------+------------------------- 0 | | 1 | 1 | GZIP(GzipLevel(6)) | PLAIN,RLE,RLE_DICTIONARY | | 4 | 42 | 101 | 61 (1 row) ``` You can call `SELECT * FROM parquet.file_metadata()` to discover file level metadata of the Parquet file, such as format version, at given uri. ```sql SELECT * FROM parquet.file_metadata('/tmp/product_example.parquet') uri | created_by | num_rows | num_row_groups | format_version ------------------------------+------------+----------+----------------+---------------- /tmp/product_example.parquet | pg_parquet | 1 | 1 | 1 (1 row) ``` You can call `SELECT * FROM parquet.kv_metadata()` to query custom key-value metadata of the Parquet file at given uri. ```sql SELECT uri, encode(key, 'escape') as key, encode(value, 'escape') as value FROM parquet.kv_metadata('/tmp/product_example.parquet'); uri | key | value ------------------------------+--------------+--------------------- /tmp/product_example.parquet | ARROW:schema | /////5gIAAAQAAAA ... (1 row) ``` ## Object Store Support `pg_parquet` supports reading and writing Parquet files from/to `S3` object store. Only the uris with `s3://` scheme is supported. The simplest way to configure object storage is by creating the standard `~/.aws/credentials` and `~/.aws/config` files: ```bash $ cat ~/.aws/credentials [default] aws_access_key_id = AKIAIOSFODNN7EXAMPLE aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY $ cat ~/.aws/config [default] region = eu-central-1 ``` Alternatively, you can use the following environment variables when starting postgres to configure the S3 client: - `AWS_ACCESS_KEY_ID`: the access key ID of the AWS account - `AWS_SECRET_ACCESS_KEY`: the secret access key of the AWS account - `AWS_REGION`: the default region of the AWS account - `AWS_SHARED_CREDENTIALS_FILE`: an alternative location for the credentials file - `AWS_CONFIG_FILE`: an alternative location for the config file - `AWS_PROFILE`: the name of the profile from the credentials and config file (default profile name is `default`) To make it more Postgres way, `pg_parquet` also defines GUCs to configure s3 client. These GUCs can only be set by a superuser. You can easily set these GUCs to use different configurations per session. You do not need to restart your session to change the config path or profile name. - `pg_parquet.aws_config_file`: an absolute path to the configuration file used by s3 client. Note that when set, the GUC overrides `AWS_CONFIG_FILE` environment variable. By default the GUC is unset, - `pg_parquet.aws_shared_credentials_file`: an absolute path to the shared credentials file used by s3 client. Note that when set, the GUC overrides `AWS_SHARED_CREDENTIALS_FILE` environment variable. By default the GUC is unset, - `pg_parquet.aws_profile`: the profile name used by s3 client. Note that when set, the GUC overrides `AWS_PROFILE` environment variable. By default the GUC is unset. > [!NOTE] > GUCs override environment variables. And environment variables override configuration files. You can use any or combination of them to configure s3 client as you wish. > [!NOTE] > To be able to write into a object store location, you need to grant `parquet_object_store_write` role to your current postgres user. > Similarly, to read from an object store location, you need to grant `parquet_object_store_read` role to your current postgres user. ## Copy Options `pg_parquet` supports the following options in the `COPY TO` command: - `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.]` extension, - `row_group_size `: the number of rows in each row group while writing Parquet files. The default row group size is `122880`, - `row_group_size_bytes `: the total byte size of rows in each row group while writing Parquet files. The default row group size bytes is `row_group_size * 1024`, - `compression `: the compression format to use while writing Parquet files. The supported compression formats are `uncompressed`, `snappy`, `gzip`, `brotli`, `lz4`, `lz4raw` and `zstd`. The default compression format is `snappy`. If not specified, the compression format is determined by the file extension, - `compression_level `: the compression level to use while writing Parquet files. The supported compression levels are only supported for `gzip`, `zstd` and `brotli` compression formats. The default compression level is `6` for `gzip (0-10)`, `1` for `zstd (1-22)` and `1` for `brotli (0-11)`. `pg_parquet` supports the following options in the `COPY FROM` command: - `format parquet`: you need to specify this option to read or write Parquet files which does not end with `.parquet[.]` extension, - `match_by_name `: matches Parquet file fields to PostgreSQL table columns by their name rather than by their position in the schema (default). By default, the option is `false`. The option is useful when field order differs between the Parquet file and the table, but their names match. ## Misc Configuration Some of the general purpose configuration parameters (GUCs) that are defined by the `pg_parquet` are shown below: - `pg_parquet.enable_copy_hooks`: you can set this parameter to `on` or `off` to enable or disable the `pg_parquet` extension. The default value is `on`. ## Supported Types `pg_parquet` has rich type support, including PostgreSQL's primitive, array, and composite types. Below is the table of the supported types in PostgreSQL and their corresponding Parquet types. | PostgreSQL Type | Parquet Physical Type | Logical Type | |-------------------|---------------------------|------------------| | `bool` | BOOLEAN | | | `smallint` | INT16 | | | `integer` | INT32 | | | `bigint` | INT64 | | | `real` | FLOAT | | | `oid` | INT32 | | | `double` | DOUBLE | | | `numeric`(1) | FIXED_LEN_BYTE_ARRAY(16) | DECIMAL(128) | | `text` | BYTE_ARRAY | STRING | | `json` | BYTE_ARRAY | STRING | | `bytea` | BYTE_ARRAY | | | `date` (2) | INT32 | DATE | | `timestamp` | INT64 | TIMESTAMP_MICROS | | `timestamptz` (3) | INT64 | TIMESTAMP_MICROS | | `time` | INT64 | TIME_MICROS | | `timetz`(3) | INT64 | TIME_MICROS | | `geometry`(4) | BYTE_ARRAY | | ### Nested Types | PostgreSQL Type | Parquet Physical Type | Logical Type | |-------------------|---------------------------|------------------| | `composite` | GROUP | STRUCT | | `array` | element's physical type | LIST | | `crunchy_map`(5) | GROUP | MAP | > [!WARNING] > - (1) `numeric` type is written the smallest possible memory width to parquet file as follows: > * `numeric(P <= 9, S)` is represented as `INT32` with `DECIMAL` logical type > * `numeric(9 < P <= 18, S)` is represented as `INT64` with `DECIMAL` logical type > * `numeric(18 < P <= 38, S)` is represented as `FIXED_LEN_BYTE_ARRAY(9-16)` with `DECIMAL` logical type > * `numeric(38 < P, S)` is represented as `BYTE_ARRAY` with `STRING` logical type > * `numeric` is allowed by Postgres. (precision and scale not specified). These are represented by a default precision (38) and scale (16) instead of writing them as string. You get runtime error if your table tries to read or write a numeric value which is not allowed by the default precision and scale (22 integral digits before decimal point, 16 digits after decimal point). > - (2) The `date` type is represented according to `Unix epoch` when writing to Parquet files. It is converted back according to `PostgreSQL epoch` when reading from Parquet files. > - (3) The `timestamptz` and `timetz` types are adjusted to `UTC` when writing to Parquet files. They are converted back with `UTC` timezone when reading from Parquet files. > - (4) The `geometry` type is represented as `BYTE_ARRAY` encoded as `WKB` when `postgis` extension is created. Otherwise, it is represented as `BYTE_ARRAY` with `STRING` logical type. > - (5) `crunchy_map` is dependent on functionality provided by [Crunchy Bridge](https://www.crunchydata.com/products/crunchy-bridge). The `crunchy_map` type is represented as `GROUP` with `MAP` logical type when `crunchy_map` extension is created. Otherwise, it is represented as `BYTE_ARRAY` with `STRING` logical type. > [!WARNING] > Any type that does not have a corresponding Parquet type will be represented, as a fallback mechanism, as `BYTE_ARRAY` with `STRING` logical type. e.g. `enum` ## Postgres Support Matrix `pg_parquet` supports the following PostgreSQL versions: | PostgreSQL Major Version | Supported | |--------------------------|-----------| | 14 | ✅ | | 15 | ✅ | | 16 | ✅ | | 17 | ✅ |