# data-mesh-demo **Repository Path**: mirrors_GoogleCloudPlatform/data-mesh-demo ## Basic Information - **Project Name**: data-mesh-demo - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-11-02 - **Last Updated**: 2026-05-16 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Data Mesh on Google Cloud ## Overview This repo is a reference implementation of Data Mesh on Google Cloud. See [this architecture guide](https://cloud.google.com/architecture/architecture-functions-data-mesh) for details. [infrastructure](/infrastructure) folder contains a number of Terraform scripts to create several Google Cloud projects which represent typical data mesh participants - data producers, data consumers, central services. The scripts are stored in folders and are meant to be executed sequentially to illustrate the typical progression of building a data mesh. ## Setup ### Prerequisites In order to run the scripts in this repo make sure you have installed: * gcloud command ([instructions](https://cloud.google.com/sdk/docs/install)) * Terraform ([instructions](https://learn.hashicorp.com/tutorials/terraform/install-cli)) * jq utility ([instructions](https://stedolan.github.io/jq/download/)) If you are using Cloud Shell, all these utilities are preinstalled in your environment. You would need to create Google Cloud's Application Default Credentials using ```shell gcloud auth application-default login ``` #### Permissions of the principal running the Terraform scripts Typicially a person with `Owner` or `Editor` basic roles will execute the Terraform scripts to set up the infrastructure of the Data Mesh. These scripts can also be run using a service account with sufficent privileges. It is important that the principal running the scripts has `Service Account Token Creator` role at the `Domain/Org level` policy rather than the project level. Without it the scripts will fail with a `403` error while creating tag templates. To start the data mesh infrastructure build process: ```shell cd infrastructure ``` ### Providing Terraform variables Several step below mention that you will need to provide Terraform variables. There are [multiple ways](https://www.terraform.io/language/values/variables) to do this. The simplest approach is to create a short script that sets the variables you need before you start setting the infrastructure, which you can re-run if you need to restart your session. Typically this script will contain several statements like: ```shell export TF_VAR_project_prefix='my-data-mesh-demo' ``` and can be set using ```shell source set-my-variables.sh ``` Notice, `TF_VAR_` indicates to Terraform that the rest of the environment variable name is Terraform's variable name. ### Projects There are 4 projects that are used in this demo. They can be existing projects, or there are scripts to create new projects. * Domain base data project. It contains base BigQuery tables and materialized views. * Domain product project. It contains BigQuery authorized views and a Pub/Sub topic that constitute consumption interfaces exposed to data consumers (see the Consumer project, described below) as data products by this domain. * Central catalog project. It is the project where the data mesh-specific Data Catalog tag templates are defined. * Consumer project. Data from the data product is consumed from this project. #### To create new projects All new projects will be created in the same organization folder. In a real organization most likely there will be multiple folders mapping to different domains in the organization, but for simplicity in managing the newly created projects we are creating them in the same folder. Set up the following Terraform variables: * project_prefix - used if creating the projects as part of the setup * billing_id - all projects created will be using this billing id * folder_id - all projects will be created in this folder * org_domain - all users within this domain will be given access to search data products in Data Catalog Create a new folder and capture the folder id Ensure that the account used to run Terraform scripts has the following roles assigned at the folder level: * Folder Admin * Service Account Token Creator Run these commands: ```shell cd initial-project-setup terraform init terraform apply cd .. ``` Now that the projects are created their project ids can be stored in environment variables by running ```shell source get-project-ids.sh ``` #### Using existing projects If you pre-created the four projects we mentioned earlier and would like to use them, make sure the following Terraform variables are set: * consumer_project_id * domain_data_project_id * domain_product_project_id * central_catalog_project_id Additionally, you would need to create a service account in the Central Catalog project, give it at least Data Catalog Admin role and assign the Terraform variable `central_catalog_admin_sa` its email address. The account that will be used to run the Terraform scripts in the central catalog project will need to have Service Account Token Creator role on this service account. ## Creating base project infrastructure Each of these steps can be done in any sequence. It simulates typical steps done by participants in the data mesh. ### Domain data (created by data producers) ```shell cd domain terraform init terraform apply cd .. ``` ### Central catalog (created by the central services team) ```shell cd central-catalog terraform init terraform apply cd .. ``` ### Consumer project (created by a data mesh consumer) ```shell cd consumer terraform init terraform apply cd .. ``` Once all the projects are populated, store additional resource ids in environment variables: ```shell source get-base-project-details.sh ``` ## Enabling the data mesh Next steps show how to enable the data mesh in the organization. We included these steps in [incremental changes folder](/infrastructure/incremental-changes) ```shell cd incremental-changes ``` ### Allow producers to tag data products and underlying data resources The central team [defined a set of tag templates](infrastructure/central-catalog/datacatalog.tf) needed to tag data product-related resources ("product interfaces") by data producers. Allowing data producers to use the tag templates requires [granting the Tag Template User role](infrastructure/incremental-changes/allow-producer-to-tag/main.tf) to the service account(s) used to tag these resources. ```shell cd allow-producer-to-tag terraform init terraform apply cd .. ``` ### Make product searchable To make data interfaces searchable in Data Catalog two things need to happen: * Resource metadata needs to become visible to anybody in the organization * Resources need to be tagged with a set of predefined tag templates `get-catalog-ids.sh` script sets several Terraform variables with the Data Catalog ids of the resources to be tagged using the data mesh-related tag templates. You will need to set up this Terraform variable: * org_domain - all users within this domain will be given access to search data products in Data Catalog, e.g., `example.com`. ```shell cd make-products-searchable source get-catalog-ids.sh terraform init terraform apply cd .. ``` ### Give the consumer access to a product interface Once a product is discovered and its suitability is assessed by the consumer the product owner needs to grant access to the resources which represent the interface. ```shell cd give-access-to-event-v1 terraform init terraform apply cd .. ``` ## Single script to create all infrastructure [build-all.sh](infrastructure/build-all.sh) script executes all the steps above. It assumes that you will be creating projects from scratch. It assumes that all the Terraform variables will be provided in `bootstrap.sh` script. Here's an example of such a script: ```shell export TF_VAR_project_prefix='my-org-mesh' export TF_VAR_org_id='' export TF_VAR_billing_id='' export TF_VAR_folder_id='' export TF_VAR_org_domain='example.com' ``` ## Using Data Mesh concepts ### Searching for products in Data Catalog Data Catalog searches are performed within scopes - either organization, folder or project. Several scripts mentioned below use organization as the scope. You would need to define `TF_VAR_org_id` environment variable before you can run them. You can find the organization id you belong to by running ```shell gcloud organizations list ``` To search for data products using Data Catalog APIs, execute this script from the root of this repo: ```shell bin/search-products.sh ``` Search terms must conform to [the Data Catalog search syntax](https://cloud.google.com/data-catalog/docs/how-to/search-reference) . A simple search term which works for this repo is `events`. It will find several data products defined in the domain project that contain the term " events" in their names or descriptions. You "undo" the data product registration in the data mesh catalog by running this script: ```shell terraform -chdir=infrastructure/incremental-changes/make-products-searchable destroy ``` A subsequent search of the catalog will return no results. To make the products searchable again use: ```shell terraform -chdir=infrastructure/incremental-changes/make-products-searchable apply ``` Also, compare this to searching the data catalog across the whole organization: ```shell bin/search-all-org-assets.sh ``` Depending on your organization and permissions granted to you to access the data you will see a large number of Data Catalog entries, most of which are not ready to be shared by the product owners with possible consumers. ### Consuming product data Once the data product is discovered and access to it is granted to a consumer (a service account or a user group), the product can be consumed. #### Authorized views In this case, access to the authorized BigQuery dataset has been granted to a Service Account in the consumer project. [bin](bin) folder has several scripts to test how access works: * `bin/populate-base-table.sh` creates several records in the base table * `bin/query-product.sh` uses a service account in the consumer project to query the table. You should see results similar to these: ``` +---------------+ | src_ip | +---------------+ | 670.345.234.1 | | 671.345.234.1 | | 672.345.234.1 | | 673.345.234.1 | | 674.345.234.1 | +---------------+ ``` * `bin/query-product-with-column-level-ac.sh` is similar to the previous script. It shows that column level access control can be enforced in the data mesh at the base table level. This query will fail with the `Access Denied...` error. When you run any of the `query-product*.sh` scripts, your gcloud authorization will be set to use the service account in the consumer project. This validates that the customer's service account had the intended access to query the product data. Use `bin/remove-consumer-sa-key.sh` script to delete the service account key created to run these scripts and make the originally active account active again. #### Consuming data streams [streaming-product](streaming-product) folder contains the instructions on how to create, discover and consume data which is exposed as a data stream. # Cleanup You can run `terraform destroy` in all the subfolders of [infrastructure](/infrastructure) to revert the changed to the projects. There is a convenience [destroy-all.sh](/infrastructure/destroy-all.sh) script that will do it for you. If you have created new projects for this demo you can just delete these projects.