# ursprung **Repository Path**: mirrors_ibm/ursprung ## Basic Information - **Project Name**: ursprung - **Description**: Repository for the Ursprung Provenance Collection System - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-11-23 - **Last Updated**: 2025-08-23 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # The Ursprung Provenance Collection System The Ursprung provenance collection system is a flexible provenance collection framework + a GUI for tracking machine learning and data science experiments and pipelines in a cluster. The collection framework combines low-level provenance information from system sources (operating and file system) with application-specific provenance that can be collected through rules in Ursprung's rule language. The GUI allows users to navigate the provenance graph and has additional features to view and compare past pipeline executions. Ursprung is currently only a research prototype and in pre-alpha. ## Architecture overview Ursprung consists of six main components: 1. The provenance consumers 2. The provenance GUI 3. The provenance database 4. The `provd` provenance daemons 5. An [auditd](https://man7.org/linux/man-pages/man8/auditd.8.html) pluging to collect operating system events through Linux's auditing subsystem 6. A Kafka message queue The consumers, GUI, and database run on the master node while the `provd` daemons and the `auditd` plugin run on the cluster worker nodes from which provenance should be collected. Below is an overview of how the different components interact with each other. ![Ursprung Architecture](doc/architecture.svg) ## Prerequisites To run Ursprung in your cluster, you need: - Linux nodes (tested with CentOS and RHEL 8) - A Kafka deployment - A Spectrum Scale file system with support for Watch Folder (version 5.0.1 or later) - Docker (or podman) on the master node ## Building the System To build Ursprung, clone this repository to a directory on your master node and `cd` into the cloned directory. It is recommended to run the master node on a separate (virtual) machine where provenance collection is not required as otherwise, provenance of the Ursprung system itself will be collected. **Building the master node components** Ursprung's main components are containerized and can be built through Docker (also tested with podman). All Dockerfiles are located in `deployment`. Before building the actual components the base image needs to be created through ``` cd deployment docker build -f Dockerfile.ursprung.build-base -t ursprung-base ../ ``` After building the base image, you can build the database image and the collection-system image through ``` docker build -f Dockerfile.ursprung.db -t ursprung-db . docker build -f Dockerfile.ursprung.collection-system -t ursprung-collection-system ../ ``` Currently, the collection-system image contains both the consumer binaries and the `provd` binary. Before builing the GUI image, you need to create a `.env` file under `gui/backend` with the following default content ``` PORT=3100 DSN=ursprung-db HG_REPO=/opt/ursprung/contenttracking TIME_ERR=1000 ``` The default content can be copied as is unless you are using your own database/want the repository for content tracking under a different location. Once the `.env` file has been created, build the GUI container image through ``` docker build -f Dockerfile.ursprung.gui -t ursprung-gui ../ ``` **Building the auditd plugin** The `auditd` plugin needs to be available on all cluster node where provenance should be tracked. You can either build the plugin on one node and copy the binary to the other nodes (given that the necessary dependencies are installed on these nodes) or build it manually on each cluster node. To build the plugin, install the Develoment Tools and the auditd and unixodbc dependencies (the instructions are for CentOS 8). Note that building the plugin also requires `cmake` version 3.13 or higher. The default cmake version in CentOS 8 is 3.11. You can download later versions manually from [here](https://cmake.org/download/). ``` yum groupinstall 'Development Tools' yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm yum install audit-libs-devel unixODBC-devel rapidjson-devel ``` The plugin also depends on `librdkafka`. To install `librdkafka`, run the following steps in a directory of your choice (note that for running `make install` you need to be root). ``` wget https://github.com/edenhill/librdkafka/archive/v1.4.2.tar.gz tar xzvf v1.4.2.tar.gz cd librdkafka-1.4.2 ./configure make make install ``` Once the dependencies are installed, run the following commands from the cloned repository ``` mkdir -p collection-system/build mkdir -p collection-system/lib cd collection-system/lib git clone https://github.com/google/googletest.git cd ../build cmake3 -DCMAKE_BUILD_TYPE=Debug -DBUILD_TESTS=1 -DINFO=1 ../ cd auditd-plugin make ``` ## Deploying Ursprung To deploy and run Ursprung, you first need to prepare the master node and then setup the auditd plugin on the rest of the cluster. **Preparing the master node** First, create the following directories on your master node ``` mkdir -p /opt/ursprung mkdir -p /opt/ursprung/config mkdir -p /opt/ursprung/contenttracking mkdir -p /opt/ursprung/data mkdir -p /opt/ursprung/rules ``` Next, create a configuration file for both the `auditd` and `scale` consumer from the templates available in the repository under `deployment/config`. You do not need to change the database configuration if you're using the database as created in this instruction. If you're using your own database, you need to set up an ODBC DSN for it and specify the corresponding details in the consumer configuration. If you're using the default database, you just need to specify your Kafka brokers and set any authentication details (if required). If your Kafka deployment does not have authentication set, remove these options from the config template. You can leave the default topics but make sure to create the topics in your Kafka. A quick way of deploying a single node Kafka instance is through Docker ``` docker run -p 2181:2181 -p 9092:9092 --name kafka --env ADVERTISED_HOST=your-master-node --env ADVERTISED_PORT=9092 spotify/kafka ``` This starts a Kafka broker (port 9092) and a Zookeeper instance (port 2181) on your master node. You can create topics by logging in to the container and running the following commands ``` docker exec -it kafka /bin/bash cd /opt/kafka_2.11-0.10.1.0/bin ./kafka-topics.sh --create --topic gpfs --partitions 1 --zookeeper localhost:2181 --replication-factor 1 ./kafka-topics.sh --create --topic auditd --partitions 1 --zookeeper localhost:2181 --replication-factor 1 ``` Copy the consumer configuration template files to `/opt/ursprung/config` and adapt them. Then start the individual components through the following commands ``` docker run --name ursprung-db -v /opt/ursprung/data:/var/lib/postgresql/data:z -p 5432:5432 -it ursprung-db docker run --name ursprung-scale-consumer -v /opt/ursprung/:/opt/ursprung/ --network host -it ursprung-collection-system /opt/collection-system/build/consumer/prov-consumer -c /opt/ursprung/config/scale-consumer.cfg docker run --name ursprung-auditd-consumer -v /opt/ursprung/:/opt/ursprung/ --network host -it ursprung-collection-system /opt/collection-system/build/consumer/prov-consumer -c /opt/ursprung/config/auditd-consumer.cfg docker run --name ursprung-gui-backend -v /opt/ursprung/:/opt/ursprung/ --network host -it ursprung-gui node /opt/gui/backend/app.js docker run --name ursprung-gui-frontend -p 3000:3000 -it ursprung-gui /bin/bash -c "cd /opt/gui/frontend; npm start" ``` The GUI should now be available on your master node through a browser under at `http://localhost:3000`. Note that Ursprung will generate the provenance database (Postgres) under `/opt/ursprung/data` and automatically create the necessary schema. The data is hence persisted across restarts of the container. **Preparing the worker nodes** To set up the `auditd` plugin, update the plugin configuration template under `deployment/config/auditd-plugin.cfg.template` with your cluster's Kafka information and then run the following commands to copy the plugin and the necessary configurations to `auditd`'s plugin folder. Note that the following instructions are for auditd version 3.0 and later in which the plugin system has been restructured. Previously, plugins were managed under `/etc/audisp`. If you're running an older version of auditd, make sure to copy the files to the right locations. As root, run the following commands ``` mkdir -p /etc/audit/plugins.d/plugins cp collection-system/build/auditd-plugin/auditd-plugin /etc/audit/plugins.d/plugins cp deployment/config/auditd-plugin.conf.template /etc/audit/plugins.d/auditd-plugin.conf cp deployment/config/auditd-plugin.cfg.template /etc/audit/plugins.d/plugins/auditd-plugin.cfg ``` You should also update the `auditd` configuration for a more robust event delivery. ``` mv /etc/audit/auditd.conf /etc/audit/auditd.conf.bak cp deployment/auditd/auditd3.0.conf /etc/audit/ ``` When you start `auditd` through `service auditd start`, you should see the following log output in `syslog`, which indicates that the plugin has been successfully loaded. ``` audit dispatcher initialized with q_depth=99999 and 1 active plugins ``` Note that on CentOS, you might see 2 active plugins as the `sedispatch` auditd plugin might be enabled by default. You may also have to set SELinux into permissive mode (or disable it) if you get errors that prevent auditd from accessing the plugin executable. To temporarily set SELinux to permissive mode, run ``` setenforce 0 ``` **TODO: add instructions for older `auditd` versions** ## Collecting Provenance **Collecting basic provenance on process/file interactions** To collect provenance, you need to set up collection rules for `auditd` and Spectrum Scale. For `auditd` just copy `deployment/auditd/ursprung.rules` to `/etc/audit/rules.d`. For Spectrum Scale, edit `deployment/scale/ursprung-watch.cfg.template` and add your broker information. Then run `mmwatch fs0 enable -F ursprung-watch.cfg ` to set up Watch Folders. You should now be able to collect basic provenance from your system and see interactions of processes with files on the Spectrum Scale file system. **Collecting application-specific provenance through rules** **TODO** ## Exploring the Provenance **TODO**