# mozaggregator2bq

**Repository Path**: mirrors_mozilla/mozaggregator2bq

## Basic Information

- **Project Name**: mozaggregator2bq
- **Description**: DEPRECATED - Scripts to dump mozaggregator to BigQuery
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-01-22
- **Last Updated**: 2026-03-07

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# mozaggregator2bq

A set of scripts for loading Firefox Telemetry aggregates into BigQuery. These
aggregates power the Telemetry Dashboard and Evolution Viewer.

## Overview

Build the container and launch it:

```bash
docker-compose build
docker-compose run --rm app bash
```

### Interacting with the database

To start a psql instance with the read-only replica of the production Postgres
instance, run the following commands. Ensure that you have the appropriate AWS
credentials.

```bash
source bin/export_postgres_credentials_s3

PGPASSWORD=$POSTGRES_PASS psql \
    --host="$POSTGRES_HOST" \
    --username="$POSTGRES_USER" \
    --dbname="$POSTGRES_DB"
```

An example query:

```sql
-- list all aggregates by build_id
select tablename
from pg_catalog.pg_tables
where schemaname='public' and tablename like 'build_id%';

--  build_id_aurora_0_20130414
--  build_id_aurora_0_20150128
--  build_id_aurora_0_20150329
--  build_id_aurora_1_20130203
--  build_id_aurora_1_20150604
-- ...

-- list all aggregates by submission_date
select tablename
from pg_catalog.pg_tables
where schemaname='public' and tablename like 'submission_date%';

--  submission_date_beta_1_20151027
--  submission_date_nightly_40_20151029
--  submission_date_beta_39_20151027
--  submission_date_nightly_1_20151025
--  submission_date_nightly_39_20151031
-- ...
```

### Database dumps by aggregate type and date

To start dumping data, run the following commands.

```bash
source bin/export_postgres_credentials_s3

time DATA_DIR=data AGGREGATE_TYPE=submission DS_NODASH=20191201 bin/pg_dump_by_day
# 23.92s user 1.97s system 39% cpu 1:05.48 total

time DATA_DIR=data AGGREGATE_TYPE=build_id DS_NODASH=20191201 bin/pg_dump_by_day
# 3.47s user 0.49s system 24% cpu 16.188 total
```

This should result in gzipped files in the following hierarchy.

```bash
data
├── [  96]  build_id
│   └── [ 128]  20191201
│       ├── [8.4M]  474306.dat.gz
│       └── [1.6K]  toc.dat
└── [  96]  submission
    └── [3.2K]  20191201
        ├── [ 74K]  474405.dat.gz
        ├── [ 48K]  474406.dat.gz
        ....
        ├── [1.8M]  474504.dat.gz
        └── [ 93K]  toc.dat

4 directories, 103 files
```

See the [`pg_dump` documentation](https://www.postgresql.org/docs/9.1/app-pgdump.html) for details on the file format.

```bash
$ gzip -cd data/submission/20191201/474405.dat.gz | head -n3
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_INSTANTIATED_FLAG", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}    {0,2,0,2,2}
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_CONSUMERS", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}    {0,0,0,0,0,0,0,0,0,0,2,0,20,2}
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_ISIMPLEDOM_USAGE_FLAG", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}        {2,0,0,0,2}
```

### Running a notebook

Ensure your data directory in the top-level directory matches the one in the
notebook. Run the following script.

```bash
bin/start-jupyter
```

This script can be modified to include various configuration parameters for
spark, including the default parallelism and the amount of executor memory.

### Processing `pg_dump` into parquet

Run the following scripts to transform the data dumps into parquet, where the
json fields have been transformed into appropriate columns and arrays.

```bash
bin/submit-local bin/pg_dump_to_parquet.py \
    --input-dir data/submission_date/20191201 \
    --output-dir data/parquet/submission_date/20191201
```

### Running backfill

The `bin/backfill` script will dump data from the Postgres database, transform
the data into Parquet, and load the data into a BigQuery table. The current
schema for the table is as follows:

Field name | Type | Mode
-|-|-
ingest_date | DATE | REQUIRED
aggregate_type | STRING | NULLABLE
ds_nodash | STRING | NULLABLE
channel | STRING | NULLABLE
version | STRING | NULLABLE
os | STRING | NULLABLE
child | STRING | NULLABLE
label | STRING | NULLABLE
metric | STRING | NULLABLE
osVersion | STRING | NULLABLE
application | STRING | NULLABLE
architecture | STRING | NULLABLE
aggregate | STRING | NULLABLE

There is a table for the build id aggregates and the submission date aggregates.
The build ids are truncated to the nearest date.

It may be useful to use a small entry-point script.

```bash
#!/bin/bash
set -x
start=2015-06-01
end=2020-04-01
while ! [[ $start > $end ]]; do
    rm -r data
    up_to=$(date -d "$start + 1 month" +%F)
    START_DS=$start END_DS=$up_to bash -x bin/backfill
    start=$up_to
done
```