# protoscribe **Repository Path**: mirrors_google-research/protoscribe ## Basic Information - **Project Name**: protoscribe - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-12-05 - **Last Updated**: 2026-03-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ProtoScribe: Modeling the Evolution of Writing ![CI status](https://github.com/google-research/protoscribe/actions/workflows/tests.yaml/badge.svg?branch=main)
π“ž

π’Ύπ’Š¬

This repository contains the supporting code for experimenting with machine learning approaches to evolution of writing. NOTE: This is work in progress. ## Installation and testing Note, the instructions in this section apply to Debian Linux where this project was developed. Depending on your operating system some of the installation steps in `setup.sh` may need to be amended. Ideally the installation should happen in a Python virtual environment. The installation is taken care of by the `setup.sh` script. Simply run ```shell ./setup.sh ``` from the root directory of the project. If all the dependencies are installed correctly, run the tests using `pytest`: ```shell ./test.sh ``` Note, use `--continue-on-collection-errors` flag to calls to `pytest` inside `test.sh` to see *all* the failing tests even if some of them cannot be loaded correctly. ## The codebase The important directories are: * [corpus](protoscribe/corpus): Utilities for building and parsing the corpus. * [data](protoscribe/data): This directory contains most of the bits out of which we generate our simulated dataset. For example, the various concept inventories can be found in [`concepts`](protoscribe/data/concepts/), while their corresponding numeric embeddings in [`semantics`](protoscribe/data/semantics/). The other types of data include things like various sets of SVGs for our glyphs and so on. * [evolution](protoscribe/evolution): Pipeline utilities and supporting scripts for simulation of writing system evolution. * [glyphs](protoscribe/glyphs): Libraries for dealing with SVGs, but also with discrete glyph vocabularies, i.e. [`glyph_vocab.py`](protoscribe/glyphs/glyph_vocab.py). * [language](protoscribe/language): Directory housing linguistic modeling APIs: * [embeddings](protoscribe/language/embeddings): Various interfaces for (semantic) embeddings. The most relevant one is the [`embedder.py`](protoscribe/language/embeddings/embedder.py). Our configuration defaults to BNC. In addition, in the past we played with representing concepts by glosses -- shorts snippets of Wikipedia text explaining what a thing is -- these can then be encoded using a pretrained language model. * [phonology](protoscribe/language/phonology): Core phonological configurations for the synthetic languages utilizing [PHOIBLE](https://github.com/phoible/) segment inventories, implementation of phonetic word similarity and a library for computing phonetic embeddings. * [morphology](protoscribe/language/morphology)/[syntax](protoscribe/language/syntax): Finite-state based definitions of the morphology and syntax of the generated language including the inflection paradigms and number grammars. The core functionality for determining morpheme shape. * [glyphs](protoscribe/glyphs): Libraries for dealing with SVGs, but also with discrete glyph vocabularies, i.e. [`glyph_vocab.py`](protoscribe/glyphs/glyph_vocab.py). * [models](protoscribe/models): Discrete and stroke-based glyph synthesis models and respective configurations. * [semantics](protoscribe/semantics): Basic helper packages for representing knowledge about categories. * [scoring](protoscribe/scoring): Tools for evaluating and scoring the resulting models. * [sketches](protoscribe/sketches/utils): Libraries for manipulating and modeling glyphs as sketches. Includes core libraries for representing glyphs as sequences of (possibly quantized) strokes in our models. * [speech](protoscribe/speech): Acoustic front-end components. * [texts](protoscribe/texts): The libraries for constructing the actual "accounting documents". * [vision](protoscribe/vision): Components for building and representing image features corresponding to semantic concepts. ## Disclaimers This is not an official Google product.