pyannote-database

Reproducible experimental protocols for multimedia (audio, video, text) database.

$ pip install pyannote.database

pyannote-database

Definitions

In pyannote.database jargon, a resource can be any multimedia entity (e.g. an image, an audio file, a video file, or a webpage). In its most simple form, it is modeled as a pyannote.database.ProtocolFile instance (basically a dict on steroids) with a uri key (URI stands for unique resource identifier) that identifies the entity.

Metadata may be associated to a resource by adding keys to its ProtocolFile. For instance, one could add a label key to an image resource describing whether it depicts a chihuahua or a muffin.

A database is a collection of resources of the same nature (e.g. a collection of audio files). It is modeled as a pyannote.database.Database instance.

An experimental protocol (pyannote.database.Protocol) usually defines three subsets:

a train subset (e.g. used to train a neural network),
a development subset (e.g. used to tune hyper-parameters),
a test subset (e.g. used for evaluation).

Configuration file

Experimental protocols are defined via YAML configuration files:

Protocols:
  MyDatabase:
    Protocol:
      MyProtocol:
        train:
            uri: /path/to/train.lst
        development:
            uri: /path/to/development.lst
        test:
            uri: /path/to/test.lst

where /path/to/train.lst contains the list of unique resource identifier (URI) of the files in the train subset:

# /path/to/train.lst
filename1
filename2

Since version 5.0, configuration files must be loaded into the registry like that:

from pyannote.database import registry
registry.load_database("/path/to/database.yml")

registry.load_database takes an optional mode keyword argument that controls what to do when loading a protocol whose name (e.g. MyDatabase.Protocol.MyProtocol) is already used by another protocol:

LoadingMode.OVERRIDE to override existing protocol by the new one (default behavior);
LoadingMode.KEEP to keep existing protocol;
LoadingMode.ERROR to raise a RuntimeException when such a conflict occurs.

For backward compatibility with 4.x branch, the following configuration files are loaded automatically when importing pyannote.database, in that order:

~/.pyannote/database.yml
database.yml in current working directory
list of ;-separated path(s) in the PYANNOTE_DATABASE_CONFIG environment variable (e.g. /absolute/path.yml;relative/path.yml)

Once loaded in the registry, protocols can be used in Python like this:

from pyannote.database import registry
registry.load_database("/path/to/database.yml")

protocol = registry.get_protocol('MyDatabase.Protocol.MyProtocol')
for resource in protocol.train():
    print(resource["uri"])
filename1
filename2

Paths defined in the configuration file can be absolute or relative to the directory containing the configuration file. For instance, the following file organization should work just fine:

.
├── database.yml
└── lists
    └── train.lst

with the content of database.yml as follows:

Protocols:
  MyDatabase:
    Protocol:
      MyProtocol:
        train:
            uri: lists/train.lst

Data loaders

The above MyDatabase.Protocol.MyProtocol protocol is not very useful as it only allows to iterate over a list of resources with a single 'uri' key. Metadata can be added to each resource with the following syntax:

Protocols:
  MyDatabase:
    Protocol:
      MyProtocol:
        train:
            uri: lists/train.lst
            speaker: rttms/train.rttm
            transcription: ctms/{uri}.ctm

and the following directory structure:

.
├── database.yml
├── lists
|   └── train.lst
├── rttms
|   └── train.rttm
└── ctms
    ├── filename1.ctm
    └── filename2.ctm

Now, resources have both 'speaker' and 'transcription' keys:

from pyannote.database import registry
protocol = registry.get_protocol('MyDatabase.Protocol.MyProtocol')

for resource in protocol.train():
    assert "speaker" in resource
    assert isinstance(resource["speaker"], pyannote.core.Annotation)
    assert "transcription" in resource
    assert isinstance(resource["transcription"], spacy.tokens.Doc)

What happened exactly? Data loaders were automatically selected based on metadata file suffix:

pyannote.database.loader.RTTMLoader for speaker entry with .rttm suffix
pyannote.database.loader.CTMLoader for transcription entry with ctm suffix).

and used to populate speaker and transcription keys. In pseudo-code:

# instantiate loader registered with `.rttm` suffix
speaker = RTTMLoader('rttms/train.rttm')

# entries with {placeholders} serve as path templates
transcription_template = 'ctms/{uri}.ctm'

for resource in protocol.train():
    # unique resource identifier
    uri = resource['uri']

    # only select parts of `rttms/train.rttm` that are relevant to current resource,
    # convert it into a convenient data structure (here pyannote.core.Annotation), 
    # and assign it to `'speaker'` resource key 
    resource['speaker'] = speaker[uri]

    # replace placeholders in `transcription` path template
    ctm = transcription_template.format(uri=uri)

    # instantiate loader registered with `.ctm` suffix
    transcription = CTMLoader(ctm)

    # only select parts of the `ctms/{uri}.ctm` that are relevant to current resource
    # (here, most likely the whole file), convert it into a convenient data structure
    # (here spacy.tokens.Doc), and assign it to `'transcription'` resource key 
    resource['transcription'] = transcription[uri]

pyannote.database provides built-in data loaders for a limited set of file formats: RTTMLoader for .rttm files, UEMLoader for .uem files, and CTMLoader for .ctm files. See Custom data loaders section to learn how to add your own.

Preprocessors

When iterating over a protocol subset (e.g. using for resource in protocol.train()), resources are provided as instances of pyannote.database.ProtocolFile, which are basically dict instances whose values are computed lazily.

For instance, in the code above, the value returned by resource['speaker'] is only computed the first time it is accessed and then cached for all subsequent calls. See Custom data loaders section for more details.

Similarly, resources can be augmented (or modified) on-the-fly with the preprocessors options for get_protocol. In the example below, a dummy key is added that simply returns the length of the uri string:


def compute_dummy(resource: ProtocolFile):
    print(f"Computing 'dummy' key")
    return len(resource["uri"])

from pyannote.database import registry
protocol = registry.get_protocol('Etape.SpeakerDiarization.TV', 
                                 preprocessors={"dummy": compute_dummy})
resource = next(protocol.train())
resource["dummy"]
Computing 'dummy' key

`FileFinder`

FileFinder is a special case of preprocessors is pyannote.database.FileFinder meant to automatically locate the media file associated with the uri.

Say audio files are available at the following paths:

.
└── /path/to
    └── audio
        ├── filename1.wav
        ├── filename2.mp3
        ├── filename3.wav
        ├── filename4.wav
        └── filename5.mp3

The FileFinder preprocessor relies on a Databases: section that should be added to the database.yml configuration files and indicates where to look for media files (using resource key placeholders):

Databases:
  MyDatabase: 
    - /path/to/audio/{uri}.wav
    - /path/to/audio/{uri}.mp3

Protocols:
  MyDatabase:
    Protocol:
      MyProtocol:
        train:
            uri: lists/train.lst

Note that any pattern supported by pathlib.Path.glob is supported (but avoid ** as much as possible). Paths can also be relative to the location of database.yml. It will then do its best to locate the file at runtime:

from pyannote.database import registry
from pyannote.database import FileFinder
protocol = registry.get_protocol('MyDatabase.SpeakerDiarization.MyProtocol', 
                                 preprocessors={"audio": FileFinder()})
for resource in protocol.train():
    print(resource["audio"])
/path/to/audio/filename1.wav
/path/to/audio/filename2.mp3

Tasks

Collections

A raw collection of files (i.e. without any train/development/test split) can be defined using the Collection task:

# ~

pyannote-database

Description

pyannote-database

Definitions

Configuration file

Data loaders

Preprocessors

FileFinder

Tasks

Collections

`FileFinder`