pyannote-database
Interface to multimedia databases and experimental protocols
Description
pyannote-database
Reproducible experimental protocols for multimedia (audio, video, text) database.
$ pip install pyannote.database
- pyannote-database
Definitions
In pyannote.database jargon, a resource can be any multimedia entity (e.g. an image, an audio file, a video file, or a webpage). In its most simple form, it is modeled as a pyannote.database.ProtocolFile instance (basically a dict on steroids) with a uri key (URI stands for unique resource identifier) that identifies the entity.
Metadata may be associated to a resource by adding keys to its ProtocolFile. For instance, one could add a label key to an image resource describing whether it depicts a chihuahua or a muffin.
A database is a collection of resources of the same nature (e.g. a collection of audio files). It is modeled as a pyannote.database.Database instance.
An experimental protocol (pyannote.database.Protocol) usually defines three subsets:
- a train subset (e.g. used to train a neural network),
- a development subset (e.g. used to tune hyper-parameters),
- a test subset (e.g. used for evaluation).
Configuration file
Experimental protocols are defined via YAML configuration files:
Protocols:
MyDatabase:
Protocol:
MyProtocol:
train:
uri: /path/to/train.lst
development:
uri: /path/to/development.lst
test:
uri: /path/to/test.lst
where /path/to/train.lst contains the list of unique resource identifier (URI) of the
files in the train subset:
# /path/to/train.lst
filename1
filename2
Since version 5.0, configuration files must be loaded into the registry like that:
from pyannote.database import registry
registry.load_database("/path/to/database.yml")
registry.load_database takes an optional mode keyword argument that controls what to do when loading a protocol whose name (e.g. MyDatabase.Protocol.MyProtocol) is already used by another protocol:
LoadingMode.OVERRIDEto override existing protocol by the new one (default behavior);LoadingMode.KEEPto keep existing protocol;LoadingMode.ERRORto raise aRuntimeExceptionwhen such a conflict occurs.
For backward compatibility with 4.x branch, the following configuration files are loaded automatically when importing pyannote.database, in that order:
~/.pyannote/database.ymldatabase.ymlin current working directory- list of
;-separated path(s) in thePYANNOTE_DATABASE_CONFIGenvironment variable (e.g./absolute/path.yml;relative/path.yml)
Once loaded in the registry, protocols can be used in Python like this:
from pyannote.database import registry
registry.load_database("/path/to/database.yml")
protocol = registry.get_protocol('MyDatabase.Protocol.MyProtocol')
for resource in protocol.train():
print(resource["uri"])
filename1
filename2
Paths defined in the configuration file can be absolute or relative to the directory containing the configuration file. For instance, the following file organization should work just fine:
.
├── database.yml
└── lists
└── train.lst
with the content of database.yml as follows:
Protocols:
MyDatabase:
Protocol:
MyProtocol:
train:
uri: lists/train.lst
Data loaders
The above MyDatabase.Protocol.MyProtocol protocol is not very useful as it only allows to iterate over a list of resources with a single 'uri' key. Metadata can be added to each resource with the following syntax:
Protocols:
MyDatabase:
Protocol:
MyProtocol:
train:
uri: lists/train.lst
speaker: rttms/train.rttm
transcription: ctms/{uri}.ctm
and the following directory structure:
.
├── database.yml
├── lists
| └── train.lst
├── rttms
| └── train.rttm
└── ctms
├── filename1.ctm
└── filename2.ctm
Now, resources have both 'speaker' and 'transcription' keys:
from pyannote.database import registry
protocol = registry.get_protocol('MyDatabase.Protocol.MyProtocol')
for resource in protocol.train():
assert "speaker" in resource
assert isinstance(resource["speaker"], pyannote.core.Annotation)
assert "transcription" in resource
assert isinstance(resource["transcription"], spacy.tokens.Doc)
What happened exactly? Data loaders were automatically selected based on metadata file suffix:
pyannote.database.loader.RTTMLoaderforspeakerentry with.rttmsuffixpyannote.database.loader.CTMLoaderfortranscriptionentry withctmsuffix).
and used to populate speaker and transcription keys. In pseudo-code:
# instantiate loader registered with `.rttm` suffix
speaker = RTTMLoader('rttms/train.rttm')
# entries with {placeholders} serve as path templates
transcription_template = 'ctms/{uri}.ctm'
for resource in protocol.train():
# unique resource identifier
uri = resource['uri']
# only select parts of `rttms/train.rttm` that are relevant to current resource,
# convert it into a convenient data structure (here pyannote.core.Annotation),
# and assign it to `'speaker'` resource key
resource['speaker'] = speaker[uri]
# replace placeholders in `transcription` path template
ctm = transcription_template.format(uri=uri)
# instantiate loader registered with `.ctm` suffix
transcription = CTMLoader(ctm)
# only select parts of the `ctms/{uri}.ctm` that are relevant to current resource
# (here, most likely the whole file), convert it into a convenient data structure
# (here spacy.tokens.Doc), and assign it to `'transcription'` resource key
resource['transcription'] = transcription[uri]
pyannote.database provides built-in data loaders for a limited set of file formats: RTTMLoader for .rttm files, UEMLoader for .uem files, and CTMLoader for .ctm files. See Custom data loaders section to learn how to add your own.
Preprocessors
When iterating over a protocol subset (e.g. using for resource in protocol.train()), resources are provided as instances of pyannote.database.ProtocolFile, which are basically dict instances whose values are computed lazily.
For instance, in the code above, the value returned by resource['speaker'] is only computed the first time it is accessed and then cached for all subsequent calls. See Custom data loaders section for more details.
Similarly, resources can be augmented (or modified) on-the-fly with the preprocessors options for get_protocol. In the example below, a dummy key is added that simply returns the length of the uri string:
def compute_dummy(resource: ProtocolFile):
print(f"Computing 'dummy' key")
return len(resource["uri"])
from pyannote.database import registry
protocol = registry.get_protocol('Etape.SpeakerDiarization.TV',
preprocessors={"dummy": compute_dummy})
resource = next(protocol.train())
resource["dummy"]
Computing 'dummy' key
FileFinder
FileFinder is a special case of preprocessors is pyannote.database.FileFinder meant to automatically locate the media file associated with the uri.
Say audio files are available at the following paths:
.
└── /path/to
└── audio
├── filename1.wav
├── filename2.mp3
├── filename3.wav
├── filename4.wav
└── filename5.mp3
The FileFinder preprocessor relies on a Databases: section that should be added to the database.yml configuration files and indicates where to look for media files (using resource key placeholders):
Databases:
MyDatabase:
- /path/to/audio/{uri}.wav
- /path/to/audio/{uri}.mp3
Protocols:
MyDatabase:
Protocol:
MyProtocol:
train:
uri: lists/train.lst
Note that any pattern supported by pathlib.Path.glob is supported (but avoid ** as much as possible). Paths can also be relative to the location of database.yml. It will then do its best to locate the file at runtime:
from pyannote.database import registry
from pyannote.database import FileFinder
protocol = registry.get_protocol('MyDatabase.SpeakerDiarization.MyProtocol',
preprocessors={"audio": FileFinder()})
for resource in protocol.train():
print(resource["audio"])
/path/to/audio/filename1.wav
/path/to/audio/filename2.mp3
Tasks
Collections
A raw collection of files (i.e. without any train/development/test split) can be defined using the Collection task:
# ~