Give AlbumentationsX a star on GitHub — it powers this leaderboard

Star on GitHub

dbt-duckdb

The duckdb adapter plugin for dbt (data build tool)

Rank: #2653Downloads: 2,178,819 (30 days)Stars: 1,236Forks: 125

Description

dbt-duckdb

DuckDB is an embedded database, similar to SQLite, but designed for OLAP-style analytics. It is crazy fast and allows you to read and write data stored in CSV, JSON, and Parquet files directly, without requiring you to load them into the database first.

dbt is the best way to manage a collection of data transformations written in SQL or Python for analytics and data science. dbt-duckdb is the project that ties DuckDB and dbt together, allowing you to create a Modern Data Stack In A Box or a simple and powerful data lakehouse with Python.

Installation

This project is hosted on PyPI, so you should be able to install it and the necessary dependencies via:

pip3 install dbt-duckdb

The latest supported version targets dbt-core versions >= 1.8.x and duckdb version 1.1.x, but we work hard to ensure that newer versions of DuckDB will continue to work with the adapter as they are released.

Configuring Your Profile

A super-minimal dbt-duckdb profile only needs one setting:

default:
  outputs:
    dev:
      type: duckdb
  target: dev

This will run your dbt-duckdb pipeline against an in-memory DuckDB database that will not be persisted after your run completes. This may not seem very useful at first, but it turns out to be a powerful tool for a) testing out data pipelines, either locally or in CI jobs and b) running data pipelines that operate purely on external CSV, Parquet, or JSON files. More details on how to work with external data files in dbt-duckdb are provided in the docs on reading and writing external files.

To have your dbt pipeline persist relations in a DuckDB file, set the path field in your profile to the path of the DuckDB file that you would like to read and write on your local filesystem. (For in-memory pipelines, the path is automatically set to the special value :memory:). By default, the path is relative to your profiles.yml file location. If the database doesn't exist at the specified path, DuckDB will automatically create it.

dbt-duckdb also supports common profile fields like schema and threads, but the database property is special: its value is automatically set to the basename of the file in the path argument with the suffix removed. For example, if the path is /tmp/a/dbfile.duckdb, the database field will be set to dbfile. If you are running in in-memory mode, then the database property will be automatically set to memory.

Using MotherDuck

As of dbt-duckdb 1.5.2, you can connect to a DuckDB instance running on MotherDuck by setting your path to use a md:<database> connection string, just as you would with the DuckDB CLI or the Python API.

MotherDuck databases generally work the same way as local DuckDB databases from the perspective of dbt, but there are a few differences to be aware of:

  1. MotherDuck is compatible with client DuckDB versions 0.10.2 and older.
  2. MotherDuck preloads a set of the most common DuckDB extensions for you, but does not support loading custom extensions or user-defined functions.

As of dbt-duckdb 1.9.6, you can also connect to a DuckDB instance running hosted DuckLake on MotherDuck by creating a DuckLake on MotherDuck and then setting is_ducklake: true in your profiles.yml.

-- to use create your own database in MotherDuck first
CREATE DATABASE my_ducklake
  (TYPE ducklake, DATA_PATH 's3://...')

An example profile is show below under "Attaching Additional Databases". DuckLake must be identified so that safe DDL operations are applied by dbt.

DuckDB Extensions, Settings, and Filesystems

You can install and load any core DuckDB extensions by listing them in the extensions field in your profile as a string. You can also set any additional DuckDB configuration options via the settings field, including options that are supported in the loaded extensions. You can also configure extensions from outside of the core extension repository (e.g., a community extension) by configuring the extension as a name/repo pair:

default:
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      extensions:
        - httpfs
        - parquet
        - name: h3
          repo: community
        - name: uc_catalog
          repo: core_nightly
  target: dev

To use the DuckDB Secrets Manager, you can use the secrets field. For example, to be able to connect to S3 and read/write Parquet files using an AWS access key and secret, your profile would look something like this:

default:
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      extensions:
        - httpfs
        - parquet
      secrets:
        - type: s3
          region: my-aws-region
          key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
          secret: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
  target: dev

As of version 1.4.1, we have added (experimental!) support for DuckDB's (experimental!) support for filesystems implemented via fsspec. The fsspec library provides support for reading and writing files from a variety of cloud data storage systems including S3, GCS, and Azure Blob Storage. You can configure a list of fsspec-compatible implementations for use with your dbt-duckdb project by installing the relevant Python modules and configuring your profile like so:

default:
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      filesystems:
        - fs: s3
          anon: false
          key: "{{ env_var('S3_ACCESS_KEY_ID') }}"
          secret: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
          client_kwargs:
            endpoint_url: "http://localhost:4566"
  target: dev

Here, the filesystems property takes a list of configurations, where each entry must have a property named fs that indicates which fsspec protocol to load (so s3, gcs, abfs, etc.) and then an arbitrary set of other key-value pairs that are used to configure the fsspec implementation. You can see a simple example project that illustrates the usage of this feature to connect to a Localstack instance running S3 from dbt-duckdb here.

Fetching credentials from context

Instead of specifying the credentials through the settings block, you can also use the CREDENTIAL_CHAIN secret provider. This means that you can use any supported mechanism from AWS to obtain credentials (e.g., web identity tokens). You can read more about the secret providers here. To use the CREDENTIAL_CHAIN provider and automatically fetch credentials from AWS, specify the provider in the secrets key:

default:
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      extensions:
        - httpfs
        - parquet
      secrets:
        - type: s3
          provider: credential_chain
  target: dev

Scoped credentials by storage prefix

Secrets can be scoped, such that different storage path can use different credentials.

default:
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      extensions:
        - httpfs
        - parquet
      secrets:
        - type: s3
          provider: credential_chain
          scope: [ "s3://bucket-in-eu-region", "s3://bucket-2-in-eu-region" ]
          region: "eu-central-1"
        - type: s3
          region: us-west-2
          scope: "s3://bucket-in-us-region"

When fetching a secret for a path, the secret scopes are compared to the path, returning the matching secret for the path. In the case of multiple matching secrets, the longest prefix is chosen.

Attaching Additional Databases

DuckDB supports attaching additional databases to your dbt-duckdb run so that you can read and write from multiple databases. Additional databases may be configured via the attach argument in your profile that was added in dbt-duckdb 1.4.0:

default:
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      attach:
        - path: /tmp/other.duckdb
        - path: ./yet/another.duckdb
          alias: yet_another
        - path: s3://yep/even/this/works.duckdb
          read_only: true
        - path: sqlite.db
          type: sqlite
        - path: postgresql://username@hostname/dbname
          type: postgres
        # Using the options dict for arbitrary ATTACH options
        - path: /tmp/special.duckdb
          options:
            cache_size: 1GB
            threads: 4
            enable_fsst: true

For DuckLake, use ducklake: for local; for MotherDuck-managed DuckLake use md: with is_ducklake: true.

attach:
  - path: "ducklake:my_ducklake.ddb"
  - path: "md:my_other_ducklake"
    is_ducklake: true

The attached databases may be referred to in your dbt sources and models by either the basename of the database file minus its suffix (e.g., /tmp/other.duckdb is the other database and s3://yep/even/this/works.duckdb is the works database) or by an alias that you specify (so the ./yet/another.duckdb database in the above configuration is referred to as yet_another instead of another.) Note that these additional databases do not necessarily have