Prefect: Orchestrating DBT

George Coyne
SlateCo
Published in
3 min readDec 22, 2020

--

A few months ago, a line was added to the DBT docs which recommends Prefect as a means of advanced orchestration. At Slate, we heavily leverage both Prefect and DBT and are receiving an increasing number of requests to help customers bring the two together.

A little background — Prefect is a lot of things. We think it is the best possible tool for dataflow automation. We use prefect for data modeling, ETL, ML, basically any where that data automation is needed. The only places we don’t use Prefect are when ingesting from well supported API’s (like what Fivetran, Stitch etc do) and CDC type data ingestion (Attunity, etc.) However, generally there exists a need for further transformation after data from these sources lands in warehouses. Here we generally recommend, and internally use, DBT. DBT does a tremendous job of using SQL pumped up with Jinja to automate data transformation within a data warehouse.

A few examples where both DBT and Prefect can be used together:

  • An organization wants a higher degree of control on scheduling than is supported by DBT Cloud.
  • Existing DBT models need to be incorporated in a Prefect flow.

Implementation

DBT and Prefect usually have different users within an organization. DBT is more suited for Analytics Engineers and Data Analysts while Data Engineers and Scientists will more commonly work with Prefect. The division is a byproduct of generally different skill focuses. Analytics Engineers are generally more comfortable with SQL while Data Engineers and Scientists are generally more comfortable with Python.

DBT projects are most often structured as a monolith, with large and complex DAG’s of interdependencies and sophisticated tags to dictate the distinct operations the project will execute. In order to allow multiple organizations to continue using DBT as they were used to, we recommend separating out the Prefect orchestration from the DBT project.

Let’s get into some code examples.

Our pygit2 task

In order to keep Python code out of the main DBT repo, and avoid the authors of DBT code needing to worry about approving PRs outside of their project scope, we start our flow off cloning the DBT repository. We opted to use pygit2 because it interacts directly with the core git methods without requiring local git installation or configuration. A single access token is the only secret required; which we can pass in using Prefect’s Secret’s API.

After we clone the repo, all we need to do is execute our DBT commands. In Prefect, version 0.11.0, community contributor Mark McDonald added DbtShellTask: an extension of the Prefect ShellTask specifically for DBT.

There are a few different options for passing credentials to DBT. Most common is via a profiles.yml file, but for our implementation, we opted to pass all secrets in as kwargs. The commented out option “profiles_dir can be used to specify the location of a yaml file containing configuration options. We added another task to print outputs to logs since, otherwise (at least as of 0.13.11), we would not receive the entirety of logs. We add an “all finished” trigger as well so that we can ensure we get logs regardless of upstream command failure.

Let’s tie it all together in our flow.

Here’s a visual representation of our flow

If you are unfamiliar with Prefect schematics this illustrates interdependencies with and without input dependencies.

This a simple example but we have nearly unlimited flexibility courtesy of Prefect. If you have any questions don’t hesitate to reach out to SLATE for any data, software, or infrastructure questions!

slateco.io

--

--