How to get started with dbt and Gitpod
Jul 12, 2023
dbt helps data teams by enabling faster data transformation, providing organized and transparent workflows, and ensuring reliable data sets. The incorporation of tools like dbt in the face of larger and more complex datasets is forcing organizations to labor against many of the same development challenges traditionally faced by software engineers.
One of these challenges is managing the analytics engineers’ development environments across different use cases, workload configurations, versions, and so on. By using a cloud development environment like Gitpod for your dbt projects, you can ensure your data and analytics engineers are always working in a secure and reproducible context.
With this guide you will learn how to:
- Create a Gitpod cloud development environment
- Configure, establish, and test connections between Gitpod, dbt, and your data warehouse
- Customize your workspace IDE
Requirements
To follow along, you will need:
- A Gitpod account
- A dbt account if using dbt Cloud
- Your data warehouse access credentials
This guide will use the dbt + BigQuery and dbt + Snowflake templates as examples, but any cloud setup that is supported by dbt core in a local dev environment can also be run in a Gitpod workspace.
Turn your dbt project into a Gitpod workspace
The fastest way to open up your dbt project in a Gitpod workspace is to prefix the GitHub/GitLab/Bitbucket url in the browser with “gitpod.io/#”.
You will be prompted to confirm the Context URL (your Git repo), IDE of choice, and the workspace class. We recommend selecting VS Code (browser or desktop) for dbt projects because of the capabilities of the vscode-dbt-power-user extension. As far as provisioning, Gitpod processes heavy workloads in the cloud warehouse, so the Standard workspace is sufficient for most use cases.
Automate and standardize your dbt development environments
The first step is to add a .gitpod.yml
file to the root of the repository. This file describes workspace configurations, including:
- The installation of languages and dependencies
- The configuration of the terminal(s) and opened ports
- The installation of extensions in the IDE
A .gitpod.yml
file can be added manually, or a boilerplate version can be generated by running gp init
.
Gitpod uses Docker images as the foundation for instances of development environments, or what we refer to as workspaces. The default workspace image for Gitpod contains support for multiple languages, such as Go, Java, Python, and JavaScript, but you can also use slimmer images or specify your own.
While the specifics will change depending on the data platform, the .gitpod.Dockerfile
file is where you will pull your Gitpod workspace image, set the path of your dbt profiles directory, and install your requirements.
Like the .gitpod.yml
file, .gitpod.Dockerfile
needs to be added to the root of the repository. Here is an example .gitpod.Dockerfile
, consistent across both the dbt + BigQuery and dbt +Snowflake templates:
# Use Gitpod's latest Python image.
FROM gitpod/workspace-python:latest
# Set the path of dbt's profiles file.
ENV DBT_PROFILES_DIR=./profiles/
# Copy requirements file from host into Container.
COPY requirements.txt /tmp
# Install the requirements.
RUN cd /tmp && pip install -r requirements.txt
After using the standard Python image, setting up environment variables, setting the dbt profile path, and installing the requirements, dbt is now ready to be set up. For these examples, the only requirement is installing the matchingdbt adapter for your warehouse.
In the following .gitpod.yml
examples, the .gitpod.Dockerfile
configured above will be called first, installing languages and dependencies.
Each object in the tasks
section creates a new terminal in the development environment. In our examples, a terminal named connect
executes three commands to complete and test the dbt configuration:
# BigQuery
image:
file: .gitpod.Dockerfile
ports:
- port: 8080
onOpen: open-preview
tasks:
- name: connect
command: |
echo $DBT_SERVICE_ACCOUNT > $GITPOD_REPO_ROOT/profiles/service_account.json
dbt debug
dbt deps
openMode: split-left
- name: generate docs
command: |
dbt docs generate
dbt docs serve --no-browser --port 8080
openMode: split-right
# Snowflake
image:
file: .gitpod.Dockerfile
ports:
- port: 8080
onOpen: open-preview
tasks:
- name: connect
# The private SSH key is stored in a single line as DBT_SNOWFLAKE_PRIVATE_KEY.
# Unfortunately, Snowflake will only accept the key if it is multi-line.
# The sed command transforms the key
# and then stores it as a file, which can be processed by Snowflake.
command: |
echo "${DBT_SNOWFLAKE_PRIVATE_KEY}" | sed -e "s/-----BEGIN PRIVATE KEY-----/&\n/" -e "s/-----END PRIVATE KEY-----/\n&/" -e "s/\S\{64\}/&\n/g" > $GITPOD_REPO_ROOT/profiles/private_key.p8
dbt debug
dbt deps
openMode: split-left
- name: generate docs
command: |
dbt docs generate
dbt docs serve --no-browser --port 8080
openMode: split-right
Following the reference to the custom Docker image, your dbt credentials need to be passed into Gitpod so that the workspace can connect to your data platform. The most convenient way of making auth credentials accessible inside of the workspace is using Gitpod’s user-specific environment variables.
The dbt debug
command tests the connection with the database. When executing this, dbt searches for the credentials to connect with the database in the profiles.yml
file, shown here:
# BigQuery
default:
target: dev
outputs:
dev:
type: bigquery
method: service-account
project: "{{ env_var('DBT_PROJECT') }}"
dataset: "{{ env_var('DBT_DEV_DATASET') }}"
threads: 4
keyfile: "{{ env_var('GITPOD_REPO_ROOT') }}/profiles/service_account.json"
location: "{{ env_var('DBT_LOCATION') }}"
# Snowflake
default:
target: dev
outputs:
dev:
type: snowflake
account: "{{ env_var('DBT_SNOWFLAKE_ACCOUNT') }}"
user: "{{ env_var('DBT_SNOWFLAKE_USER') }}"
private_key_path: "{{ env_var('GITPOD_REPO_ROOT') }}/profiles/private_key.p8"
database: "{{ env_var('DBT_SNOWFLAKE_DB') }}"
warehouse: "{{ env_var('DBT_SNOWFLAKE_WH') }}"
schema: "{{ env_var('DBT_SNOWFLAKE_SCHEMA') }}"
This file contains references to environment variables that have to be set by the user. This is the only step users of the repository have to do manually in order to launch a functional dbt dev environment once the configuration has been added to the repository, and only needs to be done once.
After the connection has been tested successfully, the workspace is ready to be used.
Customize VS Code and Git for dbt + Gitpod
The .gitpod.yml
file also allows you to describe IDE extensions and configurations.
We recommend using VS Code for dbt projects in Gitpod workspaces. While VS Code is not ideal for these environments out-of-the-box, there are several extensions that offer a greatly improved development experience, namely the vscode-dbt-power-user extension. Some of this extension’s best feature are:
- Autocompletion of dbt models
- The ability to preview model results in VS Code
- The ability to display model lineage
- Integration of ability to run and test dbt models into VS Code’s UI
For syntax highlighting, we recommend jinjahtml.
Beyond your IDE, the .gitpod.yml
file also gives you the opportunity to configure prebuilds for GitHub repositories. Prebuilds can install dependencies and run builds before a workspace opens, especially helpful for code bases that are large or can’t be compiled directly. Check the documentation for a more detailed look at these options.
For a basic set of recommended extensions and Github prebuild configurations, you can add the following to your .gitpod.yml
file:
# Same for both BigQuery and Snowflake projects
vscode:
extensions:
- ms-python.python
- mechatroner.rainbow-csv
- innoverio.vscode-dbt-power-user
- ms-toolsai.jupyter
- ms-toolsai.jupyter-keymap
- ms-toolsai.jupyter-renderers
- ms-toolsai.vscode-jupyter-cell-tags
- ms-toolsai.vscode-jupyter-slideshow
- samuelcolvin.jinjahtml
github:
prebuilds:
master: true
branches: true
pullRequests: true
pullRequestsFromForks: false
addCheck: true
addComment: false
addBadge: false
You can preview your configs by running gp validate
. For any workspace configuration options to persist, you must commit the .gitpod.yml
and .gitpod.Dockerfile
to the root of the repository and start a new workspace (a workspace restart is not sufficient). Once committed, configs become available to other users launching the workspace.