Home » Data stories » Staying on Track with Dataform Railway Design – Streamlining Dataform Development with Local Setup and CI/CD

Staying on Track with Dataform Railway Design – Streamlining Dataform Development with Local Setup and CI/CD

Recently, I had the exciting opportunity to host my first session at Measurecamp Malmö 2025, where I presented strategies to streamline Dataform development using local environments and CI/CD integration. Many data teams struggle with inefficient local testing, accidentally run test scripts on entire datasets on BigQuery, and struggle to maintain schema integrity across environments. Developing in Dataform without a structured workflow is like running a railway network without signals, track switches or station coordination. Without a proper structure, we risk delays, collisions, or even derailment.

This blog outlines a Dataform local setup in the form of a well-maintained railway system: integrating CI/CD (signal lights), automated schema testing (station maps), environment switching (track switches), and sample data transfers (cargo management), making Dataform workflows more efficient and scalable. We will serve as railway engineers, laying down the tracks for a structured Dataform local setup. We will look into the challenges of local Dataform development, go over the key features of a robust local setup and provide a step-by-step guide on how to get started, along with a more in-depth look into some of the key features. All aboard! 🚂

A more extensive setup guide can be found in the GitHub repository: Dataform Local Setup with CI/CD

🚧 Railway Obstacles in Local Dataform Development

Working with Google Cloud Dataform presents several hurdles:

🔐 Security concerns with downloading the data to the local environment.
🛑 Manual schema validation, leading to errors in production.
☁️ Limited local development support due to cloud dependencies.
🔄 Inefficient data transfer between development and production environments.

To prevent derailments, we built a local-first development approach, ensuring a seamless experience for testing and deploying Dataform workflows.

⚡ Laying Down the Tracks – Key Features of the Local Setup

🚦 Signal Lights: Automating Schema Validation with CI/CD

One of the key improvements in this setup is the integration of GitHub Actions for automated schema testing. Using these signal lights, we can avoid major disruptions from trains entering the wrong track: every time a pull request (PR) is created, the CI pipeline validates the schema, detecting any added or removed columns, type changes, or new tables before merging. This prevents unintended modifications from disrupting production environments. Developers receive instant feedback on potential schema-breaking changes, reducing deployment risks and ensuring smooth data transitions.

🔄 Track Switching – Simplified Environment Management

Managing multiple environments is essential for structured development, and this setup enables effortless switching between DEV and PROD using dedicated scripts. Instead of manually reconfiguring settings, developers can toggle environments with a single command, ensuring that production data remains untouched during development. The separation of environments using prefixed datasets ensures that testing occurs in isolation, maintaining data integrity across projects.

🚂 Standardized Train Engines: Docker for Local Development

A Dockerized development environment ensures that every developer has the same engine (setup), without the need for manual installation. By using a preconfigured devcontainer in VS Code, all necessary dependencies and configurations are pre-installed, creating a fully reproducible workspace. This guarantees consistency across team members and eliminates the common “it works on my machine” problem, streamlining collaboration and keeping all trains on schedule.

📦Cargo Management: Sample Data Transfers for Testing

A train doesn’t need to carry every possible container on every journey – only those that are required. Similarly, testing doesn’t require full production datasets. Hence, our local setup includes automated sample data transfers to a BigQuery test project. A dedicated script simplifies the movement of partitioned and incremental data, allowing developers to test transformations with realistic datasets without unnecessary load. By handling data loads programmatically, manual setup time is reduced, making the development cycle more efficient. This approach ensures that tests are conducted with relevant data while maintaining a clear separation from production environments.

🛠️ Step-by-Step Setup Guide

1. Fork and Clone the Repository

# Fork the repository on GitHub, then clone it
git clone https://github.com/<user>/dataform_local_setup_with_ci.git cd dataform_local_setup_with_ci

2. Prerequisites

🛠️ Git for version control.
🐳 Docker or OrbStack for containerized development.
🖥️ VS Code with Dev Containers extension installed.
☁️ Google Cloud Account with two projects: dev-project and prod-project.
🔑 Service Account with BigQuery permissions.

3. Configure Google Cloud Credentials

Download the JSON key for the service account.
Save it as GCPkey.json in the root directory.
Add GCPkey.json to .gitignore.

4. Start the Dev Container

code .

VS Code will detect the setup and reopen in a fully configured dev environment.

5. Initialize Dataform Credentials

  dataform init-creds

Select EU as the region.
Choose JSON service account and provide GCPkey.json.

🔍In-Depth Look into Some of the Key Features

🚆 Running Dataform Commands with Scripts

✅ dataform_exec Script

Run Dataform commands with environment validation:

dataform_exec compile # Validate Dataform code dataform_exec test # Run unit tests dataform_exec run --dry-run # Simulate execution dataform_exec run --full-refresh # Perform a full refresh

🔄 switch_env Script

Toggle between development and production:

switch_env dev # Switch to DEV switch_env prod # Switch to PROD

🛤️ Automated CI/CD Schema Tests – Keeping Trains on Track

How It Works:

Compares schema changes between PRs and existing definitions.
Detects modifications (new columns, type changes, new tables).
Logs warnings/errors for quick debugging.
Prevents schema-breaking changes from merging into production.

GitHub Workflow Configuration

– name: Set up Google Cloud authentication
run: echo “${{ secrets.GCPKEY }}” > /tmp/gcpkey.json

Example CI/CD Test Output

Schema Changes:

Processing table: project.dataform__report__analytics.paid_campaigns
  + tesfield (STRING) added
  - cost (FLOAT) removed
  ~ SOURCE (STRING → INTEGER) changed

📦 Optimized Sample Data Transfer for Testing

The export_and_load.py script simplifies data migration between GCP projects.

How It Works:

Reads config.json for source/target details.
Creates a temporary partitioned table.
Loads data incrementally into the target project.
Deletes temporary resources after transfer.

/usr/bin/python3 /dataform/src/exampleData/export_and_load.py

Example Configuration:

{
  "tables": [
    {
      "source_project": "prod-project",
      "source_dataset": "source_dataset1",
      "source_table": "event_20240201",
      "target_project": "dev-project",
      "location": "EU",
      "partition_size": 10000,
      "max_rows": 39990000
    }
  ]
}

🗺️ Final Thoughts – A Smooth, Efficient Dataform Journey

By implementing a structured Dataform local setup, your development process transforms into a well-managed railway system:

✅ Schema integrity avoids unexpected train delays by preventing train crashes
✅ Environment consistency provides us with a smooth transition between tracks
✅ Automated validation

By integrating GitHub Actions, environment switching, and sample data transfers into their Dataform projects, teams can:

🚆 Reduce deployment risks
🛠️ Automate schema validation
🔄 Streamline data migration between environments
🤝 Enhance collaboration with PR-based workflows

With the right infrastructure in place, Dataform development becomes a more smooth, predictable journey – where every train arrives on time, every deployment is robust, and every pipeline runs like a well-coordinated railway system! Want some help with your railway structure? Feel free to reach out!

This approach was presented at Measurecamp Malmö 2025, and we’re excited to share the GitHub repository: Dataform Local Setup with CI/CD 🚂

Need some help?

Andrei Fekete

“I am passionate about learning new technologies, simplifying processes, automation, and creating really usable solutions.”

How server-side tracking enhances data accuracy in GA4

Have you ever noticed that the number of conversions displayed in your reports is smaller than the actual amount shown in the CRM? If yes, you should have already gone...

Data stories

Beyond Accuracy: How to Evaluate Unsupervised Models for Reliable Data Insights

Unsupervised learning is a form of machine learning that identifies patterns and structures in data without relying on labelled examples or predefined outcomes. That is both its greatest strength and its biggest...

Data stories

GA4 Measurement Protocol: Sending Server-Side Events with Webhooks

Have you ever wondered how to see refunds in GA4 or how to add (dis)qualified leads to GA4? This is where the GA4 Measurement Protocol really shines. The GA4 Measurement...

Data stories

Why you need a marketing data pipeline in 2026

It’s Monday morning. You take a sip of your first coffee and open Looker Studio as the weekly marketing performance meeting begins. Everyone is eagerly waiting for the numbers. You...

Data stories

A scalable way to handle multiple GA4-properties in Dataform

Many organisations don’t just have one GA4 property – they have several. A webshop might split brands, countries and domains across different properties, and before you know it you’re maintaining...

Data stories