Dataform Railway Design

Staying on Track with Dataform Railway Design – Streamlining Dataform Development with Local Setup and CI/CD

Recently, I had the exciting opportunity to host my first session at Measurecamp Malmö 2025, where I presented strategies to streamline Dataform development using local environments and CI/CD integration.  Many data teams struggle with inefficient local testing, accidentally run test scripts on entire datasets on BigQuery, and struggle to maintain schema integrity across environments. Developing in Dataform without a structured workflow is like running a railway network without signals, track switches or station coordination. Without a proper structure, we risk delays, collisions, or even derailment.

This blog outlines a Dataform local setup in the form of a well-maintained railway system: integrating CI/CD (signal lights), automated schema testing (station maps), environment switching (track switches), and sample data transfers (cargo management), making Dataform workflows more efficient and scalable. We will serve as railway engineers, laying down the tracks for a structured Dataform local setup. We will look into the challenges of local Dataform development, go over the key features of a robust local setup and provide a step-by-step guide on how to get started, along with a more in-depth look into some of the key features. All aboard! 🚂

A more extensive setup guide can be found in the GitHub repositoryDataform Local Setup with CI/CD

 
 

🚧 Railway Obstacles in Local Dataform Development

Working with Google Cloud Dataform presents several hurdles:

  • 🔐 Security concerns with downloading the data to the local environment.
  • 🛑 Manual schema validation, leading to errors in production.
  • ☁️ Limited local development support due to cloud dependencies.
  • 🔄 Inefficient data transfer between development and production environments.

To prevent derailments, we built a local-first development approach, ensuring a seamless experience for testing and deploying Dataform workflows.

 
 

⚡ Laying Down the Tracks – Key Features of the Local Setup

 
🚦 Signal Lights: Automating Schema Validation with CI/CD

One of the key improvements in this setup is the integration of GitHub Actions for automated schema testing. Using these signal lights, we can avoid major disruptions from trains entering the wrong track: every time a pull request (PR) is created, the CI pipeline validates the schema, detecting any added or removed columns, type changes, or new tables before merging. This prevents unintended modifications from disrupting production environments. Developers receive instant feedback on potential schema-breaking changes, reducing deployment risks and ensuring smooth data transitions.

🔄 Track Switching – Simplified Environment Management

Managing multiple environments is essential for structured development, and this setup enables effortless switching between DEV and PROD using dedicated scripts. Instead of manually reconfiguring settings, developers can toggle environments with a single command, ensuring that production data remains untouched during development. The separation of environments using prefixed datasets ensures that testing occurs in isolation, maintaining data integrity across projects.

🚂 Standardized Train Engines: Docker for Local Development

A Dockerized development environment ensures that every developer has the same engine (setup), without the need for manual installation. By using a preconfigured devcontainer in VS Code, all necessary dependencies and configurations are pre-installed, creating a fully reproducible workspace. This guarantees consistency across team members and eliminates the common “it works on my machine” problem, streamlining collaboration and keeping all trains on schedule.

📦Cargo Management: Sample Data Transfers for Testing

A train doesn’t need to carry every possible container on every journey – only those that are required. Similarly, testing doesn’t require full production datasets. Hence, our local setup includes automated sample data transfers to a BigQuery test project. A dedicated script simplifies the movement of partitioned and incremental data, allowing developers to test transformations with realistic datasets without unnecessary load. By handling data loads programmatically, manual setup time is reduced, making the development cycle more efficient. This approach ensures that tests are conducted with relevant data while maintaining a clear separation from production environments.

 

🛠️ Step-by-Step Setup Guide

 
1. Fork and Clone the Repository

# Fork the repository on GitHub, then clone it
  git clone https://github.com/<user>/dataform_local_setup_with_ci.git
  cd dataform_local_setup_with_ci

2. Prerequisites
  • 🛠️ Git for version control.
  • 🐳 Docker or OrbStack for containerized development.
  • 🖥️ VS Code with Dev Containers extension installed.
  • ☁️ Google Cloud Account with two projects: dev-project and prod-project.
  • 🔑 Service Account with BigQuery permissions.

3. Configure Google Cloud Credentials
  • Download the JSON key for the service account.
  • Save it as GCPkey.json in the root directory.
  • Add GCPkey.json to .gitignore.

4. Start the Dev Container

  code .

  • VS Code will detect the setup and reopen in a fully configured dev environment.

5. Initialize Dataform Credentials
  dataform init-creds
  • Select EU as the region.
  • Choose JSON service account and provide GCPkey.json.

 
 

🔍In-Depth Look into Some of the Key Features

 

🚆 Running Dataform Commands with Scripts

✅ dataform_exec Script

Run Dataform commands with environment validation:

dataform_exec compile                # Validate Dataform code
dataform_exec test                   # Run unit tests
dataform_exec run --dry-run          # Simulate execution
dataform_exec run --full-refresh     # Perform a full refresh

🔄 switch_env Script

Toggle between development and production:

switch_env dev   # Switch to DEV
switch_env prod  # Switch to PROD

 
 

🛤️ Automated CI/CD Schema Tests – Keeping Trains on Track

How It Works:
  1. Compares schema changes between PRs and existing definitions.
  2. Detects modifications (new columns, type changes, new tables).
  3. Logs warnings/errors for quick debugging.
  4. Prevents schema-breaking changes from merging into production.

GitHub Workflow Configuration

– name: Set up Google Cloud authentication
  run: echo “${{ secrets.GCPKEY }}” > /tmp/gcpkey.json

Example CI/CD Test Output

Schema Changes:

Processing table: project.dataform__report__analytics.paid_campaigns
  + tesfield (STRING) added
  - cost (FLOAT) removed
  ~ SOURCE (STRING → INTEGER) changed

 
 

📦 Optimized Sample Data Transfer for Testing

The export_and_load.py script simplifies data migration between GCP projects.

How It Works:
  1. Reads config.json for source/target details.
  2. Creates a temporary partitioned table.
  3. Loads data incrementally into the target project.
  4. Deletes temporary resources after transfer.

/usr/bin/python3 /dataform/src/exampleData/export_and_load.py
Example Configuration:
{
  "tables": [
    {
      "source_project": "prod-project",
      "source_dataset": "source_dataset1",
      "source_table": "event_20240201",
      "target_project": "dev-project",
      "location": "EU",
      "partition_size": 10000,
      "max_rows": 39990000
    }
  ]
}
 
 

🗺️ Final Thoughts – A Smooth, Efficient Dataform Journey

By implementing a structured Dataform local setup, your development process transforms into a well-managed railway system:

  • ✅ Schema integrity avoids unexpected train delays by preventing train crashes
  • ✅ Environment consistency provides us with a smooth transition between tracks
  • ✅ Automated validation 

By integrating GitHub Actions, environment switching, and sample data transfers into their Dataform projects, teams can:

  • 🚆 Reduce deployment risks
  • 🛠️ Automate schema validation
  • 🔄 Streamline data migration between environments
  • 🤝 Enhance collaboration with PR-based workflows

With the right infrastructure in place, Dataform development becomes a more smooth, predictable journey – where every train arrives on time, every deployment is robust, and every pipeline runs like a well-coordinated railway system! Want some help with your railway structure? Feel free to reach out!

This approach was presented at Measurecamp Malmö 2025, and we’re excited to share the GitHub repositoryDataform Local Setup with CI/CD  🚂

Need some help?

2022-01-12MRF (1368)

Andrei Fekete

“I am passionate about learning new technologies, simplifying processes, automation, and creating really usable solutions.”

More Data stories

Dataform Railway Design
Data stories

Staying on Track with Dataform Railway Design - Streamlining Dataform Development with Local Setup and CI/CD

Explore how to streamline Dataform local development using CI/CD integration. Automate schema testing, manage environments, optimize workflows, and build scalable, reliable data pipelines.
BLOG_dorian
Data stories

The Data Story’s research on heuristic and data-driven attribution models: rigidity versus flexibility

Marketers recognise that accurately attributing revenue to marketing efforts is key to better decision-making, budgeting, and strategy. However, implementing marketing attribution effectively is challenging. In marketing attribution, we assign credit...
BLOG_Bayesian
Data stories

Frequentist Over Bayesian: A Statistician's 'Normal' Choice

Probability and statistics lie at the centre of data science. There are different ways of interpreting and expressing probability. Very often, it is expressed using the function P(), where P(a)...
google ads traffic
Data stories

Why GA4 classifies Google Ads traffic as (Organic) and how to fix it

If you rely on Google Analytics 4 (GA4) and noticed that some of your Google Ads traffic is showing up under the campaign name “(organic)”, you might be wondering why...
BLOG_koekje
Data stories

Five Ways to Enhance Your First-party Data Strategy

Google planned on phasing out third-party cookies due to issues mainly concerning privacy, at the end of 2024. However, they have postponed this phase-out once again, giving businesses (and Google)...
European Women in Technology
Data stories

European Women in Technology 2024 – Part 2

On May 26th and 27th, our team members Yvette and Sophie attended the 2024 installment of European Women In Technology. This event’s main purpose is to share ideas and discuss...
en_USEnglish