Recently, I had the exciting opportunity to host my first session at Measurecamp Malmö 2025, where I presented strategies to streamline Dataform development using local environments and CI/CD integration. Many data teams struggle with inefficient local testing, accidentally run test scripts on entire datasets on BigQuery, and struggle to maintain schema integrity across environments. Developing in Dataform without a structured workflow is like running a railway network without signals, track switches or station coordination. Without a proper structure, we risk delays, collisions, or even derailment.
This blog outlines a Dataform local setup in the form of a well-maintained railway system: integrating CI/CD (signal lights), automated schema testing (station maps), environment switching (track switches), and sample data transfers (cargo management), making Dataform workflows more efficient and scalable. We will serve as railway engineers, laying down the tracks for a structured Dataform local setup. We will look into the challenges of local Dataform development, go over the key features of a robust local setup and provide a step-by-step guide on how to get started, along with a more in-depth look into some of the key features. All aboard! 🚂
A more extensive setup guide can be found in the GitHub repository: Dataform Local Setup with CI/CD

🚧 Railway Obstacles in Local Dataform Development
Working with Google Cloud Dataform presents several hurdles:
- 🔐 Security concerns with downloading the data to the local environment.
- 🛑 Manual schema validation, leading to errors in production.
- ☁️ Limited local development support due to cloud dependencies.
- 🔄 Inefficient data transfer between development and production environments.
To prevent derailments, we built a local-first development approach, ensuring a seamless experience for testing and deploying Dataform workflows.
⚡ Laying Down the Tracks – Key Features of the Local Setup
🚦 Signal Lights: Automating Schema Validation with CI/CD
One of the key improvements in this setup is the integration of GitHub Actions for automated schema testing. Using these signal lights, we can avoid major disruptions from trains entering the wrong track: every time a pull request (PR) is created, the CI pipeline validates the schema, detecting any added or removed columns, type changes, or new tables before merging. This prevents unintended modifications from disrupting production environments. Developers receive instant feedback on potential schema-breaking changes, reducing deployment risks and ensuring smooth data transitions.
🔄 Track Switching – Simplified Environment Management
Managing multiple environments is essential for structured development, and this setup enables effortless switching between DEV and PROD using dedicated scripts. Instead of manually reconfiguring settings, developers can toggle environments with a single command, ensuring that production data remains untouched during development. The separation of environments using prefixed datasets ensures that testing occurs in isolation, maintaining data integrity across projects.


🚂 Standardized Train Engines: Docker for Local Development
A Dockerized development environment ensures that every developer has the same engine (setup), without the need for manual installation. By using a preconfigured devcontainer in VS Code, all necessary dependencies and configurations are pre-installed, creating a fully reproducible workspace. This guarantees consistency across team members and eliminates the common “it works on my machine” problem, streamlining collaboration and keeping all trains on schedule.
📦Cargo Management: Sample Data Transfers for Testing
A train doesn’t need to carry every possible container on every journey – only those that are required. Similarly, testing doesn’t require full production datasets. Hence, our local setup includes automated sample data transfers to a BigQuery test project. A dedicated script simplifies the movement of partitioned and incremental data, allowing developers to test transformations with realistic datasets without unnecessary load. By handling data loads programmatically, manual setup time is reduced, making the development cycle more efficient. This approach ensures that tests are conducted with relevant data while maintaining a clear separation from production environments.
🛠️ Step-by-Step Setup Guide
1. Fork and Clone the Repository
# Fork the repository on GitHub, then clone it git clone https://github.com/<user>/dataform_local_setup_with_ci.git
cd dataform_local_setup_with_ci
2. Prerequisites
- 🛠️ Git for version control.
- 🐳 Docker or OrbStack for containerized development.
- 🖥️ VS Code with Dev Containers extension installed.
- ☁️ Google Cloud Account with two projects: dev-project and prod-project.
- 🔑 Service Account with BigQuery permissions.
3. Configure Google Cloud Credentials
- Download the JSON key for the service account.
- Save it as GCPkey.json in the root directory.
- Add GCPkey.json to .gitignore.
4. Start the Dev Container
code .
- VS Code will detect the setup and reopen in a fully configured dev environment.
5. Initialize Dataform Credentials
dataform init-creds
- Select EU as the region.
- Choose JSON service account and provide GCPkey.json.
🔍In-Depth Look into Some of the Key Features
🚆 Running Dataform Commands with Scripts
✅ dataform_exec Script
Run Dataform commands with environment validation:
dataform_exec compile # Validate Dataform code
dataform_exec test
# Run unit tests
dataform_exec run --dry-run # Simulate execution
dataform_exec run --full-refresh # Perform a full refresh
🔄 switch_env Script
Toggle between development and production:
switch_env dev # Switch to DEV
switch_env prod # Switch to PROD
🛤️ Automated CI/CD Schema Tests – Keeping Trains on Track
How It Works:
- Compares schema changes between PRs and existing definitions.
- Detects modifications (new columns, type changes, new tables).
- Logs warnings/errors for quick debugging.
- Prevents schema-breaking changes from merging into production.
GitHub Workflow Configuration
– name: Set up Google Cloud authentication
run: echo “${{ secrets.GCPKEY }}” > /tmp/gcpkey.json
Example CI/CD Test Output
Schema Changes:
Processing table: project.dataform__report__analytics.paid_campaigns
+ tesfield (STRING) added
- cost (FLOAT) removed
~ SOURCE (STRING → INTEGER) changed
📦 Optimized Sample Data Transfer for Testing
The export_and_load.py script simplifies data migration between GCP projects.
How It Works:
- Reads config.json for source/target details.
- Creates a temporary partitioned table.
- Loads data incrementally into the target project.
- Deletes temporary resources after transfer.
/usr/bin/python3 /dataform/src/exampleData/export_and_load.py
Example Configuration:
{
"tables": [
{
"source_project": "prod-project",
"source_dataset": "source_dataset1",
"source_table": "event_20240201",
"target_project": "dev-project",
"location": "EU",
"partition_size": 10000,
"max_rows": 39990000
}
]
}

🗺️ Final Thoughts – A Smooth, Efficient Dataform Journey
By implementing a structured Dataform local setup, your development process transforms into a well-managed railway system:
- ✅ Schema integrity avoids unexpected train delays by preventing train crashes
- ✅ Environment consistency provides us with a smooth transition between tracks
- ✅ Automated validation
By integrating GitHub Actions, environment switching, and sample data transfers into their Dataform projects, teams can:
- 🚆 Reduce deployment risks
- 🛠️ Automate schema validation
- 🔄 Streamline data migration between environments
- 🤝 Enhance collaboration with PR-based workflows
With the right infrastructure in place, Dataform development becomes a more smooth, predictable journey – where every train arrives on time, every deployment is robust, and every pipeline runs like a well-coordinated railway system! Want some help with your railway structure? Feel free to reach out!
This approach was presented at Measurecamp Malmö 2025, and we’re excited to share the GitHub repository: Dataform Local Setup with CI/CD 🚂