Go back

End to End GCP Data Solution for European Transport Company

Building a Data Warehouse for a Startup That Didn't Know What Data It Had

How we built end-to-end analytics for an on-demand transit company running blind

The Startup Running on Spreadsheets and Hope

"We report to municipalities every month, but honestly, we're not entirely sure the numbers are accurate."

That's not something you want to hear from a transportation startup's CEO, especially when those municipalities are paying for the service based on those reports. But that was the reality for a European on-demand transit company trying to solve last-mile transportation in smaller cities.

The concept was clever. In towns where public buses don't run evenings and weekends, or where routes don't reach certain neighborhoods, residents could book rides through an app. You could pay per trip or use your existing monthly transit pass. The service filled a real gap, municipalities liked it, and the business was growing.

But behind the scenes? Data chaos.

What "No Data Infrastructure" Actually Looks Like

When we started working with them, the company was in that dangerous early startup phase where you have real customers, real revenue, and real reporting obligations—but no real systems to support any of it.

Their production database ran on Google Cloud SQL. That sounds fine until you realize they were running analytics queries directly against it. Every report, every dashboard, every "quick check" hit the same database that was serving their live mobile app. Performance was suffering, but nobody was sure why.

All their analysis happened in Deepnote notebooks. Someone would write a query, export to CSV, do some calculations in a spreadsheet, format it into a report, and email it to the municipality. Every month. Manually. For every city they operated in.

They were losing track of their own business. The CEO couldn't confidently answer basic questions like "What's our average cost per ride?" or "Which routes are profitable?" without someone spending hours compiling data. Monthly reporting to municipalities was stressful because they'd discover data issues only when preparing the reports, with deadlines looming.

The geographic data was particularly messy. They had GeoJSON files defining service areas and routes, Excel files with pricing and schedules, and database tables with ride data, but nothing connected them properly. Matching rides to service zones, calculating which municipality to bill, understanding route efficiency—it all required manual work.

They needed an actual data warehouse. Not eventually. Not when they raised their next round. Now.

Finding Data in All the Wrong Places

Before we could build anything, we needed to understand what data actually existed and where it lived.

This took longer than expected because the answer was "everywhere, in inconsistent formats."

The primary source was the Google Cloud SQL database containing rides, users, bookings, and payments. But the geographic boundaries that defined service areas lived in GeoJSON files stored who-knows-where. Pricing schedules existed in Excel spreadsheets that different people updated. Municipality contracts were in Google Docs. Operating schedules were in another spreadsheet. Driver assignment logic was partially in the database and partially in the app code.

Nobody had a complete picture of what data existed or how it all fit together. The previous developer who set up the original system had left. Documentation was minimal. Tribal knowledge was scattered across the CEO, CTO, and a couple of developers.

We spent the first few weeks just mapping the data landscape. Where does each piece of information live? How frequently does it change? What's the source of truth for each data type? Which reports are actually critical versus which are nice-to-have?

This discovery phase was essential but frustrating for the client. They wanted dashboards immediately. We had to explain that building on top of undefined data sources is how you get garbage analytics that nobody trusts.

Building the Foundation in Google Cloud

Once we understood the data sources, we designed an end-to-end pipeline entirely in Google Cloud to keep infrastructure simple and costs manageable for an early-stage startup.

The architecture started with daily extraction from the production Cloud SQL database. We couldn't query it directly for analytics anymore—that was killing performance for actual users. Instead, we set up automated exports to Cloud Storage every night. Fresh data landed in storage buckets where it couldn't impact the production system.

The challenge wasn't just moving data. The real complexity was in the GeoJSON files defining service zones. These files contained geographic boundaries, but the actual zone names and identifiers lived in the database. Matching rides to zones required combining geographic coordinates from ride data with polygon definitions from GeoJSON and looking up zone metadata from the database.

We built this integration using Cloud Functions for lightweight processing and Dataflow for heavier transformations. Cloud Scheduler orchestrated everything, triggering the daily data pulls, the GeoJSON processing, and the BigQuery loads in the right sequence.

BigQuery became the foundation. Raw data landed there first, then moved through transformation layers. We built what we called "theoretical ETL"—structured transformation logic that would eventually migrate to a proper tool like dbt but started as well-organized SQL for speed of implementation. The goal was clean, reliable data models that could support reporting without requiring a data engineer every time someone had a question.

The BI Layer That Actually Got Used

We evaluated several BI tools and chose Holistics, which turned out to be an excellent decision for this use case.

Holistics let us build reusable data models directly on top of BigQuery. We could define metrics once—cost per ride, rides per service area, utilization rates, revenue by municipality—and use them consistently across all reports. No more conflicting numbers because someone used a slightly different calculation in their spreadsheet.

The semantic layer was crucial. Business users could explore data using business terms like "service zone" and "peak hours" without needing to understand the underlying table structures or write SQL. This made the system sustainable. The startup's team could answer their own questions instead of waiting for us.

The automated reporting was the game-changer for their operations. We built monthly reports for each municipality showing rides served, costs incurred, peak usage times, and service area coverage. These reports went out automatically via email and Slack channels. No more scrambling at month-end to compile numbers.

The transparency improved relationships with municipalities. Instead of rough estimates delivered late, they got detailed breakdowns delivered on schedule. Cities could see exactly what they were paying for and how the service was being used. This built trust and made contract renewals smoother.

The Communication Problem

Here's the part that doesn't show up in technical architecture diagrams: this project was harder than it should have been because key stakeholders barely communicated.

The CTO and CEO were stretched thin running a startup. We understood that. But they'd go weeks without responding to questions about business requirements or data validation. We'd build something based on our best understanding, only to discover later that wasn't actually how they calculated certain metrics.

Requirements would change mid-project without notification. We'd build a report showing rides per municipality, then find out they also needed to split it by vehicle type and time of day, but nobody mentioned that until we showed them the first version.

Access to systems was a constant struggle. Getting credentials for data sources, permissions for cloud resources, or context about how specific features worked often required multiple follow-ups over days. The pace of development was limited more by communication delays than technical challenges.

We adapted by over-documenting everything, sending regular updates even when we didn't get responses, and building flexibility into the data models so we could adjust when requirements inevitably changed. It worked, but it added time and friction to a project that should have been straightforward.

What We Delivered

Despite the communication challenges, we delivered a complete data infrastructure that transformed how the company operated.

The daily automated pipeline replaced manual data exports and spreadsheet wrangling. Fresh data landed in BigQuery every morning without human intervention. The production database was no longer getting hammered by analytics queries, so app performance improved.

The geographic data integration finally worked correctly. Rides were accurately matched to service zones, billing was automated, and route analysis became possible. They could finally answer questions like "Which zones have the highest demand but lowest service?" or "Are we efficiently covering our contracted service areas?"

The Holistics semantic layer gave business users self-service analytics. The marketing team could analyze user acquisition by municipality. Operations could monitor driver utilization and route efficiency. Finance could track costs and revenue by service area. All without writing SQL or bothering developers.

Monthly reporting went from a stressful manual process taking days to an automated system that ran reliably. Municipalities got detailed breakdowns automatically. The operations team could focus on improving service instead of compiling reports.

Most importantly, the company finally had visibility into their own business. The CEO could answer investor questions with actual data. The CTO could make infrastructure decisions based on usage patterns. The operations team could identify and address service gaps proactively.

The Startup Reality

Working with early-stage startups teaches you things that enterprise projects don't.

First, you can't always get perfect requirements upfront. Startups are figuring out their business model while you're building their systems. You need to build for flexibility because what they think they need today will evolve as they learn more about their market.

Second, the constraint is rarely technical—it's organizational. The technology to build a data warehouse is well-understood. The challenge is getting information from busy founders, aligning stakeholders who have different priorities, and making decisions when the "right" answer depends on business context that's still forming.

Third, perfect is the enemy of done. We could have built a more sophisticated data platform with dbt, Airflow, data quality monitoring, and ML-based forecasting. But they needed reliable reporting first. We built the simplest thing that would actually work and left room to enhance it later as the business matured.

Fourth, stakeholder communication can make or break a project. The technical work was straightforward. The hard part was aligning with stakeholders who were overwhelmed running a startup and didn't always have time to engage with the data project. We succeeded despite this, but it would have been much faster with better communication.

Why Geographic Data Is Always Harder Than You Think

A brief technical aside about something that consumed more time than expected: integrating the GeoJSON data.

On paper, it's simple. You have points (ride pickups and dropoffs) and polygons (service zones). Just check which points fall inside which polygons. Standard GIS operation.

In practice, it's complicated. The GeoJSON files weren't always well-formed. Some polygons had inconsistent coordinate systems. Zone identifiers in the GeoJSON didn't always match the database. The app sometimes recorded coordinates slightly outside official boundaries due to GPS drift, but the ride clearly happened in that zone.

We ended up building sophisticated logic to handle edge cases. Buffer zones around boundaries to catch GPS drift. Fuzzy matching to connect GeoJSON zone identifiers to database zone names. Validation to catch malformed geographic data before it broke the pipeline.

This work wasn't glamorous, but it was essential. Accurate zone assignment drove billing, which drove revenue. Getting this wrong meant either undercharging municipalities or incorrectly allocating costs. The business logic had to be bulletproof.

The Long-Term Impact

We handed over the system after several months of development and testing. The startup could finally operate based on data rather than intuition and spreadsheets.

The operational improvements were immediate. Monthly reporting went from consuming significant staff time to running automatically. Municipalities were happier with detailed, timely reports. The team could make data-driven decisions about route planning, pricing, and expansion.

The strategic impact took longer to manifest but was more significant. With reliable data, they could confidently approach new municipalities showing concrete metrics about service quality and costs. They could identify underperforming routes and adjust operations. They could forecast demand and plan driver scheduling more efficiently.

The system scaled as they grew. When they expanded to new cities, adding them to the reporting was straightforward because the infrastructure was designed for it. When they needed new analyses, the semantic layer made development fast because the hard work of data modeling was already done.

Perhaps most importantly, they professionalized their operations in a way that made them more investable. Investors care about unit economics, retention rates, and operational efficiency. Having actual data infrastructure to answer those questions credibly made fundraising conversations much easier.

What We'd Do Differently

Looking back, we'd push harder on communication structure upfront. Instead of ad-hoc Slack messages and sporadic meetings, we should have insisted on weekly structured check-ins with clear agendas and decision points. The flexibility felt polite at the time but created inefficiency.

We'd also build more extensive data validation earlier. We caught data quality issues during development, but systematic validation from day one would have surfaced problems faster. Automated alerts when data looked wrong would have saved debugging time.

The BI tool choice worked out well, but we considered building lighter-weight dashboards in Looker Studio or Metabase. For an early-stage startup with limited budget, that might have been sufficient. Holistics was more powerful but also more expensive. The ROI worked out, but it was a closer call than we initially thought.

If This Sounds Familiar

If you're running a startup and thinking "that sounds like us," you're not alone.

Most early-stage companies go through this phase. You've got customers, you're generating revenue, you have reporting obligations—but you're still running on spreadsheets and manual processes. You know you need better infrastructure, but you're not sure where to start or whether you can afford it.

The good news is that modern cloud platforms and tools make this solvable without massive investment. You don't need a data team of ten people. You don't need to spend six months building infrastructure before seeing value. You can build incrementally, starting with the most critical pieces and expanding as you grow.

The key is starting before it becomes a crisis. Before you lose a municipality contract because you can't report accurately. Before you make a major strategic decision based on wrong data. Before manual reporting consumes so much staff time that it limits your ability to grow.

Data infrastructure should enable your business, not slow it down. When done right, it amplifies what small teams can accomplish and provides the visibility founders need to make confident decisions.

Project Details:

Duration: 10 months development + handover
Technologies: Google Cloud (BigQuery, Cloud Storage, Cloud Functions, Dataflow, Cloud Scheduler), Holistics
Industry: Transportation / Mobility
Team: 1 data engineers
Engagement: Fixed project with ongoing support option

Key Outcomes:

Automated daily data pipeline from production DB to analytics warehouse
Integrated geographic data (GeoJSON) with transactional data
Self-service BI layer for business users
Automated monthly reporting to municipalities
Real-time operational dashboards

Building something new and need data infrastructure that keeps up? We've helped enough startups navigate this phase to know what works when resources are tight and requirements are evolving.