Why not try a simple data stack

When the modern data stack looks cool but is far too early

May 01, 2022

There are these rare no-nonsense people. I meet them sometimes in my projects. Their power is the focus. Their super-power is execution. This is what we need to fix in the next three months. Everything we do must contribute to this.

They make things simple by limiting the options. It‘s convention over configuration (hello Rails or Django).

The modern data stack is all about configuration.

It‘s super flexible for any future scenario. It gives you the freedom to configure everything that fits 100% to your business model and operations (assuming that you know it for 100%). It would help if you had an Avengers-like team of Analytics engineers, and the sky is the limit.

But when my focus is to learn where we lose leads or customers - is it ok for me that I get these results in 12 weeks because it needs to be configured.

When I want to see if the new feature idea gets me more subscriptions, I don't want to wait four weeks for a new data model and dashboard.

Please don‘t get me wrong; an effective, organized, big data team with data ops can get these results much quicker at scale. But we are talking about Airbnb, Netflix & Uber scale (the ones who write a lot about their setups).

But what about the others.

The current data is an excellent option for bigger data teams with data ops. They can scale and monitor configuration. But what about the others.

Introducing the simple data stack

Going back to our Rails example. What are conventions that can speed up things:

- no (low) data transformation and modeling

- managed data operations

- simple (no-code) funnel, cohort, and segmentation analysis

When you get this out-of-the-box, then one person can deliver the insights that can create quick returns:

- Understanding your business model and scaling mechanics

- How many potential customers do you need to win long-term customers, and where can you find them?

- Instant feedback on business/marketing/product experiments - worth any further investments

- What are the potentials to save costs (find the clutter - aka unprofitable operations)

Ok, so you suggest we go into a closed, proprietary platform? No, not - we keep the system open and flexible; we add some shortcuts.

The simple architecture

decoupling

When systems are getting too complex, they slow down. With the MDS, this can happen pretty quickly. One strategy for this is decoupling things. In our case, we introduce a fast and slow track:

fast track - delivering the metrics we need for the growth decisions. Data for experimentation and quickly moving forward. The single source of growth truth

slow track - persisting the company's data brain: Well thought data warehouse infrastructure - the single source of all truth

focus

It is about tools, but issues in data stacks appear because of too much and too many. And not too much data.

Too many events. Too many special business rules. Too many compromises. Too many lines of SQL. Too many people worked on parts of the solution.

Focus is essential. Do less but do it great. Ok, that sounds like cheesy pub philosophy. But it‘s pretty accurate in this case.

Data schema

We introduce a hierarchy for the data we collect. Business core events are our backbone.

These events are designed carefully, tracked from reliable sources, monitored, and match the operational data 100% (no more missing transactions). They don‘t need to be questioned because everyone would immediately know if there is an issue with one of them.

Product and UX-related events are essential for feature development. But they are not watched all time. Usually, when a team works on a feature. Important features should get a schema and monitoring. Anything else can be tracked more loosely (yes, even with auto-tracking).

For teams with an existing event setup - see this as an event diet. Any system bigger than 20-30 events becomes hard to use and maintain.

Data collection

We use one layer for all event data. We are receiving it from the frontend, backend, or SaaS tools.

From there, the data is passed on to different systems. And yes, we also load it into a Data Warehouse. Why? Because it‘s easy and cheap. And there can be use cases where it becomes handy (see later).

The unusual thing here is the SaaS tools. More and more business data is generated outside of your system. Be it in a CRM, Customer Success, Customer support, Subscription tool. You can integrate these by using Webhooks, where tools send relevant event data to your endpoints, and you collect them.

Data activation

For most companies, there are these core functions where data can immediately help:

- Visualize the customer journey funnel (and in cohorts to see improvements over time) - this tells you where you need to focus

- Show if growth experiments (the work on marketing, sales, or product features) change the funnel (aka business outcome)

- Segmentation, Segmentation, Segmentation to find over- and under-performers within these reports (helps you with the optimization)

Use a tool that easily lets you create these reports. Besides the device, everything else is just setting up a process.

Stack examples

Pretty easy.

Data schema:

Avo or Segment Protocols (I recommend Avo since it is agnostic and has more collaboration and testing features)

Data collection:

Segment or Rudderstack (Jitsu is another option but relatively new. MParticle is an interesting option but more targeted for enterprise). Rudderstack and Jitsu both offer an Open Source version. What about Snowplow? We will talk about Snowplow later.

Data activation:

The classics:

Amplitude or Mixpanel (Amplitude offers more for experimentation as an add-on).

The challengers:

Heap or Posthog: both offer auto-tracking which can be interesting for product feature analysis. Posthog is also open source.

Extending the stack

From the tools, this is not spectacular new. We used this stack for years.

What has changed for me is the clear focus on tracking the customer lifecycle touchpoints. Wherever they are happening - if in a CRM, we use webhooks to track them.

The tools themselves are old stuff.

But now, let‘s talk about the extensions. We all love add-ones, don‘t we?

Enrich with backend data

Your application database usually holds information that can be valuable for segmentation. Imagine you offer data integration like Fivetran. I would put:

- the number of rows loaded

- the number of sources

- what destinations

- ...

With a Reverse ETL, I can get these into my analysis tool.

Control and enrich the event data before it enters

In one use case I am implementing; we are using Snowplow for the event pipeline. This enforces a schema and gives us some enrichment out of the box. The data ends up in the database, and we send it from there into our analytics tool via Reverse ETL.

Marketing cost attribution

Marketing cost attribution by itself is a complex thing. But you might start with some simple implementations. The calculation can happen in your database.

Ideas:

- calculate the number of signups for each campaign and the campaign costs for a week. Divide it and push it back with Reverse ETL to the user properties

- map campaign data to channel data in your database and push it back

What’s next

I am trying out a lot to see how a simple data stack can be extended. But the core works really well. Especially for companies with no dedicated data people, it's better to start and move forward. You can hire someone to set up a modern data stack for you, but you can't extend it and maintain it afterward.

The modern data stack setup is "easy" to implement and hard to maintain.

The simple data stack setup is straightforward to maintain. That's why I like it.

Do you work with a similar approach or similar to some extent? Or do you think the modern stack is far easier to maintain than I described it? Let me know and just hit the reply button.

Hipster Data Lab

Discussion about this post