Data Contracts in Behavioural Web Data Collection
Note: The views expressed in this blog are my own and do not necessarily reflect the opinions or positions of my employer.
Introduction
In today's data-driven world businesses increasingly rely on the behavioural data of their customers to make informed decisions and gain insights.Different teams within an organization including product, sales, marketing and data teams (such as data analysts and scientists) are dependent on this data for various reporting and activation purposes. This has led to a need for more and more data to be tracked with an eternal struggle between not only ensuring there is “enough” data but also that said data is both reliable and robust and that it is collected in a privacy-compliant manner.
When data starts to be collected at scale a company will often encounter 3 common problems as mentioned by Andrew Jones in his “Driving Data Culture Change With Data Contracts” presentation.
- Lack of expectations: Data can change with warning and without proper documentation or knowledge of where data is collected upstream - data consumers struggle to understand and utilize the collected data effectively.
- Lack of reliability: Any upstream changes will frequently break downstream applications with an onflow effect that trust in the data is slowly eroded overtime.
- Lack of quality: The collected data upstream may not be fit for purpose for consumers thereby increasingly complexity and the ability to activate on the collected data.
What is a Data Contract?
I was first introduced to the concept of the data contract, a movement that is quickly gaining traction in the data engineering space by Andrew Jones at the London Analytics Meetup (highly recommend attending if you’re ever in London) and how it solves the above-mentioned issues within CDC microservices.
A data contract is in its simplest meaning is a set of predefined rules that define the structure, format and requirements for the data being exchanged or collected. The purpose of a data contract is twofold:
- To establish guidelines and specifications for consistent and accurate data collection
- To standardize the data attributes, types, and values for easier analysis and insights derivation.
It is worth noting that a data contract goes far beyond rules for defining a schema and semantics. As originally noted from this Confluent documentation - data contracts can evolve and become more complex over time to cover the following elements:
- Structure: Defines the fields and their types, ensuring uniformity and consistency in the collected data.
- Integrity Constraints: Specifies validation rules to ensure data accuracy and validity (e.g., field must be an integer, greater than a value).
- Metadata: Includes additional information about the data such as identification of sensitive fields that require special handling.
- Rules or Policies: Defines any specific rules on how to treat a field such as hashing for added privacy.
- Change or Evolution: Provides mechanisms for accommodating future changes and updates to the data contract.
Data Contracts in Behavioural Web Data Collection
This led to some research and exploration as to whether or not data contracts could be used to solve the same aforementioned issues in behavourial web data collection. Theoretically this is possible as in many instances, web behavioural data collection is the earliest upstream location and rawest form of data as behavioural data is often collected directly from the client-side in the browser.And turns out data contracts aren’t as new in web data collection as I thought. Many web analytics tools already offer some variation of data contracts functionality to ensure that the collected data is in a fixed format:
- Snowplow Schemas
- Segment Protocols
- Adobe Experience Platform Schemas
Data Contracts in Google Tag Manager (GTM)
Google Tag Manager (GTM) does not currently have a native notion of data contracts functionality most likely because of 3 reasons:- Flexible data collection model
- Focus on ease of installation and quick setup process
- Less requirement for technical expertise to implement tracking (i.e out-of-the-box tracking for basic tracking)
With the need for more tailored measurement and the gradual adoption of Server-side GTM there is now however the new-found ability to implement customized data contracts in GTM by leveraging new features such as transformations and the built-in ability to integrate directly with Google Cloud components such as Firestore and BigQuery.
The adoption of data contracts in GTM for your GA4 data collection offers several benefits:
- Foster a more data-driven organisation: Teams are encouraged to collaborate more closely together and have open conversations about exactly what they’d like to measure and collect. This not only helps make data a first class citizen in your organisation as data is now created by intent as opposed to being a by-product of a new feature. This intent is key for a digitally mature organisation as less mature organisations tends to think less strategically about how to effectively use their data and typically adopt a “track everything” mentality and causing them to place more importance on data quantity than data quality.
- Improved Data Quality: Data contracts can enhance data quality of your behavioural data by standardizing the structure, attributes and values of the collected data. By enforcing integrity constraints and validation rules, data contracts ensure accuracy, validity and the right format for data consumption downstream.
- Enhanced Privacy Compliance: Data contracts provide a framework for incorporating privacy requirements into the data collection process. By defining rules and policies for handling sensitive fields, such as anonymization or hashing, data contracts ensure compliance with privacy regulations, protecting customer privacy and building trust.
- Accommodating Change and Evolution: Data contracts address the challenge of evolving data sources by providing predefined mechanisms for managing change. This enables the integration of new data sources and modifications while maintaining data consistency and integrity.
You can find an example implementation of how data contracts can be built for Google Analytics data collection using a combination of some Server-side Google Tag Manager and Firestore magic in this GitHub repo.
Check out for Andrew Jones’ new book if you’re interested in learning more about data contracts in the data engineering realm!