On Data Democratization & Language: A Proposal for a Modern Data Experience
We founded Veezoo with a core belief: "Accessing information should be as easy as just asking for it." And while this has always been our north star, it is more than just a belief - it’s a mission, a mission to fundamentally change the way organizations and individuals use, think about, and manage data.
When we seek information, we first formulate it as a question. Language is the medium of thought[1]. Throughout the data analytics industry, we use the metaphor of “asking questions” and “getting answers.” But in reality the mainstream of data interfaces we provide to our business users requires them to first translate their question into a series of actions. Actions that require “know-how” – often acquired through various certifications – that add friction, distract from the process of finding the answer and end up hiding the true intent of the user.
And we convinced ourselves that that’s how it is supposed to be. That data exploration requires precision, which drag-and-drop interfaces provide. That the needs of business users can be fulfilled with the combination of dashboards, filters and next-generation pivot tables. That there is a fundamental trade-off between the depth of information and the ease-of-use of a data product. And therefore, we should err on the side of minimizing depth of information. That way, if business users still need more information, they can contact us, the data team – gatekeepers of the truth.
This is what we call “self-service”.
Aren’t we setting the bar a bit too low? Is this really data democratization in its ultimate form? In 100 years, is this how we will be accessing information? Is there really nothing else we can automate from our English-to-SQL work as a data analyst?
This here is a manifesto of sorts, a set of guiding principles, a proposal for a modern data experience that puts language at the center of the data-driven organization with data democratization as the ultimate goal.
Data Democratization should be the ultimate goal of every data team
Absorbed in our routine, we may have forgotten, but the job of the data analyst in building data-driven organizations isn’t about creating the “perfect” dashboard or spoon-feeding data insights to the business.
It is about enabling rather than penalizing curiosity and initiative by individual users. It is about empowering employees to find the information they need to do their job better.
Unfortunately, this is not what we experience in our daily work. Instead, we experience a Curiosity Tax. When someone has a question, they will either need to find their way through a frustrating drag-and-drop interface, ask the data team or worse: fill out a long form justifying why they need this information. This means that only questions that support a “one-way door” decision will be asked and many potentially impactful questions never get asked (or answered) at all. This is problematic because the truth about business insights is that they often happen through spontaneous serendipity – one question leading to another and another, until something unexpected emerges. Or maybe it doesn’t. But either way, the business user got more familiar and skilled at using the underlying data.
That is the power of true data democratization – when individual users can find insights for themselves that go beyond what the jungle of dashboards tell them to ask. Isn’t that what data teams should be aiming for?
Data Democratization is about bridging the gap between the language of the data and the language of the business user
Many arguments against self-service tend to confuse the ability to ask “great” questions and rigorously interpret data with the actual ability to find information. To be able to fulfill the need for more data insights, businesses look to hire great data analysts able to translate the business needs into questions, questions into SQL queries, and results into meaningful visualizations. Finding these people is strictly harder than finding people that ask great questions and can interpret data.
The ability to understand data and its pitfalls – to be “data literate” – and to ask the right questions is an important skill set for decision makers, but has nothing to do with knowing SQL, R, Python, or how to build reports on Tableau. In fact, data literacy is much more important than any Tableau certification or SQL Coursera diploma and should be actively promoted and encouraged. But we are afraid to give business users access to data, before they harness such skills. What better way to motivate business users to develop these skills further than by providing them with easy, understandable and flexible access to information?
True self-service means allowing business people with these skills – and only these skills – to easily get answers to their questions by themselves – which is a must for any organization looking to succeed and scale.
But this can only happen when we are able to bridge the gap between the language of the user and the language of the data. The success of dbt is in part explained by this need of bringing the data closer to its real meaning in a business sense. And as Tristan Handy raised the question, this may indeed bring us closer to a future where our data products literally speak our language. That is what we set out to build when we started Veezoo with the core belief that “accessing information should be as easy as just asking for it.”
A shared language is an absolute requirement for a data-driven organization
There are an innumerable number of ways to describe every aspect of business, from what a “trial user”, “signup” or “warm lead” means to which numbers are used to calculate LTV (lifetime value of a customer), CAC (customer acquisition cost), MRR (monthly recurring revenue), or any other three-letter MBA term.
In a truly data-driven company, if you ask anyone “how many active trial users do we have?”, they will all have the same understanding of what an “active trial user” is. And it is this alignment on language and meaning that separates data-driven organizations from the rest.
Data is meant to be an objective representation of reality that we can use to measure our efforts and analyze the results. But as individuals and teams begin creating their own terminology, the value of the data degrades. Soon, as Crystal Widjaja, the former SVP of Gojek, brilliantly puts it, “the lack of shared language renders the data useless.”
But a shared language is about a lot more than just consistent definitions of metrics, it’s more than documenting tables and columns in a data dictionary – it extends to all areas of the business and customer lifecycle, including all the “events” and “actions” that make up the customer journey. Like what exactly is a “successful sign up”, a “new account”, or an “abandoned cart”?
All of these things need to be agreed upon and documented directly where the business users access the data, not merely noted in some siloed company data glossary. To truly further data literacy and democratization, we need to know what the data that we are working with means without flipping constantly back-and-forth between data tools and dictionaries.
Data Teams should model the meaning of data and the language of the company, not just build dashboards
The progress of technology (especially in the Information Age) is based on our ability to introduce abstractions that allow people from different domains of expertise to build upon each other’s work. We went from hardware and microchips to operating systems to software, like your browser, which somehow led to you reading this article… All without you needing to know how to solder a circuit.
And the path to data democratization is no different. It is the process of abstracting away “physical” data (in data warehouses, tables, columns, and SQL queries) and bringing things down to a conceptual level of what data represents semantically and how questions and our use of language map to the data.
Whenever we fail to simplify this abstraction and make it more accessible to others in different domains, we fail to make our organization less dependent on unnecessary technical know-how.
“If all you have is a hammer, everything looks like a dashboard.”
Instead, data teams in most companies out there are busy working on dashboards. Dashboards that are created to answer a specific question, made out of one-off raw SQL queries. Dashboards that rot, once the assumptions used to create them silently deprecate. Dashboards that proliferate like rats[2], since we don’t trust the old ones anymore or we can’t find them anymore or “this time I want something slightly different…” Until at last, in a well-meaning attempt to build on top of existing dashboards, they become even more complicated with filters upon filters. Then it’s not long before someone says: “can’t you just give me a simpler version?”
When did the data analyst become a glorified dashboard builder doing English-to-SQL translation on the side?
The solution is not to just take the dashboard and show it in portrait mode, instead of landscape, (like a Python Notebook – which we agree is at least better suited for data exploration). It is not to get rid of all dashboards either. Instead, the solution is for data teams to focus on modelling the true meaning of the data. To aspire for “100% of code written to be business logic” is not only to spend less time with infrastructure, but also to spend less time writing one-off SQL queries.
To “model the meaning of the data” is to model how language (as spoken by the business) maps to the “language of the data.” That means when we say “customers,” we mean those users with an active subscription. And when we say “active subscription”, we mean one that was not cancelled, which in turn means that ‘cancelled_at’ is NULL. But if we’re being honest, most people usually just say ‘subscription’ instead of ‘active subscription,’ so we need to account for that as well.
Language is implicit – we don’t say everything we mean. It is compositional – we define terms so we can reuse them again in other definitions. Language is ambiguous too – it often requires multiple iterations to get to the true meaning of a question. A data product that aims to speak your language needs to not only acknowledge that, but embrace it.
Data Teams need powerful, opinionated tools that enforce best practices
If you are paying attention to what has been happening in our field, you will have noticed that data teams are adopting the best practices of Software Engineering: configuration-as-code, DRY principle, version control, tests and documentation (and who knows, maybe one day the idea of “Ubiquitous Language?”) This is not a fad. Software development experienced an increase in productivity throughout the years, not only due to its inherent focus on reusability, but also due to the maturing of battle-tested processes and practices.
Configuration-as-code allows us to have complete transparency on what we build, while allowing us to better express complex logic. It enables us to version-control it, so we can easily roll-back when we screw up and document why something changed and who changed it. It allows us to better review changes, make sure nothing gets broken, and collaborate the same way software engineers do. Finally, it gives us ownership over what we have built.
These best practices are here to stay and it shows that the data space is becoming more mature. And we all have to thank the folks at Looker and dbt for their many contributions here, which inspired us when building Veezoo. Data products need to be both truly intuitive for business users to ask questions, while powerful for the data teams to reliably build complex business logic.
Data Products need to be managed like, well, Products, not Services
Great product teams obsess about how users experience their creation. They have dozens of tracking tools looking at everything from clicks, time on site, and monthly active usage to failures in the signup and checkout processes, past orders and search history, and more… to allow them to perfect the UX.
The early product managers at Twitter, for instance, could say with almost 100% certainty that if a new user would follow at least 10 people during the first sessions, they’d be hooked. That is the science of user behavior and the power of using data to optimize onboarding and product features.
Of course, we all know this. That’s what we have in our data warehouse. It’s the lifeblood of our craft in the data team. So, isn’t it ironic that when it comes to our own internal data products, we are practically blind? Most data products don’t provide reliable insights on how users actually use the product. Yeah, you might be able to see clicks and applied filters… but sussing out a user’s actual intent is near impossible without having to interview every department. So nobody bothers to check.
Without consistent, unbiased data on which information people are looking for, we’re forced to prioritize based on support tickets, self-determined urgency, and of course, rank. Sure half the sales team might want to filter customers by last contact date, but unless they all submit support tickets at the same time (which often doesn’t happen because of the Curiosity Tax of asking for help), there is no way data teams can know what to prioritize.
The beauty of a data product that literally speaks the user’s language is that there is complete transparency on what the user is looking for. Not clicks or clueless hovering, but actual questions. Data teams are able to investigate exactly how users interact with their data products – like which questions business users are asking, which data attribute most people ask for but is not yet available, how well the tool is being adopted company wide, who flagged which data as being wrong, and even which questions take the longest to answer, so you can tweak those database indices.
Instead of operating like a service center reacting to JIRA tickets, data teams need to become missionary product teams proactively uncovering and solving the information needs of their organization, based on real data. To build on the “restaurant” analogy from Benn Stancil and Erik Bernhardsson, to achieve a modern data experience, we cannot be waiters, just waiting for the orders and a tip at the end, but rather that British butler that is always anticipating your needs and seems to read your mind.
Closing Thoughts
While the modern data stack is bringing more mature and reliable data processes, data democratization still seems to be just a mirage for many. Luckily, people are starting to realize that the solution to answering our most pressing business questions may not be yet-another-dashboard, despite what Gartner BI leaders tell you.
We don’t need yet another dashboard – we need to abolish the Curiosity Tax and put an end to endless data breadlines.
We need to define and promote a shared language within our organization.
We need a better way to understand and address the data needs of our organization.
That is true data democratization. And we will only achieve it if we give language the proper importance in a data-driven organization.
—
[1] That’s why we are so impressed with AI models like GPT-3 – we struggle to distinguish NLP from general AI. “And is there a difference?”, Turing would ask.
[2] Not very fair to rats, I actually like them. But “multiply like bunnies” just sounds too positive and… fluffy?