Semantic Layer — To build or not to build that is the question!
“A semantic layer is a governed business representation of corporate data that helps end-users access data autonomously using common business terms. A semantic layer maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization. “— Wikipedia
Past few conversations at conferences and at work, I got repeated questions about why a semantic layer was needed when a structured data access layer is already available. My question back to them was “How much of your organization depends on IT to answer their reporting/analytics needs?” and “Are you truly self-service if your dependence on IT to do everything for you is so high?” Don’t mistake my questions as a hit against IT, but I believe in the power of truly democratizing data (link to Data Democratization) and putting the power of data into the hands of the end-users so that they have the ability to ask questions and get answers from the data themselves. End-users need to have the ability to ask and get answers to questions and pivot data easily. And that is why a semantic layer is needed as it is the answer to getting your organization to be truly data-driven.
Let’s now walk through the core components of building a successful semantic layer. A lot has been covered in my Data Democratization article, but we can go into depth about some of what has been mentioned.
Where would the layer reside?
The semantic layer is a business layer view on top of the underlying data. So, I would put it on the database where your data lake/data warehouse resides. Some organizations depend on their reporting tools to have the semantic layer, however, most enterprises have multiple reporting tools and if just one of them has the semantic layer, you run into the problem of inconsistent metrics and definitions across reporting. A semantic layer built on the database helps make the layer reporting tool agnostic as it becomes the single reference point for all reporting/analytics/data science. Modern-day data platforms also provide a lot of data science and ML capabilities (no-code to full code) for analysts and data scientists and having the semantic layer on one of these data platforms is a win-win overall and fits right into your strategy for modern data architecture.
What must the layer contain?
- Data fields/columns should be easily readable as business names
- Granular as well as aggregate data and pivots to get to answers easily.
- Data structures are more denormalized and contain the pre-defined KPIs and metrics as defined in the business glossary and as named and data must tie back to the version of truth for that metric.
Data Quality, Governance, and Privacy
Data Quality is a core requirement of the semantic layer. This layer becomes the front end for all data-driven business decision making and hence quality needs to be maintained to ensure that the data or analytics products coming out of this can be trusted. Data Profiling and Data Validation scripts need to run periodically against the baseline to ensure that the quality is consistent. Alerts and Thresholds need to be set against key data elements and metrics to address data red flags either due to a data profile change or an issue.
Data Governance requires that the semantic layer needs documented data lineage, a data dictionary, and a business glossary. Enterprise KPIs and metrics are pre-defined on the semantic layer and hence it is very important to have these defined with appropriate data stewards in a business glossary. Lineage allows you to always point back to the source of the data in order to ensure that the correct source is being used as the version of the truth. The data dictionary documents the artifacts (tables, columns, etc.) on the layer and becomes an easy reference guide for end-users who are navigating the platform. The business glossary contains the business definitions of how the enterprise KPIs and metrics are defined within the organization. Please note that each metric has to have only a singular definition for the enterprise to avoid inconsistency in reporting.
Data Privacy ensures that data access is controlled and managed by the level of access a user should be allowed access. For eg. having a customer name, and email address are important for the team that is responsible for email marketing campaigns, but it is not necessary for someone who just needs to know the count of customers who purchased with them in the last 12 months. Access to the layer needs to be based on roles within the organization. CCPA and GDPR have made having such access control a priority within organizations as related to customer data specifically. Please ensure that this is managed with utmost priority.
Data Literacy
Won’t go into too much detail on this topic as I intend to have a full post about this, but suffice to say that in order to increase user adoption of the semantic layer and ensure that you are reducing data and reporting proliferation, you need to have a data literacy plan in place. At a high level, it means educating the organization on data as to where the data exists, how it can be used/accessed, and tools that can be used to access it. This can be achieved through frequently scheduled meet-ups with data and non-data users walking them through the platform, showcasing capabilities, and features, talking about governance and privacy and its importance, and of course training in how to use and access the data. This training will include SQL training, reporting tools training, training with the different data subject areas, and data pivots if any. Data Literacy is key in shifting culture toward becoming a data-driven organization.
Option to Buy
There are tools/platforms available now that provide the Semantic Layer functionality. In some cases, it requires a build-out of the Semantic layer on the vendor platform while in some cases it sits on top of your data layer. Notable vendors are Thoughtspot, AtScale, MachEye, KyvosInsights, etc. Check out dbt and Transform which provide a metrics layer on the transform layer itself (reading about it and may do a follow-up blog on this). Do ensure that you have a thorough understanding of the benefits and limitations of the vendor based on your business needs.
If you want to talk more or provide inputs/feedback about enabling data-driven organizations or any of the topics above contact me at thedatawall@gmail.com
If you’re interested in connecting, follow me on Linkedin, and Medium.