Tobias Macey & Ahmed Elsamadisi , Data Engineering Podcast

Removing The Barrier To Exploratory Analytics with Activity Schema and Narrator

29 Oct 2021 • 68 min • EN

Summary The perennial question of data warehousing is how to model the information that you are storing. This has given rise to methods as varied as star and snowflake schemas, data vault modeling, and wide tables. The challenge with many of those approaches is that they are optimized for answering known questions but brittle and cumbersome when exploring unknowns. In this episode Ahmed Elsamadisi shares his journey to find a more flexible and universal data model in the form of the "activity schema" that is powering the Narrator platform, and how it has allowed his customers to perform self-service exploration of their business domains without being blocked by schema evolution in the data warehouse. This is a fascinating exploration of what can be done when you challenge your assumptions about what is possible. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Ahmed Elsamadisi about Narrator, a platform to enable anyone to go from question to data-driven decision in minutes Interview Introduction How did you get involved in the area of data management? Can you describe what Narrator is and the story behind it? What are the challenges that you have seen organizations encounter when attempting to make analytics a self-serve capability? What are the use cases that you are focused on? How does Narrator fit within the data workflows of an organization? How is the Narrator platform implemented? How has the design and focus of the technology evolved since you first started working on Narrator? The core element of the analyses that you are building is the "activity schema". Can you describe the design process that led you to that format? What are the challenges that are posed by more widely used modeling techniques such as star/snowflake or data vault? How does the activity schema address those challenges? What are the performance characteristics of deriving models from an activity schema/timeseries table? For someone who wants to use Narrator, what is involved in transforming their data to map into the activity schema? Can you talk through the domain modeling that needs to happen when determining what entities and actions to capture? What are the most interesting, innovative, or unexpected ways that you have seen Narrator used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Narrator? When is Narrator the wrong choice? What do you have planned for the future of Narrator? Contact Info LinkedIn @ae4ai on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Narrator DARPA Challenge Fivetran Luigi Chartio Airflow Domain Driven Design Data Vault Snowflake Schema Event Sourcing Census Podcast Episode Hightouch Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

From "Data Engineering Podcast"