Great Expectations For Your Data Pipelines with Abe Gong and James Campbell - Episode 161

Summary

Testing is a critical activity in all software projects, but one that is often neglected in data pipelines. The complexities introduced by the inherent statefulness of the problem domain and the interdependencies between systems contribute to make pipeline testing difficult to manage. To make this endeavor more manageable Abe Gong and James Campbell have created Great Expectations. In this episode they discuss how you can use the project to create tests in the exploratory phase of building a pipeline and leverage those to monitor your systems in production. They also discussed how Great Expectations works, the difficulties associated with pipeline testing and managing associated technical debt, and their future plans for the project.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? Check out Linode at linode.com/podcastinit or use the code podcastinit2018 and get a $20 credit to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


GoCD is the on-premise open source continuous delivery server created by ThoughtWorks and modeled after the ideas in the Continuous Delivery book by Jez Humble and David Farley.

With GoCD’s comprehensive pipeline modeling, you can model complex workflows for multiple teams with ease. And GoCD’s Value Stream Map lets you track a change from commit to deploy at a glance.

GoCD’s real power is in the visibility it provides over your end-to-end workflow. So you get complete control of and visibility into your deployments, across multiple teams.

Say goodbye to deployment panic and hello to consistent, predictable deliveries.

To learn more about GoCD, visit gocd.org for a free download. Professional Support and enterprise add-ons, including disaster recovery, are available.


When your website experiences an error, Airbrake alerts you in real-time, and gives you all the details you need to fix the bug fast. Some of the most useful tools that Airbrake provides for faster resolution are:

  • Exception aggregation to understand the how many users are being affected, so that you can prioritize the work you are doing to have the biggest impact
  • Contextual information to understand how certain states of the application contribute to the exception being raised, and which environments are affected
  • Deployment tracking so that you can easily see whether a new feature is the source of an error
  • Integration with all of the other tools that you use, such as automatically creating issues in GitHub and linking to the line of code where the error came from

Right now, Podcast.__init__ listeners can try Airbrake free for 30 days, plus get 50% off the first 3 months on the Startup plan. To get started, visit airbrake.com/podcastinit today.


Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • Finding a bug in production is never a fun experience, especially when your users find it first. Airbrake error monitoring ensures that you will always be the first to know so you can deploy a fix before anyone is impacted. With open source agents for Python 2 and 3 it’s easy to get started, and the automatic aggregations, contextual information, and deployment tracking ensure that you don’t waste time pinpointing what went wrong. Go to podcastinit.com/airbrake today to sign up and get your first 30 days free, and 50% off 3 months of the Startup plan.
  • To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email [email protected]
  • Your host as usual is Tobias Macey and today I’m interviewing James Campbell and Abe Gong about Great Expectations, a tool for testing the data in your analytics pipelines

Interview

  • Introduction
  • How did you first get introduced to Python?
  • What is Great Expectations and what was your motivation for starting it?
  • What are some of the complexities associated with testing analytics pipelines?
    • What types of tests can be executed to ensure data integrity and accuracy?
  • What are some examples of the potential impact of pipeline debt?
  • What is Great Expectations and how does it simplify the process of building and executing pipeline tests?
  • What are some examples of the types of tests that can be built with Great Expectations?
  • For someone getting started with Great Expectations what does the workflow look like?
  • What was your reason for using Python for building it?
    • How does the choice of language benefit or hinder the contexts in which Great Expectations can be used?
  • What are some cases where Great Expectations would not be usable or useful?
  • What have been some of the most challenging aspects of building and using Great Expectations?
  • What are your hopes for Great Expectations going forward?

Contact Info

Picks

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Start the discussion at https://discourse.pythonpodcast.com