Airflow with Maxime Beauchemin - Episode 44

Visit our site to listen to past episodes, support the show, join our community, and sign up for our mailing list.

Summary

Are you struggling with trying to manage a series of related, interdependent batch jobs? Then you should check out Airflow. In this episode we spoke with the project’s creator Maxime Beauchemin about what inspired him to create it, how it works, and why you might want to use it. Airflow is a data pipeline management tool that will simplify how you build, deploy, and monitor your complex data processing tasks so that you can focus on getting the insights you need from your data.

linode-banner-sponsor-largeDo you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? Check out Linode at linode.com/podcastinit or use the code podcastinit2018 and get a $20 credit to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.


On Hired software engineers & designers can get 5+ interview requests in a week and each offer has salary and equity upfront. With full time and contract opportunities available, users can view the offers and accept or reject them before talking to any company. Work with over 2,500 companies from startups to large public companies hailing from 12 major tech hubs in North America and Europe. Hired is totally free for users and If you get a job you’ll get a $2,000 “thank you” bonus. If you use our special link to signup, then that bonus will double to $4,000 when you accept a job. If you’re not looking for a job but know someone who is, you can refer them to Hired and get a $1,337 bonus when they accept a job.


Brief Introduction

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • Subscribe on iTunes, Stitcher, TuneIn or RSS
  • Follow us on Twitter or Google+
  • Give us feedback! Leave a review on iTunes, Tweet to us, send us an email or leave us a message on Google+
  • Join our community! Visit discourse.pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.
  • I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show you can visit our site at pythonpodcast.com
  • Linode is sponsoring us this week. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project
  • I would also like to thank Hired, a job marketplace for developers and designers, for sponsoring this episode of Podcast.__init__. Use the link hired.com/podcastinit to double your signing bonus.
  • Your hosts as usual are Tobias Macey and Chris Patti
  • Today we are interviewing Maxime Beauchemin about his work on the Airflow project.

Interview with Maxime Beauchemin

  • Introductions
  • How did you get introduced to Python? – Chris
  • What is Airflow and what are some of the kinds of problems it can be used to solve? – Chris
  • What are some of the biggest challenges that you have seen when implementing a data pipeline with a workflow engine? – Tobias
  • What are some of the signs that a workflow engine is needed? – Tobias
  • Can you share some of the design and architecture of Airflow and how you arrived at those decisions? – Tobias
  • How does Airflow compare to other workflow management solutions, and why did you choose to write your own? – Chris
  • One of the features of Airflow that is emphasized in the documentation is the ability to dynamically generate pipelines. Can you describe how that works and why it is useful? – Tobias
  • For anyone who wants to get started with using Airflow, what are the infrastructure requirements? – Tobias
  • Airflow, like a number of the other tools in the space, support interoperability with Hadoop and its ecosystem. Can you elaborate on why JVM technologies have become so prevalent in the big data space and how Python fits into that overall problem domain? – Tobias
  • Airflow comes with a web UI for visualizing workflows, as do a few of the other Python workflow engines. Why is that an important feature for this kind of tool and what are some of the tasks and use cases that are supported in the Airflow web portal? – Tobias
  • One problem with data management is tracking the provenance of data as it is manipulated and shuttled between different systems. Does Airflow have any support for maintaining that kind of information and if not do you have recommendations for how practitioners can approach the issue? – Tobias
  • What other kinds of metadata can Airflow track as it executes tasks and what are some of the interesting uses you have seen or created for that information? – Tobias
  • With all the other languages competing for mindshare, what made you choose Python when you built Airflow? – Chris
  • I notice that Airflow supports Kerberos. It’s an incredibly capable security model but that comes at a high price in terms of complexity. What were the challenges and was it worth the additional implementation effort? – Chris
  • When does the data pipeline/workflow management paradigm break down and what other approaches or tools can be used in those cases? – Tobias
  • So, you wrote another tool recently called Panoramix. Can you describe what it is and maybe explain how it fits in the data management domain in relation to Airflow? – Tobias

Keep In Touch

Picks

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Start the discussion at https://discourse.pythonpodcast.com