As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.
The need to process unbounded and continually streaming sources of data has become increasingly common. One of the popular platforms for implementing this is Kafka along with its streams API. Unfortunately, this requires all of your processing or microservice logic to be implemented in Java, so what’s a poor Python developer to do? If that developer is Ask Solem of Celery fame then the answer is, help to re-implement the streams API in Python. In this episode Ask describes how Faust got started, how it works under the covers, and how you can start using it today to process your fast moving data in easy to understand Python code. He also discusses ways in which Faust might be able to replace your Celery workers, and all of the pieces that you can replace with your own plugins.
One of the challenges of machine learning is obtaining large enough volumes of well labelled data. An approach to mitigate the effort required for labelling data sets is active learning, in which outliers are identified and labelled by domain experts. In this episode Tivadar Danka describes how he built modAL to bring active learning to bioinformatics. He is using it for doing human in the loop training of models to detect cell phenotypes with massive unlabelled datasets. He explains how the library works, how he designed it to be modular for a broad set of use cases, and how you can use it for training models of your own.
Testing is a critical activity in all software projects, but one that is often neglected in data pipelines. The complexities introduced by the inherent statefulness of the problem domain and the interdependencies between systems contribute to make pipeline testing difficult to manage. To make this endeavor more manageable Abe Gong and James Campbell have created Great Expectations. In this episode they discuss how you can use the project to create tests in the exploratory phase of building a pipeline and leverage those to monitor your systems in production. They also discussed how Great Expectations works, the difficulties associated with pipeline testing and managing associated technical debt, and their future plans for the project.
A majority of the work that we do as programmers involves data manipulation in some manner. This can range from large scale collection, aggregation, and statistical analysis across distrbuted systems, or it can be as simple as making a graph in a spreadsheet. In the middle of that range is the general task of ETL (Extract, Transform, and Load) which has its own range of scale. In this episode Romain Dorgueil discusses his experiences building ETL systems and the problems that he routinely encountered that led him to creating Bonobo, a lightweight, easy to use toolkit for data processing in Python 3. He also explains how the system works under the hood, how you can use it for your projects, and what he has planned for the future.
The way that your application handles data and the way that it is represented in your database don’t always match, leading to a lot of brittle abstractions to reconcile the two. In order to reduce that friction, instead of overwriting the state of your application on every change you can log all of the events that take place and then render the current state from that sequence of events. John Bywater joins me this week to discuss his work on the Event Sourcing library, why you might want to use it in your applications, and how it can change the way that you think about your data.
Analyzing and interpreting data is a large portion of the work involved in scientific research. Getting to that point can be a lot of work on its own because of all of the steps required to download, clean, and organize the data prior to analysis. This week Henry Senyondo talks about the work he is doing with Data Retriever to make data preparation as easy as “retriever install” .
The notebook format that has been exemplified by the IPython/Jupyter project has gained in popularity among data scientists. While the existing formats have proven their value, they are still susceptible with difficulties in collaboration and maintainability. Scott Ernst created the Cauldron notebook to be testable, production ready, and friendly to version control. This week we explore the capabilities, use cases, and architecture of Cauldron and how you can start using it today!