One of the challenges of machine learning is obtaining large enough volumes of well labelled data. An approach to mitigate the effort required for labelling data sets is active learning, in which outliers are identified and labelled by domain experts. In this episode Tivadar Danka describes how he built modAL to bring active learning to bioinformatics. He is using it for doing human in the loop training of models to detect cell phenotypes with massive unlabelled datasets. He explains how the library works, how he designed it to be modular for a broad set of use cases, and how you can use it for training models of your own.
With libraries such as Tensorflow, PyTorch, scikit-learn, and MXNet being released it is easier than ever to start a deep learning project. Unfortunately, it is still difficult to manage scaling and reproduction of training for these projects. Mourad Mourafiq built Polyaxon on top of Kubernetes to address this shortcoming. In this episode he shares his reasons for starting the project, how it works, and how you can start using it today.
Most applications require data to operate on in order to function, but sometimes that data is hard to come by, so why not just make it up? Mimesis is a library for randomly generating data of different types, such as names, addresses, and credit card numbers, so that you can use it for testing, anonymizing real data, or for placeholders. This week Nikita Sobolev discusses how the project got started, the challenges that it has posed, and how you can use it in your applications.
Learning to program is a rewarding pursuit, but is often challenging. One of the roadblocks on the way to proficiency is getting a development environment installed and configured. In order to simplify that process Aivar Annamaa built Thonny, a Python IDE designed for beginning programmers. In this episode he discusses his initial motivations for starting Thonny and how it helps newcomers to Python learn and understand how to write software.
Maintaining a consistent taxonomy for your music library is a challenging and time consuming endeavor. Eventually you end up with a mess of folders and files with inconsistent names and missing metadata. Beets is built to solve this problem by programmatically managing the tags and directory structure for all of your music files and providing a fast lookup when you are trying to find that perfect song to play. Adrian Sampson began the project because he was trying to clean up his own music collection and in this episode he discusses how the project was built, how streaming media is affecting our relationship to digital music, and how he envisions Beets position in the ecosystem in the future.
Determining the best way to manage the capacity and flow of goods through a system is a complicated issue and can be exceedingly expensive to get wrong. Rather than experimenting with the physical objects to determine the optimal algorithm for managing the logistics of everything from global shipping lanes to your local bank, it is better to do that analysis in a simulation. Ruud van der Ham has been working in this area for the majority of his professional life at the Dutch port of Rotterdam. Using his acquired domain knowledge he wrote Salabim as a library to assist others in writing detailed simulations of their own and make logistical analysis of real world systems accessible to anyone with a Python interpreter.
Every piece of software that has been around long enough ends up with some piece of it that needs to be redesigned and refactored. Often the code that needs to be updated is part of the critical path through the system, increasing the risks associated with any change. One way around this problem is to compare the results of the new code against the existing logic to ensure that you aren’t introducing regressions. This week Joe Alcorn shares his work on Laboratory, how the engineers at GitHub inspired him to create it as an analog to the Scientist gem, and how he is using it for his day job.
One of the draws of Python is how dynamic and flexible the language can be. Sometimes, that flexibility can be problematic if the format of variables at various parts of your program is unclear or the descriptions are inaccurate. The growing middle ground is to use type annotations as a way of providing some verification of the format of data as it flows through your application and enforcing gradual typing. To make it simpler to get started with type hinting, Carl Meyer and Matt Page, along with other engineers at Instagram, created MonkeyType to analyze your code as it runs and generate the type annotations. In this episode they explain how that process works, how it has helped them reduce bugs in their code, and how you can start using it today.
A majority of the work that we do as programmers involves data manipulation in some manner. This can range from large scale collection, aggregation, and statistical analysis across distrbuted systems, or it can be as simple as making a graph in a spreadsheet. In the middle of that range is the general task of ETL (Extract, Transform, and Load) which has its own range of scale. In this episode Romain Dorgueil discusses his experiences building ETL systems and the problems that he routinely encountered that led him to creating Bonobo, a lightweight, easy to use toolkit for data processing in Python 3. He also explains how the system works under the hood, how you can use it for your projects, and what he has planned for the future.
Data mining and visualization are important skills to have in the modern era, regardless of your job responsibilities. In order to make it easier to learn and use these techniques and technologies Blaž Zupan and Janez Demšar, along with many others, have created Orange. In this episode they explain how they built a visual programming interface for creating data analysis and machine learning workflows to simplify the work of gaining insights from the myriad data sources that are available. They discuss the history of the project, how it is built, the challenges that they have faced, and how they plan on growing and improving it in the future.