Dwarfs standing on the shoulders of giants:
Since the 2000s, some companies have grown massively with the rise of the Internet. They have become Digital Giants and often share with the rest of society the difficult challenges they are facing in technology. Sometimes, solutions are contributed back to the public through publications or open-source activities and give birth to strong communities embracing worldwide collaborations.
At AIscalate, we are not proposing an original idea or just a technology stack, we are dealing with a concrete pain in the industry and are actually standing on the shoulders of such giants and communities. We are taking pieces of their visions, architecture designs, lessons learned and software in the Open-Source world.
Our mission is to stitch all of those to craft and distribute a platform that can support any dwarf to see further than the giants.
"Notebooks have rapidly grown in popularity among data scientists to become the de facto standard for quick prototyping and exploratory analysis."
The following text is extracted from Netflix Technology Blog:
Project Jupyter began in 2014 with a goal of creating a consistent set of open-source tools for scientific research, reproducible workflows, computational narratives, and data analytics. Those tools translated well to industry, and today Jupyter notebooks have become an essential part of the data scientist toolkit.
Some of its core functionalities come from its composable architecture that separates where content is written (the UI) and where a code is executed (the kernel). By isolating the runtime from the interface, notebooks can span multiple languages while maintaining flexibility in how the execution environment is configured.
Backing all this is a file format that stores both code and results together. This means results can be accessed later without needing to rerun the code. In addition, the notebook stores rich prose to give context to what’s happening within the notebook. This makes it an ideal format for communicating business context, documenting assumptions, annotating code, describing conclusions, and more.
Therefore, Jupyter Notebooks are an obvious choice in Aiscalate to power prototyping and exploratory analysis. Our challenge is now to provide a way to easily turn them into robust software solutions to deploy in production.
After maturing several Data-related applications, every organization will hit the need to manage some workflow and schedule their automated tasks. Many choose to use some scripting (hard to manage on the long-term) or actually implement their own system, from there, some will open their work to the public and some other are re-using what is already available on the market.
Airflow was one of those tools, originally developed in 2014 by Airbnb (Airbnb Engineering) to manage their data based operations with a fast growing data set. It is undergoing incubation since 2016 at the Apache Software Foundation as Airbnb decided to open source it under Apache certification. Its relevance in the modern data ecosystem is growing steadily as it is now one of the workflow management system gaining most of the attention from the community and is being used in numerous companies and several of the biggest unicorns — Spotify, Lyft, Airbnb, Stripe, and others to power data engineering at massive scale...
Apache airflow makes your workflow simpler and organized by allowing you to divide it into small independent task units: Easy to organize and easy to schedule. Your entire workflow can be converted into a DAG (Directed acyclic graph). Once your workflows are defined by your code it becomes more maintainable. (Obviously, python makes it even simple to write it down). With the feature-rich user interface your workflow pipelines can be easily visualized, monitored and fixed. Finally, it is more easily scheduled and managed than CRON jobs.
Advanced meta data engineering framework:
Maxime Beauchemin, creator of Airflow explains: "in Airflow, you instantiate operators to create tasks. This means it's more natural to create tasks dynamically. This becomes really important if you want to build workflows dynamically from code (which you should sometimes!). A very simple example of that would be an Airflow script that reads a YAML config file with a list of table names, and creates a little workflow for each table, that may do things like loading the table into a target database, perhaps apply rules from the config file around sampling, data retention, anonymization, ... Now you have this abstraction where you can add entries to the config file to create new chunks of workflows without doing much work. It turns out there are tons of use cases for this type of approach. At Airbnb, the most complex use case for this is around experimentation and A/B testing."
Airflow is leading the future of Data Engineering space and blends perfectly in the modern data ecosystem, catapulting up the data maturity curve of other disciplines like Data Science, Analytics, Business intelligence etc. So it is a perfectly logical next step to AIscalate data prototypes into production.
Docker is a software technology providing containers, promoted by the company Docker, Inc and was released as open source in March 2013.
Docker is made for running software packages called "containers". Containers are isolated from each other and bundle their own tools, libraries and configuration files; they can communicate with each other through well-defined channels. All containers are run by a single operating system kernel and are thus more lightweight than virtual machines. Containers are created from "images" that specify their precise contents. Images are often created by combining and modifying standard images downloaded from public repositories.
Docker for Data Science
If you have tried to install and set up a deep learning framework (e.g. CNTK, Tensorflow etc.) on your machine you will agree that it is challenging, to say the least. The proverbial stars need to align to make sure the dependencies and requirements are satisfied for all the different frameworks that you want to explore and experiment with. Getting the right anaconda distribution, the correct version of Python, setting up the paths, the correct versions of different packages, ensuring the installation does not interfere with other Python-based installations on your system is not a trivial exercise. Using a Docker image saves us this trouble as it provides a pre-configured environment ready to start work in. Even if you manage to get the framework installed and running in your machine, every time there’s a new release, something could inadvertently break. Making Docker your development environment shields your project from these version changes until you are ready to upgrade your code to make it compatible with the newer version.
Using Docker also makes sharing projects with others a painless process. You don’t have to worry about environments not being compatible, missing dependencies or even platform conflicts. When sharing a project via a container you are not only sharing your code but your development environment as well ensuring that your script can be reliably executed, and your work faithfully reproduced. Furthermore, since you work is already containerized, you can easily deploy it using services such as Kubernetes, Swarm etc to scale it up.
Reproducible experimentation is key to Aiscalate some work from a proof of concept in a local environment on the laptop to a shared development server (for continuous integrations?) or into a production system to serve an API endpoint.
The following text is extracted from Netflix Technology Blog:
Papermill is a nteract library built for configurable and reliable execution of notebooks with production ecosystems in mind. It takes a notebook and some parameter inputs. Then, it executes the requested notebook with the rendered input. As each cell executes, it saves the resulting artifact to an isolated output notebook.
Papermill enables a paradigm change in how you work with notebook documents. Since Papermill doesn’t modify the source notebook, we get a functional property added to our definition of work — something which is normally missing in the notebook space. Our inputs, a notebook JSON document, and our input parameters are treated as immutable records for execution that produce an immutable output document. That single output document provides the executed code, the outputs, and logs from each code cell, and a repeatable template which can be easily rerun at any point in the future.
Papermill enables Software Engineering best practices for notebooks
In software engineering, continuous integration (CI) is the practice of merging all developer working copies to a single, shared mainline several times a day. CI was intended to be used in combination with automated unit tests written through the practices of test-driven development. Initially, this was conceived of as running and passing all unit tests in the developer's local environment before committing to the mainline. This helps avoid one developer's work-in-progress breaking another developer's copy. Later elaborations of the concept introduced build servers, which automatically ran the unit tests periodically or even after every commit and reported the results to the developers. In addition to automated unit tests, organizations using CI typically use a build server to implement continuous processes of applying quality control in general — small pieces of effort, applied frequently. In addition to running the unit and integration tests, such processes run additional static and dynamic tests, measure and profile performance, extract and format documentation from the source code and facilitate manual QA processes. This continuous application of quality control aims to improve the quality of software and to reduce the time taken to deliver it, by replacing the traditional practice of applying quality control after completing all development. This is very similar to the original idea of integrating more frequently to make integration easier, only applied to QA processes.
In the same vein, the practice of continuous delivery further extends CI by making sure the software checked in on the mainline is always in a state that can be deployed to users and makes the actual deployment process very rapid.
In data applications, both changes in code and data can break models, Papermill makes it possible to automatically check these continuously and document any errors for further debugging; Invaluable for production systems.