AIscalate

Dwarfs standing on the shoulders of giants:

Since the 2000s, some companies have grown massively with the rise of the Internet. They have become Digital Giants and often share with the rest of society the difficult challenges they are facing in technology. Sometimes, solutions are contributed back to the public through publications or open-source activities and give birth to strong communities embracing worldwide collaborations.

At AIscalate, we are not proposing an original idea or just a technology stack, we are dealing with a concrete pain in the industry and are actually standing on the shoulders of such giants and communities. We are taking pieces of their visions, architecture designs, lessons learned and software in the Open-Source world.

Our mission is to stitch all of those to craft and distribute a platform that can support any dwarf to see further than the giants.

Powered By:

‍

"Notebooks have rapidly grown in popularity among data scientists to become the de facto standard for quick prototyping and exploratory analysis."

‍

‍

The following text is extracted from Netflix Technology Blog:‍

Project Jupyter began in 2014 with a goal of creating a consistent set of open-source tools for scientific research, reproducible workflows, computational narratives, and data analytics. Those tools translated well to industry, and today Jupyter notebooks have become an essential part of the data scientist toolkit.

Some of its core functionalities come from its composable architecture that separates where content is written (the UI) and where a code is executed (the kernel). By isolating the runtime from the interface, notebooks can span multiple languages while maintaining flexibility in how the execution environment is configured.

Backing all this is a file format that stores both code and results together. This means results can be accessed later without needing to rerun the code. In addition, the notebook stores rich prose to give context to what’s happening within the notebook. This makes it an ideal format for communicating business context, documenting assumptions, annotating code, describing conclusions, and more.

Therefore, Jupyter Notebooks are an obvious choice in Aiscalate to power prototyping and exploratory analysis. Our challenge is now to provide a way to easily turn them into robust software solutions to deploy in production.

‍

After maturing several Data-related applications, every organization will hit the need to manage some workflow and schedule their automated tasks. Many choose to use some scripting (hard to manage on the long-term) or actually implement their own system, from there, some will open their work to the public and some other are re-using what is already available on the market.

Airflow was one of those tools, originally developed in 2014 by Airbnb (Airbnb Engineering) to manage their data based operations with a fast growing data set. It is undergoing incubation since 2016 at the Apache Software Foundation as Airbnb decided to open source it under Apache certification. Its relevance in the modern data ecosystem is growing steadily as it is now one of the workflow management system gaining most of the attention from the community and is being used in numerous companies and several of the biggest unicorns — Spotify, Lyft, Airbnb, Stripe, and others to power data engineering at massive scale...

Apache airflow makes your workflow simpler and organized by allowing you to divide it into small independent task units: Easy to organize and easy to schedule. Your entire workflow can be converted into a DAG (Directed acyclic graph). Once your workflows are defined by your code it becomes more maintainable. (Obviously, python makes it even simple to write it down). With the feature-rich user interface your workflow pipelines can be easily visualized, monitored and fixed. Finally, it is more easily scheduled and managed than CRON jobs.

Advanced meta data engineering framework:

Maxime Beauchemin, creator of Airflow explains: "in Airflow, you instantiate operators to create tasks. This means it's more natural to create tasks dynamically. This becomes really important if you want to build workflows dynamically from code (which you should sometimes!). A very simple example of that would be an Airflow script that reads a YAML config file with a list of table names, and creates a little workflow for each table, that may do things like loading the table into a target database, perhaps apply rules from the config file around sampling, data retention, anonymization, ... Now you have this abstraction where you can add entries to the config file to create new chunks of workflows without doing much work. It turns out there are tons of use cases for this type of approach. At Airbnb, the most complex use case for this is around experimentation and A/B testing."

Docker for Data Science

If you have tried to install and set up a deep learning framework (e.g. CNTK, Tensorflow etc.) on your machine you will agree that it is challenging, to say the least. The proverbial stars need to align to make sure the dependencies and requirements are satisfied for all the different frameworks that you want to explore and experiment with. Getting the right anaconda distribution, the correct version of Python, setting up the paths, the correct versions of different packages, ensuring the installation does not interfere with other Python-based installations on your system is not a trivial exercise. Using a Docker image saves us this trouble as it provides a pre-configured environment ready to start work in. Even if you manage to get the framework installed and running in your machine, every time there’s a new release, something could inadvertently break. Making Docker your development environment shields your project from these version changes until you are ready to upgrade your code to make it compatible with the newer version.

Using Docker also makes sharing projects with others a painless process. You don’t have to worry about environments not being compatible, missing dependencies or even platform conflicts. When sharing a project via a container you are not only sharing your code but your development environment as well ensuring that your script can be reliably executed, and your work faithfully reproduced. Furthermore, since you work is already containerized, you can easily deploy it using services such as Kubernetes, Swarm etc to scale it up.

Reproducible experimentation is key to Aiscalate some work from a proof of concept in a local environment on the laptop to a shared development server (for continuous integrations?) or into a production system to serve an API endpoint.

‍

The following text is extracted from Netflix Technology Blog:

Papermill is a nteract library built for configurable and reliable execution of notebooks with production ecosystems in mind. It takes a notebook and some parameter inputs. Then, it executes the requested notebook with the rendered input. As each cell executes, it saves the resulting artifact to an isolated output notebook.

Papermill enables a paradigm change in how you work with notebook documents. Since Papermill doesn’t modify the source notebook, we get a functional property added to our definition of work — something which is normally missing in the notebook space. Our inputs, a notebook JSON document, and our input parameters are treated as immutable records for execution that produce an immutable output document. That single output document provides the executed code, the outputs, and logs from each code cell, and a repeatable template which can be easily rerun at any point in the future.

‍

Papermill enables Software Engineering best practices for notebooks

In software engineering, continuous integration (CI) is the practice of merging all developer working copies to a single, shared mainline several times a day. CI was intended to be used in combination with automated unit tests written through the practices of test-driven development. Initially, this was conceived of as running and passing all unit tests in the developer's local environment before committing to the mainline. This helps avoid one developer's work-in-progress breaking another developer's copy. Later elaborations of the concept introduced build servers, which automatically ran the unit tests periodically or even after every commit and reported the results to the developers. In addition to automated unit tests, organizations using CI typically use a build server to implement continuous processes of applying quality control in general — small pieces of effort, applied frequently. In addition to running the unit and integration tests, such processes run additional static and dynamic tests, measure and profile performance, extract and format documentation from the source code and facilitate manual QA processes. This continuous application of quality control aims to improve the quality of software and to reduce the time taken to deliver it, by replacing the traditional practice of applying quality control after completing all development. This is very similar to the original idea of integrating more frequently to make integration easier, only applied to QA processes.

In the same vein, the practice of continuous delivery further extends CI by making sure the software checked in on the mainline is always in a state that can be deployed to users and makes the actual deployment process very rapid.

In data applications, both changes in code and data can break models, Papermill makes it possible to automatically check these continuously and document any errors for further debugging; Invaluable for production systems.

‍

Register and start AIscalating your data applications.

Sign me Up

Inspired by:

‍

Netflix´s vision of data tools in the future for all kind of users:

Michelle Ufford, Engineering Manager at Netflix, is leading the Big Data tools team. She shares how Netflix leverages notebooks today and describes a brief vision for the future during the JupyterCon in 2018:

Bet #1: Jupyter Notebooks are the future
Bet #2: Jupyter Notebooks can do more than interactive
Bet #3: Jupyter Notebooks can cross the technical chasm

From that keynote, it is pretty clear that Netflix is pouring a lot of resources into Notebooks and it became a core technology for their teams. You can actually find out a lot more details on their motivations, architecture and approaches on Netflix Technology Blog: Beyond Interactive: Notebook Innovation and especially how to schedule them into production.

Our mission:

At AIscalate, we are trying to reproduce a similar ecosystem but outside of Netflix infrastructure. We will be building tools and managing services that will help anyone get access to such technologies in his or her own environment.

In the end, we are pursuing similar objectives and sharing some of the same open-source tools as Netflix. If you would like to join this exciting journey, and push the boundaries with your own teams of Data Engineers, Data Scientists, Data Analysts, Project Managers etc, then don't forget to sign up!

‍

Lessons learned on production workflows with Jupyter Notebooks

During the JupyterCon in 2017, Andrew Therriault (ex-Chief Data of the city of Boston) walks through several production workflows covering the pros and cons using real-world examples from his own experience: Just because you can stay in Jupyter, should you?

‍

Our takeaways:

Notebooks are indeed very powerful and can handle a lot of uses cases. You are able to prototype a full Data Application inside a notebook and even deploy it directly to production.

This is a very desirable property as it means you can turn some experimental work around very quickly and try many things very fast. which is key to develop better models. On the long-term, Notebooks are not good at building libraries so it might not be the best approach, but then you might want to start thinking about building real software instead. Notebooks end up to be integration tools.

Nevertheless, isn't it exciting that you have such opportunity to choose anyway? You are able to go and Fail fast and not optimize prematurely. Then, when the need arises, you can start thinking about scaling afterward, following a better defined and rigorous process. Which is why we believe Notebooks have such a bright future ahead, supporting the full development cycle.

To witness this exciting transition, join us by signing up or, please, do let us know if you disagree.

‍

The Business Intelligence team rebranded as Data Science Department and their hot potatoes

In this blog post from Stitchfix´s technology teams titled Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department, Jeff Magnusson, VP of Data platform describes how typical Data Science Departments spiral down into vicious cycles of frustrations.

He shares his opinions along with some of his own observations that it is probably due to organizations following a model of "Thinker and Doer" model, where Doers are solely accountable for implementation, failure, and support of other people’s ideas, while the Thinker is rewarded for their success.

Data scientists are quickly iterating, experimenting numerous algorithms and methods that may or may not work leaving manifold dead exploration branches in their deliverable. Once they manage to produce amazing one-off graphics to promote their magical machine learning results in meetings with business teams or scientific paper. Their work is passed on to the "software development" teams including specification documents and some unwieldy prototypes or proof of concepts that are hardly reproducible. The assumption is that, just like in assembly hand-off, engineers should eagerly implement the ideas and vision from data scientists.

The end result is a team of data scientists who are empowered to be little more than report developers producing PowerPoint decks and dashboards because they lack the support of a solid, innovative data platform. Finally, prototypes are rarely successfully implemented into production systems illustrated by countless surveys and articles with alarming high failure rates of Big Data or Data Science projects.

Values of a high functioning department

Jeff Magnusson continues to explain the values he designed in his own organization and details both the motivations and objectives of data scientists/data engineers at Stitchfix:

‍autonomy in roles
‍true ownership all the way into production
‍accountability for output
everybody lead and innovate

This is why, Stitchfix´s blueprint for Data Science department is an inspiration for Aiscalate because we think Jupyter Notebooks, one of the favorite data scientist tool, should be capable of going all the way into production too. This feature enables all those previously discussed values.

Airflow to build Lego Blocks

The choice of Airflow in AIscalate is also tied to the contributions of Data Engineers described by Jeff Magnusson in his post: building Lego Blocks for Data Scientists or other users. As explained by Maxime Beauchemin in his talk on Advanced Data Engineering Patterns, Airflow with its design of configuration as code permits meta-Data Engineering of such pipelines and ETL.

As nobody enjoys writing and maintaining data pipelines or ETL (It’s the industry’s ultimate hot potato), it is possible to extract such horizontal patterns that are broadly applicable across multiple data science problems. In turn, this is an illustration of how it is possible to power-up data scientists and support their autonomy all the way into production.

"Hot Potato" workflow contrasting our AIscalated vision

‍

AIscalated Workflow

To conclude, AIscalate view the ideal workflow in the diagram at the top of this page where steps 1 to 5 and steps A to D can mix back and forth together, contributing knowledge (Lego blocks) whether they are produced during prototype iterations or product life-cycles regardless of being authored and owned by a data scientist or a data engineer. (unlike the hot potato diagram where both cycles are completely independent and disconnected).

Once again, to participate in this exciting vision, join us by signing up or, please, do let us know if you disagree.

Streamline Jupyter Notebook Prototypes into robust Data Products

One collaborative platform to support Data Mining Iterations and smoothly transition into Software Engineering development cycles.
Discussions, Innovations and Implementations simply unified for your whole organization.

Dwarfs standing on the shoulders of giants:

Powered By:

The following text is extracted from Netflix Technology Blog:‍

Advanced meta data engineering framework:

Docker for Data Science

The following text is extracted from Netflix Technology Blog:

Papermill enables Software Engineering best practices for notebooks

Register and start AIscalating your data applications.

Streamline Jupyter Notebook Prototypes into robust Data Products

AIscalate

Streamline Jupyter Notebook Prototypes into robust Data Products

One collaborative platform to support Data Mining Iterations and smoothly transition into Software Engineering development cycles. Discussions, Innovations and Implementations simply unified for your whole organization.

Dwarfs standing on the shoulders of giants:

Powered By:

The following text is extracted from Netflix Technology Blog:‍

Advanced meta data engineering framework:

Docker for Data Science

The following text is extracted from Netflix Technology Blog:

Papermill enables Software Engineering best practices for notebooks

Register and start AIscalating your data applications.

Inspired by:

Netflix´s vision of data tools in the future for all kind of users:

Our mission:

Lessons learned on production workflows with Jupyter Notebooks

Our takeaways:

The Business Intelligence team rebranded as Data Science Department and their hot potatoes

Values of a high functioning department

Airflow to build Lego Blocks

AIscalated Workflow

Streamline Jupyter Notebook Prototypes into robust Data Products

One collaborative platform to support Data Mining Iterations and smoothly transition into Software Engineering development cycles.
Discussions, Innovations and Implementations simply unified for your whole organization.