When to use jupyter/colab notebooks vs dedicated python or other language projects
Finding the correct balance between pragmatic data exploration and data engineering efforts
As companies and products utilize data more extensively as time progresses, it is becoming evermore important to understand which data processing tools are the best fit for any given work item.
This post explains the benefit and consequence tradeoffs of using tools that give a machine-learning or data engineer easy access to runnable scripts/outputs versus tools that allow an engineer to build complex systems. It also explains which tools you might use more extensively for certain engineering roles. First let’s dive into jupyter and colab notebooks:
A Jupyter/Colab notebook enables a developer to easily run python code in a web browser. The way this works is for a jupyter notebook, you can run a server that will host the notebook middleware on said server, you can go to the server’s host and develop python code/scripts from there in an easy-to-run format. Google’s colab simplifies this further by allowing you to go to your Google drive and merely create and open a new colaboratory file. These are the tradeoffs of developing in this manner:
Strengths of notebooks
Simplified Development Environment:
The notebooks allow you to develop and execute python code in a simplified manner making it easy to hit the ground running. Here is an example of a hello world cell:
With this example, we have a code cell, a place where we write code, and can press shift+enter to run it, and we see the output. All I needed to do was create a colab notebook file in my Google drive and was up-and-running quickly.
This is the main benefit of running these notebooks. You can easily create them, open them, and code immediately with them and get results quickly.
Smooth Output And Visualizations:
In addition to showing the text output of the cell, we can easily visualize data:
This data visualization is very powerful, because we immediately see it from running the code. We don’t have to run a python script from the command line that saves a plot to an image file and then open the image file to see the data. It is all just there in the notebook UI.
Reproducibility:
If we want to easily get results from the notebook, but we close it or cleared it, we can just restart the runtime and rerun the cells in order to set up the environment. Clearing the runtime means we restart the python backend process so that we no longer have variables that were created and modified in the previous runtime.
Weakness of notebooks
Low Version Control:
While the notebooks allow you to get up and running quickly, it can be quite difficult to be able to share the results with other engineers. This means the engineer needs to find a way to bring the notebook to other engineers. If you run your own server, you need to commit it to a Git repo or FTP it to another engineer if they need to run it, which complicates the development process. Colab is easier because you can just share the drive and your peers can copy the notebook and run it themselves, but any changes will require manual merge conflict resolution which can be quite painful.
Code Execution Order:
Code execution order refers to the process of running the individual cells for a notebook. When we are exploring and working with variables and data, if we don’t run the notebook cells from top to bottom, we can get conflicts in the variables for a runtime. This means that at the end of development, users need to clear the runtime and run the entire notebook from beginning to end to make sure everything is correct. It gets easy to screw up the runtime, so care needs to be given for that.
Not Suitable for Production:
When developing code and scripts in the notebook, it becomes quite difficult to push that code to a production environment that needs to provide services to users. The reason why is because the format of the notebook is meant for ease of use, not power and functionality. You could theoretically run a production environment from a notebook, but it would be quite fragile and painful.
We have just explored the tradeoffs of developing in a notebook. Now let’s explore the dedicated strengths and weaknesses of the dedicated project:
Strengths of dedicated projects
Reliable and Scalable Execution:
With a dedicated python, c++, or other language project, we can write compile and run scripts to launch a project properly in a more-scalable manner. This is incredibly important when building systems that people need to access frequently.
High Version Control:
Version control is a cornerstone of any mature software product. By using a project-based data application, it eases the ability for teams and engineers to check in their project’s progress. This becomes incredibly important as projects involve more engineers that need to work together as seamlessly as possible.
Increased Modularity:
With a project, it can be built in a way such that other projects or services can import and integrate easily with a project. This is great for a development system where a lot of services are being developed at the same time and they use one another.
Weakness of dedicated projects
More Complex Management:
With a dedicated project, it takes quite a bit of effort to build out the qualities we would expect from a well-designed and developed project such as reliability, scalability, and elasticity. It is possible to run a python script or the runtime from the command-line, but as engineers build features into the project, it increases the complexity of the project, and eventually it will become necessary to spend time to build out the building, running, and deployment systems for the project.
Less Interactive:
We saw before with the notebook that it is quite easy to work with a notebook. Just type the code and run a given cell. With the dedicated project, there is more friction to execute the program. We might have build/run scripts and output directories. We won’t be able to see easy visualizations without writing to a file on disk and opening the file.
More Complicated To Develop:
One issue with projects is that in order to code for them, you need to know enough about the project’s source code to know what changes should be made to achieve a desirable result. With the notebook, you just follow the code in the file until you need to make a change, but with the project, you need to dig through the source code references to find what you are looking for.
Now that we have explored the tradeoffs for both approaches to development, I would like to describe the ideal users and roles that would naturally gravitate towards each tool:
Notebook Developers: The Data Explorers
Data Scientist: This role is the major user of notebooks. Data scientists perform analysis on data for the purposes of fulfilling feature-engineering and model development/validation/optimization.
Data Analyst: This role is another major user of notebooks. Data analysts take existing data to derive insights about business metrics in order to understand optimal business strategies and to visualize these details.
Educator and Student: These roles make effective use of notebooks in order to better teach and learn course material. Notebooks allow people to learn and teach in a more interactive manner.
Project Developers: The Data System Builders
Software Engineer: This role primarily develops using projects. The reason why is that they need to build systems that scale properly with a standardized development flow. These systems could be data-processing systems, but they could also be general purpose systems outside of data.
Data Engineer: This role uses projects to develop data-processing pipelines. They primarily use projects because they are essentially software engineers with a focus on data processing. They focus more on the ETL pipelines and infrastructure.
DevOps and MLOps Engineer: This role is essentially a combination between the software/data engineer and operations engineer. This role also works on data-processing systems, but they also work on making sure systems fulfill their SLOs and SLAs and also maintaining the pipelines that run.
Educator and Student: These roles along with the notebook user also use projects for learning and growth as well. They might implement highly-technical assignments using projects that are not conducive to being developed in a notebook.
This concludes the more specific users for both notebooks and projects, but there are a couple roles that I want to mention that I think will use both notebooks and projects:
Machine Learning Engineer: This role is primarily focused on building data-intensive systems. This engineer should be prepared to bridge the role of data scientists and software/MLOps engineers.
Tech Lead: This role serves the purpose of leading teams and organizations. Usually this contributor starts with one of the previously mentioned roles, but has gained enough competency, influence, and trust to be able to lead the building of entire services and products.
In conclusion, both notebooks and development projects have their roles and purposes. If you need fast and prototype development, then notebooks are your best bet. If you need systems, a project will be your goto.