Why Data Scientists Should Use Jupyter Notebooks with Moderation

Jupyter notebooks were game changers for data scientists across the globe. But should they be used indiscriminately?

Overwhelmed data scientist — Photo by Elisa Ventur on Unsplash

Introduction

There's no doubt that the launch of the project Jupyter and its notebooks back in 2015 changed the relationship between scientific programmers and their code. The first reason is the simplicity of connecting to different programming languages (kernels) and combining text with code snippets and outputs such as tables, graphs, and maps into a single page. This notebook feature made it possible, and simple, to implement the literate programming paradigm, first proposed by Donald Knuth in 1984.

Literate programming was first introduced by Knuth at Stanford University, with the objective to bring the program logic closer to the human language. It combines code with natural language texts.

The second reason is the interactive nature of Jupyter Notebooks. The possibility of experimenting with data and seeing the code's results for each typed command makes it ideal for data scientists and researchers, where the focus is on data analysis, not development.

By using interactive notebooks, it is no longer necessary to write a long script filled with dozens (or hundreds) lines of code prone to errors to display the results just at the end of the processing. Depending on the objective, you don't need to bother declaring functions or designing classes. You can just declare your variable on demand and focus on the results.

Bottom line: Python and Jupyter became a standard for data scientists. This can be confirmed by the increasing number of available courses and job positions that require these skills.

But now, you may be asking yourself. If it is so good (and a game-changer), why must I take care of its use?

To answer this question, I will tell a little story.

Old fashion programming

When I started to develop my research at the university, I was at least 10 years apart from any coding, and I barely knew about the existence of Python. I used to code in Pascal, C, and a little of Fortran, the main scientific languages used in the Universities when I graduated (I know, it was a long time ago). I didn't know either about the existence of Jupyter or any of those thousands of different Python packages, and it can be overwhelming.

Old PC from the 80s — Photo by bert b on Unsplash

So, I started the way I was used to. I bought two Python books (yes, yes… I still use books) and installed the basic Python interpreter and a good and free IDE. A quick web search pointed me to the Pycharm community version.

As I didn't have the benefits of the quick visualization provided by Jupyter, I've created a pipeline to preprocess all the input data and test different processing combinations. In the end, it generated all the possible graphs and outputs I needed for my research. I was forced to write a good code, that was easily reproducible, otherwise, I would not be able to analyze everything. As I was working with high-resolution satellite images, the amount of data was huge.

It took me some time to develop everything, but once the work was done, I could focus on experimenting with the algorithm in different areas of the globe, with different coverage, etc. Ultimately, I was happy with the results, and my first scientific paper and public Python package (a water detection software for satellite images) were published. You can check the WaterDetect package in this GeoCorner article and the GitHub repository.

Notebook "Programming"

Having passed my first research "checkpoint", I was open to learning new tools to improve my skills, and then I installed Jupyter Lab (a more recent version of the previous Notebook).

My life has changed. I remember thinking at the time… "why I didn't try this before?".

I was astonished by the endless possibilities of testing, documenting, and quickly visualizing everything I was doing. I even tested some recent packages that turn the notebook into a (kind of) development environment. This tool, called nbdev makes it easier to export modules, create packages and even document the whole stuff. The better of the two worlds, I thought.

However, after months of work on another topic and having achieved pretty good results with my machine learning research, my superior told me some words of fear: "Great results! Let's try it on different sites to validate the results.". Different sites? Validate results? For tomorrow???

I was not prepared for that. To achieve the initial results, I ran a bunch of different machine learning tests with other algorithms, preprocessing normalizations, etc. But the focus was on the results, not developing a complete processing chain; I was still "experimenting". So, the code was not modularized, it wasn't easy to reproduce an old experiment, I could never find the notebook with the correct version of the implementation that worked, etc. etc. etc.

So, just reproducing the results for a new location is a real pain. And it takes a lot of time. And it makes us very inefficient. For the superior demanding, that's not something easy to understand. He only thinks, "… but you have already developed it; you've shown me the results; all I am asking now is to push the same button". Well… kind of.

The truth is that after some time, I had done a lot of different tests, experiments, and coding, that's true. However, I had no modularized code ready for publication or for sharing with other researchers. I had just that… a bunch of disconnected notebooks with duplicated functions, weird names, etc.

In the end, it seems that I was not as efficient as before. I didn't construct anything. I had no software to deliver. And this feeling was awful.

I have already written about the reasons why scientific software is not well designed in this story: 7 Reasons Why Scientific Software are Not Well Designed. And I believe that the indiscriminate use of Jupyter Notebooks by scientists "programmers" will make this problem even worse.

The insight from Kaggle

During this time that I was an avid notebook user, I also participated in some Kaggle competitions to improve my skills in deep learning (in my opinion, it is the best way to learn from other DL practitioners). One nice thing they always do after a competition finishes, is the interview with the winners.

So there was this interview with a Russian guy (I don't remember from which competition he was). He was asked about the developing environment he used, and he answered: "I don't use Jupyter notebooks. All I do is through plain old IDEs". That changed my mind. I heard that from the winner of a competition with thousands and thousands of competitors. Most of them have probably been attached to their Jupyter notebooks until now.

That story made me rethink some misconceptions I had. The truth is that I was more inefficient with notebooks than I was at the start when using Pycharm (or Spyder, VS Code, or any other IDE).

What I want to point out here is: because of the liberty that is given to us by notebooks, it is necessary to double the commitment to keep code clean, reproducible, organized, etc. And, sometimes, it is just not feasible.

The solution?

Now, what works best for me in my data science journey is to develop with the IDE and the Jupyter simultaneously, but with different purposes. I write the functions and classes on the IDE inside some new package I create, and then I use the notebook just to call the package and visualize the results. This way, in the end, I have a "ready to go" package that can be shared with other researchers.

For this setup to work, we need to pay attention to the following points:

Create a new (empty) package and install it in edit mode with pip (-e option). This way, the source code will remain in the original folder structure, and you can continue to develop it.

cd project_folder
pip install -e .

Use the %autoreload extension on the Jupyter Notebook. This will permit you to update the package on the IDE and check the result on the notebook without restarting the kernel.

# on the very first cell of the notebook
%load_ext autoreload
%autoreload 2

Optionally, you can attach the debugger of your IDE to the Jupyter kernel. On PyCharm, this is done in the Run menu (image below).

I am currently working on a new mask processor, and this is an example of what my setup looks like now. I have all the benefits of the IDE (completion, argument checking, etc.), the debugger runs normally, and, in the end, the package is "ready" for deployment. Another advantage of using the Jupyter Notebook just for displaying the results is that it can be used as the user's manual for the new package.

Conclusion

Just to be clear. I am not advocating against using Jupyter Notebooks or any other interactive environment like R or Matlab. I understand their advantage, especially for research and data science works, where the focus is on data analysis and experimentation instead of code production.

However, we must keep in mind what the expectations are. Typically, our most simple data analysis should be reproducible and easily shareable with colleagues.

If we use the notebooks just to take advantage of the multiple packages that already exist in the community and to display the results, that's fine. But, for a new piece of code, a new kind of processing or even a simple automation of an existing process, it can be counterproductive.

And you, which is the best environment setup for you as a data scientist? Leave your comments and insights.