7 Reasons Why Scientific Software are Not Well Designed

Let’s discuss some of the main reasons why software developed in Academia and research centers are so poorly designed and find solutions

Introduction

It’s been some time I wanted to write about this topic, but I must confess I’ve been procrastinating it a little bit. Although I don’t have a formal Computer Science background myself (I have a civil engineering degree), I’ve been working in the IT industry for more than 15 years. During this time, I’ve seen some of the major trends and best practices used in software development to improve software quality.

Some years ago, I took a license from my former position as IT manager in a Brazilian governmental agency to start my PhD in data science applied to Remote Sensing in France. Upon arrival, I was demanded to study a specific question related to water detection from satellite images. Then, to help me with my initial analysis and to avoid starting from scratch, they shared with me the code of a previous student that gathered some of the main ideas but that needed, as they’ve said, some “improvements”. That piece of code was part of a bigger processing chain that was meant to produce water quality maps for a European space agency.

Considering I was a complete newbie in the remote sensing field, I first thought, GREAT! I will not be starting from scratch, so it’s maybe half of the work done… but I was wrong.

Well, the code they shared with me was written in Python, as most of the coding that is done in Academia nowadays, and had no documentation, as “expected”. It took me a lot of time to even understand what the input parameters should be. When I could finally make it work, the processing time for a really small sample area was… 10 minutes. I thought to myself. Wow! That’s rocket science! I was imagining the amount and complexity of all the mathematical operations that were being done “under the hood” to solve that problem.

Expectation vs Reality — Left: Photo by WikiImages on Pixabay; Right: Photo by Hello I’m Nik on Unsplash.

After this first glimpse, I started my reverse engineering (thanks again to the lack of documentation) to understand all of the details. I took a paper notebook and started studying it… line by line, loop by loop, operator by operator. When it came to the main function (why classes if we can do it procedural?), the one that I expected would unveil all the answers to my questions; I got a big surprise! It had exactly 2148 lines of code. A single giant function. Sometimes, the same code block was repeated, again and again, prone to bugs, no rules, no guidelines, just that… a single giant function.

I am not giving this example (a very real one) to say that the code written in the Academia is always crap. No, it is not. But it needs to improve, and fast; otherwise, tech companies will take the lead on research over traditional universities in the near future. Research needs data. And the amount of data that has to be analyzed nowadays cannot be manipulated manually in spreadsheets nor by software that struggles to produce reliable results. But, above all, imagining that a code like that was installed and running on a computational cluster of a European space agency was the fact that surprised me the most.

And that brings us back to the main topic of this post. Why is it so difficult to have scientific software that is well-designed, documented, well-coded, etc.? I did some research and will share my main thoughts with you.

1- Researchers are not software engineers

The first and most obvious reason is that researchers are very skilled in their knowledge field, but they are not software engineers. Most of the software developed in Academia is not written by programmers but by the researchers themselves, research groups or students. So, important concepts such as object orientation, testing, design patterns and many others are left behind because they are not taught in science courses.

2- Money

Lack of money is always an important constraint everywhere. In this case, researchers don’t have a specific budget set aside for software development. That’s why most scientific software is developed by students receiving scholarships or grants. Even when there are specific positions for IT professionals, the salaries in the academic area are normally lower than those in the private sector, especially compared to the growing IT market, so it is difficult to keep talented professionals.

Still, in the money section, contracting tech companies to outsource the job would also be prohibited financially. And there are other issues, as discussed in reason #6.

3- Focus on papers, not code

Software is not the end goal. Scientists develop software for their consumption and to get the results for the articles, period. Usability or design are not even considered in the project. In most cases, there is not even a “software”, but just a prototype that did the job… once. The problem arises when this prototype is published as software without proper testing, documentation, etc.

Normally, the intention to reproduce the results on another area or with different inputs using the same prototype is not foreseen “a priori”. Even when the software is intended to be used by other students or research groups, these fundamental topics are left behind, and the prototype is used “as is”.

4- Limited duration contracts

As software development is not their final goal, no permanent IT teams can take care of the software lifecycle. Software is usually funded from grants for specific research projects and maintained on a collaborative basis. But technology evolves continuously. I’ve come up with different scientific software or packages that don’t work anymore because they received no upgrade, and the environment needed for them is not yet reproducible.

5- Lack of proper project management

Sometimes, the idea of developing a new scientific software is more ambitious and not confined to a single study/research and a bigger project is prepared. That would solve some of the aforementioned issues. But, the problem is that, in such situations, team leaders are usually the researchers themselves, not experienced project managers, so they are not qualified to manage the time, quality or scope of the project. Parts of the software are then written by different students with different backgrounds without any associated software development process. Without guidelines, it isn't easy to integrate the pieces afterwards.

6- No commercial interest

The “money problem” would be solved if there was some commercial interest in the software being developed. That’s the case in very specific situations where an idea can impact an industry. However, as a rule of thumb, scientific software is ultra-specific for certain academic areas, and the user base is really limited. That leaves little commercial interest for tech companies to explore the market and compete to increase quality.

7- Difficulty for the programmers to understand the problem

Most of the previous reasons focus on the lack of computer science specialists involved in the development process of scientific software. However, there is another point that I would like to stand out. Scientific software generally solves new problems in specific knowledge fields, and it is not that easy to find programmers who are able to understand the science behind it and, at the same time, master all the necessary IT concepts. I have already seen scientists disappointed with IT professionals hired to make this bridge between science and programming because they could not catch up with the pace of the research aspect.

Conclusion

Well. In conclusion, I would like to say that this is indeed a big topic, and I don’t mean to answer all the philosophical questions in this quick article. I want it to serve as a starting point for the discussion and the pursuit of solutions. Will we ever have a category that can cope with the challenge of developing good scientific software? Should data scientists fill in this blank?

Let me know your thoughts on that and start a good and productive discussion on the subject.