The Importance of Clean Code

— Chapter 1 —

Jan 24, 2024

Text within this block will maintain its original spacing when published

«prev          next»

This happens sooner or later when you’ve worked as a software engineer long enough: You come across a class file that is thousands of lines long. I have seen with my own eyes classes as long as 6,000 lines during my career as a software engineer. I have heard rumors of classes that are tens of thousands of lines long.

These classes are pretty much incomprehensible to most software engineers. You cannot work on, maintain, or add features to some code that you cannot understand. If you have to deal with such code, you are going to be in a world of pain.

Now, I have to give the tech companies what they are due: Some of them understand the importance of clean code to a certain degree. They have many processes in place to ensure such monstrosities do not happen: Code reviews, code repositories with rollback mechanisms, static analysis, test-driven development, CI/CD, and so forth.

However, bad code can still occasionally slip through and end up in the code repositories. These organizations certainly still have some room to improve.

In the following chapters, I am going to mention the several root causes of this phenomenon, and how bad code comes into existence despite the many barriers in place. But first, in this chapter, I am going to mention why clean code is important, and why we should not have any classes that are thousands of lines long.

What I mean by “clean" code is that code needs to be understandable. It needs to be neat, organized, well structured etc. So then you can improve it, modify it, and add new features to it without having so much headache. Some people have different terms for such well designed and well written code. You can call it whatever you like: Good, great, wonderful, beautiful, adaptive, solid, complete, or simply clean code. For the rest of the book, I will be using the term clean code to make my point across.

This point needs to be emphasized, because even in most prominent tech companies, there could be people who do not understand the importance of clean code. Even people in very influential positions. For example a director of a software engineering team might one day claim “A class could be 30,000 lines long. That’s totally normal. What’s wrong with it?” It is shocking that with that kind of mentality one could climb up that high, but falling upwards is a real phenomenon, as I will also explain in the upcoming chapters.

This is not right. This should not happen.

What Is Code Really For?

Most people are under the assumption that code is written for computers. It is true that code is written to be ultimately executed by the computers, in order to make them run the algorithms that we want them to do. However, here is one overlooked fact: The computer CPUs do not really understand or directly deal with Java, C++, Python, Javascript, or any other high-level language you may have heard of. The CPUs operate only on low-level machine code. Binary code. Just ones and zeros. Those high-level languages that I mentioned are stored in text files in disk or memory, and translated to the binary machine code by other programs (compilers, interpreters, etc.) A computer CPU cannot make sense of a Java file if it was directly presented with one.

Then what is code really written for? Or more accurately, who is code really written for?

The code is written for us humans to understand it.

Dear reader, if you do not remember any other thing from this book, please try to remember this one only thing: That the code is actually written for us humans to make sense of it. It is not written for the computers to make sense. Because the computers don’t care. They will execute any binary machine code given to them. Whether they actually do the task we want them to do, or fail in error, the computers do not care one way or another.

And if we wish a computer to succeed in running the task that we want it to run, then we better understand the high-level code itself very well.

A Brief History of High-Level Languages

When the computers were first invented, there were no high-level languages. Those early computers were programmed by people directly in binary machine code. “Programming” them involved connecting wires to the right place, or punching holes in punch-cards in more “advanced” models of those early computers. Those wire connections or the holes in the punch cards represented the ones and zeros of the binary machine code. Back in those days, there were no integrated circuits or even transistors. The very early computers worked with vacuum tubes that were connected to each other with wires, capacitors, and resistors. The early “bugs” in those computers were literal bugs: They were insects that got caught between the wires and caused short-circuits & burn-outs in the computer’s circuitry, causing the running program to fail in mysterious ways. This is the reason why we still call programming errors “bugs” to this very day.

The people who programmed those early computers were usually mathematicians or scientists. There were no “software engineers” back then, because the field didn’t exist.

Those early programmers however would quickly realize the primary rule of software engineering that I’m trying to convey in this book: It is paramount for humans to understand the software they are writing. They found programming the computers in binary machine-code very tedious. Human brains were not made to comprehend a list of ones and zeros. They quickly had to come up with better ways to program their computers. They started coming up with high-level languages. These developments happened pretty early on, pretty much right after the computers were invented.

The first “high-level” language they came up with was assembly language. Instead of writing the commands sent to the CPU for execution in ones and zeros, they wrote those individual CPU commands in a way that humans can understand. For instance: ADD R1, R2, R3. This would mean add the values in registers 1 and 2, and write the output to register 3. (Depending on how that particular assembly language was designed, such an instruction could take on a different format, such as R2 and R3 being added into R1, but the main principle remained the same.) Each assembly instruction corresponded to an actual binary machine language instruction that the CPU would execute: Add, subtract, multiply, divide, save to a register, load from a register, jump to a different address in memory to proceed the command execution from, etc. It was pretty easy to map between the assembly language and the binary machine language: it was a straightforward line-to-line mapping. Each line of assembly code in a program would be mapped to a line of machine language command to be executed by the CPU. There were programs developed to do this mapping automatically, called Assemblers.

Assembly languages are still used, among certain circles of software engineers, for very specific tasks. However, it is the actual high-level languages that reign supreme today, not assembly languages. That is because the early programmers realized that once the software being developed grew beyond a certain size and complexity, even the assembly languages were not enough. Human brains still could not comprehend very well a really long and complex assembly code. It was still easy to make mistakes while developing the assembly code, and cause bugs (the non-biological kind). There needed to be something better.

The first programmable, electronic, general-purpose digital computer called ENIAC came into existence in 1945, built with vacuum tubes and wires as I mentioned.1 Just in over a decade following that, the first high-level languages started to come into existence as well: Fortran in 19572, Algol in 19583, COBOL in 19594, Lisp in 19605, etc. Now engineers could specify branches (if-then-else statements) and loops in a better way. Now a long program could be decomposed into clearly defined functions. Now variable types could be defined, preventing certain errors of confusion from taking place.

Algol inspired many other languages to be developed later on, one of which was C in 1972.6 C turned out to be an incredibly influential language itself, inspiring other languages like C++, Java, and C#. Thus, one can say that many of the modern languages in existence today have their roots in Algol.

Lisp was another highly influential language in the functional-programming paradigm that inspired the development of other functional programming languages later on, such as Scheme and Clojure. Functional programming is a very influential paradigm that is in use today, just like object oriented programming. Actually, a lot of languages that started out as purely object oriented languages have been upgraded in the past decade or so to include functional programming features as well. For example, Java these days has lambda expressions which can be used to pass around functions to other functions as arguments, in order to construct more elaborate algorithms. Java also has “streams” with “map & reduce” features, which can be used to operate the same function on all the elements of a list, and then reduce the list to a single result if necessary by using another function. All of these particular language features came from the functional programming paradigm.

Fortran and Cobol have the reputation of being ancient, archaic languages today. However, they are still in use. Some scientific research institutions still use Fortran, while some financial and banking institutions still use Cobol. These languages have undergone a lot of revisions in the past decades to try to adapt to the modern times. For example, object oriented support was added to Cobol in 2002 and to Fortran in 2003. Still, these languages have been mostly replaced by more modern languages today in terms of widespread availability and use.

Speaking of object oriented programming, the object oriented language paradigms started making their appearance as early as the 1960s. Simula language (first developed in 1962) introduced concepts of objects, classes, inheritance, virtual procedures, and polymorphism in 1967.7 These are all concepts associated with the modern object oriented languages in existence today, like C++, Java, Python, and JavaScript. Few software engineers even today realize that these object oriented programming concepts were already being practiced in the 1960s.

The trend in software engineering has been trying to find ways to program in a way that our human brains can comprehend better. We have been trying to come up with better programming languages capable of better software abstractions so that it is easier to make sense of our code. Just within 20 years after the very first computer came into existence, all the major programming paradigms we use today were pretty much all invented: procedural programming, functional programming, and object oriented programming. That says something.

These efforts have paid off immensely. In today’s world it is much easier to make sense of a given code, or develop very complex and sophisticated programs compared to the world of the early 1950s.

Yet, we still fail. No matter how modern a programming language we are using, we still find ways to develop confusing classes that are tens of thousands of lines long and confusing functions that are trying to do too many things at once.

No programming language that has ever existed can prevent us from shooting ourselves in the foot, so to speak.

Part of the reason we are so easily confused by bad code has to do with who we really are, our very nature.

A Brief History of Humanity

Our species, Homo Sapiens, evolved in the continent of Africa around 300,000 years ago.8 We evolved from other hominid species, all of which have gone extinct. Our closest living relatives are chimpanzees, with whom we shared a last common ancestor around 5 to 13 million years ago.9 As it turns out, Chimpanzees and Humans are closer genetically than are Chimpanzees and Gorillas. We are firmly located in the Great Apes branch in the tree of life.

Just like our hominid ancestors, us Homo Sapiens evolved to be hunter-gatherers. And that is our true nature. For the vast majority of time of our 300,000 year existence, our species have been hunter-gatherers.

Agriculture was invented around 12,000 years ago. We have only started tilling fields and harvesting crops since then. The earliest known writing was invented around 5400 years ago in Sumeria. Everything we know from written history, all the nations, leaders, and wars that are taught in history classes happened within the last 5,400 years. Galileo Galilei invented the telescope and looked at the moon in 1609. Isaac Newton published his theories on gravity and the universal laws on motion in 1687. Steam engine was invented in the 1700s, kicking off the industrial revolution. And as you all know by now, the first electronic programmable computer was invented in 1945.

We have been programming only in the last seven decades, which is within the lifetime of a single human being. Yet we have existed for 300,000 years as hunter-gatherers. Biological evolution happens in the span of hundreds of thousands of years. Our genes, our biological nature, and our brain structures haven’t had time to adapt to understanding complex code.

We are hunter-gatherers in our very nature, not programmers. Our hunter-gatherer brains cannot make sense of badly written code. The code must be clean, well architected, well documented, and well written for us to make sense of it.

Clean Code Is Paramount

Software engineers usually work in collaborative environments, within teams of other software engineers in their working groups, departments, and companies. Each software engineer needs to be able to understand the code written by another software engineer, in order to get any meaningful work done.

But let’s say you are a software engineer who is working in a team of one. Let’s say you formed your own software startup, and you are the first and the only employee. This means you can just go ahead and start coding up any sort of mess, right?

Wrong. Even in this situation, you are still in fact collaborating with someone else: Your future self. You need to understand the code that you yourself wrote 6 months from now.

I cannot speak for anyone else, but I myself cannot even remember what I had for lunch a couple of days ago. If I write an unintelligible mess, although I can make sense of it today, it is pretty guaranteed that I won’t be able to make any sense of it 6 months from now. I probably won’t be able to make sense of it even a couple of weeks from now. The code I write needs to be so clean that when I take a look at it some time later, I should still understand it. And I am someone with decades of software development experience.

Clean code is paramount. Code cleanliness is the most important quality. It trumps everything else.

Yet, today in many tech companies, when candidates are being interviewed for the software engineering positions, they are rarely asked interview questions about clean code or clean architecture practices. The interview questions are pretty much all about complex data structures and algorithms, and sometimes about large-scale system design. These are important topics for sure, and every good software engineering candidate should have some basic knowledge of fundamental data structures. But let’s get real, not every candidate needs to know about complex algorithms. If I need to design/implement some code that requires a dynamic programming algorithm or the Bellman-Ford algorithm for example, I can google and look up these things.

To be perfectly honest, if anybody asked me a typical software engineering interview question related to complex algorithms today right this second, I would probably fail spectacularly. If I need to look for a software engineering job in today’s world, I would need to sit down and practice hundreds of these algorithmic puzzle questions, even with decades of development experience. And to be perfectly honest again, I can count on one hand how many times I’ve had to implement such a complex algorithm in my career. Each of these times, I could simply look up the solution on the internet.

Yet, there were countless times I’ve had to design a clean and reusable API for a software service. Solutions to such new challenges were not always very straightforward since most of the time, I had to come up with an answer from scratch.

This is what a lot of job searching candidates do these days: They sit down, practice, and memorize hundreds of algorithmic puzzle questions. This impresses a lot of interviewers for sure. However, when these candidates are hired, a considerable number of them start writing unintelligible code. They need to be taught how to write clean code when they first start working, usually by their already experienced peers who review their code.

I have an entire chapter dedicated to my criticism of the algorithmic puzzle interview questions, and my suggestions of what to replace them with. Just to be clear, I have nothing against asking programming interview questions to candidates. Candidates do lie on their resumes, and programming interviews are a good way to detect that. Also, testing the fundamental software knowledge of candidates is all fair game, especially if this knowledge can apply to the day-to-day work of an engineer.

I also believe there should be a difference in the interviewing process when you're hiring a new grad vs a senior software engineer. You cannot quantify both these types of people by asking them the same algorithmic puzzle questions. You hire senior software engineers for their software development experience, for their abilities to lead a team, mentor others, design projects etc. How are you going to determine to hire them if you ask them algorithmic puzzle questions instead of technical questions related to their expertise area? A senior software engineer knowing about the Floyd-Warshall or any other two-dude-name algorithm doesn't assure me of their API design or backend development skills. One has to be memorized, the other can be earned from actual experience. That's the line we must draw.

Most tech companies pride themselves with a vigorous interviewing process and asking these types of algorithmic puzzle questions. Yet, they still manage to have bad hires in their organizations. Therefore, this kind of interviewing process is not up to par.

As I indicated before, tech organizations could do a lot more to truly value the importance of clean code by improving their practices and methodologies. And they should.

Text within this block will maintain its original spacing when published

«prev          next»

Weik, Martin H. “✍️The ENIAC Story.” The ENIAC Story, 1961, https://web.archive.org/web/20110814181522/http://ftp.arl.mil/~mike/comphist/eniac-story.html. Accessed 11 September 2023.

Backus, John. “The history of Fortran I, II, and III.” IEEE Annals of the History of Computing, vol. 20, no. 4, 1998, pp. 68-78. 10.1109/85.728232.

Backus, John. “The Syntax and Semantics of the Proposed International Algebraic Language of Zürich ACM-GAMM Conference.” Proceedings of the International Conference on Information Processing, 1959, pp. 125-132.

Sammet, Jean E. “The real creators of Cobol.” IEEE Software, vol. 17, no. 2, 2000, pp. 30-32. 10.1109/52.841602.

McCarthy, John. “Recursive functions of symbolic expressions and their computation by machine, Part I.” Communications of the ACM, vol. 3, no. 4, 1960, pp. 184-195. 10.1145/367177.367199.

Ritchie, Dennis M. “The development of the C language.” ACM SIGPLAN Notices, vol. 28, no. 3, 1993, pp. 201-208. 10.1145/155360.155580.

Dahl, Ole-Johan, et al. Common base language. Norsk regnesentral, 1970.

Hublin, Jean-Jacques, et al. “New fossils from Jebel Irhoud, Morocco and the pan-African origin of Homo sapiens.” Nature, vol. 546, 2017, pp. 289–292. 10.1038/nature22336.

“Chimpanzee–human last common ancestor.” Wikipedia, https://en.wikipedia.org/wiki/Chimpanzee%E2%80%93human_last_common_ancestor. Accessed 15 January 2024.

Defending Software Quality

Discussion about this post