As the number of data scientists at Ynformed is growing, it becomes increasingly important to collaborate efficiently with each other in our projects. As a result, we have increased the focus on sharing our code in a structured way, which simultaneously improves the readability of our codes. As part of this effort, we have standardized several of our development and coding practices. One of them is the code style. Code style can refer to anything that has no effect on the outcome of the code, but instead affects the style and form of the code. A consistent style has several advantages:
- It makes code easier to ‘grok’, which means that it is easier to immerse yourself in the actual meaning of the code, as opposed to spending energy on understanding the different code styles used across the team.
- It makes the code more of a common property of the team. When (part of the) code is written by a person with a unique style, it means there will forever exist a conflict between parts of the code, which takes time and effort to mend.
- Sometimes, the code just looks nicer with a bit of styling done. Compare it to using hair gel: it’s easy and quick, but it can have a huge effect on how nice you look.
These are all general advantages, independently from the chosen programming language. At Ynformed, both R and Python are commonly used programming languages. Although some data scientists have their personal preferences for one of these languages, we usually make a choice for one of these languages when multiple colleagues work on the same project. In this blog, I will focus on the code style practices we use for R.
For R, we have made the decision to use two tools to maintain this consistent style: the Tidyverse style guide (https://style.tidyverse.org/), and a custom cookiecutter template. (https://github.com/audreyr/cookiecutter)
The Tidyverse style guide is a set of rules about the coding style. Instead of reinventing the wheel, we adopt a consistent style that is used across the entire R ecosystem. This has the advantage that the style of our code stays consistent when combined with many other packages that also use this style guide. Another advantage of the Tidyverse style guide is that it is by default supported by the ‘styler’ and ‘lintr’ packages, which I will talk about later in this blog.
Cookiecutter is a tool that applies a project template, such as the one seen to the right. Fundamentally, this project template is not much more than a bunch of directories, and a few placeholder files. However, this basic structure goes a long way in making a project more ‘grokkable’. By creating some directories at the start of a project, it means you will no longer have to think deeply about a logical place to put new code files. Even better, it means that other people looking at your project no longer have to go on a hunt to find the one file they are looking for.
There is a huge correlation between the effort it takes to commit to a change, the perceived rewards, and the chance that you will actually take the time to do it. Therefore, to maximize the probability everyone will abide by the Tidyverse style guide, we need to make it as easy as possible. This is done by ‘automatic styling’. In R, this is achieved by a combination of the ‘styler’ package and RStudio add-ins. After installing the styler package, automatically styling your code is only two clicks away!
A picture speaks a thousand words (or rather, fixes a thousand styling errors). Even RStudio itself sighs in relief as all its warnings disappear with just two clicks.
Even though automatic styling solves a lot of issues, it cannot solve everything. For example, we want to enforce that all variables are lowercase, and that lines are not too long. These changes would be slightly too dangerous to make automatically. For these issues, we have a linter. Linting can be done by the excellent ‘lintr’ package, and is as simple as running the command ‘lint(“filename.R”)’. But can we make it even easier using RStudio add-ins? Yes!
Linting cannot solve all issues automatically, but lintr integrates with RStudio to create a list you can click through, and solve all the issues one by one. Lintr does not create an RStudio add-in by default, so we created one. If you’re interested, it’s available at https://dev.ynformed.nl/source/YnfoRmed/
Even after automatically styling and linting, not every issue is fixed. For example, I wrote a piece of code for this blog that passes all our automatic tests. Unfortunately, this code is far from ‘high-quality’. Even after looking at it for a while, I’m not sure what it actually does, and using ‘T’ as a variable name certainly doesn’t help. For these scenarios, the fastest solution is a second pair of eyes. The reviewer’s job is not to fix the code, but simply to point out potential issues. A comment such as “this function is a breeding place for bugs” can prevent a lot of problems.
Another advantage of code review is that most people will unconsciously write better code if they know it’s going to be reviewed. After all, why waste your time by writing a lot of bad code, if you know you have to rewrite it all anyway, after the reviewer tears it apart? And of course, getting your code reviewed by an experienced programmer can be a learning opportunity for data scientists with less coding experience.
As you can see, there are several safeguards in place at Ynformed to enforce cleaner code, ranging from the automatic fixing of minor styling errors, to the manual inspection of code by a reviewer. Each serves a purpose, and each helps with not only writing better code in the present, but also making the code easier to use in the future. After all, even if you understand your own code like the back of your own hand, what about two months later? It’s always a good idea to write code for the maintainer (http://wiki.c2.com/?CodeForTheMaintainer). If that maintainer is you, then all the more reason to write clean code.
For those worrying about Python code: we are doing many, if not all of these things in Python as well. Luckily, this is often even easier in Python, as there are many tools to support the widely used PEP8 standard.