Every data scientist that uses neural networks recognizes the following issue: your computer isn’t powerful enough and it takes hours and hours to finish running complex algorithms. At Ynformed we felt like we had to tackle this returning issue once and for all. And so we managed to find a solution to help our colleagues run their models faster. How? We built our own deep learning machine! How we did this? Read more about our considerations and the process below.
When having lots of data, especially machine generated data, we prefer to use neural networks. Not only because they can handle vague input data, but also because they do not require a lot of feature generation. Neural networks have made incredible advances in image recognition, speech synthetization and translation, but there are some downsides. One of these downsides is the enormous training time these models can take. A single model can easily train for hours, days and even weeks. This is a direct result of how they work, and while methods exists to mitigate this (transfer learning), it cannot totally be avoided.
CPU vs GPU
It turns out the computation time is mainly the result of a lot relatively simple mathematics. What if we could speed up these operations? Normal computers have CPUs that can do quite complex computations, but we do not want to solve this kind of computations. To solve our problem we need Graphics Processing Units (GPUs): the hardware specifically designed for rendering images, like running games or video editing. GPUs are very good at doing data-parallel computations. More specifically, GPUs can do matrix and vector operations much faster than a CPU. To quickly iterate on training neural networks, it is beneficial to use a machine with modern GPU(s).
Conclusion: we need GPU(s). The question now is how we are going to get these GPUs. We explored two options: using a cloud provider and building our own deep learning machine.
We first did some research at GPU computing options as a service at cloud platforms. At Ynformed we mainly use Microsoft Azure, sometimes AWS. However, it turns out that this is quite a new development, along with other Machine learning workspaces. This also means that things are likely to change in the near future. Since cloud-based APIs for using GPUs on neural networks are prone to frequent changes, we would have to invest quite some time in keeping up to date with this approach. Right now we prefer a faster solution which colleagues can start using immediately.
Another important fact is that these cloud solutions are not cheap. Running a single instance for a month can ramp up to almost 3,000 euros. Of course, there are methods to minimize the time needed. You could only pay for the time you actually use the hardware, but this also means extra work. Besides, we considered the psychological aspect. If you want to try something new and you immediately have to start paying, it will likely scare some colleagues away.
Building our own deep learning machine
In terms of costs, building your own machine is without a doubt the better option. Cloud solutions generally make use of a pay-per-use model whereas building your own machine only leads to costs up-front (OpEx vs. CapEx). Simply because you buy hardware yourself instead of using hardware from another. Besides, we found that the hardware we needed is quite affordable. The pay-for-use is much cheaper when using your own machine since you only have electricity costs.
The downside of building your own computer is the maintenance. With more hardware comes more maintenance. Whereas a cloud provider is responsible for maintenance in the case of using a cloud solution, we need to do this ourselves when building a machine. We had to consider if it was worth our time, and we decided it was. Partly because we are used to maintaining other applications within the company and we expect that maintaining a new machine will be a rather small task on a weekly basis.
Time to build
The decision was made: we will build our own machine! First, we searched the internet for people who have done this before. As it turned out, there were quite a lot. We combined their experience into a machine that suits our needs. We started with a single, but expandable, GPU and picked the other hardware to support it. The GPU we chose is an all-rounder: RTX 2070. We based this decision on an article from Tim Dettmers.
Next, we ordered the parts and picked an evening to build it. Quite some Ynformers where interested in our new professional piece of hardware (or ‘toy’). So it did not take very long for us to gather some interested Ynformers to help with the building- and installation process. We have a server room in our office where we could put it, secure and all, so we decided to get a rack case. The building itself is similar to building a regular pc: not that complicated and a lot of fun.
“The building itself is similar to building a regular computer: not that complicated & a lot of fun!”
After comparing cloud solutions with building a machine and using it on-premise, we have thus decided to build our own deep learning machine. Nevertheless, we would not advise anyone to never use cloud solutions for the problem discussed. It all depends on your situation, budget and goals. Although we have chosen for an on-premise solution, time will tell if this solution is still feasible in the long run. Depending on the number of requests, we might upgrade the machine or even consider using a machine learning solution from a cloud provider. But until then, we will test our machine to the max and hope that our colleagues will find many uses for it.
Curious about this project? About the speed-ups we saw? A blog about the installation of the software, how to monitor everything and how to keep up with to rapidly changing tools will follow soon.