Most data scientists works with Python. Well known machine learning libraries allows them to solve many real world problems. Our company is no exception.
As a computer scientist, I work with many languages, one of them is Go (https://tour.golang.org/). This language has good qualities : it is easy to write or read (not as easy as Python but much easier than C or C++ for instance), it compiles to native libraries and executables with high performance and offers a good parallel programming paradigm.
This article is an introduction to go programming for data scientists. We will use some mathematical libraries, one of them is Gonum (https://github.com/gonum/gonum). This excellent library provides linear algebra, statistics and a lot more. In Go, the central repository of all libraries is Github.
Go is a statically and strongly typed language, which can seem burdensome, especially to data scientists used to code with Python. But this allows the go compiler to produce highly optimized native code.
This is a well known dataset that describes a graph which nodes are members and edges indicate that members are friends. We will try to find communities of members using spectral clustering. I like this example because we will see how to manipulate matrices in Go as well as a Kmeans implementation.
Refer to http://konect.cc/networks/ucidata-zachary/ and https://en.wikipedia.org/wiki/Zachary's_karate_club for more information on this dataset and its history.
We are really attached to produce good quality software. One way to achieve this is to do TDD : Test Driven Development. It's a very simple concept : each feature of a program is written in short cycles composed of 3 steps :
Using TDD allows to produce fully tested software as well as readable and reusable code. We will use this method to build our program. If you don't already use it, I hope I will convince you to give it a try for your own developments.
More information on TDD here : http://butunclebob.com/ArticleS.UncleBob.TheThreeRulesOfTdd
All the code is available on Github : https://github.com/wearelumenai/getswgonum
The dataset is given in the
karate.csv file. We want to load it in a matrix type. Let us begin by writing our first test (1st step of TDD), in the file
We first load the data in a matrix and then we test if it is correct by testing well known facts :
* Actually TDD requires that the test must be the smallest that fails. This is not the case here but I had to deal with the readability of the text.
Of course, this test will fail because we have not provided an implementation for the function
We will write the production code in the file
spectral.go (2nd step of TDD cycle) :
In this code, we use the Gonum library to build a matrix of type Symmetric. Now we run the test using the
go test command and see that it passes :
Testing with Go is really easy, it has been included since the early conception of the language.
We would want our program to be more readable. Let us split
LoadAdjacency in two separate functions. This is called refactoring (the 3rd and last step of the TDD cycle) :
We also want our code to be generic and to be able to load any adjacency matrix : we need to get rid of the magic number 34 and allow the matrix to grow as necessary. Thus we refactor a bit more :
Let's verify that the refactoring did not introduce any regression, that is our test still passes :
Yes ! We have completed our first TDD cycle !
The Laplacian is given by L = D - A where A is the adjacency matrix and D is a diagonal matrix which values are the node degrees. We will proceed with two TDD cycle :
First we write a test in
Here we reuse the LoadAdjacency function and introduce the GetDegrees function (which is not yet implemented). Then we test that member 0 has a degree of 16.
Then we write the implementation code in the file
We have used the Gonum library to build a matrix of type Diagonal. Our tests now pass :
First write a test :
Then the production code :
GetLaplacian function reuses the
Our tests still pass :
We want to run a clustering algorithm ; for this we need a vector space. Thus we will not work on the laplacian matrix but on its eigen vectors. We need the
k eigen vectors corresponding to the smallest eigen values except the first (which value is equal to zero and vector has all components equal).
Let's write the test :
In this test we first check that the result has 34 rows (the number of members) and 4 columns (the number of dimension we want for clustering). Then we verify that eigen vectors have norm 1.
Now we write the implementation :
This is really easy thanks to Gonum !
Tests pass :
It's time for the 3rd TDD step : refactoring. Now I want to refactor my test to make it more clear to the reader :
Now we write our last test :
We have introduced the last function to be implemented :
KMeans, it takes the eigen decomposition result matrix. Then we test the following facts (that are well known - see the documentations linked above) :
Here is the production code :
The go tool is able to give us the coverage of our tests (i.e.) the percentage of code which is effectively tested :
We cover 97% of our code. We can see by using
go tool cover that the missing statements are in the
LoadAdjacendy function :
Missing statements are actually error cases. Let's fix that ! We add a test that tries to load a file that does not exist :
And now :
That's it ! 100% of our code is now tested and ready for production.
Go tool allows to easily write a benchmark to test and profile runtime performances. We add the following function to the file
Let's run and profile the benchmark :
The algorithm was executed 10000 times in 2 seconds. Now we can analyze which calls take the most CPU time :
We see that most of the time is used to compute the eigen vectors decomposition (Gonum uses the lapack library under the hood). This tool is very useful to optimize our programs, you can also use the
web command in the
pprof prompt to display a graphical tree of CPU usage : it's amazing.
Finally we want to be able to execute our program. We will add a
main function in the file
main.go that split the members in two communities and display the result :
It is not tested but you can see that this function is exactly the same as the first lines of our last test function
We build the executable and run our program :
This result is pretty good, you can verify that by referring to the literature (links above).
I hope you enjoyed our approach to build a simple machine learning program in Go language. The key points are :
go tool pprofcommand
Try it yourself !
We just sent you an email. Please click the link in the email to confirm your subscription!