Return to site

Machine Learning, Test Driven Development and Go language

Most data scientists works with Python. Well known machine learning libraries allows them to solve many real world problems. Our company is no exception.

As a computer scientist, I work with many languages, one of them is Go (https://tour.golang.org/). This language has good qualities : it is easy to write or read (not as easy as Python but much easier than C or C++ for instance), it compiles to native libraries and executables with high performance and offers a good parallel programming paradigm.

This article is an introduction to go programming for data scientists. We will use some mathematical libraries, one of them is Gonum (https://github.com/gonum/gonum). This excellent library provides linear algebra, statistics and a lot more. In Go, the central repository of all libraries is Github.

Go is a statically and strongly typed language, which can seem burdensome, especially to data scientists used to code with Python. But this allows the go compiler to produce highly optimized native code.

Zachary karate club

This is a well known dataset that describes a graph which nodes are members and edges indicate that members are friends. We will try to find communities of members using spectral clustering. I like this example because we will see how to manipulate matrices in Go as well as a Kmeans implementation.

Refer to http://konect.cc/networks/ucidata-zachary/ and https://en.wikipedia.org/wiki/Zachary's_karate_club for more information on this dataset and its history.

Test Driven Development

We are really attached to produce good quality software. One way to achieve this is to do TDD : Test Driven Development. It's a very simple concept : each feature of a program is written in short cycles composed of 3 steps :

  1. Write a test
  2. Write production code that fulfill the test
  3. Refactor the code

Using TDD allows to produce fully tested software as well as readable and reusable code. We will use this method to build our program. If you don't already use it, I hope I will convince you to give it a try for your own developments.

More information on TDD here : http://butunclebob.com/ArticleS.UncleBob.TheThreeRulesOfTdd

All the code is available on Github : https://github.com/wearelumenai/getswgonum

Load the dataset in a matrix

The dataset is given in the karate.csv file. We want to load it in a matrix type. Let us begin by writing our first test (1st step of TDD), in the file spectral_test.go*

We first load the data in a matrix and then we test if it is correct by testing well known facts :

  1. there must be 34 members
  2. member 0 is a friend of member 11
  3. member 0 is not a friend of member 29

* Actually TDD requires that the test must be the smallest that fails. This is not the case here but I had to deal with the readability of the text.

Of course, this test will fail because we have not provided an implementation for the function LoadAdjacency.

We will write the production code in the file spectral.go (2nd step of TDD cycle) :

In this code, we use the Gonum library to build a matrix of type Symmetric. Now we run the test using the go test command and see that it passes :

Testing with Go is really easy, it has been included since the early conception of the language.

We would want our program to be more readable. Let us split LoadAdjacency in two separate functions. This is called refactoring (the 3rd and last step of the TDD cycle) :

We also want our code to be generic and to be able to load any adjacency matrix : we need to get rid of the magic number 34 and allow the matrix to grow as necessary. Thus we refactor a bit more :

Let's verify that the refactoring did not introduce any regression, that is our test still passes :

Yes ! We have completed our first TDD cycle !

Compute the Laplacian Matrix

The Laplacian is given by L = D - A where A is the adjacency matrix and D is a diagonal matrix which values are the node degrees. We will proceed with two TDD cycle :

  1. Compute the D matrix
  2. Compute the L matrix
The D matrix

First we write a test in spectral_test.go :

Here we reuse the LoadAdjacency function and introduce the GetDegrees function (which is not yet implemented). Then we test that member 0 has a degree of 16.

Then we write the implementation code in the file spectral.go :

We have used the Gonum library to build a matrix of type Diagonal. Our tests now pass :

The L matrix

First write a test :

Then the production code :

The GetLaplacian function reuses the GetDegrees function.

Our tests still pass :

Eigen vectors decomposition

We want to run a clustering algorithm ; for this we need a vector space. Thus we will not work on the laplacian matrix but on its eigen vectors. We need the k eigen vectors corresponding to the smallest eigen values except the first (which value is equal to zero and vector has all components equal).

Let's write the test :

In this test we first check that the result has 34 rows (the number of members) and 4 columns (the number of dimension we want for clustering). Then we verify that eigen vectors have norm 1.

Now we write the implementation :

This is really easy thanks to Gonum !

Tests pass :

It's time for the 3rd TDD step : refactoring. Now I want to refactor my test to make it more clear to the reader :

Now the Kmeans

Now we write our last test :

We have introduced the last function to be implemented : KMeans, it takes the eigen decomposition result matrix. Then we test the following facts (that are well known - see the documentations linked above) :

  1. member 0 and member 33 are not in the same community
  2. member 2 is in the same community as member 1
  3. member 32 is in the same community as member 33

Here is the production code :

Here we have used this library : https://github.com/bugra/kmeans. We have chosen it because it is well tested and we love that : it is a guaranty of quality.

All tests pass :

Test coverage

The go tool is able to give us the coverage of our tests (i.e.) the percentage of code which is effectively tested :

We cover 97% of our code. We can see by using go tool cover that the missing statements are in the LoadAdjacendy function :

Missing statements are actually error cases. Let's fix that ! We add a test that tries to load a file that does not exist :

And now :

That's it ! 100% of our code is now tested and ready for production.

Benchmark the code

Go tool allows to easily write a benchmark to test and profile runtime performances. We add the following function to the file spectral-test.go :

Let's run and profile the benchmark :

The algorithm was executed 10000 times in 2 seconds. Now we can analyze which calls take the most CPU time :

We see that most of the time is used to compute the eigen vectors decomposition (Gonum uses the lapack library under the hood). This tool is very useful to optimize our programs, you can also use the web command in the pprof prompt to display a graphical tree of CPU usage : it's amazing.

The main program

Finally we want to be able to execute our program. We will add a main function in the file main.go that split the members in two communities and display the result :

It is not tested but you can see that this function is exactly the same as the first lines of our last test function Test_KMeans.

We build the executable and run our program :

This result is pretty good, you can verify that by referring to the literature (links above).

Conclusion

I hope you enjoyed our approach to build a simple machine learning program in Go language. The key points are :

  1. we have used the Gonum library which constitute an amazing work that provides a lot of powerful mathematical tools
  2. we have used Test Driven Development to build a well tested program using intensively the go test command
  3. we were able to analyze the runtime performance of our program thanks to the go tool pprof command

Try it yourself !

All Posts
×

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!

OKSubscriptions powered by Strikingly