Return to site

R vs Python for cluster analysis

By Wajdi Farhani

· ToolsAndTech

While Python and R are known as the preferred languages of data scientists, the question of which one is 'better' for data analysis issues was and is still a FAQ. In this article, I will try to give a detailed comparison based on our particular experience with both languages in cluster analysis.

To solve data analysis issues, we mainly need a technology with:

  • Trusted statistical tools.
  • Good reliability and strong performances.

As a short answer, we can say that both Python and R provide libraries that cover the majority of statistical techniques and methods while the performance question is relative and the answer depends on the use case. Before we begin, let's do some history:

Python:

Python is a high-level language created in the late 80s - beginnings of 90s by the deutsch computer programmer Guido van Rossum inspired by the ABC language.

Guido van Rossum wrote:

"Over six years ago, in December 1989, I was looking for a "hobby" programming project that would keep me occupied during the week around Christmas. My office ... would be closed, but I had a home computer, and not much else on my hands. I decided to write an interpreter for the new scripting language I had been thinking about lately: a descendant of ABC that would appeal to Unix/C hackers. I chose Python as a working title for the project, being in a slightly irreverent mood (and a big fan of Monty Python's Flying Circus)."

After that, Python has become a widely used high-level, general-purpose, interpreted, dynamic programming language and even favorably compared to Lisp, Tcl, Perl, Ruby, C#, Visual Basic, Visual Fox Pro, Scheme or Java.

While Python wasn't particularly designed for, and since the early 2000s, adoption of Python for scientific computing in both industrial applications and academic research has increased significantly due to its ecosystem of open source libraries.

R:

R is a programming language and software environment for statistical computing that has been implemented by Ross Ihaka and Robert Gentleman in 1995 and supported by R Foundation for Statistical Computing.

The goal behind creating R was to implement the S programming language in a clear and user-friendly way that facilitates statistical data analysis.

At first, R was created exclusively for academic use even though lately, it was used in industries by big companies such as Bank of America, Benetech, Bing and Facebook (here is a list of some companies using R)

So, R or Python?

There is a lot of good comparisons in the web based on numbers representing the adoption of R and Python, their popularity, the average salary of each language's hackers and some other general criteria, take a look at this numbers-based comparison for example.

I will try to talk about our particular experience with both languages because we had to make this choice for implementing one of our algorithms (Online Clustering Algorithm) in an industrial environment.

Online Clustering Algorithm (alias OCA) is a complex algorithm that is able to classify a huge stream of data in real time with no prior knowledge about the generator behavior (number of clusters, distribution etc. ). It has been motivated by recent theoretical advances in online learning but raises many computational challenges.

Of course, the choice was bounded between Python and R especially that we need a lot of statistical tools and a user-friendly language considering the complexity of the algorithm.

Because OCA is a real-time clustering algorithm, we need an instant interaction for each incoming data-point of the stream and the reaction time has absolutely to be less than the inverse of the generator average frequency, otherwise some of the incoming data will be escaped by the clustering system.

We have benchmarked R and Python to look at their execution times (on Dell Latitude E7450 - Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz ). In the following, I will present two examples:

  • A very simple data-simulation example.
  • Kmeans algorithm.

Example 1 (random sampling)

Let's take a look at R and python performance for a simple loop that simulates N variables following a uniform distribution in [0,1]:

The evolution of the elapsed time by N on my Dell Lattitude E7450 looks like this:

Python

Wow! R is very slow! ok, we are done here ... Python is by far our winner !

Let's not judge so quickly :-) The first rule we need to respect with R is avoiding for() loops! However, we are running a huge loop which "can" be the reason behind the huge execution time difference. Lets try to generate N values in one batch:

Here we are using the runif(N,inf,sup) method of R which generates a batch of N values, its equivalent in python is the np.random.uniform(inf,sup,N) method.

By eliminating the for() Loop, R has become much faster but the difference of the execution time compared to python is still large:

R/Python's execution time

Example 2 (k-means)

k-means is a very popular algorithm in cluster analysis which aims at partitioning n observations into k clusters. In the following I will ask both Python and R to classify series of datasets with k=2, 10 and n (dataset length) between 1000 and 100000 by setting the same model with the same parameters.

To generate datasets I used the code below :

Within its stats library, R has an implementation of this algorithm that I will use in the folowing:

For Python, I will use the famous sklearn.cluster.KMeans library:

Once codes above have been executed, I drew some charts to resume their execution times and loss:

Case 1: k = 2:

Elapsed Time (k =2)
Inertia (k =2)

Case 2: k = 10:

Inertia (k =10)

As we can see, R is much faster than Python in the first case where there is only two centers but it is less accurate as the inertia (which sums the distances between points and nearest centers) is higher and its slope is bigger than python.

This means that for few clusters, kmeans with R is running quickly but gives an inaccurate result while Python takes more time and converges to highly accurate decisions.

From the second case (k=10), R's performances are deteriorated in terms of execution time while Python is still efficient with an excellent accuracy.

Conclusion

What I presented above doesn't mean that Python is "better" than R but it gives an idea for data scientist and hackers that have a similar use case in unsupervised learning.

Of course, If you are an engineer or you are working in an engineering environment, you might prefer Python because of its Object Oriented Paradigm, its clarity and its huge open source community.

In the other hand, R also is still by far the preferred language of statisticians and academic researchers and I assume that it's very important for data junkies to know and have some experience with this excellent language.

All Posts
×

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!

OKSubscriptions powered by Strikingly