Performance Improvements for the Graph Module of Sagemath

Sunday, August 16, 2015

Conclusion of the Main Part of the Project

Hi!
In this post, I will summarize the results obtained with the inclusion in Sage of Boost and igraph libraries. This was the main part of my Google Summer of Code project, and it was completed yesterday, when ticket 19003 was closed.

We have increased the number of graph algorithms available in Sage from 66 to 98 (according to the list used in the initial comparison of the graph libraries [1]). Furthermore, we decreased the running-time of several Sage algorithms: in some cases, we have been able to improve the asymptotic running-time, obtaining up to 10000x improvements in our tests. Finally, during the inclusion of external algorithms, we have refactored and cleaned some of Sage source code, like the shortest path routines: we have standardized the input and the output of 15 routines related to shortest paths, and we have removed duplicate code as much as possible.

More specifically, the first part of the project was the inclusion of Boost graph library: since the library is only available in C++, we had to develop an interface. This interface lets us convert easily a Sage graph into a Boost graph, and run algorithms on the converted graph. Then, we have written routines to re-translate the output into a Sage-readable format: this way, the complicated Boost library is "hidden", and users can interact with it as they do with Sage. In particular, we have interfaced the following algorithms:

Edge connectivity (trac.sagemath.org/ticket/18564);
Clustering coefficient (trac.sagemath.org/ticket/18811);
Cuthill-McKee and King vertex orderings (trac.sagemath.org/ticket/18876);
Minimum spanning tree (trac.sagemath.org/ticket/18910);
Dijkstra, Bellman-Ford, Johnson shortest paths (trac.sagemath.org/ticket/18931);

All these algorithms were either not available in Sage, or quite slow, compared to the Boost routines. As far as we know, Boost does not offer other algorithms that improve Sage algorithms: however, if such algorithms are developed in the future, it will be very easy to include them, using the new interface.

In the second part of the project, we included igraph: since this library already offers a Python interface, we decided to include it as an optional package (before it becomes a standard package, at least an year should pass [2]). To install the package, it is enough to type the following instruction from the Sage root folder:

sage -i igraph # To install the igraph C core
sage -i python_igraph # To install the Python interface

Then, we can easily interact with igraph: for a list of available routines, it is enough to type "igraph." and click tab twice. To convert a Sage graph g_sage into an igraph graph it is enough to type g_igraph = g_sage.igraph_graph(), while a Sage graph can be instantiated from an igraph graph through g_sage=Graph(g_igraph) or g_sage=DiGraph(g_igraph). This way, all igraph algorithms are now available in Sage.

Furthermore, we have included the igraph maximum flow algoritm inside the Sage corresponding function, obtaining significant improvements (for more information and benchmarks, we refer to ticket 19003 [3]).

In conclusion, I think the project reached its main goal, the original plan was followed very closely, and we have been able to overcome all problems.

Before closing this post, I would like to thank many people that helped me with great advices, and who provided great solutions to all the problems I faced. First of all, my mentor David Coudert: he always answered very fast to all my queries, and he gave me great suggestions to improve the quality of the code I wrote. Then, a very big help came from Nathann Cohen, who often cooperated with David in reviewing my code and proposing new solutions. Moreover, I have to thank Martin Cross, who gave me good suggestions with Boost graph library, and Volker Braun, who closed all my ticket. Finally, I have to thank the whole Sage community for giving me this great opportunity!

[1] https://docs.google.com/spreadsheets/d/1Iu1hkQtRn9J-sgfZbQTu2RoXzyjoMEWP5-cm3nAwnWE/edit?usp=sharing
[2] http://doc.sagemath.org/html/en/developer/coding_in_other.html
[3] http://trac.sagemath.org/ticket/19003

Monday, July 27, 2015

Including igraph Library

Hello!
In this new blog post, I would like to discuss the inclusion of igraph library inside Sage.
Up to now, I have interfaced Sagemath with Boost graph library, in order to run Boost algorithms inside Sage. Now, I want to do the same with igraph, the other major C++ graph library, which stands out because it contains 62 routines, 29 of which are not available in Sage. Moreover, igraph library is very efficient, as shown in [1] and in the previous post on library comparison.

This inclusion of igraph in Sage is quite complicated, because we have to include a new external library [2] (while in the Boost case we already had the sources). We started this procedure through ticket 18929: unfortunately, after this ticket is closed, igraph will only be an optional package, and we will have to wait one year before it becomes standard. The disadvantage of optional packages is that they must be installed before being able to use them; however, the installation is quite easy: it is enough to run Sage with option -i python_igraph.

After the installation, the usage of igraph library is very simple, because igraph already provides a Python interface, that can be used in Sage. To transform the Sagemath network g_sage into an igraph network g_igraph, it is enough to type g_igraph=g_sage.igraph_graph(), while to create a Sagemath network from an igraph network it is enough to type g_sage = Graph(g_igraph) or g_sage=DiGraph(g_igraph). After this conversion, we can use all routines offered by igraph!
For instance, if we want to create a graph through the preferential attachment model, we can do it with the Sagemath routine, or with the igraph routine:

sage: G = graphs.RandomBarabasiAlbert(100, 2)
sage: G.num_verts()
100
sage: G = Graph(igraph.Graph.Barabasi(100, int(2)))
sage: G.num_verts()
100

The result is the same (apart from randomness), but the time is very different:

sage: import igraph
sage: %timeit G = Graph(igraph.Graph.Barabasi(10000000, int(2)))
1 loops, best of 3: 46.2 s per loop
sage: G = graphs.RandomBarabasiAlbert(10000000, 2)
Stopped after 3 hours.

Otherwise, we may use igraph to generate graphs with Forest-Fire algorithm, which is not available in Sagemath:

sage: G = Graph(igraph.Graph.Forest_Fire(10, 0.1))
sage: G.edges()
[(0, 1, None), (0, 2, None), (1, 7, None), (2, 3, None), (2, 4, None), (3, 5, None), (3, 8, None), (4, 6, None), (8, 9, None)]

We may also do the converse: transform a Sage network into an igraph network and apply an igraph algorithm. For instance, we can use label propagation to find communities (a task which is not implemented in Sage):

sage: G = graphs.CompleteGraph(5)+graphs.CompleteGraph(5)
sage: G.add_edge(0,5)
sage: com = G.igraph_graph().community_label_propagation()
sage: len(com)
2
sage: com[0]
[0, 1, 2, 3, 4]
sage: com[1]
[5, 6, 7, 8, 9]

The algorithm found the two initial cliques as communities.

I hope that these examples are enough to show the excellent possibilities offered by igraph library, and that these features will soon be available in Sagemath!

[1] https://sites.google.com/a/imtlucca.it/borassi/unpublished-works/google-summer-of-code/library-comparison
[2] http://doc.sagemath.org/html/en/developer/packaging.html

Thursday, July 9, 2015

New Boost Algorithms

Hello!
My Google Summer of Code project is continuing, and I am currently trying to include more Boost algorithms in Sage. In this post, I will make a list of the main algorithms I'm working on.

Clustering Coefficient

If two different people have a friend in common, there is a high chance that they will become friends: this is the property that the clustering coefficient tries to capture. For instance, if I pick two random people, very probably they will not know each other, but if I pick two of my acquaintances, very probably they will know each other. In this setting, the clustering coefficient of a person is the probability that two random acquaintances of this person know each other. In order to quantify this phenomenon, we can formalize everything in terms of graphs: people are nodes and two people are connected if they are acquaintances. Hence, we define the clustering coefficient of a vertex $v$ in a graph $G=(V,E)$ as:
$$\frac{2|\{(x,y) \in E:x,y \in N_v\}|}{\deg(v)(\deg(v)-1)}$$ where $N_v$ is the set of neighbors of $v$ and $\deg(v)$ is the number of neighbors of $v$. This is exactly the probability that two random neighbors of $v$ are linked with an edge.
My work has included in Sagemath the Boost algorithm to compute the clustering coefficient, which is more efficient that the previous algorithm, which was based on NetworkX:

sage: g = graphs.RandomGNM(20000,100000)
sage: %timeit g.clustering_coeff(implementation='boost')
10 loops, best of 3: 258 ms per loop
sage: %timeit g.clustering_coeff(implementation='networkx')
1 loops, best of 3: 3.99 s per loop

But Nathann did better: he implemented a clustering coefficient algorithm from scratch, using Cython, and he managed to outperform the Boost algorithm, at least when the graph is dense. Congratulations, Nathann! However, when the graph is sparse, Boost algorithm still seems to be faster.

Dominator tree

Let us consider a road network, that is, a graph where vertices are street intersections, and edges are streets. The question is: if I close an intersection, where am I still able to go, assuming I am at home?
The answer to this question can be summarized in a dominator tree. Assume that, in order to go from my home to my workplace, I can choose many different paths, but all these paths pass through the café, then they pass through the square (that is, if either the café or the square is closed, then there is no way I can go to work). In this case, in the dominator tree, the father of my workplace is the square, the father of the square is the café, and the father of the café is my home, that is also the root of the tree. More formally, given a graph $G$, the dominator tree of $G$ rooted at a vertex $v$ is defined by connecting each vertex $x$ with the last vertex $y \neq x$ that belongs to each path from $v$ to $x$ (note that this vertex always exists, because $v$ belongs to each path from $v$ to $x$).
Until now, Sagemath did not have a routine to compute the dominator tree: I have been able to include the Boost algorithm. Unfortunately, due to several suggestions and improvements in the code, the ticket is not closed, yet. Hopefully, it will be closed very soon!

Cuthill-McKee ordering / King ordering

Let us consider a graph $G=(V,E)$: a matrix $M$ of size $|V|$ can be associated to this graph, where $M_{i,j}=1$ if and only if there is an edge between vertices $i$ and $j$.
In some cases, this matrix can have specific properties, that can be exploited for many purposes, like speeding-up algorithms. One of this properties is bandwidth, which measures how far the matrix is from a diagonal matrix: it is defined as $\max_{M_{i,j} \neq 0}|i-j|$. A small bandwidth might help in computing several properties of the graph, like eigenvalues and eigenvectors.
Since the bandwidth depends on the order of vertices, we can try to permute them in order to obtain a smaller value: in Sage, we have a routine that performs this task. However, this routine is very slow, and it is prohibitive even for very small graphs (in any case, finding an optimal ordering is NP-hard).
Hence, researchers have developed heuristics to compute good orderings: the most important ones are Cuthill-McKee ordering and King ordering. Boost contains both routines, but Sage does not: for this reason, I would like to insert these two functions. The code is almost ready, but part of it depends on the code of the dominator tree: as soon as the dominator tree is reviewed, I will open a ticket on these two routines!

Dijkstra/Bellman-Ford/Johnson shortest paths

Let us consider again a road network. In this case, we are building a GPS software, which has to compute the shortest path between the place where we are and the destination. The textbook algorithm that performed this task is Dijkstra algorithm, which computes the distance between the starting point and any other reachable point (of course, there are more efficient algorithms involving a preprocessing, but Dijkstra is the most simple, and its running-time is asymptotically optimal). This algorithm is already implemented in Sagemath.
Let's spice things up: what if that there are some streets with negative length? For instance, we like a street so much that we are willing to drive 100km more just to pass from that street, which is 50km long. It is like that street is -50km long!
First of all, under these assumptions, a shortest path might not exist: if there is a cycle with negative length, we may drive along that cycle all the times we want, decreasing more and more the distance to the destination. At least, we have to assume that no negative cycle exists.
Even with this assumption, Dijkstra algorithm does not work, and we have to perform Bellman-Ford algorithm, which is less efficient, but more general. Now, assume that we want something more: we are trying to compute the distance between all possible pairs of vertices. The first possibility is to run Bellman-Ford algorithm $n$ times, where $n$ is the number of nodes in the graph. But there is a better alternative: it is possible to perform Bellman-Ford algorithm only once, and then to modify the lengths of edges, so that all lengths are positive, and shortest paths are not changed. This way, we run Dijkstra algorithm $n$ times on this modified graph, obtaining a better running time. This is Johnson algorithm.
Both Bellman-Ford and Johnson algorithms are implemented in Boost and not in Sagemath. As soon as I manage to create weighted Boost graphs (that is, graphs where edges have a length), I will include also these two algorithm!

Thursday, June 25, 2015

Edge Connectivity through Boost Graph Library

After two weeks, we have managed to interface Boost and Sagemath!

However, the interface was not as simple as it seemed. The main problem we found is the genericity of Boost: almost all Boost algorithms work with several graph implementations, which differ in the data structures used to store edges and vertices. For instance, the code that implements breadth-first search works if the adjacency list of a vertex v is a vector, a list, a set, etc. This result is accomplished by using templates [1]. Unfortunately, the only way to interface Sagemath with C++ code is Cython, which is not template-friendly, yet. In particular, Cython provides genericity through fused types [2], whose support is still experimental, and which do not offer full integration with templates [3-5].

After a thorough discussion with David, Nathann, and Martin (thank you very much!), we have found a solution: for the input, we have defined a fused type "BoostGenGraph", including all Boost graph implementations, and all functions that interface Boost and Sagemath use this fused type. This way, for each algorithm, we may choose the most suitable graph implementation. For the output, whose type might be dependent on the input type, we use C++ to transform it into a "standard" type (vector, or struct).

We like this solution because it is very clean, and it allows us to exploit Boost genericity without any copy-paste. Still, there are some drawbacks:
1) Cython fused types do not allow nested calls of generic functions;
2) Boost graphs cannot be converted to Python objects: they must be defined and deleted in the same Cython function;
3) No variable can have a generic type, apart from the arguments of generic functions.

These drawbacks will be overcome as soon as Cython makes templates and generic types interact: this way, we will be able create a much stronger interface, by writing a graph backend based on Boost, so that the user might create, convert, and modify Boost graphs directly from Python. However, for the moment, we will implement all algorithms using the current interface, which already provides genericity, and which has no drawback if the only goal is to "steal" algorithms from Boost.

As a test, we have computed the edge connectivity of a graph through Boost: the code is available in ticket 18564 [6]. Since the algorithm provided by Sagemath is not optimal (it is based on linear programming), the difference in the running time is impressive, as shown by the following tests:

sage: G = graphs.RandomGNM(100,1000)
sage: %timeit G.edge_connectivity()
100 loops, best of 3: 1.42 ms per loop
sage: %timeit G.edge_connectivity(implementation="sage")
1 loops, best of 3: 11.3 s per loop

sage: G = graphs.RandomBarabasiAlbert(300,3)
sage: %timeit G.edge_connectivity(implementation="sage")
1 loops, best of 3: 9.96 s per loop
sage: %timeit G.edge_connectivity()
100 loops, best of 3: 3.33 ms per loop

Basically, on a random Erdos-Renyi graph with 100 vertices and 1000 edges, the new algorithm is 8,000 times faster, and on a random Barabasi-Albert graph with 300 nodes and average degree 3, the new algorithm is 3,000 times faster! This way, we can compute the edge connectivity of much bigger graphs, like a random Erdos-Renyi graph with 5,000 vertices and 50,000 edges:

sage: G = graphs.RandomGNM(5,000, 50,000)
sage: %timeit G.edge_connectivity()
1 loops, best of 3: 16.2 s per loop

The results obtained with this first algorithm are very promising: in the next days, we plan to interface several other algorithms, in order to improve both the number of available routines and the speed of Sagemath graph library!

[1] https://en.wikipedia.org/wiki/Template_%28C%2B%2B%29
[2] http://docs.cython.org/src/userguide/fusedtypes.html
[3] https://groups.google.com/forum/#!topic/cython-users/qQpMo3hGQqI
[4] https://groups.google.com/forum/#!searchin/cython-users/fused/cython-users/-7cHr6Iz00Y/Z8rS03P7-_4J
[5] https://groups.google.com/forum/#!searchin/cython-users/fused$20template/cython-users/-7cHr6Iz00Y/Z8rS03P7-_4J
[6] http://trac.sagemath.org/ticket/18564

Tuesday, June 9, 2015

Performance Comparison of Different Graph Libraries

As promised in the last post, I have compared the performances of several graph libraries, in order to choose which ones should be deployed with Sagemath. Here, I provide the main results of this analysis, while more details are available on my website (see also the links below).
The libraries chosen are the most famous graph libraries written in Python, C, or C++ (I have chosen these languages because they are easier to integrate in Sagemath, using Cython). Furthermore, I have excluded NetworkX, which is already deployed with Sagemath.
First of all, I have to enforce that no graph library comparison can be completely fair, and also this comparison can be criticized, due to the large amount of available routines, to the constant evolution of libraries, and to many small differences in the outputs (for instance, one library might compute the value of a maximum s-t flow, another library might actually compute the flow, and a third one might compute all maximum flows). Despite this, I have tried to be as fair as possible, through a deeper and more detailed analysis than previous comparisons (https://graph-tool.skewed.de/performance, http://www.programmershare.com/3210372/, http://arxiv.org/pdf/1403.3005.pdf).
The first comparison deals with the number of algorithms implemented. I have chosen a set of 107 possible algorithms, trying to cover all possible tasks that a graph library should perform (avoiding easy tasks that are common to all libraries, like outputting the number of nodes, the number of edges, the neighbors of a node, etc). In some cases, two tasks were collapsed in one, if the algorithms solving these tasks are very similar (for instance, computing a maximum flow and computing a minimum cut, computing vertex betweenness and edge betweenness, etc).
The number of routines available for each library is plotted in the following chart, and a table containing all features is available in HTML or as a Google Sheet.

The results show that Sagemath has more routines than all competitors (66), closely followed by igraph (62). All other libraries are very close to each other, having about 30 routines each. Furthermore, Sagemath could be improved in the fields of neighbor similarity measures (assortativity, bibcoupling, cocitation, etc), community detection, and random graph generators. For instance, igraph contains 29 routines that are not available in Sagemath.

The second comparison analyzes the running-time of some of the algorithms implemented in the libraries. In particular, I have chosen 8 of the most common tasks in graph analysis: computing the diameter, computing the maximum flow between two vertices, finding connected components and strongly connected components, computing betweenness centrality, computing the clustering coefficient, computing the clique number, and generating a graph with the preferential attachment model. I have run each of these algorithms on 3 inputs, and I have considered the total execution time (excluding the time needed to load the graph). More details on this experiment are available here, and the results are also available in a Google Sheet.
In order to make the results more readable, I have plotted the ratio between the time needed by a given library and the minimum time needed by any library. If an algorithm was not implemented, or it needed more than 3 hours to complete, the corresponding bar is not shown.

Overall, the results show that NetworKit is the fastest library, or one of the fastest, in all routines that are implemented (apart from the generation of preferential attachment graphs, where it is very slow). Boost graph library is very close to NetworKit, and it also contains more routines. Also Sagemath is quite efficient in all tasks, apart from the computation of strongly connected components and the generation of a preferential attachment graph, where it needed more than 3 hours. However, in the latter case, the main problem was not speed but memory consumption.

In conclusion, Sagemath can highly benefit from the possibility of using algorithms from other libraries. First of all, it might improve the number of algorithms offered, especially by including igraph, and it also might improve its performance, by including Boost, NetworKit, or other fast graph libraries.

Thursday, June 4, 2015

Comparison of Graph Libraries

Many times, people asked me "Which is the best available graph library?", or "Which graph library should I use to compute this, or that?".
Well, personally I love to use Sage, but there are also several good alternatives. Then, the question becomes "How could we improve Sage, so that people will choose it?".

In my opinion, graph libraries are compared according to the following parameters:

simplicity and documentation: people have little time, and the faster they learn how to use the library, the better;
number of routines available;
speed: sometimes, the input is very big, and the algorithms take much time to finish, so that a fast implementation is fundamental.

While it is very difficult to measure the first point, the others can be compared and improved. For this reason, in order to outperform other libraries, we should implement new features, and improve existing ones. You don't say!

However, this answer is not satisfactory: in principle, we could add all features available in other libraries, but this is a huge translational work, and while we are doing this work the other libraries will change, making this effort a never-ending story.

My project proposes an alternative: cooperating instead of competing. I will try to interface Sage with other libraries, and to use their algorithms when the Sage counterpart is not available, or less efficient. This way, with an affordable amount of work, we will be able to run all algorithms available in the best graph libraries!

As a first step, I have compared all the most famous C, C++, and Python graph libraries according to points 2 and 3, in order to choose which libraries should be included. The next posts will analyze the results of this comparison.

Google Summer of Code: let's start!

This blog will follow my Google Summer of Code project, entitled Performance Improvements for the Graph Module of Sagemath. The complete project is available here, and related documents with partial results will be available on the same website.
In this first post, I would like to thank my mentor David Coudert and Nathann Cohen, who helped me a lot in writing this project and understanding how the graph module of Sagemath works.
With their help, and with the help of the Sage community, I hope it will be a useful and funny work! Let's start!