Hello!

In this new blog post, I would like to discuss the inclusion of igraph library inside Sage.

Up to now, I have interfaced Sagemath with Boost graph library, in order to run Boost algorithms inside Sage. Now, I want to do the same with igraph, the other major C++ graph library, which stands out because it contains 62 routines, 29 of which are not available in Sage. Moreover, igraph library is very efficient, as shown in [1] and in the previous post on library comparison.

This inclusion of igraph in Sage is quite complicated, because we have to include a new external library [2] (while in the Boost case we already had the sources). We started this procedure through ticket 18929: unfortunately, after this ticket is closed, igraph will only be an optional package, and we will have to wait one year before it becomes standard. The disadvantage of optional packages is that they must be installed before being able to use them; however, the installation is quite easy: it is enough to run Sage with option -i python_igraph.

After the installation, the usage of igraph library is very simple, because igraph already provides a Python interface, that can be used in Sage. To transform the Sagemath network g_sage into an igraph network g_igraph, it is enough to type g_igraph=g_sage.igraph_graph(), while to create a Sagemath network from an igraph network it is enough to type g_sage = Graph(g_igraph) or g_sage=DiGraph(g_igraph). After this conversion, we can use all routines offered by igraph!

For instance, if we want to create a graph through the preferential attachment model, we can do it with the Sagemath routine, or with the igraph routine:

sage: G = graphs.RandomBarabasiAlbert(100, 2)

sage: G.num_verts()

100

sage: G = Graph(igraph.Graph.Barabasi(100, int(2)))

sage: G.num_verts()

100

The result is the same (apart from randomness), but the time is very different:

sage: import igraph

sage: %timeit G = Graph(igraph.Graph.Barabasi(10000000, int(2)))

1 loops, best of 3: 46.2 s per loop

sage: G = graphs.RandomBarabasiAlbert(10000000, 2)

Stopped after 3 hours.

Otherwise, we may use igraph to generate graphs with Forest-Fire algorithm, which is not available in Sagemath:

sage: G = Graph(igraph.Graph.Forest_Fire(10, 0.1))

sage: G.edges()

[(0, 1, None), (0, 2, None), (1, 7, None), (2, 3, None), (2, 4, None), (3, 5, None), (3, 8, None), (4, 6, None), (8, 9, None)]

We may also do the converse: transform a Sage network into an igraph network and apply an igraph algorithm. For instance, we can use label propagation to find communities (a task which is not implemented in Sage):

sage: G = graphs.CompleteGraph(5)+graphs.CompleteGraph(5)

sage: G.add_edge(0,5)

sage: com = G.igraph_graph().community_label_propagation()

sage: len(com)

2

sage: com[0]

[0, 1, 2, 3, 4]

sage: com[1]

[5, 6, 7, 8, 9]

The algorithm found the two initial cliques as communities.

I hope that these examples are enough to show the excellent possibilities offered by igraph library, and that these features will soon be available in Sagemath!

[1] https://sites.google.com/a/imtlucca.it/borassi/unpublished-works/google-summer-of-code/library-comparison

[2] http://doc.sagemath.org/html/en/developer/packaging.html

## Monday, July 27, 2015

## Thursday, July 9, 2015

### New Boost Algorithms

Hello!

My Google Summer of Code project is continuing, and I am currently trying to include more Boost algorithms in Sage. In this post, I will make a list of the main algorithms I'm working on.

If two different people have a friend in common, there is a high chance that they will become friends: this is the property that the clustering coefficient tries to capture. For instance, if I pick two random people, very probably they will not know each other, but if I pick two of my acquaintances, very probably they will know each other. In this setting, the clustering coefficient of a person is the probability that two random acquaintances of this person know each other. In order to quantify this phenomenon, we can formalize everything in terms of graphs: people are nodes and two people are connected if they are acquaintances. Hence, we define the clustering coefficient of a vertex \(v\) in a graph \(G=(V,E)\) as:

$$\frac{2|\{(x,y) \in E:x,y \in N_v\}|}{\deg(v)(\deg(v)-1)}$$ where \(N_v\) is the set of neighbors of \(v\) and \(\deg(v)\) is the number of neighbors of \(v\). This is exactly the probability that two random neighbors of \(v\) are linked with an edge.

My work has included in Sagemath the Boost algorithm to compute the clustering coefficient, which is more efficient that the previous algorithm, which was based on NetworkX:

sage: g = graphs.RandomGNM(20000,100000)

sage: %timeit g.clustering_coeff(implementation='boost')

10 loops, best of 3: 258 ms per loop

sage: %timeit g.clustering_coeff(implementation='networkx')

1 loops, best of 3: 3.99 s per loop

But Nathann did better: he implemented a clustering coefficient algorithm from scratch, using Cython, and he managed to outperform the Boost algorithm, at least when the graph is dense. Congratulations, Nathann! However, when the graph is sparse, Boost algorithm still seems to be faster.

Let us consider a road network, that is, a graph where vertices are street intersections, and edges are streets. The question is: if I close an intersection, where am I still able to go, assuming I am at home?

The answer to this question can be summarized in a dominator tree. Assume that, in order to go from my home to my workplace, I can choose many different paths, but all these paths pass through the café, then they pass through the square (that is, if either the café or the square is closed, then there is no way I can go to work). In this case, in the dominator tree, the father of my workplace is the square, the father of the square is the café, and the father of the café is my home, that is also the root of the tree. More formally, given a graph \(G\), the dominator tree of \(G\) rooted at a vertex \(v\) is defined by connecting each vertex \(x\) with the last vertex \(y \neq x\) that belongs to each path from \(v\) to \(x\) (note that this vertex always exists, because \(v\) belongs to each path from \(v\) to \(x\)).

Until now, Sagemath did not have a routine to compute the dominator tree: I have been able to include the Boost algorithm. Unfortunately, due to several suggestions and improvements in the code, the ticket is not closed, yet. Hopefully, it will be closed very soon!

Let us consider a graph \(G=(V,E)\): a matrix \(M\) of size \(|V|\) can be associated to this graph, where \(M_{i,j}=1\) if and only if there is an edge between vertices \(i\) and \(j\).

In some cases, this matrix can have specific properties, that can be exploited for many purposes, like speeding-up algorithms. One of this properties is bandwidth, which measures how far the matrix is from a diagonal matrix: it is defined as \(\max_{M_{i,j} \neq 0}|i-j|\). A small bandwidth might help in computing several properties of the graph, like eigenvalues and eigenvectors.

Since the bandwidth depends on the order of vertices, we can try to permute them in order to obtain a smaller value: in Sage, we have a routine that performs this task. However, this routine is very slow, and it is prohibitive even for very small graphs (in any case, finding an optimal ordering is NP-hard).

Hence, researchers have developed heuristics to compute good orderings: the most important ones are Cuthill-McKee ordering and King ordering. Boost contains both routines, but Sage does not: for this reason, I would like to insert these two functions. The code is almost ready, but part of it depends on the code of the dominator tree: as soon as the dominator tree is reviewed, I will open a ticket on these two routines!

Let us consider again a road network. In this case, we are building a GPS software, which has to compute the shortest path between the place where we are and the destination. The textbook algorithm that performed this task is Dijkstra algorithm, which computes the distance between the starting point and any other reachable point (of course, there are more efficient algorithms involving a preprocessing, but Dijkstra is the most simple, and its running-time is asymptotically optimal). This algorithm is already implemented in Sagemath.

Let's spice things up: what if that there are some streets with negative length? For instance, we like a street so much that we are willing to drive 100km more just to pass from that street, which is 50km long. It is like that street is -50km long!

First of all, under these assumptions, a shortest path might not exist: if there is a cycle with negative length, we may drive along that cycle all the times we want, decreasing more and more the distance to the destination. At least, we have to assume that no negative cycle exists.

Even with this assumption, Dijkstra algorithm does not work, and we have to perform Bellman-Ford algorithm, which is less efficient, but more general. Now, assume that we want something more: we are trying to compute the distance between all possible pairs of vertices. The first possibility is to run Bellman-Ford algorithm \(n\) times, where \(n\) is the number of nodes in the graph. But there is a better alternative: it is possible to perform Bellman-Ford algorithm only once, and then to modify the lengths of edges, so that all lengths are positive, and shortest paths are not changed. This way, we run Dijkstra algorithm \(n\) times on this modified graph, obtaining a better running time. This is Johnson algorithm.

Both Bellman-Ford and Johnson algorithms are implemented in Boost and not in Sagemath. As soon as I manage to create weighted Boost graphs (that is, graphs where edges have a length), I will include also these two algorithm!

My Google Summer of Code project is continuing, and I am currently trying to include more Boost algorithms in Sage. In this post, I will make a list of the main algorithms I'm working on.

#### Clustering Coefficient

If two different people have a friend in common, there is a high chance that they will become friends: this is the property that the clustering coefficient tries to capture. For instance, if I pick two random people, very probably they will not know each other, but if I pick two of my acquaintances, very probably they will know each other. In this setting, the clustering coefficient of a person is the probability that two random acquaintances of this person know each other. In order to quantify this phenomenon, we can formalize everything in terms of graphs: people are nodes and two people are connected if they are acquaintances. Hence, we define the clustering coefficient of a vertex \(v\) in a graph \(G=(V,E)\) as:

$$\frac{2|\{(x,y) \in E:x,y \in N_v\}|}{\deg(v)(\deg(v)-1)}$$ where \(N_v\) is the set of neighbors of \(v\) and \(\deg(v)\) is the number of neighbors of \(v\). This is exactly the probability that two random neighbors of \(v\) are linked with an edge.

My work has included in Sagemath the Boost algorithm to compute the clustering coefficient, which is more efficient that the previous algorithm, which was based on NetworkX:

sage: g = graphs.RandomGNM(20000,100000)

sage: %timeit g.clustering_coeff(implementation='boost')

10 loops, best of 3: 258 ms per loop

sage: %timeit g.clustering_coeff(implementation='networkx')

1 loops, best of 3: 3.99 s per loop

But Nathann did better: he implemented a clustering coefficient algorithm from scratch, using Cython, and he managed to outperform the Boost algorithm, at least when the graph is dense. Congratulations, Nathann! However, when the graph is sparse, Boost algorithm still seems to be faster.

#### Dominator tree

Let us consider a road network, that is, a graph where vertices are street intersections, and edges are streets. The question is: if I close an intersection, where am I still able to go, assuming I am at home?

The answer to this question can be summarized in a dominator tree. Assume that, in order to go from my home to my workplace, I can choose many different paths, but all these paths pass through the café, then they pass through the square (that is, if either the café or the square is closed, then there is no way I can go to work). In this case, in the dominator tree, the father of my workplace is the square, the father of the square is the café, and the father of the café is my home, that is also the root of the tree. More formally, given a graph \(G\), the dominator tree of \(G\) rooted at a vertex \(v\) is defined by connecting each vertex \(x\) with the last vertex \(y \neq x\) that belongs to each path from \(v\) to \(x\) (note that this vertex always exists, because \(v\) belongs to each path from \(v\) to \(x\)).

Until now, Sagemath did not have a routine to compute the dominator tree: I have been able to include the Boost algorithm. Unfortunately, due to several suggestions and improvements in the code, the ticket is not closed, yet. Hopefully, it will be closed very soon!

#### Cuthill-McKee ordering / King ordering

Let us consider a graph \(G=(V,E)\): a matrix \(M\) of size \(|V|\) can be associated to this graph, where \(M_{i,j}=1\) if and only if there is an edge between vertices \(i\) and \(j\).

In some cases, this matrix can have specific properties, that can be exploited for many purposes, like speeding-up algorithms. One of this properties is bandwidth, which measures how far the matrix is from a diagonal matrix: it is defined as \(\max_{M_{i,j} \neq 0}|i-j|\). A small bandwidth might help in computing several properties of the graph, like eigenvalues and eigenvectors.

Since the bandwidth depends on the order of vertices, we can try to permute them in order to obtain a smaller value: in Sage, we have a routine that performs this task. However, this routine is very slow, and it is prohibitive even for very small graphs (in any case, finding an optimal ordering is NP-hard).

Hence, researchers have developed heuristics to compute good orderings: the most important ones are Cuthill-McKee ordering and King ordering. Boost contains both routines, but Sage does not: for this reason, I would like to insert these two functions. The code is almost ready, but part of it depends on the code of the dominator tree: as soon as the dominator tree is reviewed, I will open a ticket on these two routines!

#### Dijkstra/Bellman-Ford/Johnson shortest paths

Let us consider again a road network. In this case, we are building a GPS software, which has to compute the shortest path between the place where we are and the destination. The textbook algorithm that performed this task is Dijkstra algorithm, which computes the distance between the starting point and any other reachable point (of course, there are more efficient algorithms involving a preprocessing, but Dijkstra is the most simple, and its running-time is asymptotically optimal). This algorithm is already implemented in Sagemath.

Let's spice things up: what if that there are some streets with negative length? For instance, we like a street so much that we are willing to drive 100km more just to pass from that street, which is 50km long. It is like that street is -50km long!

First of all, under these assumptions, a shortest path might not exist: if there is a cycle with negative length, we may drive along that cycle all the times we want, decreasing more and more the distance to the destination. At least, we have to assume that no negative cycle exists.

Even with this assumption, Dijkstra algorithm does not work, and we have to perform Bellman-Ford algorithm, which is less efficient, but more general. Now, assume that we want something more: we are trying to compute the distance between all possible pairs of vertices. The first possibility is to run Bellman-Ford algorithm \(n\) times, where \(n\) is the number of nodes in the graph. But there is a better alternative: it is possible to perform Bellman-Ford algorithm only once, and then to modify the lengths of edges, so that all lengths are positive, and shortest paths are not changed. This way, we run Dijkstra algorithm \(n\) times on this modified graph, obtaining a better running time. This is Johnson algorithm.

Both Bellman-Ford and Johnson algorithms are implemented in Boost and not in Sagemath. As soon as I manage to create weighted Boost graphs (that is, graphs where edges have a length), I will include also these two algorithm!

Subscribe to:
Posts (Atom)