Difference between revisions of "Kruskal's algorithm"

From PEGWiki
Jump to: navigation, search
m (categorize)
Line 2: Line 2:
  
 
=Theory of the algorithm=
 
=Theory of the algorithm=
Kruskal's may be characterized as a [[greedy algorithm]], which builds the MST one edge at a time. As befits a MST algorithm, the greedy strategy is to continually add the remaining edge of lowest weight. Unlike Prim's, however, Kruskal's adds edges without regard to the connectivity of the partially built MST; that is, it does not necessarily add an edge emanating from a vertex that is in the partially built MST. Indeed, it may be said that Kruskal's starts with <math>V</math> forests of one vertex each, and adds edges one by one, each one causing two trees in the forest to coalesce into one, until all vertices have been placed in the same connected component and the MST is complete. In doing so, one must be careful not to add an edge between two vertices that are ''already'' in the same component, for doing so would create a cycle. We shall assume that a spanning tree exists for the following sections. (If you find them too difficult, skip them.)
+
Kruskal's may be characterized as a [[greedy algorithm]], which builds the MST one edge at a time. As befits a MST algorithm, the greedy strategy is to continually add the remaining edge of lowest weight. Unlike Prim's, however, Kruskal's adds edges without regard to the connectivity of the partially built MST. We shall assume that a spanning tree exists for the following sections. (If you find them too difficult, skip them.)
  
 
==Lemma==
 
==Lemma==
<p>Suppose that a subset <math>S</math> of the edges of a graph <math>G</math> is known to be a subset of the edges of some spanning tree of <math>G</math>. Consider the set of edges <math>T</math> containing exactly those edges <math>\in G\S</math> which, when added to <math>S</math>, do not induce a cycle. If a minimal-weight edge <math>\in T</math> is added to <math>S</math>, the resulting set is also guaranteed to be a subset of the edges of some spanning tree of <math>G</math>.
+
<p>Suppose that a spanning tree <math>T</math> is given of some graph <math>G</math>. Then, the addition of any edge <math>\notin E(T)</math> to <math>E(T)</math>, followed by the removal of any edge from the resulting cycle, yields a spanning tree of <math>G</math>.</p>
  
''Proof'': Consider the graph <math>G'</math> with vertex set <math>V(G)</math> and edge set <math>S</math>. Now, any edge in <math>G\S</math> connects either two vertices in different connected components in <math>G'</math> or two vertices in the same component. If it connects two vertices in the same component, adding it generates a cycle (since there is already a path from one component to the other that does not use that edge, and adding the edge generates a return path) and it is not in <math>T</math>. Otherwise, it does not generate a cycle because it is a bridge in the resulting single connected component, and it is in <math>T</math>. That is, <math>T</math> consists of exactly those edges linking different connected components in <math>G'</math>. It is clear that <!--
+
<p>''Proof'': Before the operation, the number of vertices is one more than the number of edges. After the operation, this is again true. As the addition of the new edge generates exactly one simple cycle, there are no longer any cycles after an edge on this cycle is removed. So the new <math>T</math> has a vertex count which exceeds its edge count by one and contains no cycles; it must therefore be a tree.</p>
  
==Lemma 2==
+
==The algorithm==
<p>Given some tree <math>T</math> which is a subgraph of graph <math>G</math> that is known to be a subtree of some MST of <math>G</math>, we shall call an edge a ''crossing'' edge if it joins a vertex <math>\in V(T)</math> and a vertex <math>\notin V(T)</math>. Then, any minimal crossing edge <math>e</math> may be added to <math>T</math> to give a new tree which is also a subtree of some MST of <math>G</math>.</p>
+
<p>We are now ready to present the algorithm. We begin with no knowledge of the edges of the MST, and add them one by one until the MST is complete. To do so we consider edges in increasing order of weight. When considering an edge, if adding it would create a cycle, we skip it; otherwise we add it. Once <math>V-1</math> edges have been added, we have constructed a MST. Kruskal's is both <i>correct</i> (Theorem 2) and <i>complete</i> (Theorem 1).</p>
  
<p>''Proof'': Given some MST of <math>G</math> containing <math>T</math> as a subtree, we may add our minimal crossing <math>e</math> to the MST. Evidently, the resulting cycle must contain another crossing edge. If this other crossing edge is of higher weight than <math>e</math>, we can delete it to yield a new spanning tree, ''per'' Lemma 1, whose weight is lower than the weight of the original spanning tree, contradicting our assumption that the spanning tree with which we started was minimal. Otherwise, the other crossing edge is of the same weight as <math>e</math> (it cannot be of lower weight, since <math>e</math> is minimal) and the addition of <math>e</math> and deletion of this edge yield, ''per'' Lemma 2, another spanning tree, this time of the same weight as the original, which is therefore another MST of <math>G</math>. That is, given any MST of <math>G</math> that does not contain <math>e</math>, we are able to generate another that does.</p>
+
===Remark===
 +
If adding an edge to the partially built MST generates a cycle, it will also do so if the edge is added to the partially built MST later on in the algorithm, since we only add edges and never remove them. The contrapositive is also true: if adding an edge does not generate a cycle, it wouldn't have generated a cycle if it were added earlier, either.
  
==The algorithm==
+
===Theorem 1===
<p>The algorithm then operates as follows: we start with a trivial tree containing only one vertex (and no edges). It does not matter which vertex is chosen. Then, we choose a minimal-weight edge emanating from that vertex, adding it to our tree. We repeatedly choose a minimal-weight edge that joins any vertex in the tree to one not in the tree, adding the new edge and vertex to our tree. When there are no more vertices to add, the tree we have built is an MST.</p>
+
<p>Kruskal's algorithm will never fail to find a spanning tree in a connected graph.</p>
 +
<p>''Proof'': By contradiction. Assume that all edges have been considered and the partially built tree is still not complete. Then, there exists some edge that connects two vertices in different connected components of the partially built tree. But this edge must have been considered at some point and discarded as adding it would have created a cycle. This is a contradiction, as adding it ''now'' would not create a cycle, ''per'' the Remark above.</p>
  
<p>''Proof'': By induction.
+
===Theorem 2===
* Base case: We begin with a tree which, as it contains only one vertex and no edges, is certainly a subtree of some MST of the graph. (Of course, it is a subtree of ''all'' MSTs of the graph, but that fact is not important here.)
+
<p>Kruskal's algorithm will never produce a non-minimal spanning tree.</p>
* Inductive step: By Lemma 2, if the current tree is the subtree of some MST of the graph, then the tree resulting from the addition of any minimal crossing edge to our tree yields a new tree which is also the subtree of some MST of the graph.
+
<p>''Proof'':<br/>
In this way we proceed through a sequence of trees, each of which contains one more vertex than its predecessor, until we obtain a spanning tree. Since this spanning tree is the subtree of some MST of the graph, it must be an MST itself.</p>
+
We first proceed by induction on the number of edges so far added.
 +
* When no edges have yet been added, the set of edges so far added, being empty, is obviously a subset of the edges of some MST.
 +
* Suppose that <math>k</math> edges have been added in this manner (<math> 1 \le k < V </math>), and we know that they form a subset <math>S</math> of the edges of some MST. All further edges to be considered are of greater or equal weight. If adding an edge generates a cycle at this point, it can be discarded, as it can never be added ''per'' the Remark above. According to our algorithm, then, the next edge to be added is the edge <math>e</math> with minimal weight which does not generate a cycle. Consider now any MST <math>T</math> containing the edges so far added as a subset of its edge set. It it contains <math>e</math>, we are done; we know that when <math>e</math> is added our edge set is still a subset of the edge set of some MST. Otherwise, add the edge <math>e</math> to <math>E(T)</math>. A cycle is now generated. This cycle must contain some edge not yet considered, because we know that the edge currently under consideration, in conjunction with the edges previously added, do not form a cycle. This edge not yet considered has a weight greater than or equal to that of <math>e</math>. Then, deleting this edge yields a new spanning tree (''per'' the Lemma above). If the edge deleted was of greater weight than <math>e</math>, the resulting spanning tree has lower weight than <math>T</math>, a contradiction as <math>T</math> was assumed to be minimal. Otherwise, it must have equal weight to <math>e</math>, and deleting it and adding <math>e</math> gives a new spanning tree of equal weight, that is, another MST. So we have proven that given that there is any MST at all containing <math>S</math>, we can add <math>e</math> and find some MST containing this new set as a subset of its own edge set.<br/></p>
 +
Thus, if a spanning tree is built, then its edges form a subset of the edge set of some MST of the graph, and therefore the spanning tree must be that MST itself.
  
 +
<!--
 
=Implementation=
 
=Implementation=
 
As the previous sections are a bit heavy, here is some pseudocode for Prim's algorithm:
 
As the previous sections are a bit heavy, here is some pseudocode for Prim's algorithm:
Line 35: Line 41:
 
-->
 
-->
  
[[Category:Algorithms]]
+
=Implementation=
 +
We have so far glossed over a crucial detail: we must have a means of efficiently deciding whether an edge can be added without generating a cycle, and of adding that edge if it can. To do so, we note that a cycle is created if and only if the two endpoints of the edge are in the same connected component. We could, of course, answer this query through any [[graph search]] algorithm such as [[Depth-first search|DFS]] or [[Breadth-first search|BFS]]. However, that would make the algorithm quadratic time overall, which is undesirable. Instead, we notice that a data structure [[Disjoint sets data structure|already exists]] which can efficiently identify the component containing a vertex and add new edges (joining together components). Assuming that we know how to implement it, then, here is Kruskal's algorithm:
 +
<pre>
 +
input G
 +
let T = (V(G),∅)
 +
for each {u,v} ∈ E(G) in increasing order of wt(u,v)
 +
    if find(u) ≠ find(v)
 +
          unite(u,v)
 +
          add {u,v} to E(T)
 +
</pre>
 +
It would make sense to stop the loop after <math>V-1</math> edges have been added (instead of processing every edge, which is often unnecessary). Nevertheless, it does not affect the asymptotic complexity (see Analysis)
 +
 
 +
=Analysis=
 +
The usual implementation of Kruskal's sorts the edges by weight first, which takes <math>\mathcal{O}(E\lg E)</math> time. Following this, <math>\mathcal{O}(E)</math> union-find operations are performed. As each can be done in almost-constant time (see [[Disjoint set data structure|the article itself]]), this step requires approximately <math>\mathcal{O}(E)</math> time to complete. The cost of the sort then dominates the running time. As typical implementations of [[quicksort]], usually used here, are faster than those of [[heapsort]], which may be said to operate implicitly in [[Prim's algorithm]], a well-coded Kruskal's will outperform Prim's in sparse graphs. In dense graphs, Prim's non-heap implementation, taking <math>\mathcal{O}(E+V^2)</math> time, is likely to outperform Kruskals, with runtime close to <math>\mathcal{O}(V^2 \lg V)</math>.
 +
 
 +
== References ==
 +
* Joseph. B. Kruskal: ''[http://links.jstor.org/sici?sici=0002-9939(195602)7%3A1%3C48%3AOTSSSO%3E2.0.CO%3B2-M On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem]''. In: ''Proceedings of the American Mathematical Society'', Vol 7, No. 1 (Feb, 1956), pp. 48–50

Revision as of 18:15, 23 December 2009

Kruskal's algorithm is a general-purpose algorithm for the minimum spanning tree problem, based on the disjoint sets data structure. The existence of very simple algorithms to maintain disjoint sets in almost constant time gives rise to simple implementations of Kruskal's algorithm whose running times are close to linear, usually outperforming Prim's algorithm in sparse graphs.

Theory of the algorithm

Kruskal's may be characterized as a greedy algorithm, which builds the MST one edge at a time. As befits a MST algorithm, the greedy strategy is to continually add the remaining edge of lowest weight. Unlike Prim's, however, Kruskal's adds edges without regard to the connectivity of the partially built MST. We shall assume that a spanning tree exists for the following sections. (If you find them too difficult, skip them.)

Lemma

Suppose that a spanning tree T is given of some graph G. Then, the addition of any edge \notin E(T) to E(T), followed by the removal of any edge from the resulting cycle, yields a spanning tree of G.

Proof: Before the operation, the number of vertices is one more than the number of edges. After the operation, this is again true. As the addition of the new edge generates exactly one simple cycle, there are no longer any cycles after an edge on this cycle is removed. So the new T has a vertex count which exceeds its edge count by one and contains no cycles; it must therefore be a tree.

The algorithm

We are now ready to present the algorithm. We begin with no knowledge of the edges of the MST, and add them one by one until the MST is complete. To do so we consider edges in increasing order of weight. When considering an edge, if adding it would create a cycle, we skip it; otherwise we add it. Once V-1 edges have been added, we have constructed a MST. Kruskal's is both correct (Theorem 2) and complete (Theorem 1).

Remark

If adding an edge to the partially built MST generates a cycle, it will also do so if the edge is added to the partially built MST later on in the algorithm, since we only add edges and never remove them. The contrapositive is also true: if adding an edge does not generate a cycle, it wouldn't have generated a cycle if it were added earlier, either.

Theorem 1

Kruskal's algorithm will never fail to find a spanning tree in a connected graph.

Proof: By contradiction. Assume that all edges have been considered and the partially built tree is still not complete. Then, there exists some edge that connects two vertices in different connected components of the partially built tree. But this edge must have been considered at some point and discarded as adding it would have created a cycle. This is a contradiction, as adding it now would not create a cycle, per the Remark above.

Theorem 2

Kruskal's algorithm will never produce a non-minimal spanning tree.

Proof:
We first proceed by induction on the number of edges so far added.

  • When no edges have yet been added, the set of edges so far added, being empty, is obviously a subset of the edges of some MST.
  • Suppose that k edges have been added in this manner ( 1 \le k < V ), and we know that they form a subset S of the edges of some MST. All further edges to be considered are of greater or equal weight. If adding an edge generates a cycle at this point, it can be discarded, as it can never be added per the Remark above. According to our algorithm, then, the next edge to be added is the edge e with minimal weight which does not generate a cycle. Consider now any MST T containing the edges so far added as a subset of its edge set. It it contains e, we are done; we know that when e is added our edge set is still a subset of the edge set of some MST. Otherwise, add the edge e to E(T). A cycle is now generated. This cycle must contain some edge not yet considered, because we know that the edge currently under consideration, in conjunction with the edges previously added, do not form a cycle. This edge not yet considered has a weight greater than or equal to that of e. Then, deleting this edge yields a new spanning tree (per the Lemma above). If the edge deleted was of greater weight than e, the resulting spanning tree has lower weight than T, a contradiction as T was assumed to be minimal. Otherwise, it must have equal weight to e, and deleting it and adding e gives a new spanning tree of equal weight, that is, another MST. So we have proven that given that there is any MST at all containing S, we can add e and find some MST containing this new set as a subset of its own edge set.

Thus, if a spanning tree is built, then its edges form a subset of the edge set of some MST of the graph, and therefore the spanning tree must be that MST itself.

Implementation

We have so far glossed over a crucial detail: we must have a means of efficiently deciding whether an edge can be added without generating a cycle, and of adding that edge if it can. To do so, we note that a cycle is created if and only if the two endpoints of the edge are in the same connected component. We could, of course, answer this query through any graph search algorithm such as DFS or BFS. However, that would make the algorithm quadratic time overall, which is undesirable. Instead, we notice that a data structure already exists which can efficiently identify the component containing a vertex and add new edges (joining together components). Assuming that we know how to implement it, then, here is Kruskal's algorithm:

input G
let T = (V(G),∅)
for each {u,v} ∈ E(G) in increasing order of wt(u,v)
     if find(u) ≠ find(v)
          unite(u,v)
          add {u,v} to E(T)

It would make sense to stop the loop after V-1 edges have been added (instead of processing every edge, which is often unnecessary). Nevertheless, it does not affect the asymptotic complexity (see Analysis)

Analysis

The usual implementation of Kruskal's sorts the edges by weight first, which takes \mathcal{O}(E\lg E) time. Following this, \mathcal{O}(E) union-find operations are performed. As each can be done in almost-constant time (see the article itself), this step requires approximately \mathcal{O}(E) time to complete. The cost of the sort then dominates the running time. As typical implementations of quicksort, usually used here, are faster than those of heapsort, which may be said to operate implicitly in Prim's algorithm, a well-coded Kruskal's will outperform Prim's in sparse graphs. In dense graphs, Prim's non-heap implementation, taking \mathcal{O}(E+V^2) time, is likely to outperform Kruskals, with runtime close to \mathcal{O}(V^2 \lg V).

References