Implementation details of the UnionFind datastructure, and its utility for implementing Minimum Spanning Tree algorithms
My desire to efficiently solve problems of increasing complexity, I believe, is the origin of my interest in graphbased algorithms. Thus, on the day my university shut down due to a large amount of snow, it was no surprise I found myself implementing and visualizing not one, but two, minimum spanning tree algorithms.
Robert Prim and Joseph Kruskal both developed Minimum Spanning Tree algorithms. This article implements Kruskal’s algorithm, and makes use of the UnionFind
datastructure. If you want to see UnionFind
in action check out my visualization.
View the code for this article here.
In this section, I provide some intuition for UnionFind
. I discuss a suboptimal implementation that achieves the same results as the one we’ll develop farther down, but, for now, is easier to understand.
Below, I create a list with five sublist. For the purpose of this example, let’s say each sublist is a distinct group. There are two operations I’d like to perform: find
and union
.


For simplicity, I’ll say that the name of a group is the index of that group in the list. So originally, each letter is in a different group because each sublist is at a different index in the groups
list.
Pretending that the above functions find
and union
are implemented, let’s make a few function calls and examine the results.


We see above that find
always tells us the name given letter’s group. find
allows us to determine if two different letters are in the same group or not. We can create larger groups by calling union
. union
determines what group two input letter belong too and creates the union of those groups.
Below, I write a purposely inefficient implementation of the find
and union
methods, so the fast version, we make later, will seem more impressive :)


After writing the code above, we’ve now implemented the UnionFind
datastructure, yep it’s that easy! But, as mentioned, the above implementation is inefficient. Let’s take a look at the runtime; the table below shown a slightly worse than linear increase as elements are added.
n  time 

100  0.03 sec 
1000  0.3 sec 
10000  2.3 sec 
100000  46 sec 
1000000  696 sec (~11 min) 
The test we perform each iteration is the following:


The test reveals that the implementation has a roughly linear runtime
. This runtime is what we should expect. Every time we call find
, we have to look at each element in linear
order until we find the one we want. And, calling union
is just two calls to find
, the construction of the union group, and the removal of two elements: all linear time operations!
If we want to make this faster, we can start by thinking about the two operations we’re doing find
and union
. We could get find
to have a runtime of O(1) or O(logn) (depending on implementation) by using a dictionary, but, if we are using a dictionary, union
might still be O(n) because each time we merge two groups we’d have to linearly update each element to point to its new group.
Interestingly, we can make union
run in O(1) on average (I’ll discuss why we say on average in the conclusion below). We can use a treelike structure to represent what group an element is currently in and this will avoid the need to update each element individually when building the union of two groups.
The code below shows an example of how we use the dictionary datastructure to store our elements.


The keys
in the dictionary are the elements we want to keep track of and the values
are the group that a given element is currently in. At the start, every element is in its own group. Even though we haven’t implemented find
and union
yet with this new structure, let’s pretend we have and call some functions.


Above, we see that find
appears to be returning the value for a given key
, but, actually, find
does a bit more than that. find
actually looks at the value
for a given key
and then pretends the value
is a key
and makes another call to find
. This is confusing, but we see an example of this in line 6
above. When find
is asked to locate a
it sees that a
belongs to group c
, but c
belongs to group e
! We seemingly could follow this chain forever, but we end when we find an element that belongs to a group with a name that matches itself. We find that e
is in group e
, and so we return that a
is in group e
.
The code below implements find
. We observe that find
recursively climbs the tree until it reaches a root element. One could say that every element belonging to the same group is a child of the same root element!


That’s a super short definition, but it actually works! However, it’s missing one optimization that will payoff in the long run. Let’s say I ask find
what group a
is in and it has to traverse a tree that’s one million
in height. Then I call find
with a
again. The algorithm would have to make the same traversal of one million
nodes. So, to improve things, every time I find the group that an element belongs to I set that element to point directly to the root element
To avoid repetitive expensive calls, I set each element to point directly to the
root element
every time I complete a call to find
Here’s the optimized find
function.


Oh, it’s actually not that much longer. Anyways, now even if it does cost a lot to call find("a")
it will only cost O(1)
when we call it a second time! So, we can hope that the average cost of a call will be something closer to O(1)
.
Now that we have the find
function we can use it to implement an efficient union
function. As discussed, all union has to do is find the groups that two elements belong to and make one element point to the other. Here’s what that looks like:


Awesome! It’s one line. Seems too good to be true, but let’s run the same test as before. If you recall, one million
elements took 11min to run with the linear implementation. The table below shows the times for this more efficient implementation.
n  time 

100  1 ms 
1000  1 ms 
10000  9 ms 
100000  2 ms 
1000000  0 ms 
Ya, it actually said 0 ms :) I though it was broken, but it wasn’t. It’s just that fast… for one million
elements. So, now the question is how many elements can it handle?
Let’s do an experiment. Here’s my test code:


I get this:


So, ya O(1) on average is fast
! It runs out of memory before it takes longer than 2ms to complete. Here’s a final version of the code we just created all wrappedup nicely in a class for you.


In this article, we found that UnionFind
is an easy to implement, fast, and intuitive datastructure. A use cases for UnionFind
is the implementation of Kruskal’s Minimum Spanning Tree Algorithm. I used the term O(1) on average and by this I meant that sometimes a single call might be slow, but after the first repetitive call the second call will be O(1), and so on average
a series of calls will have a runtime of O(1). I hope you enjoyed this article. If you want to get in contact, my info’s in the footer below.