Implementation details of the UnionFind data-structure, and its utility for implementing Minimum Spanning Tree algorithms
My desire to efficiently solve problems of increasing complexity, I believe, is the origin of my interest in graph-based algorithms. Thus, on the day my university shut down due to a large amount of snow, it was no surprise I found myself implementing and visualizing not one, but two, minimum spanning tree algorithms.
Robert Prim and Joseph Kruskal both developed Minimum Spanning Tree algorithms. This article implements Kruskal’s algorithm, and makes use of the
UnionFind data-structure. If you want to see
UnionFind in action check out my visualization.
View the code for this article here.
In this section, I provide some intuition for
UnionFind. I discuss a sub-optimal implementation that achieves the same results as the one we’ll develop farther down, but, for now, is easier to understand.
Below, I create a list with five sub-list. For the purpose of this example, let’s say each sub-list is a distinct group. There are two operations I’d like to perform:
For simplicity, I’ll say that the name of a group is the index of that group in the list. So originally, each letter is in a different group because each sub-list is at a different index in the
Pretending that the above functions
union are implemented, let’s make a few function calls and examine the results.
We see above that
find always tells us the name given letter’s group.
find allows us to determine if two different letters are in the same group or not. We can create larger groups by calling
union determines what group two input letter belong too and creates the union of those groups.
Below, I write a purposely inefficient implementation of the
union methods, so the fast version, we make later, will seem more impressive :)
After writing the code above, we’ve now implemented the
UnionFind data-structure, yep it’s that easy! But, as mentioned, the above implementation is inefficient. Let’s take a look at the runtime; the table below shown a slightly worse than linear increase as elements are added.
|1000000||696 sec (~11 min)|
The test we perform each iteration is the following:
The test reveals that the implementation has a roughly
linear runtime. This runtime is what we should expect. Every time we call
find, we have to look at each element in
linear order until we find the one we want. And, calling
union is just two calls to
find, the construction of the union group, and the removal of two elements: all linear time operations!
If we want to make this faster, we can start by thinking about the two operations we’re doing
union. We could get
find to have a runtime of O(1) or O(logn) (depending on implementation) by using a dictionary, but, if we are using a dictionary,
union might still be O(n) because each time we merge two groups we’d have to linearly update each element to point to its new group.
Interestingly, we can make
union run in O(1) on average (I’ll discuss why we say on average in the conclusion below). We can use a tree-like structure to represent what group an element is currently in and this will avoid the need to update each element individually when building the union of two groups.
The code below shows an example of how we use the dictionary data-structure to store our elements.
keys in the dictionary are the elements we want to keep track of and the
values are the group that a given element is currently in. At the start, every element is in its own group. Even though we haven’t implemented
union yet with this new structure, let’s pretend we have and call some functions.
Above, we see that
find appears to be returning the value for a given
key, but, actually,
find does a bit more than that.
find actually looks at the
value for a given
key and then pretends the
value is a
key and makes another call to
find. This is confusing, but we see an example of this in
line 6 above. When
find is asked to locate
a it sees that
a belongs to group
c belongs to group
e! We seemingly could follow this chain forever, but we end when we find an element that belongs to a group with a name that matches itself. We find that
e is in group
e, and so we return that
a is in group
The code below implements
find. We observe that
find recursively climbs the tree until it reaches a root element. One could say that every element belonging to the same group is a child of the same root element!
That’s a super short definition, but it actually works! However, it’s missing one optimization that will pay-off in the long run. Let’s say I ask
find what group
a is in and it has to traverse a tree that’s
one million in height. Then I call
a again. The algorithm would have to make the same traversal of
one million nodes. So, to improve things, every time I find the group that an element belongs to I set that element to point directly to the root element
To avoid repetitive expensive calls, I set each element to point directly to the
root elementevery time I complete a call to find
Here’s the optimized
Oh, it’s actually not that much longer. Anyways, now even if it does cost a lot to call
find("a") it will only cost
O(1) when we call it a second time! So, we can hope that the average cost of a call will be something closer to
Now that we have the
find function we can use it to implement an efficient
union function. As discussed, all union has to do is find the groups that two elements belong to and make one element point to the other. Here’s what that looks like:
Awesome! It’s one line. Seems too good to be true, but let’s run the same test as before. If you recall,
one million elements took 11min to run with the linear implementation. The table below shows the times for this more efficient implementation.
Ya, it actually said 0 ms :) I though it was broken, but it wasn’t. It’s just that fast… for
one million elements. So, now the question is
how many elements can it handle? Let’s do an experiment. Here’s my test code:
I get this:
So, ya O(1) on average is
fast! It runs out of memory before it takes longer than 2ms to complete. Here’s a final version of the code we just created all wrapped-up nicely in a class for you.
In this article, we found that
UnionFind is an easy to implement, fast, and intuitive data-structure. A use cases for
UnionFind is the implementation of Kruskal’s Minimum Spanning Tree Algorithm. I used the term O(1) on average and by this I meant that sometimes a single call might be slow, but after the first repetitive call the second call will be O(1), and so
on average a series of calls will have a runtime of O(1). I hope you enjoyed this article. If you want to get in contact, my info’s in the footer below.