Understanding UnionFind

Implementation details of the UnionFind data-structure, and its utility for implementing Minimum Spanning Tree algorithms

My desire to efficiently solve problems of increasing complexity, I believe, is the origin of my interest in graph-based algorithms. Thus, on the day my university shut down due to a large amount of snow, it was no surprise I found myself implementing and visualizing not one, but two, minimum spanning tree algorithms.

Overview

Robert Prim and Joseph Kruskal both developed Minimum Spanning Tree algorithms. This article implements Kruskal’s algorithm, and makes use of the UnionFind data-structure. If you want to see UnionFind in action check out my visualization.

View the code for this article here.

Intuition

In this section, I provide some intuition for UnionFind. I discuss a sub-optimal implementation that achieves the same results as the one we’ll develop farther down, but, for now, is easier to understand.

Below, I create a list with five sub-list. For the purpose of this example, let’s say each sub-list is a distinct group. There are two operations I’d like to perform: find and union.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
let groups = [['a'],['b'],['c'],['d'],['e']]

function find(p) {
    // return the group element "p" belongs too
}

function union(p, k) {
    // replace the groups that contain "p" and "k" are
    // currently in with the union of both groups
}

For simplicity, I’ll say that the name of a group is the index of that group in the list. So originally, each letter is in a different group because each sub-list is at a different index in the groups list.

Pretending that the above functions find and union are implemented, let’s make a few function calls and examine the results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
find('c')       // returns 2
union('b', 'c') // new groups = [['a'],['b', 'c'],['d'],['e']]
find('c')       // returns 1

union('d', 'e') // new groups = [['a'],['b', 'c'],['d', 'e']]
union('a', 'b') // new groups = [['a', 'b', 'c'],['d', 'e']]

find('a')       // returns 1
find('e')       // returns 2

We see above that find always tells us the name given letter’s group. find allows us to determine if two different letters are in the same group or not. We can create larger groups by calling union. union determines what group two input letter belong too and creates the union of those groups.

An Inefficient Implementation

Below, I write a purposely inefficient implementation of the find and union methods, so the fast version, we make later, will seem more impressive :)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
function find(p, groups) {
    for (let i in groups) { // for each group...

        let group = groups[i]

        for (let j in group) { // ...for each letter...

            let letter = group[j]

            // ...if the letter is found in this group
            // return the group's first letter
            if (letter == p) return i;
        }
    }
}

function union(p, k, groups) {
    let g1 = find(p, groups) // name of group one
    let g2 = find(k, groups) // name of group two

    // create the union of the two groups
    let union_group = groups[g1] + groups[g2]

    // remove old groups from the groups list and
    // add the new union group
    groups.splice(g1, 1);
    groups.splice(g2, 1);
    groups.push(union_group);
}

After writing the code above, we’ve now implemented the UnionFind data-structure, yep it’s that easy! But, as mentioned, the above implementation is inefficient. Let’s take a look at the runtime; the table below shown a slightly worse than linear increase as elements are added.

n time
100 0.03 sec
1000 0.3 sec
10000 2.3 sec
100000 46 sec
1000000 696 sec (~11 min)

Testing the Inefficient Implementation

The test we perform each iteration is the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
function makeGroups(n) {
    // makes n groups of size one each with a unique number
    let groups = []
    for (let i=0; i<n; i++) groups.push([i])
    return groups
}

function randomMember(n) {
    // return a random number less than n
    return Math.floor(Math.random() * n);
}

function test() {
    for (let n of [100, 1000, 10000, 100000, 1000000]) {
        let groups = makeGroups(n)
        let start = Date.now();
        for (let i=0; i<1000; i++) find(randomMember(n), groups);
        for (let i=0; i<100; i++) union(randomMember(n), randomMember(n), groups);
        for (let i=0; i<1000; i++) find(randomMember(n), groups);
        let time = Date.now() - start;
        console.log("n: " + n + " time: " + time);
    }
}

The test reveals that the implementation has a roughly linear runtime. This runtime is what we should expect. Every time we call find, we have to look at each element in linear order until we find the one we want. And, calling union is just two calls to find, the construction of the union group, and the removal of two elements: all linear time operations!

How could it be Faster?

If we want to make this faster, we can start by thinking about the two operations we’re doing find and union. We could get find to have a runtime of O(1) or O(logn) (depending on implementation) by using a dictionary, but, if we are using a dictionary, union might still be O(n) because each time we merge two groups we’d have to linearly update each element to point to its new group.

Interestingly, we can make union run in O(1) on average (I’ll discuss why we say on average in the conclusion below). We can use a tree-like structure to represent what group an element is currently in and this will avoid the need to update each element individually when building the union of two groups.

The code below shows an example of how we use the dictionary data-structure to store our elements.

1
2
3
4
5
6
7
8
let groups = {
    "a": "a",
    "b": "b",
    "c": "c",
    "d": "d",
    "e": "e"
}

The keys in the dictionary are the elements we want to keep track of and the values are the group that a given element is currently in. At the start, every element is in its own group. Even though we haven’t implemented find and union yet with this new structure, let’s pretend we have and call some functions.

1
2
3
4
5
6
7
8
find("a", groups)       // returns "a"
union("a", "c", groups) // new groups {"a": "c", "b": "b", "c": "c", "d": "d", "e": "e"}
union("d", "e", groups) // new groups {"a": "c", "b": "b", "c": "c", "d": "e", "e": "e"}
find("a", groups)       // returns "c"
union("a", "e", groups) // new groups {"a": "c", "b": "b", "c": "e", "d": "e", "e": "e"}
find("a", groups)       // returns "e"
find("d", groups)       // returns "e"

Above, we see that find appears to be returning the value for a given key, but, actually, find does a bit more than that. find actually looks at the value for a given key and then pretends the value is a key and makes another call to find. This is confusing, but we see an example of this in line 6 above. When find is asked to locate a it sees that a belongs to group c, but c belongs to group e! We seemingly could follow this chain forever, but we end when we find an element that belongs to a group with a name that matches itself. We find that e is in group e, and so we return that a is in group e.

I’m still Confused Show Me the Code

The code below implements find. We observe that find recursively climbs the tree until it reaches a root element. One could say that every element belonging to the same group is a child of the same root element!

1
2
3
4
5
function find(p, groups) {
    if (p == groups[p]) return p
    return find(groups[p], groups)
}

That’s a super short definition, but it actually works! However, it’s missing one optimization that will pay-off in the long run. Let’s say I ask find what group a is in and it has to traverse a tree that’s one million in height. Then I call find with a again. The algorithm would have to make the same traversal of one million nodes. So, to improve things, every time I find the group that an element belongs to I set that element to point directly to the root element

To avoid repetitive expensive calls, I set each element to point directly to the root element every time I complete a call to find

Here’s the optimized find function.

1
2
3
4
5
6
7
function find(p, groups) {
    let start = p
    while (p != groups[p]) p = groups[p];
    groups[start] = p
    return p
}

Oh, it’s actually not that much longer. Anyways, now even if it does cost a lot to call find("a") it will only cost O(1) when we call it a second time! So, we can hope that the average cost of a call will be something closer to O(1).

Efficient Implementation of Union

Now that we have the find function we can use it to implement an efficient union function. As discussed, all union has to do is find the groups that two elements belong to and make one element point to the other. Here’s what that looks like:

1
2
3
4
5
6
function union(p, k, groups) {
    // the group that p belongs too will point
    // at the group that k belongs too
    groups[find(p, groups)] = find(k, groups)
}

Awesome! It’s one line. Seems too good to be true, but let’s run the same test as before. If you recall, one million elements took 11min to run with the linear implementation. The table below shows the times for this more efficient implementation.

n time
100 1 ms
1000 1 ms
10000 9 ms
100000 2 ms
1000000 0 ms

Ya, it actually said 0 ms :) I though it was broken, but it wasn’t. It’s just that fast… for one million elements. So, now the question is how many elements can it handle? Let’s do an experiment. Here’s my test code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
function test2() {
    let n = 1000000
    while (true) {
        let groups = makeGroups(n)
        let start = Date.now();
        for (let i=0; i<1000; i++) find(randomMember(n), groups);
        for (let i=0; i<100; i++) union(randomMember(n), randomMember(n), groups);
        for (let i=0; i<1000; i++) find(randomMember(n), groups);
        let time = Date.now() - start;
        console.log("n: " + n + " time: " + time);
        n = n * 10;
    }
}

I get this:

1
2
3
4
n: 1000000 time: 1
n: 10000000 time: 1
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

So, ya O(1) on average is fast! It runs out of memory before it takes longer than 2ms to complete. Here’s a final version of the code we just created all wrapped-up nicely in a class for you.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
class UnionFind {
    constructor(elements) { // elements is a flat list of numbers or strings
        this.groups = {}
        for (let item of elements) this.groups[item] = item
    }
    find(p) {
        let start = p
        while (p != this.groups[p]) p = this.groups[p];
        this.groups[start] = p
        return p
    }
    union(p, k) {
        this.groups[this.find(p)] = this.find(k)
    }
}

Conclusion

In this article, we found that UnionFind is an easy to implement, fast, and intuitive data-structure. A use cases for UnionFind is the implementation of Kruskal’s Minimum Spanning Tree Algorithm. I used the term O(1) on average and by this I meant that sometimes a single call might be slow, but after the first repetitive call the second call will be O(1), and so on average a series of calls will have a runtime of O(1). I hope you enjoyed this article. If you want to get in contact, my info’s in the footer below.