Hey, that's me!

Phil Mayer

Full-stack software engineer currently learning the Rust programming language. All opinions are my own.

Sherlock and Anagrams

Published Updated

Note: I recommend reading this article in PDF form since it includes typeset formulas and proofs.

HackerRank provides an Interview Preparation Kit containing a number of problems spanning programming topics like data structures, sorting, and searching. After taking some time away from HackerRank, I decided to pick up a question on dictionaries and hash maps. The problem is called "Sherlock and Anagrams" on the platform.

The Problem

Two strings are said to be anagrams if and only if their letters can be rearranged to form each other. The challenge as defined by HackerRank is to implement a function which accepts an input string and returns the number of a pairs of substrings which are anagrams of each other. For this challenge, the original string is not considered a substring of itself.

Approach

After some thinking about the problem, I decided that my strategy would be to:

  • Compute all substrings of the input ss, except for ss itself.
  • For each substring, sort its characters alphabetically. For example, the substring "baa" would be reordered to "aab".
  • Count the frequency f(i)f(i) of each sorted substring ii within ss. For example, if ss contains substrings "baa" and "aba", both would be reordered to "aab" by alphabetical character sorting. The sorted substring "aab" would have frequency 2.
  • Take all sorted substrings with frequency f(i)2f(i) \geq 2; these are the substrings for which at least one pair can be assembled. The number of possible pairs of the underlying, unsorted substrings is given by the combination (f(i)2)f(i) \choose 2. Aggregate this value for all sorted substrings.

As a sanity check, my first concern was to count the number of expected substrings of a given string. Counting the number of substrings required a bit of algebra. As it turns out, a string of length nn has nn substrings of length 1, n1n - 1 substrings of length 2, and so on: this is a summation of ii for 1in1 \leq i \leq n. The value for this sum is given by the following formula (see "Proofs" section below):

i=1ni=(n)(n+1)2\begin{aligned} \sum_{i = 1}^{n} i = \frac{(n)(n + 1)}{2} \end{aligned}

Since the challenge does not consider ss to be a substring of itself, the number of substrings is actually one less than this value.

My justification for the fourth point is given by combinatorics. Since I need to count the number of possible ways to select k=2k = 2 substrings (disregarding order) from n=f(i)n = f(i) possible substrings of ss, I needed to essentially calculate nn choose kk, given by (nk)n \choose k. More on this later.

Solution

I chose the C# programming language because I'm a little rusty with it. My solution begins by computing all substrings of ss except for ss itself. In the loops below, ii represents the starting position and jj represents the substring length. I guard against possibly adding ss to the list of substrings by adding a condition in the inner loop.

var numSubstrings = (s.Length * (s.Length + 1)) / 2 - 1;
var substrings = new List<string>(numSubstrings);
for (var i = 0; i < s.Length; i++) {
    for (var j = 1; i + j <= s.Length && !(i == 0 && j == s.Length); j++) {
        substrings.Add(s.Substring(i, j));
    }
}

Since the challenge is intended to test the user's ability to use dictionaries and hash maps, I then used a C# SortedDictionary to count the frequency of each substring within ss. I used LINQ here to sort the characters in each substring.

var repeatFrequency = new SortedDictionary<string, int>();
foreach (var substring in substrings) {
    var sortedCharacters = new string(substring.OrderBy(c => c).ToArray());
    if (repeatFrequency.ContainsKey(sortedCharacters)) {
        repeatFrequency[sortedCharacters]++;
    } else {
        repeatFrequency[sortedCharacters] = 1;
    }
}

Finally, to count the number of anagram pairs within ss, I wrote the following. Note that possible refactors are discussed in the next section.

var numAnagramPairs = 0;
foreach(var frequency in repeatFrequency.Values) {
    if (frequency > 1) {
        numAnagramPairs += getNumPairs(frequency);
    }
}

The function to count the number of pairs (i.e. nn choose 2) is defined below.

private static int getNumPairs(int n)
{
    if (n < 2) {
        return 0;
    }

    return n * (n - 1) / 2;
}

For my first submission, I implemented a choosechoose function which calculated (nk)n \choose k by the well-known formula (nk)=n!k!(nk)!{n \choose k} = \frac{n!}{k!(n - k)!}. After encountering runtime errors for a couple of the HackerRank test cases, I figured that a brute-force approach (implementing some factorialfactorial function) would not be suitable to pass all test cases. Fortunately, it turns out that for n>1n > 1:

(nk)={1k=0n(n1k1)kk>0\begin{aligned} {n \choose k} = \begin{cases} 1 & k = 0 \\ \frac{n {n - 1 \choose k - 1}}{k} & k > 0 \end{cases} \end{aligned}

Now since we start with k=2k = 2, we arrive at the simplified form used in the code above, since (n11)=n1{n - 1 \choose 1} = n - 1.

(n2)=n(n11)2=n(n1)2\begin{aligned} {n \choose 2} = \frac{n {n - 1 \choose 1}}{2} = \frac{n (n - 1)}{2} \end{aligned}

Areas for Improvement

As mentioned before, I chose to implement my solution in C# because I haven't worked with the language regularly in several years. After my first iteration, I realized I could use LINQ to write my calculation for numAnagramPairs more expressively. In fact, the calculation for numAnagramPairs can be simplified to the expression:

repeatFrequency.Values
    .Where(n => n > 1)
    .Aggregate(0, (acc, n) => acc + (n * (n - 1) / 2))

Coming from mostly writing JavaScript for the past four years, this more functional solution captures my original intent in a less imperative style. I'm sure there are other refactors I could make to modernize my C# code. Perhaps more importantly, I'm also curious to learn about other clever mathematics/combinatorics to apply here for a more efficient solution.

Proofs

Since I'm also rusty on my math, let's do a few quick proofs.

Theorem: Let nNn \in \mathbb{N} and n>0n > 0. Then i=1ni=(n)(n+1)2\sum_{i = 1}^{n} i = \frac{(n)(n + 1)}{2}.

Proof: For the base-case n=1n = 1, i=11i=(1)(1+1)2=22=1\sum_{i = 1}^{1} i = \frac{(1)(1 + 1)}{2} = \frac{2}{2} = 1.

Now assume the theorem is true for n=kn = k.

If we sum up to k+1k + 1, we see that:

i=1k+1i=i=1ki+(k+1)=k(k+1)2+(k+1)=k(k+1)2+2(k+1)2i=1k+1i=(k+1)(k+2)2\begin{aligned} \sum_{i = 1}^{k+1} i &= {\sum_{i = 1}^{k} i} + (k + 1) \\ & = \frac{k(k+1)}{2} + (k + 1) \\ & = \frac{k(k+1)}{2} + \frac{2(k + 1)}{2} \\ \sum_{i = 1}^{k+1} i &= \frac{(k+1)(k+2)}{2} \end{aligned}

Q.E.D.

Theorem: Let n,kNn, k \in \mathbb{N} where 1<kn1 < k \leq n. Then (nk)=nk(n1k1){n \choose k} = \frac{n}{k}{n - 1 \choose k - 1}.

Proof: We already know that (nk)=n!k!(nk)!{n \choose k} = \frac{n!}{k!(n - k)!}. By the definition of factorial:

(nk)=n!k!(nk)!=n(n1)!k(k1)!(nk)!=nk(n1)!(k1)!(n1k+1)!=nk(n1)!(k1)!([n1][k1])!(nk)=nk(n1k1)\begin{aligned} {n \choose k} &= \frac{n!}{k!(n - k)!} \\ & = \frac{n(n-1)!}{k(k-1)!(n - k)!} \\ & = \frac{n}{k} \cdot \frac{(n-1)!}{(k-1)!(n - 1 - k + 1)!} \\ & = \frac{n}{k} \cdot \frac{(n-1)!}{(k-1)!([n-1] - [k-1])!} \\ {n \choose k} & = \frac{n}{k} \cdot {n-1 \choose k-1} \end{aligned}

Q.E.D.