Sherlock and Anagrams
Note: I recommend reading this article in PDF form since it includes typeset formulas and proofs.
HackerRank provides an Interview Preparation Kit containing a number of problems spanning programming topics like data structures, sorting, and searching. After taking some time away from HackerRank, I decided to pick up a question on dictionaries and hash maps. The problem is called "Sherlock and Anagrams" on the platform.
The Problem
Two strings are said to be anagrams if and only if their letters can be rearranged to form each other. The challenge as defined by HackerRank is to implement a function which accepts an input string and returns the number of a pairs of substrings which are anagrams of each other. For this challenge, the original string is not considered a substring of itself.
Approach
After some thinking about the problem, I decided that my strategy would be to:
- Compute all substrings of the input , except for itself.
- For each substring, sort its characters alphabetically. For example, the substring "baa" would be reordered to "aab".
- Count the frequency of each sorted substring within . For example, if contains substrings "baa" and "aba", both would be reordered to "aab" by alphabetical character sorting. The sorted substring "aab" would have frequency 2.
- Take all sorted substrings with frequency ; these are the substrings for which at least one pair can be assembled. The number of possible pairs of the underlying, unsorted substrings is given by the combination . Aggregate this value for all sorted substrings.
As a sanity check, my first concern was to count the number of expected substrings of a given string. Counting the number of substrings required a bit of algebra. As it turns out, a string of length has substrings of length 1, substrings of length 2, and so on: this is a summation of for . The value for this sum is given by the following formula (see "Proofs" section below):
Since the challenge does not consider to be a substring of itself, the number of substrings is actually one less than this value.
My justification for the fourth point is given by combinatorics. Since I need to count the number of possible ways to select substrings (disregarding order) from possible substrings of , I needed to essentially calculate choose , given by . More on this later.
Solution
I chose the C# programming language because I'm a little rusty with it. My solution begins by computing all substrings of except for itself. In the loops below, represents the starting position and represents the substring length. I guard against possibly adding to the list of substrings by adding a condition in the inner loop.
var numSubstrings = (s.Length * (s.Length + 1)) / 2 - 1;
var substrings = new List<string>(numSubstrings);
for (var i = 0; i < s.Length; i++) {
for (var j = 1; i + j <= s.Length && !(i == 0 && j == s.Length); j++) {
substrings.Add(s.Substring(i, j));
}
}
Since the challenge is intended to test the user's ability to use dictionaries and hash maps, I then used a C# SortedDictionary to count the frequency of each substring within . I used LINQ here to sort the characters in each substring.
var repeatFrequency = new SortedDictionary<string, int>();
foreach (var substring in substrings) {
var sortedCharacters = new string(substring.OrderBy(c => c).ToArray());
if (repeatFrequency.ContainsKey(sortedCharacters)) {
repeatFrequency[sortedCharacters]++;
} else {
repeatFrequency[sortedCharacters] = 1;
}
}
Finally, to count the number of anagram pairs within , I wrote the following. Note that possible refactors are discussed in the next section.
var numAnagramPairs = 0;
foreach(var frequency in repeatFrequency.Values) {
if (frequency > 1) {
numAnagramPairs += getNumPairs(frequency);
}
}
The function to count the number of pairs (i.e. choose 2) is defined below.
private static int getNumPairs(int n)
{
if (n < 2) {
return 0;
}
return n * (n - 1) / 2;
}
For my first submission, I implemented a function which calculated by the well-known formula . After encountering runtime errors for a couple of the HackerRank test cases, I figured that a brute-force approach (implementing some function) would not be suitable to pass all test cases. Fortunately, it turns out that for :
Now since we start with , we arrive at the simplified form used in the code above, since .
Areas for Improvement
As mentioned before, I chose to implement my solution in C# because I haven't worked with the language regularly in several years. After my first iteration, I realized I could use LINQ to write my calculation for numAnagramPairs
more expressively. In fact, the calculation for numAnagramPairs
can be simplified to the expression:
repeatFrequency.Values
.Where(n => n > 1)
.Aggregate(0, (acc, n) => acc + (n * (n - 1) / 2))
Coming from mostly writing JavaScript for the past four years, this more functional solution captures my original intent in a less imperative style. I'm sure there are other refactors I could make to modernize my C# code. Perhaps more importantly, I'm also curious to learn about other clever mathematics/combinatorics to apply here for a more efficient solution.
Proofs
Since I'm also rusty on my math, let's do a few quick proofs.
Theorem: Let and . Then .
Proof: For the base-case , .
Now assume the theorem is true for .
If we sum up to , we see that:
Q.E.D.
Theorem: Let where . Then .
Proof: We already know that . By the definition of factorial:
Q.E.D.