I have been playing Word Challenge on facebook, it is a very fun game. It made me start playing with anagrams, and I decided to write an anagram-solving program. However, I read about Donald Knuth’s insight that if you sort a word and its anagram alphabetically, it will give the same word.
sergio -> egiors
orgies -> egiors
So, by using a hash table it is pretty easy to find anagrams in constant time. You just have to associate a word made up of ordered letters to a list of all the possible anagrams. You can do it in O(n) by iterating through a dictionary file. There’s no problem to solve when there is already an optimal solution.
I then decided to write a program that finds every possible word that can be formed by a group of letters (Which would be perfect to use if I ever wanted to cheat on Word Challenge. Which of course I never would! =P) .
I wrote an algorithm on top of Knuth’s idea that is stupidly slow.
It takes n letters, and sorts them. It then finds all the anagrams associated to that sorted word and adds them to a set. Then it recurses to each possible word created by removing one letter, until it reaches the minimum number of letters. Which is, by default, 3.
I haven’t proven it, but if my math is right, my algorithm is O( (n^2)(n-1)! ).
It works instantly on my machine when n<9.
It reads any dictionary file that is composed of single lines containing words on some language.
It is written in English, but there is some Spanish output. (You can probably figure out what it says.)
I was thinking of not uploading it, but there are probably already a lot of web applets that do the exact same thing, and I guess your average cheater wouldn’t go through the trouble of running a Python script if he/she can’t go through the trouble of practicing to get better at the game.
There are some obvious optimizations that could be made, like taking into account repeated letters.
There is also the option to change the data structure: First, load the dictionary to a list. Then order that list by the size of the words (from short to long).
For each word, create a key and append that word to the hash map with that key.
Create n sub-keys, where n is the number of letters in that key. Each sub-key has a letter removed. Since the key is sorted, each sub-key is also sorted. That means it is a valid key. Since it is shorter, it has already been added to the hash map (unless it doesn’t have any anagrams or it is below our limit. In those cases, we ignore it). We add the values associated to these sub-keys to the value of the hash-map for the current key.
We are using dynamic programming, since we are eliminating the need for recursion by saving state. I don’t know if I explained myself right..
There is no implementation for this idea since I came up with it as I was writing it =D. The new data structure would allow us to get all possible words for n letters in constant time!. And that structure can probably be filled pretty fast.
I’ll edit this post when I have a working implementation =D
Update: New implementation.
I implemented the idea I wrote about, and it worked! (Sort of..)
First off, it takes a lot more memory. I have a Spanish dictionary with around 80,000 words, and my program is consuming about 177 MB of memory. This can be hugely improved by using indexes instead of storing strings.
The good news is that if you input any word to the program, and that word or any of its anagrams are in my dictionary, the program outputs all the possible words that can be formed by those letters in constant time.
The bad news is that if you input any word whose key is not in the hash table, the program can’t tell you anything.
The solution to this was to check for this problem, and in those cases, iterate recursively through the possible sub-keys until we find the ones that are in the hash map.
This solution will cause the program to be extremely slow when you input large strings whose keys are not in the hash table.
Link to the new version.
Again, there are a lot of things that can still be improved. But this version works better than the last one.