Anagrammer

in C

Monday, Feb-27-2006

boo ya. Got it working last night. Took a run this morning at "wellpunchmeintheface". Stats:

Start  = 1141032457 
Finish = 1141032637 
Total = between 970000 and 980000 results in 180 seconds.

!

Monday, Mar-13-2006

C_anagrammer.tar.gz, contains the source for the anagrammer. You'll have to compile it on your own.

words.h

Contains the functions that manipulate individual words including a letter counter, a word comparison function (wordCheck - can word 'a' be spelled with the letters from word 'b') and the function that actually grabs all the words out of original dictionary file.

lexicon.h

Simple data type for creating our temporary lexicons. There isn't much going on here, simple queue type to hold variable length lists.

main.c

"Where the action is." This file prepares all our data and starts the MainLoop. MainLoop is really the entire program. It's a recursive function that goes 'down' one level for each new word in the anagram.

Notes

This is a very basic attempt at translating anagrammer from python to C and so far it's worked. There is a lot of stuff hardcoded (location of dictionary.txt, number of results possible, length of strings in the original lexicon, etc.) where it was easier and quicker than making everything dynamic. One desire is for a growable results datatype, but that can wait, the first goal was just to make it work. So far it runs about 2 to 3 times faster than optimized python tested against the same test strings with the same initial dictionary.

Method

Broadly, the goal of the program is to find out what anagrams can be created using an input string and a given dictionary. The program follows these steps--primarily in the MainLoop function--to find anagrams:

  1. Create a "lexicon" of words that can be spelled using only the letters from the input string. The first word in the lexicon is added to a list of temporary results and its letters are removed from the input string.
  2. If the input string is empty it means all letters have been used and are now in the temporary results list. An anagram has been found. The temporary results are added to the final results list and the word we just added to the temp results is removed from the temp results and added back to the input string. We also remove the word from the lexicon and return to step 1.
  3. If the input string is full we generate a new lexicon using only the words in the current lexicon compared against the remaining letters of the input string.
  4. If the new lexicon is empty, we have a bad input string so we remove the word we just added to the temp results and add it back to the input string. We also remove the word from the lexicon and return to step 1.
  5. If the new lexicon is full, we call MainLoop again (return to step 1, except one level "down"), passing it the new lexicon, new input string, and current temporary results. Processing at this level is held until the new lexicon is exhausted.
  6. Once returned from MainLoop we remove the word we most recently added to the temp results and add it back to the input string. We also remove the word from the lexicon. If the lexicon is empty at this point, we break out of the loop. If we are in a lower level of recursive calls this means taking a step back up the stack. If we are at the original level, we return to the calling program with the final results.

Data Types

The types used were intended to be as simple as possible. Other than the original word list, no actual strings (char *) are stored. At the beginning of the program, the words are converted to a special "letter count" format and stored in a fixed length array of arrays. Each word is a 27 cell array containing short ints representing the quantities of letters a-z respectively in the word with the 27th cell containing a reference to the word's location in the original word list.

Having all words represented as arrays of shorts simplifies common operations. When comparing words (can word 'a' be spelled with the letters from word 'b'?) I can simply step through the lists. If a[n] > b[n] (if there are of the nth letter required by a than can be provided by 'b') return false right away. If we get to 26 successfully, return true.

The lexicons that are created at each level of MainLoop are stored as a queue (singly linked list) of short ints which are the indecies of the letter-count versions of the original words (that's a mouthful). Temporary results and final results are pre-allocated arrays of short ints containing only indecies. I am lazy so right now I've limited This means that the only place letter-count arrays are stored is in the initial list and the only place the strings are stored is in their initial list. Space is only allocated for list headers (head/tail reference container) and lexicon items.

Finale

I'm not sure right now if I'm properly free()ing memory, but as I dig into C I'll figure that out. It runs wicked fast (compared to the Python implementation) on my normal 1GHz Athalon ("canteloupe") and is at least functional on my totally sweet Pentium I, 90MHz monster laptop ("orange"). Both are running linux, Kubuntu on canteloupe and Core Distro on orange.

Anyone is welcome to take it and do what they want, all code is public domain and doesn't really do anything anyone else couldn't do in a weekend if they got the anagram bug in their ear. I originally wanted my own anagram generator because I'd been pleased with the freeware versions I had used but I wanted features they didn't offer and knew it couldn't be too tough to write my own.

in Python

get the code. All the above descriptions apply. If you want it to go fast, you'll need psyco, the python optimizing compiler. Other than that, it's a plain ol' anagram generator, no frills, just fight. If you want to put it in a harness to measure results, the easiest way would be:

For example:

from anagrammer import main
from time import time

start = time()
results = main("puresoapunion") #bonus points if you find the funny anagram!
finish = time()

print "%f seconds used, %i results found" % (finish-start, len(results))

There you have it, an anagram generator.

Limits

Repeated lexicon generation, specifically word checking, is the bottle neck. Check it out:

input = puresoapunionst
results = 153617
lexicon generations = 2218360
word checks = 19736516
running time = 23.690000

res / sec = 6484.466004
lexgen / res = 14
wdchk / res = 128

You can get a general picture from these results of the ratio of word checks per result. For even a short input phrase we have almost 20 million "word checks" and more than 2 million calls to the lexicon generator. Any speed gain made in those areas will be largely reflected in the running time of the anagram generator.

For the Future

I hope to eventually use the anagram generator to dig into anagram prediction. I ran about 1000 or so tests on the python version a while back gathering data to see if there is any traceable correlation between input phrase and final results length. I haven't done any strenuous analysis of the numbers, but if you stay tuned I'll be posting them soon (if I can find them). Really all I did was plug them into gnuplot to see what it looks like, but you've got to start somewhere.

Comments

I've got a pile of anagram programs -- really the same program, written in lots of different languages --, using what sounds like the same algorithm. I've been considering putting them on Google code for the heck of it, but wouldn't want to duplicate what you've done ... care to see 'em?

The languages I've got are c++, perl5, perl6, scheme (lots of variants), c-sharp (brand new), emacs lisp, haskell, python, common lisp (fastest), parrot, and ruby. I didn't seriously consider doing C because it seemed like it'd be too painful :-)

I considered doing d, lua, factor, ocaml, squeak, and io, but never got around to finishing those.

You can contact me at offby1@blarg.net