2016-11-28

One of my early programming projects in Python was a word game solver (example: word jumble solver) – the early version was a simple script, which grew into a web application. Since then, I’ve always enjoyed using dictionary search problems to test out a new language.

Today’s article will look at building a searchable anagram dictionary using R. We will take an English word list and create a simple hash-map by sorting the letters into alphabetical order. We will store this into a hash-map data structure (um, Houston, we may have a wee problem here….does R even have such a beast?) and write a simple function to check if your letters can be unscrambled into a word.

Let’s start with the dictionary. We’re going to use the Enable open source dictionary. A copy can be downloaded from here. Please download to your local machine and access it from there.

wordlist <- readLines(“c:\\enable1.txt”)

Now, to sort the letters in a string into alphabetical order. Basically, we’re splitting the word up into letters. Then using the unlist function to convert this into a vector. Then sorting the vector into alphabetical order. Then pasting the whole mess together. I give you… the hashword.

hashword = paste(sort(unlist(strsplit(word, “”))), collapse = “”)

This is helpful for two reasons:

Identifies anagram families – words with the same letters

Can be used to unscramble letters into words

To build the hash map, we will loop through the word list and load them into the hash map. Now… to find options to use for the hash map. Unlike many other languages, R doesn’t have an obvious choice for a fast key-value lookup. My first attempt at this used lists and the name property. While this worked for small examples, it wasn’t efficient for large sets.

And then I stumbled across R environments (longer discussion here). These incorporate a hashmap structure by default, so they are a good candidate for our project. (This functionality has also been encapsulated in the “hash” package available on R-Cran – documentation here.)

We’re ready to write our dictionary builder:

map <- new.env(hash=T, parent=emptyenv())

for (word in wordlist){

hashword = paste(sort(unlist(strsplit(word, “”))), collapse = “”)

if (!is.null(map[[hashword]])){

map[[hashword]] <- c(map[[hashword]], c=word)

} else {

map[[hashword]] <- list(word)

}

}

This iterates across the original list of words, converting each word into a sorted string of letters. It checks the hashmap (the environment variable) to see if we’ve already seen that set of letters before. If we’ve seen it before, we append the word to the list of letters stored for that hashword value. If not, we initialize a list with that word.

And now we need a retrieval function.  This is a simple lookup. Remember to sort the word before looking it up.

getAnagrams <- function(x) {

return(map[[paste(sort(unlist(strsplit(x, “”))), collapse = “”)]])

}

And to see the results in action. First, looking at a known “anagram family”:

> getAnagrams(“goat”)

[[1]]

[1] “goat”

$c

[1] “toga”

And next, trying to unscramble letters into a word (check here to confirm results):

> getAnagrams(“terref”)

[[1]]

[1] “ferret”

And there you have it. A simple anagram solver compressed into a couple of lines of R.

This is a decent starting point for those interested in automated puzzle solvers. A couple of useful extension of this code include solving for partial anagrams (uses some but not all letters) – output would look like this or this. Adding a letters-to-points score mapping would enable you to calculate scores for scrabble or words with friends. This works fairly well for basic searches. The next stage up from there generally will require a rewrite, with a faster language and more efficient data structure to address the complexity of more advanced word problems such as solving boggle.

The post Simple Anagram Finder Using R appeared first on ProgrammingR.

Show more