Sunday, July 13, 2014

Nomenclature

I've had to take a bit of a break from project work recently due to other commitments but hopefully I'll be able to get back to at least tinkering with things soon.  As a bit of a preamble to that I've been looking at a bit of a fringe task for procedurally generated worlds, namely the procedural generation of names.

My plan for the near future is to focus  more on generating a plausible political and logistical infrastructure for a planet instead of purely working on the visual terrain as I have done many times before so I'm going to need to be able to name things be they continents, countries, regions, cities, airports, premiers or whatever.  This is a completely new area to me and so sounded like an interesting diversion.

There are many ways to procedurally construct names ranging from simply producing random sequences of letters to utilising complex linguistic rules but the one I chose to experiment with was the use of Markov Chains (http://en.wikipedia.org/wiki/Markov_chain), an algorithm that decides what comes next in a sequence by looking only at the preceding X elements and using some probability tables to decide what's most likely to follow.  In my case the sequence is the letters of the name being generated so by looking at the last so many characters of the name so far I can choose a plausible next character and so on.

So where do these probability tables come from?  Well to make names that look sensible I thought it best to reference the real world as a starting point, so to generate names for countries I use a set of 275 real world country names as my exemplars - something that's not too hard to come by these days thanks to the internet.  These real names are analysed to store the probability that any given sequence of characters is followed by another, for example it may be that 23% of the time the letter 'a' is followed by the letter 't', or that 15% of the time the sequence "tr" is followed by the letter "s" in the real world names.

The resultant behaviour of the algorithm is heavily dependent upon the size of the exemplar pool but even more significantly upon the number of characters you look at when deciding what might come next.  For example running the same algorithm with the same exemplars but different length groups produced the following sets of names:


It's immediately obvious that the longer the character sequence examined the more rigidly the generated names stick to the exemplars while the shorter the character sequence the more variety the generated names exhibit.  The extreme example here is table (1d) where only the last character emitted is used to decide what comes next resulting in a virtually random sequence of characters that bears little resemblance to the exemplars or even anything vaguely pronounceable!

Of the variations I think the three character sequence version (1b) is most promising but while the longer sequence length prevents too much randomness in the output at times there simply isn't enough randomness meaning that the generated names are quite similar or even identical to the real world exemplars.  There is also nothing preventing the same sequence being generated more than once such as "falklands uk".

To address the first of these issues I decided to run a check on each generated name to see if it was different enough to the exemplars to be accepted.  This is done by comparing the generated name to each exemplar and calculating the Levenshtein Distance between the two strings (http://en.wikipedia.org/wiki/Levenshtein_distance), essentially the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

By rejecting any generated names that produced a Levenshtein Distance to any exemplar that was too low we can help ensure that we don’t accidentally end up with something too similar to a real world place name.  Running the same test again but now with this added exemplar distance check rejecting any generated names with an exemplar distance score of less than two we get a better selection of plausible but not actually real world names:


Note here that the distance check is performed on each separate word of the generated and exemplar names separately to avoid generating names where one word is unique but the other lifted straight from an exemplar - certain key words such as "the", "of" or "island" are exempted from this check however as they are perfectly valid in the output.

As an experiment, running the test one final time but this time with the distance rejection limit raised from less than two to less than five produced these country names:


Comparing these to the previous test's output the biggest change here is with the longer character sequence generated names (3b and to a lesser extent 3c) as the distance check has rejected anything even remotely similar to one of the exemplars causing far more random and less recognisable names.  The shorter sequence names at the bottom have been less affected as they were pretty random to start with.  Note there is no four character sequence output here as the system was unable to generate any names using that most restrictive of criteria that also passed the distance check - the names were by definition most similar to the exemplars therefore least likely to have an acceptably high distance score.

So far these have all been country names but how does the system work with a different set of exemplars?  To find out I used it to try to generate plausible city names instead - first using a full set of 26914 Earth city names with a three character sequence length and a distance threshold of three:


As you can see there is a lot of variety in the names reflecting the huge real world variety of city names depending on culture and language, as one of my main goals at the moment is to produce a procedural world that exhibits more heterogeneity than it typically found by global application of procedural systems however this one-size-fits-all name style is not really what I'm after.

By limiting the set of exemplars to those real world cities from a particular country however more localised results can be achieved.  Out of interest I ran the system over just those real world city names from the United Kingdom, United States, France and Japan in turn to see how closely the output would reflect their dramatically contrasting morphology:


I'm no language expert but to my layman eyes the output from each while varied amongst it's peers is noticeably different from those in the other tables which I think is good enough for my needs - I just want each of my virtual countries to have an identity reflected in it's nomenclature.

These tables have been generated from just a few runs and with specific parameters, but it's important to bear in mind that when used for real the exemplar pool, sequence length and the distance threshold can all be varied each time a name is needed providing a steerable but near limitless potential set of generated names.

Anyway, I’m happy enough with it for now so I’m going to move on to something else...comments as always are welcome



4 comments:

  1. You're alive! Woot woot!

    Seriously man, glad to see you back at it. Hope all is well.

    Interesting post. I prefer all of the examples generated with the preceding 2 characters used. I think they produced the most alien names that were still able to be pronounced. Going up to 3 and the names started to retain too many recognizable elements from Earth.

    While this was in fact a very interesting diversion, I can't wait till you get back to work on the planet(s)!

    Take care and again, good to see this blog active!

    ReplyDelete
  2. Just a cautionary tale : http://thedailywtf.com/Articles/The-Automated-Curse-Generator.aspx

    ReplyDelete
  3. I like this approach, but I think that if the cities of a given country were to be used in a source set, the "flavor" of the resulting names would impose a cultural bias where the virtual country is aligned with the real country. For example, if the country "Calleory" had cities with names generated from your French cities list, an observer would just think of Calleory as being a thinly veiled France. (Not unlike how Romulans are thinly veiled Romans, Klingons are thinly veiled Japanese Samurai, and so forth.)

    Perhaps a way around this is to group the cities of two or three unlike countries together as a source set. This may provide a result set that still contains plausible city names, is still distinctive from the other result sets, but doesn't present the cultural bias.

    ReplyDelete

Comments, questions or feedback? Here's your chance...