开发者

Any good surname databases?

开发者 https://www.devze.com 2023-03-12 07:33 出处:网络
I\'m looking to generate some database test data, specifically table columns containing people\'s names. In order to get a good indication of how well indexing works with regard to name based searches

I'm looking to generate some database test data, specifically table columns containing people's names. In order to get a good indication of how well indexing works with regard to name based searches I want to get as close as possible to real world names and their true frequency distribution, e.g. lots of different names with frequencies distributed over some power law distribution.

I开发者_如何学JAVAdeally I'm looking for a freely available data file with names followed by a single frequency value (or equivalently a probability) per name.

Anglo-saxon based names would be fine, although names from other cultures would be useful also.


I found some US census data which fits the requirement. The only caveat is that it lists only names that occur at least 100 times...

  • Genealogy Data: Frequently Occurring Surnames from Census 2000
  • names.zip

Found via this blog entry that also shows the power law distribution curve

  • Power law curve in surnames(blog entry)

Further to this you can sample from the list using Roulette Wheel Selection, e.g. (not tested)

struct NameEntry
{
    public string _name;
    public int _frequency;
}

int _frequencyTotal; // Precalculate this.


public string SampleName(NameEntry[] nameEntryArr, Random rng)
{
    // Throw the roulette ball.
    int throwValue = rng.NextDouble() * frequencyTotal;
    int accumulator = 0.0;

    for(int i=0; i<nameEntryArr.Length; i++)
    {
        accumulator += nameEntryArr[i]._frequency;
        if(throwValue <= accumulator) {
            return nameEntryArr[i]._name;
        }
    }

    // If we get here then we have an array of zero fequencies.
    throw new ApplicationException("Invalid operation. No non-zero frequencies to select.");
}


Oxford University provides word lists on their public FTP site as compressed .gz files at ftp://ftp.ox.ac.uk/pub/wordlists/names/.


You can also check out jFairy project. It's written in Java and produces fake data (like for example names). http://codearte.github.io/jfairy/

Fairy fairy = Fairy.create(); 
Person person = fairy.person();
System.out.println(person.firstName());           // Chloe
System.out.println(person.lastName());            // Barker
System.out.println(person.fullName());            // Chloe Barker
0

精彩评论

暂无评论...
验证码 换一张
取 消