[okfn-labs] Find country names in blobs of unknown text

Adi Eyal adi at code4sa.org
Sat Jun 14 14:53:35 UTC 2014


Mine is slightly better and easier to maintain (although they're quite
similar). Also that one can can't cope with multiple occurrences -
e.g. "XXXX Pakistan Algeria XXXX - it will only pick up the second
one.

Adi

On 14 June 2014 15:50, Thomas Levine <_ at thomaslevine.com> wrote:
> Here's what I came up with.
> http://dada.pink/dada/finding-country-names/
>
> On 13 Jun 20:03, Adi Eyal wrote:
>> Not sure why you need anything smarter than a regex. You'll only need
>> bigger guns if you need to do some fuzzy matching or something
>> similar. You might come unstuck in marginal cases such as "US" but an
>> entity extractor wouldn't do any better on short strings like the
>> examples that you've give. I've pasted a quick example below, it could
>> be prettier. One thing that you may want to do is create a mapping
>> table for synonyms so that you know that they refer to the same
>> country, i.e. USA => United States of America
>>
>>
>> import re
>> sentences = [
>>     "sfdfdsfd New Zealand sdfdsf",
>>     "yhj53434 India sdfdsf",
>>     "45654 The United States of America sdfdsf",
>>     "no match here"
>> ]
>>
>>
>> countries = [
>>     "New Zealand", "India",
>>     "The United States of America", "USA", # Be careful with "US"
>> ]
>>
>> re_countries = re.compile(r"\b(%s)\b" % "|".join(countries))
>>
>> for sentence in sentences:
>>     match = re_countries.search(sentence)
>>     if match:
>>         print match.groups()
>>
>> On 13 June 2014 17:54, Thomas Levine <_ at thomaslevine.com> wrote:
>> > I'm looking for a function or regular expression that finds country names in blobs of text.
>> > This can just be something that does a bunch of exact string matches so that it doesn't matter
>> > whether the source blob (company names in my case) is spelled "Aecom New Zealand Limited",
>> > "Aecom (New Zealand)", "Aecom, New Zealand", or "New Zealand". Has someone released something
>> > like this?
>> >
>> > If I don't see an answer soon, I'm going to write a regular expression that matches with a
>> > bunch of country names from some country name dataset.
>> > _______________________________________________
>> > okfn-labs mailing list
>> > okfn-labs at lists.okfn.org
>> > https://lists.okfn.org/mailman/listinfo/okfn-labs
>> > Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>>
>>
>>
>> --
>> Adi Eyal
>> Director
>> Code for South Africa
>> Promoting informed decision-making
>>
>> phone: +27 78 014 2469
>> skype: adieyalcas
>> linkedin: http://za.linkedin.com/pub/dir/Adi/Eyal
>> web: http://www.code4sa.org
>> twitter: @soapsudtycoon
>>
>> For more information on how to participate in the open data community
>> in South Africa, go to: http://www.code4sa.org/#community



-- 
Adi Eyal
Director
Code for South Africa
Promoting informed decision-making

phone: +27 78 014 2469
skype: adieyalcas
linkedin: http://za.linkedin.com/pub/dir/Adi/Eyal
web: http://www.code4sa.org
twitter: @soapsudtycoon

For more information on how to participate in the open data community
in South Africa, go to: http://www.code4sa.org/#community



More information about the okfn-labs mailing list