[okfn-labs] Find country names in blobs of unknown text

Adi Eyal adi at code4sa.org
Fri Jun 13 18:03:17 UTC 2014


Not sure why you need anything smarter than a regex. You'll only need
bigger guns if you need to do some fuzzy matching or something
similar. You might come unstuck in marginal cases such as "US" but an
entity extractor wouldn't do any better on short strings like the
examples that you've give. I've pasted a quick example below, it could
be prettier. One thing that you may want to do is create a mapping
table for synonyms so that you know that they refer to the same
country, i.e. USA => United States of America


import re
sentences = [
    "sfdfdsfd New Zealand sdfdsf",
    "yhj53434 India sdfdsf",
    "45654 The United States of America sdfdsf",
    "no match here"
]


countries = [
    "New Zealand", "India",
    "The United States of America", "USA", # Be careful with "US"
]

re_countries = re.compile(r"\b(%s)\b" % "|".join(countries))

for sentence in sentences:
    match = re_countries.search(sentence)
    if match:
        print match.groups()

On 13 June 2014 17:54, Thomas Levine <_ at thomaslevine.com> wrote:
> I'm looking for a function or regular expression that finds country names in blobs of text.
> This can just be something that does a bunch of exact string matches so that it doesn't matter
> whether the source blob (company names in my case) is spelled "Aecom New Zealand Limited",
> "Aecom (New Zealand)", "Aecom, New Zealand", or "New Zealand". Has someone released something
> like this?
>
> If I don't see an answer soon, I'm going to write a regular expression that matches with a
> bunch of country names from some country name dataset.
> _______________________________________________
> okfn-labs mailing list
> okfn-labs at lists.okfn.org
> https://lists.okfn.org/mailman/listinfo/okfn-labs
> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs



-- 
Adi Eyal
Director
Code for South Africa
Promoting informed decision-making

phone: +27 78 014 2469
skype: adieyalcas
linkedin: http://za.linkedin.com/pub/dir/Adi/Eyal
web: http://www.code4sa.org
twitter: @soapsudtycoon

For more information on how to participate in the open data community
in South Africa, go to: http://www.code4sa.org/#community



More information about the okfn-labs mailing list