[okfn-labs] Find country names in blobs of unknown text

Thomas Levine _ at thomaslevine.com
Sat Jun 14 13:50:04 UTC 2014


Here's what I came up with.
http://dada.pink/dada/finding-country-names/

On 13 Jun 20:03, Adi Eyal wrote:
> Not sure why you need anything smarter than a regex. You'll only need
> bigger guns if you need to do some fuzzy matching or something
> similar. You might come unstuck in marginal cases such as "US" but an
> entity extractor wouldn't do any better on short strings like the
> examples that you've give. I've pasted a quick example below, it could
> be prettier. One thing that you may want to do is create a mapping
> table for synonyms so that you know that they refer to the same
> country, i.e. USA => United States of America
> 
> 
> import re
> sentences = [
>     "sfdfdsfd New Zealand sdfdsf",
>     "yhj53434 India sdfdsf",
>     "45654 The United States of America sdfdsf",
>     "no match here"
> ]
> 
> 
> countries = [
>     "New Zealand", "India",
>     "The United States of America", "USA", # Be careful with "US"
> ]
> 
> re_countries = re.compile(r"\b(%s)\b" % "|".join(countries))
> 
> for sentence in sentences:
>     match = re_countries.search(sentence)
>     if match:
>         print match.groups()
> 
> On 13 June 2014 17:54, Thomas Levine <_ at thomaslevine.com> wrote:
> > I'm looking for a function or regular expression that finds country names in blobs of text.
> > This can just be something that does a bunch of exact string matches so that it doesn't matter
> > whether the source blob (company names in my case) is spelled "Aecom New Zealand Limited",
> > "Aecom (New Zealand)", "Aecom, New Zealand", or "New Zealand". Has someone released something
> > like this?
> >
> > If I don't see an answer soon, I'm going to write a regular expression that matches with a
> > bunch of country names from some country name dataset.
> > _______________________________________________
> > okfn-labs mailing list
> > okfn-labs at lists.okfn.org
> > https://lists.okfn.org/mailman/listinfo/okfn-labs
> > Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
> 
> 
> 
> -- 
> Adi Eyal
> Director
> Code for South Africa
> Promoting informed decision-making
> 
> phone: +27 78 014 2469
> skype: adieyalcas
> linkedin: http://za.linkedin.com/pub/dir/Adi/Eyal
> web: http://www.code4sa.org
> twitter: @soapsudtycoon
> 
> For more information on how to participate in the open data community
> in South Africa, go to: http://www.code4sa.org/#community



More information about the okfn-labs mailing list