[okfn-labs] Find country names in blobs of unknown text
Thomas Levine
_ at thomaslevine.com
Sat Jun 14 13:50:04 UTC 2014
Here's what I came up with.
http://dada.pink/dada/finding-country-names/
On 13 Jun 20:03, Adi Eyal wrote:
> Not sure why you need anything smarter than a regex. You'll only need
> bigger guns if you need to do some fuzzy matching or something
> similar. You might come unstuck in marginal cases such as "US" but an
> entity extractor wouldn't do any better on short strings like the
> examples that you've give. I've pasted a quick example below, it could
> be prettier. One thing that you may want to do is create a mapping
> table for synonyms so that you know that they refer to the same
> country, i.e. USA => United States of America
>
>
> import re
> sentences = [
> "sfdfdsfd New Zealand sdfdsf",
> "yhj53434 India sdfdsf",
> "45654 The United States of America sdfdsf",
> "no match here"
> ]
>
>
> countries = [
> "New Zealand", "India",
> "The United States of America", "USA", # Be careful with "US"
> ]
>
> re_countries = re.compile(r"\b(%s)\b" % "|".join(countries))
>
> for sentence in sentences:
> match = re_countries.search(sentence)
> if match:
> print match.groups()
>
> On 13 June 2014 17:54, Thomas Levine <_ at thomaslevine.com> wrote:
> > I'm looking for a function or regular expression that finds country names in blobs of text.
> > This can just be something that does a bunch of exact string matches so that it doesn't matter
> > whether the source blob (company names in my case) is spelled "Aecom New Zealand Limited",
> > "Aecom (New Zealand)", "Aecom, New Zealand", or "New Zealand". Has someone released something
> > like this?
> >
> > If I don't see an answer soon, I'm going to write a regular expression that matches with a
> > bunch of country names from some country name dataset.
> > _______________________________________________
> > okfn-labs mailing list
> > okfn-labs at lists.okfn.org
> > https://lists.okfn.org/mailman/listinfo/okfn-labs
> > Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>
>
>
> --
> Adi Eyal
> Director
> Code for South Africa
> Promoting informed decision-making
>
> phone: +27 78 014 2469
> skype: adieyalcas
> linkedin: http://za.linkedin.com/pub/dir/Adi/Eyal
> web: http://www.code4sa.org
> twitter: @soapsudtycoon
>
> For more information on how to participate in the open data community
> in South Africa, go to: http://www.code4sa.org/#community
More information about the okfn-labs
mailing list