[openbiblio-dev] updates from JISC open bib project
Ben O'Steen
bosteen at gmail.com
Fri May 13 12:20:13 UTC 2011
Some updates and a life lesson:
Lesson:
Don't try to upload batches of files to archive.org at the same time, as
any failure, timeout or hiccup will cause them to discard all of the
upload, regardless to how close to 99% you are.
- Reuploading the files to archive.org that failed to upload while I was
asleep, currently adding file 11 of 17. (6 to go)
- Pulled out affiiations from medline files 630 -> 653 and am geocoding
the 2010 set from this. There are actually a terrifying number of
affiliations, so I am working on a rough clustering of them to cut down
on the numbers.
- Due to the time-factor, I am skipping the clustering and filtering it
to just the affiliations that are repeated verbatim. This still leaves
us with 339593 unique affiliations (which may be reduced by clustering
as it's not unlikely that the same typo/form of address has been entered
more than once.)
- Geocoding these as fast as I can... :)
Ben
More information about the openbiblio-dev
mailing list