[okfn-labs] SEC EDGAR scraping experience?
Friedrich Lindenberg
friedrich at pudo.org
Mon Oct 6 10:20:24 UTC 2014
Hey all,
so I’ve done a bit of hacking on this over the weekend, but unfortunately the overlap between what I need (metadata and full-text indexing) and what the existing data package focusses on (XBRL key figures), isn’t very large. In any case, I’ve made the code snippets available here: https://github.com/pudo/edgar
By the way, the resource.org snippets look really, really useful - thanks!
- Friedrich
On 01 Oct 2014, at 11:35, Rufus Pollock <rufus.pollock at okfn.org> wrote:
> On 30 September 2014 21:50, Friedrich Lindenberg <friedrich at pudo.org> wrote:
> Hey Rufus (and labs :),
>
> I was browsing around for info about scraping the SEC’s EDGAR database and delighted to see that some of the first results were your work on it [1], [2]. I’m thinking about looking into that data casually, and I was wondering whether you might have some help for me on a few questions:
>
> 1) Do you have any sense how large a full scrape of the data (the XML portion at least) might be?
>
> I think it is pretty large but not absolutely sure. I think that public.resource.org might have quite a bit already done here for old stuff (pre 2001 IIRC) - https://bulk.resource.org/edgar/
>
> 2) Did you ever play with any of the available parsers for the actual SGML filings? [3] looks like this might be quite traumatic to the untrained explorer.
>
> Not totally clear on the SGML vs XBRL stuff - i was focused on getting more of the "data" so focused on XBRL (however, the mention of pysec in the SO comments suggests that most libraries may do both). I had a very short library review here:
>
> https://github.com/datasets/edgar/tree/master/scripts#library-review
>
> I also note that the lady behind RankAndFiled.com must have done some pretty good stuff (however I don't think any of that is open-source AFAICT).
>
> 3) Similarly, did you ever try out any of the Python tooling for XBRL?
>
> Yes, and I actually managed to get one working. see https://github.com/datasets/edgar/tree/master/scripts (pull requests welcome ;-) ...)
>
> Having asked (3), I will now drink to forget and wish you all a pleasant evening.
>
> Building a decent archive of EDGAR filings data would be really useful and something I think lots of folks including myself would be excited to contribute to :-) (it was partly why I started that work ...)
>
> Rufus
>
> - Friedrich
>
>
> [1] http://okfnlabs.org/blog/2014/03/04/sec-edgar-database.html
> [2] https://github.com/datasets/edgar
> [3] https://stackoverflow.com/questions/13504278/parsing-edgar-filings
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141006/7007ee9f/attachment-0004.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141006/7007ee9f/attachment-0004.sig>
More information about the okfn-labs
mailing list