[okfn-labs] SEC EDGAR scraping experience?

Rufus Pollock rufus.pollock at okfn.org
Wed Oct 1 09:35:10 UTC 2014


On 30 September 2014 21:50, Friedrich Lindenberg <friedrich at pudo.org> wrote:

> Hey Rufus (and labs :),
>
> I was browsing around for info about scraping the SEC’s EDGAR database and
> delighted to see that some of the first results were your work on it [1],
> [2]. I’m thinking about looking into that data casually, and I was
> wondering whether you might have some help for me on a few questions:
>
> 1) Do you have any sense how large a full scrape of the data (the XML
> portion at least) might be?
>

I think it is pretty large but not absolutely sure. I think that
public.resource.org might have quite a bit already done here for old stuff
(pre 2001 IIRC) - https://bulk.resource.org/edgar/


> 2) Did you ever play with any of the available parsers for the actual SGML
> filings? [3] looks like this might be quite traumatic to the untrained
> explorer.
>

Not totally clear on the SGML vs XBRL stuff - i was focused on getting more
of the "data" so focused on XBRL (however, the mention of pysec in the SO
comments suggests that most libraries may do both). I had a very short
library review here:

https://github.com/datasets/edgar/tree/master/scripts#library-review

I also note that the lady behind RankAndFiled.com must have done some
pretty good stuff (however I don't think any of that is open-source AFAICT).

3) Similarly, did you ever try out any of the Python tooling for XBRL?
>

Yes, and I actually managed to get one working. see
https://github.com/datasets/edgar/tree/master/scripts (pull requests
welcome ;-) ...)


> Having asked (3), I will now drink to forget and wish you all a pleasant
> evening.
>

Building a decent archive of EDGAR filings data would be really useful and
something I think lots of folks including myself would be excited to
contribute to :-) (it was partly why I started that work ...)

Rufus


> - Friedrich
>
>
> [1] http://okfnlabs.org/blog/2014/03/04/sec-edgar-database.html
> [2] https://github.com/datasets/edgar
> [3] https://stackoverflow.com/questions/13504278/parsing-edgar-filings
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20141001/69f37815/attachment-0003.html>


More information about the okfn-labs mailing list