[okfn-labs] SEC EDGAR scraping experience?
rufus.pollock at okfn.org
Wed Oct 1 09:35:10 UTC 2014
On 30 September 2014 21:50, Friedrich Lindenberg <friedrich at pudo.org> wrote:
> Hey Rufus (and labs :),
> I was browsing around for info about scraping the SEC’s EDGAR database and
> delighted to see that some of the first results were your work on it ,
> . I’m thinking about looking into that data casually, and I was
> wondering whether you might have some help for me on a few questions:
> 1) Do you have any sense how large a full scrape of the data (the XML
> portion at least) might be?
I think it is pretty large but not absolutely sure. I think that
public.resource.org might have quite a bit already done here for old stuff
(pre 2001 IIRC) - https://bulk.resource.org/edgar/
> 2) Did you ever play with any of the available parsers for the actual SGML
> filings?  looks like this might be quite traumatic to the untrained
Not totally clear on the SGML vs XBRL stuff - i was focused on getting more
of the "data" so focused on XBRL (however, the mention of pysec in the SO
comments suggests that most libraries may do both). I had a very short
library review here:
I also note that the lady behind RankAndFiled.com must have done some
pretty good stuff (however I don't think any of that is open-source AFAICT).
3) Similarly, did you ever try out any of the Python tooling for XBRL?
Yes, and I actually managed to get one working. see
https://github.com/datasets/edgar/tree/master/scripts (pull requests
welcome ;-) ...)
> Having asked (3), I will now drink to forget and wish you all a pleasant
Building a decent archive of EDGAR filings data would be really useful and
something I think lots of folks including myself would be excited to
contribute to :-) (it was partly why I started that work ...)
> - Friedrich
>  http://okfnlabs.org/blog/2014/03/04/sec-edgar-database.html
>  https://github.com/datasets/edgar
>  https://stackoverflow.com/questions/13504278/parsing-edgar-filings
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the okfn-labs