[okfn-discuss] Fwd: Twitter Scrape
Jonathan Gray
jonathan.gray at okfn.org
Mon Dec 22 17:36:03 UTC 2008
I thought people might be interested to see this. (From flip at
infochimps to theinfo list...)
J.
---------- Forwarded message ----------
Hey y'all,
I've gathered a massive scrape of the Twitter friend graph: about 2.7M
users (and slowing, meaning I'm starting to find the edge), 10M
tweets, 58M edges, with pretty-near complete edge data for users with
more than a dozen followers.
Big huge thanks to twitter.com who have given permission to share this
freely. Please go build tools with this data that make both
twitter.com and yourself rich and famous. This will convince more
corporations to free their data.
I'm just pushing what I have up there and have _not_ done all the
double-checking I'd like. I'm getting out of town for a week, though,
and I thought people might like to play with it even rough as it is.
I'll do a better release after new year's.
The scrape has full metadata on users and releationships, and I've
calculated pagerank for the users. I'll post data on the 2-cliques and
local density when I get back.
Happy festivus,
flip
PS If anyone can point me towards map-reduce algorithms to efficiently
* calculate all-pairs distances (or ballpark it)
* assign clusters
at this scale please advise.
=========================================================
NOTES
=========================================================
I've posted it at
http://infochimp.info/ics/data/arch/social/network/twitter_friends/
username: 'theinfo.org'
... the password is the ramanujan taxicab number followed by the word
'kennedy', all one word. If that doesn't work or doesn't make sense
email me.
Approx. cardinality:
8 068 820 twitter_user_partial.tsv # partial users as found in
other users' tweets
2 675 458 ... ... giving info on 2675458 unique users
2 173 417 twitter_user.tsv # full user records
2 168 569 twitter_user_profile.tsv
2 168 739 twitter_user_style.tsv
219 388 hashtag.tsv # hashtags collected from tweets
58 010 471 a_follows_b.tsv # following relationships
10 168 919 tweet.tsv # unique tweets
2 071 290 tweet_url.tsv # urls collected from tweets
2 997 735 a_atsigns_b.tsv # all @atsigns collected from
tweets (anywhere in tweet, but screen_name only and not threaded)
2 494 807 a_replied_b.tsv # @replies collected from
tweets (only those that appear first, but threaded)
90 542 155 total
(the user_partial thing: when you ask for a user's following / friends
list, or in the public timeline tweets, you get a partial listing of
each user. I've kept, for these partial users, each unique state
observed. So if <mrflip> was seen on the 10th, the 15th, and the 16th
and had (everything else the same) 80, 80 and 82 followers resp.
you'll get the user_partial records of the 10th and the 16th.)
===========================================================
Layout of each file (all tab-delimited):
# class_name [key_field] [scraped_at] ... attributes ...
TwitterUserPartial [:id], :id, :screen_name, :followers_count,
:protected, :name, :url, :location, :description, :profile_image_url )
TwitterUser [:id], :id, :screen_name, :created_at,
:statuses_count, :followers_count, :friends_count, :favourites_count,
:protected )
TwitterUserProfile [:id], :id, :name, :url, :location, :description,
:time_zone, :utc_offset )
TwitterUserStyle [:id], :id, :profile_background_color,
:profile_text_color, :profile_link_color,
:profile_sidebar_border_color, :profile_sidebar_fill_color,
:profile_background_image_url, :profile_image_url,
:profile_background_tile )
Tweet [:id], :id, :created_at, :twitter_user_id, :text,
:favorited, :truncated, :tweet_len, :in_reply_to_user_id,
:in_reply_to_status_id, :fromsource, :fromsource_url )
AFollowsB [:user_a_id, :user_b_id], :user_a_id,
:user_b_id )
ARepliedB [:user_a_id, :user_b_id, :status_id], :user_a_id,
:user_b_id, :status_id, :in_reply_to_status_id )
AAtsignsB [:user_a_id, :user_b_name, :status_id], :user_a_id,
:user_b_name, :status_id ) # note we have no user_b_id for @foo
Hashtag [:user_a_id, :hashtag], :user_a_id,
:hashtag, :status_id )
TweetUrl [:user_a_id, :tweet_url], :user_a_id,
:tweet_url, :status_id )
Pagerank files are
user_id pagerank ids_that_user_follows
(and 'dummy' if we haven't gotten their follower list, or if that list
is empty.)
===========================================================
Notes:
* If you use the .tsv form make sure your language doesn't interpret the
zero-padded twitter_user.id of '000000000072' as octal 58.
* I think that there may be inconsistent user data for all-numeric
screen_names: see
http://code.google.com/p/twitter-api/issues/detail?id=162
That is, I think that the data in this scrape may commingle information about
the user having screen_name '415' with that of the user having id
#415. Not much I can do bout it, but I plan to scrub that data later.
* Watch out for some ill-formed screen_names: see
http://code.google.com/p/twitter-api/issues/detail?id=209
* For the parsed data: act as if I've double-checked none of this. If you
have questions please ask, though.
* The scraped files (ripd-_xxxxxxxx.tar.bz2) *are* in their final form, and are
exactly as they came off the server.
* pagerank is non-normalized -- divide by N and take the log.
* The files are huge, and the ripd-_xxxx directories will make your
filesystem cry, I recommend hadoop.
More information about the okfn-discuss
mailing list