[okfn-labs] what we are working on...
Matthias Schlögl
m.schloegl at bath.ac.uk
Thu Feb 13 10:21:47 UTC 2014
Dear Friedrich,
thanks very much for your reply.
@SNAtools: Your list seems already pretty complete to me. I would maybe add d3.js for online-visualizations. There are two desktop analysis programs I am aware of that are not on your list: Visone, which is done by a German University and though not very widely used my favorite for analysing networks. Its written in Java and therefore platform independent (as I am working on Mac pretty important ;)). Its also much more intuitive than for example UCInet (which is at least in my opinion one of the worst pieces of software ever) and it can be linked to R (igraph) and Siena. Siena is also a software you could add, although its maybe a bit special (its for building models of e.g. longitudinal data). What you should definitely add is Pajek. Its done by a Slovenian University and most of the researchers doing SNA I know use it. Its not as easy to use as Visone or Gephi, but very powerful. Unfortunately its windows only. And you could maybe add Neo4j. Since v2 its easier to use and there is a plugin for Gephi which allows to import and analyze the whole (or only a part) database. And there are of course a lot of R libraries (e.g. the already mentioned igraph). If you are interested in NLP you should take a look at NLTK (Python library), though you probably already know that ;).
SNA and text analysis: Your mock-up looks exactly like what I am up to. I would like to have a tool that is able to locate basic entities (people, institutions, URLs etc.) in text and map these connections. A is author of text B which mentions institution C. Or text A and B both mention URL C. The idea behind is to - when the tool is applied to sources in two different fields, e.g. think tanks and media - have a kind of "map of influence". This map is of course not going to be a final answer, but it is going to give you hints where to look a bit closer. Does that make sense?
The thing is, finding entities in texts is not easy at all. First of all you have the problem of the stem, also a lot of institutions can be written in several different forms etc. I am trying to overcome those difficulties with three techniques/tricks. First of all the tool is only giving suggestions where and what an entity could be, it has to be confirmed manually. Second - and this might sound a bit crazy in the first place ;) - I am putting whole texts directly in the graph. That means that every word (in its stemmed version) is a node in the graph. Like that its not only easy to find similar sentences throughout the whole corpus, but so-called bigrams (two words) are easy observable as well. Words occurring very often together are either good key-phrases or entities. I am also using alchemyAPI to get suggestions for entities. It has a Python SDK and is really easy to use (for non-profit projects you get 30.000 API calls/day for free). However, the documentation is unfortunately not very extensive and it is therefore a kind of black box system, you dont really know how they are processing your requests.
In a former project we used a list of wikipedia page titles to find entities in texts, but again that needed a lot of manual tweaking in the end.
@OpenInterests: Do you really hard-copy the data into your system or just connect to external APIs. And if you copy the data, how do you deal with changing data? I ask because maybe our data on think tanks for example could be interesting (http://thinktanknetworkresearch.net/wiki_ttni_en/index.php?title=Main_Page), but it is changing a lot (at least it should). E.g. we are going to add/edit the data on Austrian think tanks within the next few months and we are actually working on South American think tanks at the moment. I guess our think tank data is the only dataset we can publish we would get copyright problems with the other datasets.
BTW: Just recognized that you are also behind dataset (the Python library). Especially for me doing a lot of small data-mining/data-cleansing scripts in Python that library is very helpful!
Kind regards,
Matthias
Am 12.02.2014 um 17:19 schrieb Friedrich Lindenberg <friedrich at pudo.org>:
> Hey Matthias,
>
> it’s fascinating to hear about the work you’re doing - thanks very much for the comprehensive write-up! As a first ask, you seem to know a lot of approaches and tools in this space - I’m wondering whether you would be interested in contributing to the SNA tool survey that I’ve started (http://untangled.knightlab.com/readings/charting-social-network-analysis-tools.html). Also, it needs a better set of classification criteria :)
>
> I’m really interested to learn more about your work on text analysis, it would be cool to know whether you’ve actually used entity extraction techniques to generate graphs? What other connections do you see between NLP and SNA? I’ve been very interested in combining narrative and structured elements in graphs, e.g. this mock-up: http://opendatalabs.org/misc/demo/grano/_mockup/.
>
> It would also be interesting if you could see your work relate to OpenInterests directly: do you think it may make sense to import any of the datasets you have collected into the site? I’m quite eager to combine different sources of information, as long as they are clearly attributable it should make for interesting overlaps.
>
> I’m also delighted to see that you’ve already come across grano ;) The project is really new (and, to be honest, not that professional), and therefore also still quite malleable in its design - so if there is any way we could make it useful to your project, I’d be delighted to explore those options! It doesn’t actually use a graph database, mostly because I wanted a data schema that fully traces each fact’s source and attribution - in a way its more about collecting “evidence" than just making a graph.
>
> My greatest challenge with the project is not so much about data mining at all. OpenInterests, for example, already has a significant amount of well-structured information available. The issue then is: what can I let users like investigative reporters or researchers do with this, so that it is actually an everyday tool, rather than just a big bucket of stuff that one stumbles across occasionally.
>
> The answer will probably technically boil down to graph algos, list-making and aggregate reporting of some sort - but that isn’t the layer on which we can expect these groups of users to work. So there has to be an intermediate language that actually defines human activities, which is probably also going to be fairly domain-specific. Not sure this makes a lot of sense, but if you had any pointers to work in that direction, I would really appreciate it!
>
> For the combination of technologies that you’re using (Neo/Django), I’d also consider having a look at detective.io, it’s Journalism++ brain child - an open source platform aimed at journalists with no data modelling skills.
>
> All the best,
>
> - Friedrich
>
>
> On 12 Feb 2014, at 16:17, Matthias Schlögl <m.schloegl at bath.ac.uk> wrote:
>
>> Hi,
>>
>> my name is Matthias Schlögl. I did my master degree at the University of Vienna and I am currently working as a research assistant and PhD student at the University of Bath (Social & Policy Sciences Departement, Prof. David Miller), as well as at the Commission for Development Research in Vienna. Additionally I am involved in some minor projects (e.g. for the Arbeiterkammer, the official representation of employees in Austria).
>>
>> Jonathan Gray asked me to post something on what we are working on at the moment to this list. If you are interested in something in greater detail (e.g. the technical side) please don't hesitate to contact me.
>>
>> Basically speaking we/I are trying to map networks (power relations) and find patterns. However, recently we started to also include texts themselves into our research efforts.
>> I have to say right at the beginning that I am a social scientist by training and never really learned programming in school or at university. However, when still in University I started to recognize the power of these new technologies and started to learn to program a bit by myself (HTML, JavaScript, Python).
>>
>> Technically speaking we have several different projects:
>> We have two semantic wikis of which one is publicly available: http://thinktanknetworkresearch.net/wiki_ttni_en/index.php?title=Main_Page Its on think tank networks and holds information on about 400 think tanks. Dieter Plehwe, Werner Krämer and I started the project at the Social Science Research Centre Berlin some years ago. However, unfortunately the information is still not equally good for all think tanks. As it is a semantic wiki you can query it like a database and it is easily possible to export data in various formats (e.g. Json, csv etc.). I wrote some Python scripts that for example can automatically generate interlocking directorates data and export files suitable for Social Network Analysis programs like Visone, Pajek or Gephi.
>> We use forms to enter the data into the wiki, that makes it possible that researchers that have no clue about wiki markup can edit and add information very easily.
>>
>> Additionally we conducted some data-mining/scraping projects throughout recent years. We scraped for example the Science Media Centre articles, or the WEF contributors database until 2009 (about 14000 people that once spoke on the WEF). We also scanned a printed version of the Trade Associations Directory (2007) and used OCR to generate a database of trade associations and their members (only in the field of addiction industries).
>>
>> We are mainly using Social Network Analysis and Natural Language Processing (e.g. TF/IDF to compute keywords for the SMC articles) techniques to analyze our data.
>>
>> At the moment we have - additionally to the data we collected ourselves - data from several different sources and a system to collaborate (the wiki) that is, though working fine for the things we are doing right now, likely to get performance issues in the future. The semantic wiki is just not build for the complex networks we are trying to map. So I am working (also for my PhD) on a system that should be capable to combine the data we have right now and allow future projects. Funny enough like grano (the system used for openinterests.eu) it uses a graphdatabase, though its of course not as professional as grano.
>> Its build on Neo4J (graphdatabase) and Django (web framework). It holds 3 different kind of entities (nodes): People, Institutions and texts. I try to process texts added to the database automatically and find connections within them using regexp: e.g. URLs, Names, figures etc.. Once running it should automatically check RSS feeds for new entries and add them to the database.
>>
>> I hope that is showing a bit what I/we are doing. I plan to go to the dataharvest+ conference in Brussels, so if someones interested we could meet there and talk about the projects in greater detail.
>>
>> Kind regards,
>>
>> Matthias
>> _______________________________________________
>> okfn-labs mailing list
>> okfn-labs at lists.okfn.org
>> https://lists.okfn.org/mailman/listinfo/okfn-labs
>> Unsubscribe: https://lists.okfn.org/mailman/options/okfn-labs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.okfn.org/pipermail/okfn-labs/attachments/20140213/cc006d93/attachment-0004.html>
More information about the okfn-labs
mailing list