[ddj] Open data murder tracker in Trinidad and Tobago
dansonguyen at gmail.com
Mon Feb 3 21:00:37 UTC 2014
Hi Gerard, cool project and thank you for sharing it. Counting homicides
isn't easy and the data structure problems you ran into are common to most
such projects (Slate had a great writeup on tracking general gun deaths:
To solve your immediate problem, I would *do away with the "Toll"* column.
That column, as you've realized, will always fluctuate as murders become
classified/discovered/re-classified. Your spreadsheet should not have any
set order...because with a spreadsheet, you can reorder by column as you
wish, and you want to maintain that flexibility.
Now, in terms of creating a unique ID, I think your idea could work fine.
The key thing is that once you have set that unique ID *it must never
change*. It should be a value that has no real meaning, i.e. you'll never
show it outside of this spreadsheet. But ti's the key that you can use for
your own internal reference's sake.
An easy thing to do is to use this formula:
=CONCATENATE(TEXT(NOW(), "yymmddHHmmss"), "000", RANDBETWEEN(10000,99999))
This will take the current time and turn it into a string, which will be
unique enough if you enter in entries at a rate *slower* than one a second.
But just to make sure, we add a random 5 digit number to the end.
Now, this *unique id* should *never change*. So once you've generated this
number, *copy it, then Paste Special -> Paste Values only*
This will replace the *formula* that was in the cell with just the literal
value, i.e. that random number.
That's one way to do it. It doesn't even have to be that complicated, you
just want a reasonably good way of making sure all the numbers are
different, and the timestamp is reliable since your computer time (should)
never repeat itself.
- *You should maintain a field that simply records when you first entered
the data*. This is of course different than when the murder actually
happened. For even better record-keeping, you can have a second column that
you update with the current date anytime you make a change to an existing
* Tip*: If you click on a cell and then hit *Cmd-; *the current day
will be entered into the field
*- You should avoid keeping multiple tabs: *Instead of having separate tabs
for "unclassified", "classified", "police killings", etc., you should
into one sheet* and add a *new column* in which the value is either
"Classified", "Unclassified", etc. The most important thing about this is
that it keeps you from having to alter 3-5 different tables every time you
make a change to the headers. When you just need a certain kind of death,
you filter by that category column.
- *You should avoid removing rows - *If you take something out because of a
technicality...again, it's better to make a separate column...maybe
*that it is either *true/false. *
It's always easier to filter columns, and *extremely hard* to reconstruct
data. You want to reduce the amount of clicking you have to do to organize
data in order to minimize the mistakes you make.
Adrian Holovaty, of Django and ChicagoCrime.org, has a great series of
articles on this topic of data integrity, though he's talking about a much
larger data set with different constraints. But the insights and advice he
has may still be useful:
Hope that helps!
On Fri, Jan 31, 2014 at 1:09 PM, Gerard Best <gerardbest at gmail.com> wrote:
> Hello all,
> I'm on a team working on a open data journalism project called Bullet
> Points and I'm hoping you can lend your collective experience/expertise to
> the project.
> Bullet Points tracks murders in Trinidad and Tobago. The murders are
> tracked on a Google doc here: http://tinyurl.com/bulletpoints2014
> The fatal incidents are listed chronologically and are numbered by victim
> (murder toll). The victims are announced on Twitter (
> twitter.com/bulletpoints_) by their toll number. This system has some
> serious limitations.
> Firstly, it is geared toward tracking the murder victims but is an
> inadequate way of tracking the actual killings themselves. For example,
> when there is a double murder, the toll goes up by two but the number of
> incidents goes up by only one.
> Secondly, killings do not always occur in the same chronological order
> that the victims' bodies are discovered. Sometimes bodies of murder victims
> are discovered after several days. When this happens, the relevant killing
> must be inserted somewhere in the middle of the existing chronological
> list, which obviously changes the toll count of all subsequent victims.
> Unfortunately, because our current system relies on the toll count to
> identify the victim, it has now run into this serious limitation.
> PROPOSED SOLUTION
> I think that what I need to do is add another column that contains a
> unique reference number for each incident. The unique reference number can
> follow a standard protocol such as [DATE][INTEGER]. For example, the most
> recent murder, which was discovered today, would have a reference number
> 2014013001. I think this will deal with the problem but I am not sure it
> is the best solution.
> Thanks for taking the time to read this. If you think you can help, I
> would be grateful for your perspective on this problem and proposed
> Gerard Best
> data-driven-journalism mailing list
> data-driven-journalism at lists.okfn.org
> Unsubscribe: https://lists.okfn.org/mailman/options/data-driven-journalism
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the data-driven-journalism