Data in the Wild

In the Boston BarCamp session called Public Data, one of the opening statements stood out to me: “Data in the wild is dirty.”

There is an overwhelming amount of public data that could and should be processed and analyzed, however there is, as of yet, no real set of standards for organizing the data into a machine-readable, parsable format.

Wikipedia, is, of course, everybody’s favorite example of an enormous amount of data that is meticulously maintained. Who would have ever expected Wikipedia to work? Yet, in general, truth is up-modded on the internet. This means that Wikipedia is self-healing. Literally.

But does our society have the capacity, as a whole, to participate in this self-healing process?

Better yet, is our society willing to participate?

At dinner on Saturday night, @mattknox told us that Americans took all of the time they spend watching television in one year and put those hours towards editing Wikipedia instead, we would be able to create 2,000 Wikipedias. In one year.

Alas. So much public data is freely available, yet we (yes, the big, collective, internet “we”) need to be figure out ways to harness the power and acquire the knowledge contained within it.