My wonderful friend Per has been asking me my opinion about ‘big data’.
I have two responses. One response is to the strident talk about its novelty. As computers get faster and handle more data the data sets can get larger. Nevertheless I have only seen one statistical technique to examine the data. It is to observe the ‘variance’ in the numbers. When two or more sets of data form a cluster with similar levels of variance the statistics allow one to examine the parallels in the numerical structure. That last phrase can be called correlation.
Variance is arithmetically simple. Take the numbers 2,3,4 and 3,3,3. They both have the same average and when squared you get 4,9,14 for a total of 27. And the same with 9,9,9 for 27. But dealing with paired results 2x3=6, 3x3=9 and 4x3=12 for a total of 27 which squared is 729. While 2x3 squared is 36, 3x3 is 81, 4x3 squared is 144 for a total of 261. This difference between the sum of the pairs squared (261) and the sum of the squared pairs (729) is variance.
Big data is a matter of looking for all the pairs of numbers in a data set that give a high difference in the co-variance of the pairs. Then seeing if that high co-variance has a meaning.
I once did this on a the data for a few hundred people for whom I had their astrological sign and about 50 other pieces of miscellaneous information. The only connection, only high co-variance I found was that some astrological signs were associated with people who listened to one radio station in San Francisco. Nothing of meaning. No astrological sign correlated with prison time, eating patterns or voting practices.
Big data doesn’t change anything. It still has to be related to some theory of human behavior. Like listening to rap connects with prison time and connects with dropping out of high school, but makes no sense when connected to preferring boxer shorts over low tennis socks.
The second issue is whether big data can be useful to ‘big brother’. I call this the issue of technology and people. The big brother countries: North Korea, China, Iran and Cuba don’t need technology to increase big brother. Their systems work fine now and have for a century. France has always had a big brother postage system and still do and it doesn’t protect them from regular and violent terrorism. The same is true for the U.S.
Big data, and NSA is as big as data can get, hasn’t prevented 155 Americans killed since 2003 in 65 incidences of terrorism. Go ahead and count them yourself. Also count the 15,000 Americans murdered annually, about whom big brother obviously doesn’t care since half are blacks killed by other blacks.
The real issue is that big data is a big Lefty and paranoid-others, worry. If it works it is trivial in its effectiveness. Partly because it lacks a civilian population that supports big brother in most English speaking countries. Most importantly because nobody has a decent model of what type of data indicates a terrorist or any other kind of socially dangerous person.