So what is the outsized hooha about ‘big data’? The media presents it as a boogeyman.
The media seem to use the phrase ‘big data’ to indicate that the corporate world has some new magical tool that understands human motivation and intrudes on individual decision-making. The same way Lefties used to talk about 'advertising'.
As my readers know, I've taught statistics and I ran marketing research departments in the early 1960s. I used statistics on data within the Bank of America and other institutions to understand business. Bank of America had over 50 million accounts which I could examine freely. I did. And I had computers (very slow, limited memory but I could process hundreds of thousands of data points overnight.) That is more than 40 years ago and that was god damn big data. So what did I use it for?
I did mail surveys to find out how to improve customer service at branches. Individual branches had thousands of accounts. I used that data to improve the quality of services offered in banking, in individual branches and I used it in branch design.
The only other person in the bank who understood statistics was an economist. Jointly we analyzed the bad credit at Bankamericard. We looked at applications from good users and compared those to users who ended up not paying the card balances. Bad users.
We used correlation equations on large bodies of data. The two of us found significant differences in the applications of the future good users and future bad creditors for Bankamericard. Out of this we developed the credit scoring system that is still used in consumer lending worldwide. We eliminated millions of man-hours of ineffective credit screening by humans.
We were much more accurate in detecting good and bad potential customers than humans examining the same data. We tested our data too.
You are welcome. It was so good it lasted forty years. Based on big data.
So what is the math of the data analysis?
It is always a reduction in variance. The computer manipulation of all data then and today is based on comparing numbers squared and totaled to totals squared. That approach is used for everything in finding correlation and in selecting important information branches on a decision tree.
The math is simple: 2+3+4=9, 9 squared is 81. The same numbers totaled after squaring are 4+9+16 equals 29. If we did this for numbers with no variance 3+3+3 we would get a total of nine squared for 81 again with the squared 3’s totaling 27. That difference between 29/81 and 27/81 becomes the measure of variance and is the core of all ‘big data’ analysis. All data analysis using correlations and branching are based on this simple comparison.
The special algorithms used by Netflix to pick your favorite films and by Google to guess what you are searching for are based on that simple difference in squared numbers.
Is that worth being scared about?