Peter Norvig focused on a major lesson learned at Google and elsewhere in recent years and gave a fascinating keynote presentation on “The Unreasonable Effectiveness of Data” at the SDForum conference on “The Analytics Revolution” April 9th, 2010. The lesson is that data can be surprisingly effective: it can be used to get better performance improvements than one can get from improvements in algorithms.
In contrast to Wigner’s “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” Norvig’s presentation pointed out that in biology, natural language, and other complex domains, often it does not pay to strive for elegant mathematical formulas or compact, simple models or theories. And it never pays to waste time trying for perfect models because as George Box said “…all models are wrong, but some are useful.” Relatively simple methods can often be used to take advantage of ample data to build useful models. The models may be relatively complex but sometimes the data seems to demand this and even more laborious methods for constructing models “by hand” may produce results that are at least as complex and more brittle. An example of a rule base for spelling correction taken from HTDig was shown and it seemed to be very complex. Peter pointed out that it would be difficult to change that rule base to extend it to another language but it would be relatively easy in a more data-driven approach, you would just need a lot of examples in the new language. Peter remarked that data-driven programming is the ultimate agile method.
In many cases three steps need to be taken: choosing a representation language, encoding a model in that language, and then performing inference on the model. Peter summarized his recommended approach with the acronym DINO: Data In Non-parametric model Out. Google’s Seti system for using machine learning to acquire models by learning from massive data sets is described by Simon Tong in the Google research blog at “Lessons Learned Developing a Practical Large Scale Machine Learning System.”
Jaap Suermondt gave a counterexample later in the day in his closing keynote. In the example, unmanageable amounts of data needed to be processed to solve an optimization problem. It was a linear programming problem but if turned out to be a special case that had a more efficient solution. Even so, it turned out to be necessary to improve on that for the special problem at hand in order to get a solution in a reasonable time. In this case, they had tons of data but it was just clutter until an improved algorithm was found that made it possible to get what was wanted out of the data.
Peter’s response to this counterexample is that Google also invests time into improving their algorithms. They have a lot of nearest neighbor problems and need to avoid searching for nearest neighbors so they invest effort into locality sensitive hashing resulting in a simple algorithm. So they are not dogmatic. Even so, the point is that it is surprisingly often the case that data is more important than programs.
In trying to capture the gist of Peter’s presentation, I have skipped over a lot of great examples and interesting points. A complete video recording of Peter’s presentation provided by Dyyno is available at “Analytics Conference – Keynote – Peter Norvig.” “The Unreasonable Effectiveness of Data” also appears as an “expert opinion” article published in IEEE Intelligent Systems by Alon Halevy, Peter Norvig, and Fernando Pereira, pp. 8-12, March/April, 2009. Seeds of the notion that more data can be better or more important than trying for better algorithms on smaller datasets appeared in an earlier presentation Peter gave at PARC Forum in 2006 on “Web Search as a Product of and Catalyst for AI.”