The futility of gigantic training sets with simple models
It is believed that a simple, usually linear, model with an extra-huge training set and a gigantic feature representation is superior to a more powerful model with less data, a claim often made by Google. And indeed, large companies are able to get better results by using ever larger training sets with simple models: each order of magnitude increase in the training set results in a reasonable increase in performance.
This approach is sensible in the sense that increasing the size of the training data is essentially guaranteed to improve performance. And if I have a fast learning algorithm, thousands of cores, and lots of data, then it is conceptually trivial to use more training data whenever the learning algorithm can be parallelized without too much engineering effort. And if I needed better performance very soon, I’d do precisely that.
However, the problem with this approach is that it runs out of steam in the sense that it will not reach human level performance. The following figure illustrates the point:
In this figure, the simpler model eventually outperforms the more sophisticated one, mainly because it is easy and relatively cheap to make the model larger. However, simple models will necessarily fail to reach human level performance, and the more powerful models will eventually but certainly outperform them. It must be so, hence QED. More seriously, a model could not successfully solve tasks that involve any kind of text comprehension, for example, without first extracting a really good representation of text’s meaning with miraculous-looking properties. And that’s something simple models don’t even try to do. By not using a good representation, the simple model falls back on its more primitive feature representation, which does explicitly describe the higher-level concepts that are ultimately needed to solve our task.
Nonetheless, simple (ie linear) models have serious advantages over complex models. They are faster to train and are easier to extend and understand, and their behaviour and performance is more predictable. However, it is finally becoming recognized that neural networks have the potential to be vastly more expressive than linear models without using too many parameters. And now that we are becoming better at trainign deep neural networks, we will see the proliferation of the more powerful multilayered perceptrons. Of course, naive multilayered perceptrons will also probably run out of steam, in which case we’ll have to design more exotic and ambitious architectures. But for now they are the simplest and the most powerful model class.