The art of data science via Mondrian Forests

Probabilités et Statistique

Salle séminaire M3-324
Erwan Scornet
Mercredi, 16 Janvier, 2019 - 10:30 - 11:30

The recent and ongoing digital world expansion now allows anyone to have access to a tremendous amount of information. However collecting data is not an end in itself and thus techniques must be designed to gain in-depth knowledge from these large data bases.

This has led to a growing interest for statistics, as a tool to find patterns in complex data structures, and particularly for turnkey algorithms which do not require specific skills from the user.

Such algorithms are quite often designed based on a hunch without any theoretical guarantee. Indeed, the overlay of several simple steps (as in random forests or neural networks) makes the analysis more arduous. Nonetheless, the theory is vital to give assurance on how algorithms operate thus preventing their outputs to be misunderstood.

In this talk, we analyze a stylized version of random forest called Mondrian Forests and prove that it reaches minimax rates of consistency for Lipschitz and twice differentiable regression functions. This is the first result showing the optimality of a particular random forest algorithm in arbitrary dimension. We will also elaborate on the importance of  aggregation in the forest.