Comments on 'Maximum Likelihood Estimation of Intrinsic Dimension' by E. Levina and P. Bickel (2004)

David J.C. MacKay+* and Zoubin Ghahramani+

+ Gatsby Computational Neuroscience Unit, University College London
* Department of Physics, University of Cambridge

We like Levina and Bickel's paper, because it confirms our longstanding view that the way to solve most inference problems is to write down an appropriate likelihood function. Fig1a

However, we feel that the authors missed the chance to apply this `likelihood principle' all the way to the final hurdle. We came to this conclusion when we saw their figure 1a, which shows the estimated dimension of a five-dimensional manifold as a function of the number of nearest neighbours k used to obtain the estimate. We were surprised that for small k, there is a strong bias in the estimated dimension, even when there are large quantities of data. We would not expect a maximum likelihood estimator to misbehave in this way. For large k, yes, we would expect biases because of the breakdown of the 'locally-uniform density' assumption. But for small k, we expected that the estimator would be unbiased, albeit with increasing variance as k decreased.

Fortunately there is a simple explanation of the bias, and the problem is easily fixed.

At equations (7) and (8), the authors write down two nice likelihood-derived estimators of dimension m based on the distances {T} to the nearest neighbours of a single point on the manifold.

eqn7
eqn8

The question then is how to combine the results obtained in the neighbourhood of all n datapoints to give a single inference of the dimension m.

Surely the correct way to proceed is to write down a likelihood function? Sadly, the authors decide to average the n estimators

eqn9a

It's not hard to come up with a better-motivated way of combining the statistics. Treating the data surrounding each of the n points as if they are independent (which they are not, of course), we can write down a likelihood of exactly the same form we already encountered. This approach motivates an estimator just like (7),

mha
Rather than averaging the estimators, we think you should average their inverses.

Figures 2 and 3 below show the results of experiments involving 2000 uniformly-distributed points on a manifold of dimension 5. We fixed R to a sequence of values and used the above estimator to obtain individual estimates of m as illustrated in figure 3b (for one particular choice of R). The inverses of those estimators are shown in figure 3a. Figure 2 shows, for each value of R (controlling position on the horizontal axis), the average of the m-hats (i.e. the authors' estimator) (yellow), the inverse of the average of the inverse m-hats (orange), and our preferred maximum likelihood estimator (which is just equation (7) again). [2a shows a log scale on the horizontal axis, which is the average number of enclosed points.] Of course, where the data density varies, we can't expect to use a single value of R. So figure 4 shows similar results for the case where we fix k. One way of writing the estimator is

mha2

estimatorsLog
(2a)
estimators
(2b)
scatter.im
(3a)
scatter.m
(3b)
Kestimators
Figure 4
Left Yellow - their estimator, found by averaging the individual estimators.
Red - Our maximum likelihood estimator, found by averaging the inverses.
Below The scatterplots show individual estimators of 1/m (left) and m (right) based on the neighbours of each of the 2000 datapoints, with colour coding the value of k used, starting from red (k=2), orange (k=3), yellow (k=4), green (k=6), cyan (k=8), etc.
Notice the modest asymmetry of the distribution of estimators of 1/m, and the extreme kurtosis of the distribution of estimators of m.
Kscatterim Kscatterm

Fixing the poor behaviour of the estimator at small k simplifies matters arising later in the paper: for example, the fine-tuning of the range of k values (k1, k2) can probably be done away with; instead, there could be a single controlling hyperparameter describing the lengthscale at which the assumptions of the model break down. This hyperparameter could itself be inferred automatically with a Bayesian approach.

Take-home message: never average anything without consulting a likelihood function.

P.S. Octave code used to generate these figures is supplied in RUNME2.m, doplots.m, and RUNME3.m.


David MacKay
Last modified: Wed Jan 26 09:48:24 2005