Dimensionality reduction and model interpretability

johnmcgaughey255
Mar 21, 2022
6 min read

I would say that the main purpose of communication is to give a universal understanding of abstract ideas. An abstraction is, for my intents and purposes, a lower dimensional encoding of a higher dimensional phenomenon. Take language, the noises that we make with our mouths hold very specific and important meanings, it gives a meaning to things that would otherwise be incommunicable. Language embeds our understanding of the world in itself. Language is a latent formulation of our environment, circumstances and thoughts. One important note about language is that it is of a lower dimensionality than the thing that it is attempting to describe. In fact, this important note is exactly what makes language as useful as it is. We already know of models that let us infer latent representations of observable data. Before I dive into what I will talk about in this blog, understand that the purpose of language is to provide an interpretable, abstract framework for problem solving. Language is a lower dimensional representation of a high dimensional phenomena, and while maintaining is abstract and low dimensional form, it is interpretable and understandable across a domain of individuals. These types of models are called latent variable models (LVM), and their primary purpose is dimensionality reduction. These models are not necessarily deep learning models, in fact some of the most useful ones are not. In this blog, I will be discussing principle component analysis (PCA), the Fourier transform, symbolic regression, program synthesis, and deep learning dimensionality reduction techniques.

We can effectively call all of these mentioned models latent variable models - although we might argue about how useful the produced latent variables are - varying in precision, interpretability, and the ability to generalize. These types of models do not necessarily belong in the same general conversation, but it will become clear as to why I am doing this. All of these models can be looked at models to fit an unknown function, and as dimensionality reduction. Principle component analysis (PCA) is a technique used across many domains of application, with its primary purpose being dimensionality reduction. This classic technique of extracting individual principles of high variance allows us to keep what is important and discard with what is not. Developed in 1901, this tool of linear algebra extracts statistic features and stores them in a low dimensional form. PCA is a relatively limited form of dimensionality reduction, because it only applies to extracting features (principles) that are linear in nature - and works better for smaller dimensions. The statistical variances from an eigenvalues are recorded in a separate matrix - and these variances along with a linear approximation are called the principle components of the data. PCA is very powerful in certain situations when the principle components are of a linear combination. PCA is a very strong dimensionality reduction technique for data that is relatively small in dimensions and is linear in nature. PCA is not the best dimensionality reduction technique for the purpose of classification because of the way it treats outliers in a statistical distribution. Although being relatively limited, PCA is a building block for other techniques.

Joseph Fourier, a French mathematician, created the Fourier transform to decompose signals into their fundamental sinusoidal components. Created in 1807, this formula would be used for many different purposed ranging across all of science. Later generalized into higher dimensions, the Fourier transform is able to approximate drawings, functions, and other things as summations of different sinusoidal functions. It is similar to a Taylor series, but somewhat reversed. The Taylor series approximates functions with polynomials, and the Fourier series approximated functions with sinusoidal functions. Things such as sketches are more easily encoded with a Fourier transform than rigid structures such as polygons. This is typical to different kind of dimensionality reduction techniques, some models will perform well if the data that is to be compressed is structured in a particular way. The Fourier approximation works very well with data structured in a smooth, continuous, and differentiable way — especially when that data is close to a sinusoidal function. Built into each one of these models is an a priori - also called a prior distribution - that describes a model’s inherent bias towards data structured in a particular way. These prior assumptions increase the tractability, efficiency, and interpretability of these latent variable models, but at the cost of generality and multi-modal processing.

Symbolic regression is a function fitting technique that searches a perviously defined function space to fit a function using a tree search. The benefit of symbolic regression is that its results are highly interpretable - as they are a combination of previously defined functions (sin, cos, exponential, quadratic). This method of modeling dynamic behavior is especially helpful in field like physics, where it is extremely important to understand what the model itself is doing. Symbolic regression techniques are also good for extrapolation and exploring data defined beyond the initial training data set. Because the function space that symbolic regression searches through is predefined, we are able to say with certainty how far these approximations are good for. Symbolic regression is looked at as almost cheating to an extent, because so much of the knowledge it gains is merely from what we define to be a good answer as humans. I do not share this viewpoint. This method for function approximation is extremely powerful because of the high degree of interpretability and the low degree of complexity. Increasing interpretability from a human’s perspective is equivalent to decreasing the entropy of a system, which takes work. A lot of this work is allocated to the initial machine learning process itself in our usual models. However, if we can define a set of functions that have a high expected value of potential in lowering the relative entropy of the problem… why would we not want to use them? I agree to some extent that this is not the optimal way to fit models, but I think that it is definitely an interesting start. The primary purpose why I don’t brush symbolic regression off as a cop out method is because it is very similar to what we do as humans, and of course this is our primary reference frame for intelligence. Of course, we did not come up with the function spaces that we search every day to complete problems, that would be so inefficient. We understand the work of previous scientists to be of low entropy, high interpretability, and of intelligence. We use this work to build our knowledge of dynamical systems as well as coming up with better ways to define their most fundamental behavior. One limitation when working with symbolic regression techniques is the intractability of computation as the dimension of the input increases. Symbolic regression works best with low dimensional mappings, but unfortunately most of the problems we face today in deep learning are extremely high dimensional.

In the realm of dimensionality reduction and latent inference of complex and high dimensional problems, we turn to deep learning because of its ability to learn a low dimensional representation of high dimensional data efficiently. This type of learning is generalizable to all instances of neural networks, they learn smooth manifolds that represent the latent structure of a dataset of higher dimensionality. Out of all of these techniques of function approximation and representation, deep learning is the one I know most about, but still by far the most complex. The implications of using such a model fitting techniques are beyond our current understanding - as there is still research every day pointed at figuring out what these neural networks are actually doing and how reliable they really are. The more we understand the internal processing logic at multiple levels of abstraction inside of our deep learning models, the more we can do with it. The more we can understand how the neural network does what it does, the more we are able to define in what cases the model’s approximation will be considered accurate. Interpretability is the backbone of extrapolation. In order to generalize to data outside of a training data, a coherent, interpretable, and logical thought process is needed to know when it is wrong. For all of the above mentioned techniques in previous paragraphs, as humans, we know when these things will fail to approximate and to what degree they will approximate. In the case of deep learning, because the model works to its own ends, creating every representation it needs to manipulate — the representations it creates will be single purposed and uninterpretable to an outside observer.

Giving an understanding to abstraction is incredibly important, and is exactly what is meant by ‘interpretability’. As humans, if we are able to define a coherent and logical explanation to how something works, we are able to generalize that phenomena to a wider domain of behavior and instances of that thing. Developing a ‘method to the madness’ allows for extrapolation. Deep neural networks are extremely powerful models, but they are primarily limited by their uninterpretable abstraction, which greatly limits our understanding of them and ability to manipulate them. Building off of the initial conversation about communication, in future blogs, I will share my opinions about how to make these models more interpretable.

Dimensionality reduction and model interpretability

Recent Posts

Comments