Probabilistic Inference Tasks for Latent Variable Models

johnmcgaughey255
Jan 5, 2022
5 min read

I think it is important to realize that before anything else, deep learning and computer science are truly mathematical fields of study. The computer is a mechanism, a medium, to speed up computation, allowing individuals to perform extremely complex calculations in the matter of nanoseconds. It is important to note that mathematical improvements in computer science theory can speed up computation in computers just as much as engineering developments in transistor research. We deal with abstract data structures as computer scientists, and the study of transfiguring that data into meaningful representations to accomplish meaningful tasks is the study of algorithms. The study of algorithms is more or less the study of communication between math and computers. There lies much intricacy in this communication, some of which involve discovering optimal ways to formulate mathematical equations in a computationally efficient fashion. One of my favorite examples of this kind of algorithm is the fast Fourier transform. The discrete Fourier transform (DFT) is a mathematical tool to transform data into its generative frequencies. Taking the DFT of a piano recording will output the frequencies that make up that recording. This mathematical tool is incredibly useful in both lossy and lossless compression algorithms, because storing the frequencies that exist within data is much more space efficient when compared to storing the data itself. This also works in 2 dimensions, which is unfortunately less intuitive, but just as useful in compressing images. The fast Fourier transform is a way to write the DFT in a much more computationally efficient way. The computational complaint for the DFT algorithm is in the order of the square of the data contained in the file we want to compress. The FFT’s time complexity is nlog(n), which when scaled up to very large values of n is amazingly more efficient. This is why computer science is a science discipline and not an engineering discipline. As computer scientists we look down into the structures of mathematics and study what makes an algorithm efficient.

A similar perspective should be taken for the study of deep learning. I think that fundamentally deep learning is a different field of study, separate from computer science. It is separate in the mode of problem solving, there is a fundamentally different approach when it comes to solving and conceptualizing problems. However complex, the field of deep learning is heavily rooted in mathematics - these models operate on the basis of mathematical formulas.

Some people can get caught up with looking at deep learning models as this mystical thing, and where that is a perfectly viable perspective to have, there are others. While I love looking into the intricacies of specific models, I think it is important to zoom out occasionally and try to find greater applications of these models. Deep learning models are first and foremost function approximations for high dimensional functions between an input and an output - at least in the supervised scenario. Even in the unsupervised learning models, the model often learns a relationship between high and low dimensional data. One such model that I have gone deep into is the probabilistic autoencoder, which learns low dimensional representations of high dimensional data.

A probabilistic inference task in latent variable models asks the question, “What makes up this observable data?” This is a field of mathematics that has been bettered and far more applicable through the use of deep learning models as probabilistic function approximations. For the remainder of the blog post I will be using math jargon… this is code for saying I’m not smart enough to making it understandable and intuitive yet. There are two main things that this field explores. The first being the computation of the posterior distribution ( the probability of some latent variables occurring given observable data), and the second being the optimal parameterization of the posterior and likelihood distributions. The parameterization of distribution alludes to the use of deep neural networks, because they are universal parameterized models used for the approximation of functions. One type of probabilistic inference model comes from the paper, “Auto-Encoding Variational Bayes” from which the infamous variational autoencoder (VAE) is derived from. These papers are far from the norm of computer science literature; often being rooted in probability theory, as is this paper. To understand the basis of the VAE, we must understand the evidence lower bound and the basis for variational inference. Variational inference is used for the task of inferring generative variables of observable data. The posterior distribution p(z|x) is just mathematical language for describing the relationship between the observable data, x, and the latent description of that data, z. The task of variational inference, in my mind, is a very beautiful thing that partially captures the essence of learning… but that for another blog. This is jut half of the story if we wish to form a true objective function, the other half is called exception minimization. With variational inference, we can compress the observable data, but the problem of expectation minimization asks us, “to what ends do we compress the data?” The traditional answer here is that we compress the information such that the observable data can be reconstructed from the latent expression of the original observable data. Expectation minimization usually takes the form of likelihood maximization. The likelihood function can be conceptualized as the expansion of the latent variables back into the original observable data. Maybe you’d like to think about it as decoding information as if it was some secret code.

Evidence in probability theory refers to the likelihood function when evaluated at a static set of parameters. Let’s take a look at that likelihood function for a second. As we said earlier, the likelihood has been defined to be the log probability of reconstructing the data from its latent representation back to its observable for. We should ask, “what is the true human objective of encoding information on computational machines?” Let’s say we are aiming for a lossy compression algorithm, as is needed for the compression of very high dimensional data into relatively low dimensional data. Maybe the objective of variational inference would work with a different kind of expectation minimization other than likelihood maximization. The question to ask is, “What do we want the latent information to convey?” The answer to that question is whatever we would like to reconstruct the data too, such that we preserve that meaningful information specifically. Take the denoising autoencoder, it encodes the information essential to decoding it back to a less noisy image. In that specific scenario, we did not care about the exact reconstruction of the observable data, we cared about maintaining the meaningful information inside of that observable data - what we deemed to be meaningful. I think that a much more powerful model can be brought into existence by using a variable expected minimization function.

Probabilistic Inference Tasks for Latent Variable Models

Recent Posts

Comments