top of page
Search
  • johnmcgaughey255

Unsupervised learning part 2

One of the most popular uses for a standard autoencoder is for data compression. If I have a picture of a whale on my phone and I want to send that picture to you, instead of sending you all of the information in that images, I could send you the encoded state of that image. Once you receive that encoded state, it is passed through the decoder and used to regenerate the image of the whale. Another use for the autoencoder as talked about in the previous blog is denoising images, where the autoencoder is trained to ignore noise, and reconstruct the encoded state with respect to a clean image. This is a type of unsupervised learning, because the way in which the input data is abstracted is not according to an objective function. The objective function in a standard autoencoder is a reconstruction loss, but it’s important to understand that the abstraction of the data as it passes through the architecture is not fitting this function as it would in a supervised learning setting.

Generation of new, unseen images, is not a feature of standard autoencoders. Let’s take an example of human faces as images, pass these pictures through an autoencoder and iteratively compute the reconstruction loss to train the network. Once the network is fully trained, we might want to generate a new image, how would we go about doing this? The first thing you might try is to give the decoder a new set of randomly generated latent variables, likely this image will look like gibberish because the latent variables are random. It’s actually quite helpful, at least for me, to visualize these latent variables as existing in some kind of latent space with n-dimensions, n being however many latent variables there are. So that any given point in this n-dimensional space is representative of a unique permutation of latent variables. We might try passing two faces through an autoencoder and recording the latent variables of both images. After this, take the mean of the corresponding latent variables, now you have a new set of latent variables that is an interpolation between the two sets of variables.

We as humans might like to be able to control the generation of images to some extent. You might expect the decoded interpolation of the two sets of latent variables to be some kind of mixture of the two input faces, but that would be wrong most of the time. We need to make some change to the standard autoencoder algorithm to make this kind of interpolation possible. A variational autoencoder allows variation of the latent space to correspond to variations in image space. Small changes in latent space should correspond to small changes in image space. This change we need, is in the distance between different distributions in latent space, this will make more sense later on. The prior probability distribution in the case of an autoencoder is the probability of seeing any given set of latent variables. In the variational Autoencoder, we can choose the prior probability distribution as we want it. Let’s define the prior as a Gaussian distribution for each of the latent variables with an uncorrelated covariance matrix, a mean of 0, and a variance of 1. The variance and the mean are arbitrary, but the uncorrelated covariance is an intended choice so that latent dimensions are uncorrelated.

The prior is important, because it describes the desirability of any latent instance. In order to have the property of correspondence between changes in latent space and image space, different latent representations almost need to bleed into each other, so interpolating between representations in latent space will have a mixing effect in image space. One main difference between autoencoders and variational autoencoders is that in a VAE the latent variables are sampled from a learned gaussian distribution. The encoder learns a variance and mean for each of the latent variables. The variance and mean completely define the distribution, and then that distribution is sampled from, and that sampled point is the latent variable for that dimension. This encourages that for a given image, there is not one single encoded point in latent space that corresponds to that image, but a distribution of points in latent space that correspond to that image.

The reconstruction term will always be a part of the objective function of any autoencoder. Specific to variational autoencoders, there is a new term in the objective function that discourages learned distributions given a sample to be too far away from the prior, which has been defined. We want learned distributions of input data to be different enough, so that the reconstructed, decoded image makes sense, but we also want the learned distributions to be similar enough, so that moving around in latent space has meaning in the context of image space. This second term in the objective function is the KL divergence between the learned distribution and the prior distribution. It helps me to look at the terms in the objective function as forces. The reconstruction term which measures the difference between the input and output data is essential because it ensures that the structures in the data are maintained through compression. This KL divergence term is only applicable to VAE’s because it is the term that ensures variation of latent variables corresponds to meaningful variation in the high dimensional data space. This term ensures the encoded expressions of different inputs are not too different. In a way, these two terms are dueling forces. The first term, the reconstruction loss, wants the network to learn distinct representations in latent space, likely being far away from each other. Meanwhile, the second term directly penalizes the way data is distributed in the latent space, it is a force towards the conformity of the data. We can actually weight these terms differently, if we weight the reconstruction loss low and the KL-divergence high, the network is not incentivized to reconstruct the input data. As the weight shifts completely towards the second term, all learned distributions will tend towards the prior, and reconstruction will be ruined.

Now comes the goal of disentanglement, the idea that the change in one latent variable should correspond to the change on an individual feature in data space. I think that one of the reasons this is so hard is because features in data space are seemingly subjective. Take again the idea of faces, some disentangled features might be hair, face color, face shape, eye shape, nose shape, etc. These features may seem disentangled to us… but it begs the question, would an unsupervised network come to the same conclusions about what the disentangled elements of a face are. There was the assumption of a covariance matrix with a diagonal of ones, this actually has the effect of making sure latent dimensions are uncorrelated. So with that, there is already some disentanglement cooked into the VAE. We can actually get more disentanglement by weighting the second term, which encourages latent dimensions to be uncorrelated. But again, if we weight the KL term too much, you loose the reconstruction ability, and therefore the meaning of the data in the latent space. The KL term means a lot in the context of the VAE, and the objective should not necessarily be to minimize that term. The objective function to a VAE minimizes the evidence lower bound, sometimes called ELBO. In a paper called ELBO surgery, the authors go over how you can break the KL term up into the KL between the learned prior and set prior, and the indexed mutual information. We want this KL term: KL[q(z); p(z)] to be minimized. This has led to more success in effective disentanglement. You can actually go even deeper in decomposing the evidence lower bound, and find different ways of putting the objective function to encourage better disentanglement.

Yoshua Bengio’s claim is that decisions, and our interpretation of observations, occur much more low dimensionally compared to the dimensionality of which those observations exist. I agree. For me, this all has to do with the idea of capacity. We all strive for the simplest way to do things, we want a tool set small enough to hold in our tool belt, but diverse enough to complete any task. Disentanglement is the idea that each tool be unique, serves a different purpose in the context of completing tasks. Picture some robot, with eyes, ears, nose, the whole deal. It has control of this object, a box. Let’s say the robot has 8 variables of movement he can control. These variables of movement are not hard coded into the robot, they must be learned. An intelligent robot might decide to learn to control some disentangled elements of motion. Up, down, left, right, rotation each direction. There is a punishment for an excessive amount of combinations of these movement variables. The robot should use as little latent variables as possible, this would incentive disentanglement. It’s almost like I want the robot to conserve its mental energy when making moves in the world. I want the robot to learn the simplest way to do complex things. Again, the robots actions are more like a specially crafted tool box. For whatever task is at hand, the same tools, movements, are used. This does not encourage the robot to have excessive or complex actions in its toolbox. In fact, it encourages the opposite, given a broad enough array of tasks, and given a smaller limit for how many tools it can have, it would be no good to learn weird tools. The complexity or difficulty of a task might be defined as how many of these action variables have to be combined together to complete a task. Take those 8 disentangled variables of movement from above. I’d take the stance that these actions are in fact optimally disentangled. The robot might be able to reason that walking in a straight line is a simpler task than walking in a figure eight because a figure eight requires a more complex combination of action variables. Not only is this whole idea for figuring out actions good for finding the best low energy movement, it gives the machine the ability to rank-order complexity somewhat objectively.

3 views0 comments

Recent Posts

See All

Meta-heuristic optimization

These are all drafts, by the way. They are not meant to be perfect or to convey all the information I wish to convey flawlessly. My blogs are just a way for me to get ideas and my thoughts realized as

Dimensionality reduction and model interpretability

I would say that the main purpose of communication is to give a universal understanding of abstract ideas. An abstraction is, for my intents and purposes, a lower dimensional encoding of a higher dime

bottom of page