Objective functions and action distributions in MARL

johnmcgaughey255
Feb 21, 2022
7 min read

One thing that kind of goes unquestioned in the deep learning and reinforcement learning community is an agent having a fixed distribution of actions. A policy parameterized by some deep neural network maps an observation of an environment to an action or set of actions that the agent would take in that environment. The agent learns this policy function which learns the relationship between an observation and an optimal action.

This type of learning still feels a little bit too supervised to me. The prior action distribution is of a human intelligence — that is, someone made the decision that these specific actions were optimal for a system to perform at an optimal performance. As we all know, human intelligence and decision making is far from perfect. This type of prior assumption makes the computation more tractable — attempting to learn an optimal prior on top of the policy function itself is a meta-heuristic problem and thereby intractable. In order to make computation tractable, we must make some simplifying assumptions about the nature of the optimal action distribution, but we might also want to give the agent enough flexibility to learn actions or action patterns that were not explicitly hard-coded. I think this type of thinking is especially true in multi-agent systems in which multiple agents are working together to accomplish a larger objective - larger physically and conceptually than the objectives of the agent. In this type of system with hierarchical objectives, an action distribution to accomplish the higher order objective is not explicitly stated and must be learned, making the problem computationally difficult. The problem is that this second order larger objective cannot explicitly communicate itself through a centralized agent if the underlying system is decentralized, the ends to which this objective is enforced and optimized becomes more abstract and indirect.

This is when I think it might be more useful to look at an agent as a means to an end for an objective. We typically look at the agent as having an objective - and learning a policy to optimize that objective - which is meaningful for our usual intents and purposes. Instead, I encourage the reader to think about the objective as having the agent, and using that agent to optimize itself. The agent becomes an intermediary that bends to the will of an objective. This way of thinking is a bit harder to conceptualize because we think of ourselves as being agents, and possessing objectives. Looking at this objective-agent relationship in this new way expands our understanding of what an agent is; it goes from something that possesses an objective to something that is possessed by an objective. In the latter case, the agent becomes something not so clear or familiar, the form that the agent takes can become more abstract. An agent in this case becomes something a bit more variable to structure, the structure becoming more submissive to the objective. The idea that was holding us back is something like, “Objective’s don’t exist without an agent to carry it out.” My theory is that agents would not exist if it were not for an objective’s will to actualize itself to manipulate an environment. I don’t necessarily care about proving this theory, I more care about experimenting with the thought process to see if I can get anywhere with it.

Back to the idea of multi-agent systems and multi-layered objectives. Take traffic light control for an example of a multi-agent problem. Each individual agent (an intersection or something) has a fixed action distribution - describing what it can or cannot do. The individual agents have a localized objective function which they try to maximize by learning an optimal policy (observation to action mapping). But now we ask the question, what do we really want out of a traffic signal control system? We want these independent, decentralized agents to work together to create an intelligent system that works on a large scale, adapting to changes and conforming to some higher objective. The problem is that this larger objective does not have an obvious agent through which it is able to actualize its will. The larger scale objective has no means to manipulate its environment, primarily because it has no set action distribution for which it can do so. This problem is the same meta-heuristic problem that I had discussed earlier, the intractability of learning the prior distribution of possible actions whilst learning the optimal policy. Some kind of action distribution for the purpose of satisfying the higher order objective could be formed by influencing the objectives of the individual agents in some way - like a puppeteer. I like to think about these objectives as having sort of a consciousness and will to manifest themselves - the real difficulty is them figuring out how to do so. The higher level objective must manipulate its environment through a faculty that has agency and the ability to manipulate its environment. Thus, the objective that had existed separate from the agent, must use the agent or system of agents to accomplish its will.

I think that the problem of traffic signal control can be optimized with two objectives. One that inspires the individual agent to take greedy actions, and another that ensures that agents cooperate with each other to create a smooth flowing system. The former of these two objectives will exist within each agent separately, each agent will be incentivized to act only in self interest to optimize its own traffic flow. Understand that this objective exists at the level of the agent because that is where the reward from the agent exists; the agent will receive feedback from its environment as a direct result of its actions. The latter of the objectives will exist at the level of the entire system, because that is where the reward exists and where the metric will exist. This is to say that the problem of ‘smooth flow’ across a network is not to be assigned to a single agent, rather the entire system. The difficulty in the latter is the disconnect from the level at which the reward is received, and the level on which manipulation must occur. It is easy for the first objective, because the individual agent both manipulates (takes actions) and receives rewards from the environment. It is a more difficult problem for the higher order objective because the level on which the reward exists is not the level on which agency exists, so it is not directly evident how the individual should change its behavior to influence an objective that is of a higher order. Just like the objective that motivates the agent to be competitive and selfish, the objective that motivates the agents to cooperate with each other must express itself through the actions of the individual agent.

What concerns me particularly in the discussion of multi-agent systems and multi-layered objectives is the formulation of an optimal action distribution. This meta-heuristic problem is especially interesting to me because there is a formulated action distribution (actions that are available), but the real problem comes in when we try to make sense and formulate a usable action distribution. What one might refer to as the ‘centralized’ action distribution is a combinatorial large action space that covers the combinations of all actions in the agents. This centralized action space, although mitigating the initial problem of finding a suitable action distribution, it creates another problem of intractable computation as we introduce more and more agents into the system. The primary reason why the centralized system gets around the ‘prior’ issue is because the scale on which the reward is given is the same as the scale on which an action is taken. We may call this kind of action a ‘joint’ action, because it is the conglomerate action of many agents. Scientists wanting to adhere more to research of the past may be more inclined to model multi-agent systems with centralized agents that take joint actions because with this, we can use Markovian processes to model and learn transitions and optimal actions the centralized agent can take in the environment. These centralized models are simply not scalable, the computation explodes and adding more agents to the network doesn’t work. Decentralizing computation allows for a more tractable form of computation that does not grow exponentially with the number of agents, but it comes with its problems. The primary problem is that, because the physical scale on which actions are taken is much smaller than the metric of collective efficiency, the agents cannot perform their actions with full knowledge of the system. The agent’s observation is extremely limited in the scope of the entire system, therefore it cannot have enough information to make a proper decision - the partial observability takes us away from being able to use Markov processes to learn the system. Many papers concerning the implementation of decentralized algorithms in multi-agent systems discuss improving the partial observability of each agent. A single agent would be able to use the observations of neighboring agents to better understand how to make the best decision it can given a current state of the environment. Remember, the actions of a single agent influence a metric beyond the measurement of that agent (beyond the measurement of any single agent). The more we can expand the vision of a single agent, the better. I quite like this idea, but I feel like taken too far, we can encounter other problems that we don’t want such as nonstationarity or taking on too much for a single agent.

There are two large roadblocks in MARL, and they both involve the intractability of computation. Of course, anything can be solved if we throw enough computational power at it — the beauty in it is figuring out how to do it with a limited amount of computational power, and how to work within the bounds of what is known to create something new. One of the two roadblocks is the intractability of scaling up in multi-agent systems with a centralized agent. The other is attempting to learn the meta-heuristic of a prior action distribution. I have touched on the idea of symmetrical priors in previous blogs. If symmetries in information and behaviors can be assumed by agents, their knowledge of the local environment can be extrapolated to other parts of the environment that are beyond their direct observation. If symmetrical priors exist within the actions of agents, the individual agent would be able to learn how other agents would respond to its action. The implications of symmetrical priors in decentralized structures go far beyond what I will be discussing in this blog, but one thing I am going to discuss briefly is how it forms a system that has the ability to react to larger scale problems. A single decentralized agent that has the ability to predict how another agent will act given their observation has the ability to control their actions, and in turn to have their actions controlled by other decentralized agents. The final point I want to touch on is to revisit our idea of what an agent is. Our idea of a centralized agent should be different from the idea of a decentralized agent. A decentralized agent does not represent the objective function completely. The decentralized agent is a mechanism of a centralized abstract agent to accomplish its objective.

Objective functions and action distributions in MARL

Recent Posts

Comentários