Decentralized partially observable Markov decision process

From HandWiki

The decentralized partially observable Markov decision process (Dec-POMDP)[1][2] is a model for coordination and decision-making among multiple agents. It is a probabilistic model that can consider uncertainty in outcomes, sensors and communication (i.e., costly, delayed, noisy or nonexistent communication).

It is a generalization of a Markov decision process (MDP) and a partially observable Markov decision process (POMDP) to consider multiple decentralized agents.[3]

Definition

Formal definition

A Dec-POMDP is a 7-tuple [math]\displaystyle{ (S,\{A_i\},T,R,\{\Omega_i\},O,\gamma) }[/math], where

  • [math]\displaystyle{ S }[/math] is a set of states,
  • [math]\displaystyle{ A_i }[/math] is a set of actions for agent i, with [math]\displaystyle{ A=\times_i A_i }[/math] is the set of joint actions,
  • [math]\displaystyle{ T }[/math] is a set of conditional transition probabilities between states, [math]\displaystyle{ T(s,a,s')=P(s'\mid s,a) }[/math],
  • [math]\displaystyle{ R: S \times A \to \mathbb{R} }[/math] is the reward function.
  • [math]\displaystyle{ \Omega_i }[/math] is a set of observations for agent i, with [math]\displaystyle{ \Omega=\times_i \Omega_i }[/math] is the set of joint observations,
  • [math]\displaystyle{ O }[/math] is a set of conditional observation probabilities [math]\displaystyle{ O(s',a, o)=P(o\mid s',a) }[/math], and
  • [math]\displaystyle{ \gamma \in [0, 1] }[/math] is the discount factor.

At each time step, each agent takes an action [math]\displaystyle{ a_i \in A_i }[/math], the state updates based on the transition function [math]\displaystyle{ T(s,a,s') }[/math] (using the current state and the joint action), each agent observes an observation based on the observation function [math]\displaystyle{ O(s',a, o) }[/math] (using the next state and the joint action) and a reward is generated for the whole team based on the reward function [math]\displaystyle{ R(s,a) }[/math]. The goal is to maximize expected cumulative reward over a finite or infinite number of steps. These time steps repeat until some given horizon (called finite horizon) or forever (called infinite horizon). The discount factor [math]\displaystyle{ \gamma }[/math] maintains a finite sum in the infinite-horizon case ([math]\displaystyle{ \gamma \in [0,1) }[/math]).

References

  1. Bernstein, Daniel S.; Givan, Robert; Immerman, Neil; Zilberstein, Shlomo (November 2002). "The Complexity of Decentralized Control of Markov Decision Processes". Mathematics of Operations Research 27 (4): 819–840. doi:10.1287/moor.27.4.819.297. ISSN 0364-765X. 
  2. Oliehoek, Frans A.; Amato, Christopher (2016) (in en-gb). A Concise Introduction to Decentralized POMDPs | SpringerLink. SpringerBriefs in Intelligent Systems. doi:10.1007/978-3-319-28929-8. ISBN 978-3-319-28927-4. http://www.fransoliehoek.net/docs/OliehoekAmato16book.pdf. 
  3. Oliehoek, Frans A.; Amato, Christopher (2016-06-03) (in en). A Concise Introduction to Decentralized POMDPs. Springer. ISBN 978-3-319-28929-8. https://books.google.com/books?id=FZRPDAAAQBAJ&q=Decentralized+partially+observable+Markov+decision+process. 

External links