decentralized partially observable Markov decision process

The decentralized partially observable Markov decision process (Dec-POMDP){{Cite journal

| last1=Bernstein | first1=Daniel S.

| last2=Givan | first2=Robert

| last3=Immerman | first3=Neil

| last4=Zilberstein | first4=Shlomo

| date=November 2002

| title=The Complexity of Decentralized Control of Markov Decision Processes

| journal=Mathematics of Operations Research

| volume=27

| issue=4

| pages=819–840

| doi=10.1287/moor.27.4.819.297

| issn=0364-765X

| arxiv=1301.3836

| s2cid=1195261}}{{Cite book|title=A Concise Introduction to Decentralized POMDPs {{!}} SpringerLink|last1=Oliehoek|first1=Frans A.|last2=Amato|first2=Christopher|language=en-gb|doi=10.1007/978-3-319-28929-8|series = SpringerBriefs in Intelligent Systems|year = 2016|isbn = 978-3-319-28927-4|s2cid=3263887|url = http://www.fransoliehoek.net/docs/OliehoekAmato16book.pdf}} is a model for coordination and decision-making among multiple agents. It is a probabilistic model that can consider uncertainty in outcomes, sensors and communication (i.e., costly, delayed, noisy or nonexistent communication).

It is a generalization of a Markov decision process (MDP) and a partially observable Markov decision process (POMDP) to consider multiple decentralized agents.{{Cite book|last1=Oliehoek|first1=Frans A.|url=https://books.google.com/books?id=FZRPDAAAQBAJ&q=Decentralized+partially+observable+Markov+decision+process|title=A Concise Introduction to Decentralized POMDPs|last2=Amato|first2=Christopher|date=2016-06-03|publisher=Springer|isbn=978-3-319-28929-8|language=en}}

Definition

= Formal definition =

A Dec-POMDP is a 7-tuple $(S,\{A_i\},T,R,\{\Omega_i\},O,\gamma)$ , where

$S$ is a set of states,
$A_i$ is a set of actions for agent $i$ , with $A=\times_i A_i$ is the set of joint actions,
$T$ is a set of conditional transition probabilities between states, $T(s,a,s')=P(s'\mid s,a)$ ,
$R: S \times A \to \mathbb{R}$ is the reward function.
$\Omega_i$ is a set of observations for agent $i$ , with $\Omega=\times_i \Omega_i$ is the set of joint observations,
$O$ is a set of conditional observation probabilities $O(s',a, o)=P(o\mid s',a)$ , and
$\gamma \in [0, 1]$ is the discount factor.

At each time step, each agent takes an action $a_i \in A_i$ , the state updates based on the transition function $T(s,a,s')$ (using the current state and the joint action), each agent observes an observation based on the observation function $O(s',a, o)$ (using the next state and the joint action) and a reward is generated for the whole team based on the reward function $R(s,a)$ . The goal is to maximize expected cumulative reward over a finite or infinite number of steps. These time steps repeat until some given horizon (called finite horizon) or forever (called infinite horizon). The discount factor $\gamma$ maintains a finite sum in the infinite-horizon case ( $\gamma \in [0,1)$ ).

References

External links

[http://masplan.org maspan.org]
[http://rbr.cs.umass.edu/camato/decpomdp/ The Dec-POMDP page]

Category:Markov processes