A sensing policy for the restless multi-armed bandit problem with stationary\nbut unknown reward distributions is proposed. The work is presented in the\ncontext of cognitive radios in which the bandit problem arises when deciding\nwhich parts of the spectrum to sense and exploit. It is shown that the proposed\npolicy attains asymptotically logarithmic weak regret rate when the rewards are\nbounded independent and identically distributed or finite state Markovian.\nSimulation results verifying uniformly logarithmic weak regret are also\npresented. The proposed policy is a centrally coordinated index policy, in\nwhich the index of a frequency band is comprised of a sample mean term and a\nconfidence term. The sample mean term promotes spectrum exploitation whereas\nthe confidence term encourages exploration. The confidence term is designed\nsuch that the time interval between consecutive sensing instances of any\nsuboptimal band grows exponentially. This exponential growth between suboptimal\nsensing time instances leads to logarithmically growing weak regret. Simulation\nresults demonstrate that the proposed policy performs better than other similar\nmethods in the literature.\n
Discussion(0)
No comments yet. Be the first to comment.