Skip to content

Commit

Permalink
add support to py3
Browse files Browse the repository at this point in the history
Some erro when I install vizdoom, so I just modify these file
  • Loading branch information
wwxFromTju committed Mar 9, 2017
1 parent f711b86 commit 0685c12
Show file tree
Hide file tree
Showing 9 changed files with 865 additions and 218 deletions.
128 changes: 87 additions & 41 deletions Contextual-Policy.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Simple Reinforcement Learning in Tensorflow Part 1.5: \n",
"## The Contextual Bandits\n",
Expand All @@ -15,7 +18,9 @@
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
Expand All @@ -26,17 +31,22 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### The Contextual Bandits\n",
"Here we define our contextual bandits. In this example, we are using three four-armed bandit. What this means is that each bandit has four arms that can be pulled. Each bandit has different success probabilities for each arm, and as such requires different actions to obtain the best result. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit-arm that will most often give a positive reward, depending on the Bandit presented."
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 2,
"metadata": {
"collapsed": true
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
Expand Down Expand Up @@ -66,17 +76,22 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### The Policy-Based Agent\n",
"The code below established our simple neural agent. It takes as input the current state, and returns an action. This allows the agent to take actions which are conditioned on the state of the environment, a critical step toward being able to solve full RL problems. The agent uses a single set of weights, within which each value is an estimate of the value of the return from choosing a particular arm given a bandit. We use a policy gradient method to update the agent by moving the value for the selected action toward the recieved reward."
]
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 3,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
Expand All @@ -102,49 +117,57 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Training the Agent"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We will train our agent by getting a state from the environment, take an action, and recieve a reward. Using these three things, we can know how to properly update our network in order to more often choose actions given states that will yield the highest rewards over time."
]
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 4,
"metadata": {
"collapsed": false
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean reward for the 3 bandits: [ 0. -0.25 0. ]\n",
"Mean reward for the 3 bandits: [ 9. 42. 33.75]\n",
"Mean reward for the 3 bandits: [ 45.5 80. 67.75]\n",
"Mean reward for the 3 bandits: [ 86.25 116.75 101.25]\n",
"Mean reward for the 3 bandits: [ 122.5 153.25 139.5 ]\n",
"Mean reward for the 3 bandits: [ 161.75 186.25 179.25]\n",
"Mean reward for the 3 bandits: [ 201. 224.75 216. ]\n",
"Mean reward for the 3 bandits: [ 240.25 264. 250. ]\n",
"Mean reward for the 3 bandits: [ 280.25 301.75 285.25]\n",
"Mean reward for the 3 bandits: [ 317.75 340.25 322.25]\n",
"Mean reward for the 3 bandits: [ 356.5 377.5 359.25]\n",
"Mean reward for the 3 bandits: [ 396.25 415.25 394.75]\n",
"Mean reward for the 3 bandits: [ 434.75 451.5 430.5 ]\n",
"Mean reward for the 3 bandits: [ 476.75 490. 461.5 ]\n",
"Mean reward for the 3 bandits: [ 513.75 533.75 491.75]\n",
"Mean reward for the 3 bandits: [ 548.25 572. 527.5 ]\n",
"Mean reward for the 3 bandits: [ 587.5 610.75 562. ]\n",
"Mean reward for the 3 bandits: [ 628.75 644.25 600.25]\n",
"Mean reward for the 3 bandits: [ 665.75 684.75 634.75]\n",
"Mean reward for the 3 bandits: [ 705.75 719.75 668.25]\n",
"Mean reward for each of the 3 bandits: [ 0. 0. 0.25]\n",
"Mean reward for each of the 3 bandits: [ 26.5 38.25 35.5 ]\n",
"Mean reward for each of the 3 bandits: [ 68.25 75.25 70.75]\n",
"Mean reward for each of the 3 bandits: [ 104.25 112.25 107.25]\n",
"Mean reward for each of the 3 bandits: [ 142.5 147.5 145.75]\n",
"Mean reward for each of the 3 bandits: [ 181.5 185.75 178.5 ]\n",
"Mean reward for each of the 3 bandits: [ 215.5 223.75 220. ]\n",
"Mean reward for each of the 3 bandits: [ 256.5 260.75 249.5 ]\n",
"Mean reward for each of the 3 bandits: [ 293.5 300.25 287.5 ]\n",
"Mean reward for each of the 3 bandits: [ 330.25 341. 323.5 ]\n",
"Mean reward for each of the 3 bandits: [ 368.75 377. 359. ]\n",
"Mean reward for each of the 3 bandits: [ 411.5 408.75 395. ]\n",
"Mean reward for each of the 3 bandits: [ 447. 447. 429.75]\n",
"Mean reward for each of the 3 bandits: [ 484. 482.75 466. ]\n",
"Mean reward for each of the 3 bandits: [ 522.5 520. 504.75]\n",
"Mean reward for each of the 3 bandits: [ 560.25 557.75 538.25]\n",
"Mean reward for each of the 3 bandits: [ 597.75 596.25 574.75]\n",
"Mean reward for each of the 3 bandits: [ 636.5 630.5 611.25]\n",
"Mean reward for each of the 3 bandits: [ 675.25 670. 644.5 ]\n",
"Mean reward for each of the 3 bandits: [ 710.5 706.5 682.75]\n",
"The agent thinks action 4 for bandit 1 is the most promising....\n",
"...and it was right!\n",
"The agent thinks action 2 for bandit 2 is the most promising....\n",
Expand Down Expand Up @@ -189,34 +212,57 @@
" #Update our running tally of scores.\n",
" total_reward[s,action] += reward\n",
" if i % 500 == 0:\n",
" print \"Mean reward for each of the \" + str(cBandit.num_bandits) + \" bandits: \" + str(np.mean(total_reward,axis=1))\n",
" print(\"Mean reward for each of the \" + str(cBandit.num_bandits) + \" bandits: \" + str(np.mean(total_reward,axis=1)))\n",
" i+=1\n",
"for a in range(cBandit.num_bandits):\n",
" print \"The agent thinks action \" + str(np.argmax(ww[a])+1) + \" for bandit \" + str(a+1) + \" is the most promising....\"\n",
" print(\"The agent thinks action \" + str(np.argmax(ww[a])+1) + \" for bandit \" + str(a+1) + \" is the most promising....\")\n",
" if np.argmax(ww[a]) == np.argmin(cBandit.bandits[a]):\n",
" print \"...and it was right!\"\n",
" print(\"...and it was right!\")\n",
" else:\n",
" print \"...and it was wrong!\""
" print(\"...and it was wrong!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 2",
"display_name": "Python 3",
"language": "python",
"name": "python2"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 0685c12

Please sign in to comment.