Uncategorized

Enjoy AI in Minecraft (Malmo and MineRL)

Reinforcement learning (RL) is one of interesting challenges for gaming, such like, AlphaGo or AlphaZero. The game will be the best place to learn reinforcement learning for engineers.
In this post, I’ll show you the amazing 2 projects – Malmo and MineRL (closely related each other, as you can see later in this post) – which is used for reinforcement learning in the most popular video game, Minecraft.

I hope this will help you study reinforcement learning (RL), such as, in school or Hackathon, since a lot of people is familiar with Minecraft, in which you can enjoy a variety of missions.

Malmo – Installation and Settings

Now, let’s begin Malmo.

A variety of Minecraft environments will be available in gym-like (not gym-based, but compatible) environments using Malmo, which is built on top of Minecraft (built as one of Minecraft mods) by Microsoft Research.
With Malmo, you can programmatically simulate player’s (agent’s) activities and monitor results (observations) in environments.

First, install Malmo package using pip as follows.

# install prerequisite packages
pip3 install gym lxml numpy pillow
# install malmo
pip3 install --index-url https://test.pypi.org/simple/ malmo

Note : Currently malmo package for python 3.6 or 3.7 version on Linux is not in the default pip repository (https://pypi.org/simple), but it’s in the test repository (https://test.pypi.org/simple). Then I specified the test repository for installation as above. (In this post, I’m using Ubuntu 18.07 with Python 3.6.)
When you run on Windows, you can get these packages without test repository.
See the following URL for the included packages in each repositories.
https://test.pypi.org/simple/malmo/
https://pypi.org/simple/malmo/

Using a bootstrap, you should download Minecraft mod in your development machine. These files are downloaded on ./MalmoPlatform.

python3 -c "import malmo.minecraftbootstrap; malmo.minecraftbootstrap.download();"

Set MALMO_XSD_PATH environment’s variable as follows.
The following $HOME/MalmoPlatform should be the above location (directory) for the downloaded files.

echo -e "export MALMO_XSD_PATH=$HOME/MalmoPlatform/Schemas" >> ~/.bashrc
source ~/.bashrc

Now download Malmo client, helper files, and samples in GitHub repo.

git clone https://github.com/microsoft/malmo.git
cd malmo

Malmo runs on the engine of Minecraft Java version (mod).
Then you should install Java runtime and set JAVA_HOME in your environment’s variables.

sudo apt-get install openjdk-8-jdk
echo -e "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc
source ~/.bashrc

Finally, set the Malmo version (which is written on VERSION file) in ./Minecraft/src/main/resources/version.properties .
This is needed for launching Minecraft with this mod.

cd Minecraft
(echo -n "malmomod.version=" && cat ../VERSION) > ./src/main/resources/version.properties

Note : Running Malmo and MineRL is only supported on the machine with a monitor attached. (In this post, I use Ubuntu with X remote desktop.)
If you run in a console without a monitor, run your program with a virtual monitor – such as, xvfb (X Virtual Frame Buffer) – and record the screen as a video file to see the result.

# run program on xvfb
xvfb-run --listen-tcp --server-num 55 --auth-file /tmp/xvfb.auth -s "-ac -screen 0 800x600x24" python3 test01.py
# record above screen (display #55)
ffmpeg -f x11grab -video_size 800x600 -i :55 -codec:v libx264 -r 12 /home/tsmatsuz/test01.mp4
# gracefully stop recording
kill $(pgrep ffmpeg)

Note : See here for details about Malmo installation and settings.

Malmo – Run Your Program

Before running your agent, launch Minecraft client to render your agent.
In this post, the client will be started with port 9000 and later your agent (program) will attach to this listening port 9000.

cd Minecraft
./launchClient.sh -port 9000 -env

Now let’s run your agent in another shell.
Now you can use Malmo with raw socket API or MalmoPython class. In this post, I use the wrapped malmoenv class in MalmoEnv/malmoenv .

If you’re new to Malmo, you can run and check Malmo agent step-by-step in python console.

cd MalmoEnv
python3
# In Python console
import malmoenv
env = malmoenv.make()

The beauty of Malmo is the flexibility for building your own environment with XML definition.
Let’s see the following sample definition. (I copied from an original sample in AML and changed several settings.)

<Mission xmlns="http://ProjectMalmo.microsoft.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

  <About>
    <Summary>Avoid Lava</Summary>
  </About>

  <ModSettings>
    <MsPerTick>1</MsPerTick>
  </ModSettings>

  <ServerSection>
    <ServerInitialConditions>
      <Time>
        <StartTime>0</StartTime>
        <AllowPassageOfTime>false</AllowPassageOfTime>
      </Time>
      <Weather>clear</Weather>
      <AllowSpawning>false</AllowSpawning>
    </ServerInitialConditions>
    <ServerHandlers>
      <FlatWorldGenerator generatorString="3;7,220*1,5*3,2;3;,biome_1"/>

      <DrawingDecorator>
        <DrawSphere x="-29" y="70" z="-2" radius="100" type="air"/>
        <DrawCuboid x1="-34" y1="70" z1="-7" x2="-24" y2="70" z2="3" type="lava" /> 
      </DrawingDecorator>

      <MazeDecorator>
        <Seed>random</Seed>
        <SizeAndPosition width="5" length="5" height="10" xOrigin="-32" yOrigin="69" zOrigin="-5"/>
        <StartBlock type="emerald_block" fixedToEdge="true"/>
        <EndBlock type="lapis_block" fixedToEdge="true"/>
        <PathBlock type="grass"/>
        <FloorBlock type="air"/>
        <GapBlock type="lava"/>
        <GapProbability>0.6</GapProbability>
        <AllowDiagonalMovement>false</AllowDiagonalMovement>
      </MazeDecorator>

      <ServerQuitFromTimeUp timeLimitMs="15000000" description="out_of_time"/>
      <ServerQuitWhenAnyAgentFinishes/>
    </ServerHandlers>
  </ServerSection>

  <AgentSection mode="Survival">
    <Name>Agent0</Name>

    <AgentStart>
      <Placement x="-28.5" y="71.0" z="-1.5" pitch="70" yaw="0"/>
    </AgentStart>

    <AgentHandlers>

      <VideoProducer want_depth="false">
        <Width>800</Width>
        <Height>600</Height>
      </VideoProducer>

      <ObservationFromFullInventory flat="false"/>
      <ObservationFromFullStats/>
      <ObservationFromCompass/>
      <DiscreteMovementCommands/>

      <RewardForMissionEnd>
        <Reward description="out_of_time" reward="-100" />
      </RewardForMissionEnd>

      <RewardForTouchingBlockType>
        <Block reward="-100" type="lava" behaviour="onceOnly"/>
        <Block reward="100" type="lapis_block" behaviour="onceOnly"/>
      </RewardForTouchingBlockType>

      <RewardForSendingCommand reward="-1"/>

      <AgentQuitFromTouchingBlockType>
        <Block type="lava" />
        <Block type="lapis_block" />
      </AgentQuitFromTouchingBlockType>
    </AgentHandlers>
  </AgentSection>

</Mission>

Let’s see the brief outline of this mission file.

First it fills the lava in a range between (-34, 70, -7) and (-24, 70, 3).

<DrawingDecorator>
  ...
  <DrawCuboid x1="-34" y1="70" z1="-7" x2="-24" y2="70" z2="3" type="lava" /> 
</DrawingDecorator>

Next it randomly creates a maze, in which the start block is an emerald block and the goal block is a lapis block. The path is filled with grass blocks.

<MazeDecorator>
  <Seed>random</Seed>
  <SizeAndPosition width="5" length="5" height="10" xOrigin="-32" yOrigin="69" zOrigin="-5"/>
  <StartBlock type="emerald_block" fixedToEdge="true"/>
  <EndBlock type="lapis_block" fixedToEdge="true"/>
  <PathBlock type="grass"/>
  <FloorBlock type="air"/>
  <GapBlock type="lava"/>
  <GapProbability>0.6</GapProbability>
  <AllowDiagonalMovement>false</AllowDiagonalMovement>
</MazeDecorator>

In this mission, your agent can take the following rewards.

  • The time is up : reward = -100
  • The agent on lava : reward = -100
  • The agent on lapis block (goal) : reward = 100
  • The agent proceeds : reward = -1 for each step
<RewardForMissionEnd>
  <Reward description="out_of_time" reward="-100" />
</RewardForMissionEnd>

<RewardForTouchingBlockType>
  <Block reward="-100" type="lava" behaviour="onceOnly"/>
  <Block reward="100" type="lapis_block" behaviour="onceOnly"/>
</RewardForTouchingBlockType>

<RewardForSendingCommand reward="-1"/>

When the agent reaches to the lava (which results into dead) or lapis blocks (goal), the game will be ended.

<AgentQuitFromTouchingBlockType>
  <Block type="lava" />
  <Block type="lapis_block" />
</AgentQuitFromTouchingBlockType>

Same as usual Minecraft game, the world is decided by a seed, which is generated automatically by default.
When you’re needing the same world for debugging or testing, you can set a seed in mission file (XML) manually. (A maze also has a seed and you can replay the same maze using a fixed seed.)

When you run a program, each steps has at least some interval (milliseconds) by default, which is defined by MsPerTick in mission file. For instance, when you set MsPerTick=5, each step should have at least 5 milliseconds. If an agent runs next step within 5 milliseconds, the step will wait (sleep) till next 5 milliseconds.
ServerQuitFromTimeUp means that the game will end after ServerQuitFromTimeUpMsPerTick milliseconds. After the game has ended, an agent cannot take an action.

You can also use other biomes and insert various mobs in certain areas with mission file.
See other samples in GitHub repo.

Now, initialize an environment using this mission file as follows. We assume the file name is lava_maze.xml.

from pathlib import Path
xml = Path('lava_maze.xml').read_text()
env.init(xml,
  port=9000,
  server='127.0.0.1',
  role=0,  # index of AgentHandlers
  exp_uid='test1',
  episode=0, # start episode number
  resync=0) # exit and re-sync every N resets (0 means never)

When you reset (start) an environment, the agent enters into the generated Minecraft’s world (which is previously configured in XML).

obs = env.reset()

Fig. output screen (here I’ve changed to third-person view by F5 key.)

Note : In the screen, you can use your familiar Minecraft’s commands, such as, F5 key.

The returned obs (which is called an observation in RL) is integer’s array of RGB pixels.
In our example, the size of this array will be 800 x 600 x 3 = 1,440,000 .

Now we make the agent move to the goal (a lapis block) as follows.

obs, reward, done, info = env.step(0) # forward

obs, reward, done, info = env.step(2) # turn right

obs, reward, done, info = env.step(0) # forward

obs, reward, done, info = env.step(0) # forward

obs, reward, done, info = env.step(2) # turn right

obs, reward, done, info = env.step(0) # forward

obs, reward, done, info = env.step(3) # turn left

obs, reward, done, info = env.step(0) # forward

When the agent has reached to a lapis block, the agent will get 100 in reward and get True in done. (The game will be done, since the agent has reached to a lapis block.)

Note : The returned reward of step() will be delayed (postponed) in malmoenv.

With this gym-like environments, you can simulate an agent and build your own learner using a variety of algorithms, such as, plain Q-learning, DQN, DDQN, PPO, and so on.
Later I’ll show you a sample of reinforcement learning.

MineRL – Installation and Settings

MineRL includes a fork of Malmo, in which you can also use mission files (.xml) with OpenAI gym-integrated environments (MineRLEnv, which is inherited from gym.core.Env).

It’s very simple to setup MineRL.
Just install a package with pip and that’s all. (The dependent package, gym, is also installed.)

pip3 install minerl

Note : You can see the mission files in ~/.local/minerl/herobraine/env_specs/missions.

Such like Malmo, this also simulates the agent in Minecraft using the engine of Minecraft Java version.
Then you should install java runtime and set JAVA_HOME in your environment.

sudo apt-get install openjdk-8-jdk
echo -e "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc
source ~/.bashrc

Note : Same as Malmo, you need a monitor, even when you run on Linux. (Use a virtual monitor, when you run a program as a batch.)

MineRL – Run Your Program

Now, run your program on MineRL.

Using minerl package, a variety of Minecraft environments will be available in gym-integrated manners.
For instance, the following will load the environment, in which the agent can get rewards for the purpose of obtaining diamonds.
On building an environment, Minecraft Java edition will be automatically launched. (During env initialization, it automatically launches Minecraft client. You don’t need to launch Minecraft by yourself.)

import gym
import minerl

# this will take a while, since the data is so large.
env = gym.make('MineRLObtainDiamond-v0')

Note : MineRL also provides environments for competitions, such as, MineRLObtainDiamondDenseVectorObf-v0. (These environments include “VectorObf” in its name.)
In competition environments, both actions and observations are well-featurized (vectorized) for competition’s use.

When you reset (start) a gym environment, the agent enters into the generated Minecraft’s world.
Same as Malmo, you can also specify a seed for the world, when debugging or testing. (Otherwise, the seed is randomly generated by default.)

obs = env.reset()

The result (observation), obs, includes the frame of agent’s view and it has the following format. The following pov is the array of RGB pixels of 64 x 64 image (i.e, 64 x 64 x 3).

OrderedDict
(
  [
    ('compassAngle', array(-22.53418)),
    ('inventory', OrderedDict([('dirt', array(0))])),
    ('pov',
      array(
        [
          [
            [15, 22, 77],
            [14, 20, 77],
            [16, 23, 81],
            ...,
            [36, 25, 19],
            [36, 25, 19],
            [36, 25, 19]
          ],
          [
            [17, 26, 82],
            [16, 25, 81],
            [15, 22, 78],
            ...,
            [27, 18, 14],
            [26, 18, 14],
            [44, 32, 24]
          ],
          [
            [14, 21, 75],
            [16, 23, 79],
            [16, 24, 80],
            ...,
            [26, 18, 14],
            [26, 18, 13],
            [44, 32, 24]
          ],
          ...,
  
          [
            [16, 11,  8],
            [19, 14, 10],
            [12,  8,  6],
            ...,
            [15, 11,  8],
            [15, 11,  8],
            [23, 16, 12]
          ],
          [
            [11,  8,  6],
            [11,  8,  6],
            [16, 11,  8],
            ...,
            [15, 11,  8],
            [15, 10,  7],
            [23, 16, 11]
          ],
          [
            [23, 17, 12],
            [24, 17, 13],
            [15, 11,  8],
            ...,
            [18, 13,  9],
            [18, 13,  9],
            [18, 13,  9]
          ]
        ],
        dtype=uint8
      )
    )
  ]
)

Now, let’s move an agent forward in your program.
The following env.action_space.noop() will create a new action of “nothing to do”. When you want an agent to move forward, you should set action['forward'] = 1 and call env.step(action).

When the agent find and obtain diamonds, the agent will get a high score in reward and get True in done. (See here about the rewards for obtaining diamonds.)

Each action can also include multiple activities at once, such as, action['attack'] = 1 and action['camera'] = [0, 90] (turn to 90 degree).

action = env.action_space.noop()
action['forward'] = 1

# run step-by-step in python console
obs, reward, done, info = env.step(action)
obs, reward, done, info = env.step(action)
obs, reward, done, info = env.step(action)
obs, reward, done, info = env.step(action)

Note : In MineRL, the human player can also connect into an environment, in which you can interact with your agent. (See here.)

MineRL – Sample Data

MineRL also provides a lot of sample dataset which you can use in your machine learning tasks.
Now let’s download the dataset for obtaining diamonds mission (MineRLObtainDiamond-v0).

mkdir data01
MINERL_DATA_ROOT="$HOME/data01" python3 -m minerl.data.download 'MineRLObtainDiamond-v0'

The downloaded folder includes a lot of sub-folders corresponding for each episodes.

Each episode’s folder includes 3 files, metadata.json, recording.mp4, and rendered.npz .
recording.mp4 is an original video which is captured in Minecraft Java server using MineRL Recording Mod.

All frames in a video are tagged, and these (all tags) are included in rendered.npz.

from numpy import load

dat = load('MineRLObtainDiamond-v0/v3_ample_salad_doppelganger-1_556-12734/rendered.npz')
files = dat.files
for item in files:
  print(item)
'reward'
'observation$inventory$coal'
'observation$inventory$cobblestone'
'observation$inventory$crafting_table'
'observation$inventory$dirt'
'observation$inventory$furnace'
'observation$inventory$iron_axe'
'observation$inventory$iron_ingot'
'observation$inventory$iron_ore'
'observation$inventory$iron_pickaxe'
'observation$inventory$log'
'observation$inventory$planks'
'observation$inventory$stick'
'observation$inventory$stone'
'observation$inventory$stone_axe'
'observation$inventory$stone_pickaxe'
'observation$inventory$torch'
'observation$inventory$wooden_axe'
'observation$inventory$wooden_pickaxe'
'observation$equipped_items.mainhand.damage'
'observation$equipped_items.mainhand.maxDamage'
'observation$equipped_items.mainhand.type'
'action$forward'
'action$left'
'action$back'
'action$right'
'action$jump'
'action$sneak'
'action$sprint'
'action$attack'
'action$camera'
'action$place'
'action$equip'
'action$craft'
'action$nearbyCraft'
'action$nearbySmelt'

Each entry has an attribute’s array for each video frames.
For instance, in the following example, you can find that the player (agent) goes forward in the first several frames and doesn’t go forward in the last several frames.

print(dat['action$forward'])
array([1, 1, 1, ..., 0, 0, 0])

In the following example, it means that the player obtains diamonds in the final frame. (See here about the rewards for obtaining diamonds mission.)

print(dat['reward'])
array([0, 0, 0, ..., 0, 0, 1024])

You can use these captured dataset for the help of your learning tasks.
For instance, MineRL tutorial “K-means exploration” demonstrates more natural (human-like) behaviors by clustering sample actions using K-means algorithm. (Note that this sample uses a competition dataset.)

Reinforcement Learning in Minecraft

Now you can run machine learning’s workloads with Malmo or MineRL.
In this example, we train an agent to solve the previous mission for lave maze in MineRL, using reinforcement learning’s algorithm with ray tune framework.

Before running, let’s install prerequisite packages (these are needed for ray) as follows.

pip3 install pandas tensorflow==1.15 tensorboardX tabulate dm_tree lz4 ray==0.8.3 ray[rllib]==0.8.3 ray[tune]==0.8.3

Let’s start to train an agent as follows. (See the following python code.)

For simplicity (not to make things difficult), here I’m using Dueling Double DQN (D3QN) algorithm with ray tune’s framework in a single working machine without cluster. (Then it will take a long long time to achieve the goal…)
Please tune mission – such as, algorithm’s exploration, parameter’s tuning, GPU utilization, scaling batches to multiple workers (using xvfb virtual monitor), more robust algorithms (such as, IMPALA) if needed, so on and so forth.
It’s now your turn and your homework for tuning ! 🙂

# Reinforcement Learning with MineRL (MineRLEnv) 0.3.6 
import gym
import ray
import ray.tune as tune
import minerl.env.core
import minerl.env.comms
import numpy as np

class MineRLEnvWrap(minerl.env.core.MineRLEnv):
  def __init__(self, xml):
    super().__init__(
      xml,
      gym.spaces.Box(low=0, high=255, shape=(84, 84, 3), dtype=np.uint8),
      gym.spaces.Discrete(3),
      None
    )

  def _setup_spaces(self, observation_space, action_space):
    self.observation_space = observation_space
    self.action_space = action_space

  def _process_action(self, action_in) -> str:
    # regardless of settings (allow-list, deny-list) in xml,
    # this overwrites a list of available actions.
    action_to_command_array = [
      'move 1',
      'camera 0 90',  # right turn
      'camera 0 -90']   # left turn
    return action_to_command_array[action_in]

  def _process_observation(self, pov, info):
    # we need only pov to analyze observation
    pov = np.frombuffer(pov, dtype=np.uint8)
    pov = pov.reshape((self.height, self.width, self.depth))
    return pov

def create_env(config):
  # for quick recovery (sometimes socket timed out, see Note)
  minerl.env.core.SOCKTIME = 25.0 # default 60.0 * 4
  minerl.env.comms.retry_timeout = 1 # default 10
  minerl.env.comms.retry_count = 5 # default 20

  mission = config["mission"]
  env = MineRLEnvWrap(mission)
  return env

def stop_check(trial_id, result):
  return result["episode_reward_mean"] >= 80

if __name__ == '__main__':
  tune.register_env("testenv01", create_env)

  ray.init()

  tune.run(
    run_or_experiment="DQN",
    config={
      "log_level": "DEBUG",
      "env": "testenv01",
      "env_config": {
        "mission": "/home/tsmatsuz/lava_maze_minerl.xml"
      },
      "num_gpus": 0,
      "num_workers": 1,
      "ignore_worker_failures": True,
      "double_q": True,
      "dueling": True,
      "explore": True,
      "exploration_config": {
        "type": "EpsilonGreedy",
        "initial_epsilon": 1.0,
        "final_epsilon": 0.02,
        "epsilon_timesteps": 500000
      }
    },
    stop=stop_check,
    checkpoint_freq=2,
    checkpoint_at_end=True,
    local_dir='./logs'
  )

  print('training has done !')
  ray.shutdown()

In this example, I resized the screen size 800 x 600 to 84 x 84 (i.e, reduced pixels) by modifying our original mission file (lava_maze.xml), since it’s so large to analyze observations.
In MineRL, the tag <CameraCommands/> is also needed in the mission file, when you make it possible to turn (turn right, turn left) for your agent.

<Mission xmlns="http://ProjectMalmo.microsoft.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

  <About>
    <Summary>$(ENV_NAME)</Summary>
  </About>

  <ModSettings>
    <MsPerTick>50</MsPerTick>
  </ModSettings>

  <ServerSection>
    <ServerInitialConditions>
      <Time>
        <StartTime>0</StartTime>
        <AllowPassageOfTime>false</AllowPassageOfTime>
      </Time>
      <Weather>clear</Weather>
      <AllowSpawning>false</AllowSpawning>
    </ServerInitialConditions>
    <ServerHandlers>
      <FlatWorldGenerator generatorString="3;7,220*1,5*3,2;3;,biome_1"/>

      <DrawingDecorator>
        <DrawSphere x="-29" y="70" z="-2" radius="100" type="air"/>
        <DrawCuboid x1="-34" y1="70" z1="-7" x2="-24" y2="70" z2="3" type="lava" /> 
      </DrawingDecorator>

      <MazeDecorator>
        <Seed>random</Seed>
        <SizeAndPosition width="5" length="6" height="10" xOrigin="-32" yOrigin="69" zOrigin="-5"/>
        <StartBlock type="emerald_block" fixedToEdge="true"/>
        <EndBlock type="lapis_block" fixedToEdge="true"/>
        <PathBlock type="grass"/>
        <FloorBlock type="air"/>
        <GapBlock type="lava"/>
        <GapProbability>0.6</GapProbability>
        <AllowDiagonalMovement>false</AllowDiagonalMovement>
      </MazeDecorator>

      <ServerQuitFromTimeUp timeLimitMs="300000" description="out_of_time"/>
      <ServerQuitWhenAnyAgentFinishes/>
    </ServerHandlers>
  </ServerSection>

  <AgentSection mode="Survival">
    <Name>Agent0</Name>

    <AgentStart>
      <Placement x="-28.5" y="71.0" z="-1.5" pitch="70" yaw="0"/>
    </AgentStart>

    <AgentHandlers>

      <VideoProducer want_depth="false">
        <Width>84</Width>
        <Height>84</Height>
      </VideoProducer>

      <ObservationFromFullInventory flat="false"/>
      <ObservationFromFullStats/>
      <ObservationFromCompass/>

      <DiscreteMovementCommands/>
      <CameraCommands/>

      <RewardForMissionEnd>
        <Reward description="out_of_time" reward="-100" />
      </RewardForMissionEnd>

      <RewardForTouchingBlockType>
        <Block reward="-100" type="lava" behaviour="onceOnly"/>
        <Block reward="100" type="lapis_block" behaviour="onceOnly"/>
      </RewardForTouchingBlockType>

      <RewardForSendingCommand reward="-1"/>

      <AgentQuitFromTouchingBlockType>
        <Block type="lava" />
        <Block type="lapis_block" />
      </AgentQuitFromTouchingBlockType>
    </AgentHandlers>
  </AgentSection>

</Mission>

Note : Here I changed length attribute in <SizeAndPosition /> to 6, since some type of maze will fail (socket timeout error in initial peek, because of Minecraft instance crash) in MineRL 0.3.6. This setting modification mitigates this error.
(When error happens, Minecraft instance will be restarted and the training will continue.)

Fig. running a training (resized observations to 84 x 84 x 3)

Note : During the training, the summary of current progress can be seen in {log folder}/DQN/DQN_{env}_{experiment}/progress.csv.
Even when the training has been interrupted by accident, you can restore and start again using checkpoint files.

Note : You should prevent to run multiple Minecraft client instances. (Run only one instance.)
You can also launch Minecraft client (with malmo mod) separately from training workers as follows, but this cannot restart a Minecraft client for recovery.

# launchClient.sh (a fork of malmo) runs at the bottom
from minerl.env.malmo import InstanceManager
if __name__ == '__main__':
  instance = InstanceManager.Instance(9000)
  instance.launch()
...
class MineRLEnvWrap(minerl.env.core.MineRLEnv):
  def __init__(self, xml):
    super().__init__(
      xml,
      gym.spaces.Box(low=0, high=255, shape=(84, 84, 3), dtype=np.uint8),
      gym.spaces.Discrete(3),
      None,
      port=9000  # attach to running port
    )
  ...

if __name__ == '__main__':
  env = MineRLEnvWrap("/home/tsmatsuz/lava_maze_minerl.xml")
  env.reset()

Note : Don’t use low-spec machine, since the training worker will request multiple CPUs. (Here I used Standard D3 v2 with 4 cores and 14 GB RAM on Microsoft Azure.)

Once the agent is trained (saved as checkpoint files), you can restore and run this trained agent on another environment.
See the completed example in Github repo.

Fig. simulating a trained agent

Fig. progress of reward means

See another example for RL in Malmo, which uses Tabular-Q learning.

The reinforcement learning will consume much workloads, and then the cluster (i.e, multiple machines for each roles) might be essential in the practical training.
With a help of Azure Machine Learning (AML), you can easily provision pre-configured containers without any cumbersome settings of ray cluster. When the training has completed, the size of cluster can be reduced to zero. (The built-in estimator (ReinforcementLearningEstimator)  will automatically setup and optimize multiple machines for ray workers.)
See a sample notebook for MineRL training with Azure Machine Learning.

 

GitHub : Reinforcement Learning for Minecraft (MineRL) Sample
https://github.com/tsmatz/minerl-maze-sample

MineRL Competition
https://minerl.io/competition/

 

Categories: Uncategorized

Tagged as:

1 reply »

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s