User:Ibada/Tuto Deep Learning

From Grid5000
Jump to: navigation, search
Warning.png Warning

this tutorial is under construction

Grid5000 can be used for running deep learning software, but users must do configuration and installation steps before using grid5000 for deep learning.
Because the standard environment of grid5000 doesn't contain ( for now ) these deep learning tools, users must install them on their home directory.

This tutorial will help your for these configuration/installation steps and show you the best practices for deep learning experiments on grid5000.
Deep learning software programs covered here are keras, theano, tensorflow and pytorch.

Note.png Note

Basic and recommended requirements for following this tutorial are:

  • You must already have a grid5000 account, if it's not the case please ask an account here:

https://www.grid5000.fr/mediawiki/index.php/Grid5000:Get_an_account

  • You must have a working ssh configuration for grid5000 (private and public ssh keys already generated), read this tutorial:

https://www.grid5000.fr/mediawiki/index.php/SSH

  • You must have basic skills with bash, know bash commands to browse directories with a shell (command line, there is no GUI on grid5000).
  • Basic skills on linux system administration is an asset for better understanding grid5000, but it's not mandatory.
  • It's recommended to read first "getting started ( first steps on grid5000 )" tutorial for grid5000 before reading this deep learning tutorial:
https://www.grid5000.fr/mediawiki/index.php/Getting_Started


Contents

 [hide

Creation of a virtual environment for python[edit]

A virtual environment is a directory where all python libraries ( and python itself ) are copied,
this will allow to isolate from the system a python installation and all its libraries, in order to avoid problems like conflict/mismatch versions
of libraries, broken python source code during the execution, as all python system files are stored in a virtual environment then it eliminates all these problems.

The first thing to do is to create a virtual environment in your grid5000 home, there are 2 ways for creating a virtual environment:

  • with virtualenv command
  • or with anaconda

with virtualenv command[edit]

- connect to a grid5000 frontend :

Terminal.png outside:
ssh login@access.grid5000.fr

- go to Nancy grid5000 site:

Terminal.png inside:
ssh nancy

- creation of the virtual environment for python, here a directory name "venv" in your home will be created by virtualenv:

Terminal.png inside:
virtualenv /home/login/venv

- once your virtual environment is created you need to activate it :

Terminal.png inside:
source /home/login/venv/bin/activate

alternative syntax :

Terminal.png inside:
. /home/login/venv/bin/activate

when your virtual environment is activated you will notice that the prompt will change, like this ( I assume that "venv" is the name of your virtual environment ) :

  (venv) login@fnancy:~

with anaconda[edit]

- connect to a grid5000 frontend :

Terminal.png outside:
ssh login@access.grid5000.fr

- go to Nancy grid5000 site :

Terminal.png inside:
ssh nancy

We will use the light version of anaconda in order to save disk space : miniconda, download the last version of miniconda:

  • for python 2.7 version ( default version of python in grid5000 ) :
  • for python 3.6 version:

then make executable the script:

  • for python 2.7 version:
Terminal.png inside:
chmod +x Miniconda2-latest-Linux-x86_64.sh
  • for python 3.6 version:
Terminal.png inside:
chmod +x Miniconda3-latest-Linux-x86_64.sh

run the script for installing miniconda:

Terminal.png inside:
bash Miniconda2-latest-Linux-x86_64.sh

or this if you use the python 3.6 version:

Terminal.png inside:
bash Miniconda3-latest-Linux-x86_64.sh

your anaconda virtual environment will be activated by adding the bin path to $PATH environment variable:

Terminal.png inside:
export /home/login/miniconda2/bin:$PATH

or this if you use python 3.6 version:

Terminal.png inside:
export /home/login/miniconda3/bin:$PATH

Installation of deep learning software[edit]

I assume that you have created a virtual environment for python, and you have activated it (see the previous section).

Note.png Note

You have to always activate your virtual environment before installing a python module or running a python script.

Installation of keras[edit]

Keras is a high-level neural networks API, written in python, which is used as a wrapper of theano, tensorflow or CNTK.
Keras allows to create deep learning experiments much more easily than using directly theano or tensorflow,
it's the recommended tool for beginners and even advanced users who don't want to deal and spend too much time with the complexity of low-level libraries as theano and tensorflow.

with pip[edit]

- keras can be installed with the pip command, directly in your virtual environment:

Terminal.png inside:
pip install keras

- if keras is already installed in your virtual environment and you want to update to the last version:

Terminal.png inside:
pip install keras --upgrade

- if you need a previous version of keras ( for example 1.2.0 version ) then use the "==" option followed by the version number:

Terminal.png inside:
pip install keras==1.2.0

with anaconda[edit]

check if you have added anaconda binary path in your $PATH ( see the previous section ), type this command:

Terminal.png inside:
which python

the output must contain the path of your anaconda installation ( something like /home/<login>/miniconda*/bin ), if it's ok then you can install keras:

Terminal.png inside:
conda install keras

Installation of theano[edit]

Theano is a low-level API for deep learning, written in python, it's an open-source symbolic tensor manipulation framework developed by LISA Lab at Université de Montréal,
it can be used by keras, it's automatically installed by keras as dependency,
but if it's not the case then you can install theano manually with pip command in your virtual environment:

with pip[edit]

Terminal.png inside:
pip install theano
Warning.png Warning

the pip installation of theano is not reliable,

an error occurs with keras due to the absence of pygpu python module, and it's not possible to install it with pip:
https://github.com/Theano/libgpuarray/issues/537

you have to install locally (in your grid5000 home) pygpu by following these instructions:
http://deeplearning.net/software/libgpuarray/installation.html

An alternative way is to install pygpu with anaconda, which is the recommended way.

with anaconda[edit]

Terminal.png inside:
conda install theano pygpu

Installation of tensorflow[edit]

Tensorflow is also a low-level API for deep learning, written in python, it's an open-source symbolic tensor manipulation framework developed by Google,
it can be used by keras, but it's not automatically installed by keras as dependency,

then you have to install it manually with pip command or anaconda:

with pip[edit]

Terminal.png inside:
pip install tensorflow-gpu

with anaconda[edit]

Terminal.png inside:
conda install tensorflow-gpu


Note.png Note

two versions of the package for tensorflow exist:

  • "tensorflow-gpu" which uses the GPU for computations
  • "tensorflow" which uses only the CPU
For grid5000 you have to pick the GPU version ( tensorflow-gpu ).

Installation of pytorch[edit]

Pytorch is a framework for deep learning, similar to theano and tensorflow but can not be used ( for now ) as backend for keras.

On the main website of pytorch there are updated links for the installation, using conda and pip:

http://pytorch.org/

with pip[edit]

You can install pytorch with pip command in your virtual environment, 3 different versions are available, according to your python version:

- for python 2.7 ( default version of python in grid5000 ) :

Terminal.png inside:
pip install torchvision

- for python 3.5:

Terminal.png inside:
pip3 install torchvision

- for python 3.6:

Terminal.png inside:
pip3 install torchvision
Note.png Note

These links date from march 2018, they can be obsolete, please check also on the main website of pytorch : http://pytorch.org/

with anaconda[edit]

Terminal.png inside:
conda install pytorch torchvision cuda90 -c pytorch

Installation of cuDNN library[edit]

The cuDNN library is written by nvidia, it's a library which provides fast optimization for GPU computation and optimized algorithms for deep learning.
Tensorflow 1.5.0 needs cuDNN 7, and Theano can use it, so it's recommended to install cuDNN library in your home.

manual installation[edit]

First you have to register yourself in nvidia website, in order to be able to download cuDNN library ( version 7 for cuda 9.0.1 ) :

https://developer.nvidia.com/cudnn

Then create these directory structures in your home:

  /home/login/cudnn/include
  /home/login/cudnn/lib

Unzip the file you have downloaded from nvidia website, and put the file "cudnn.h" in /home/login/cudnn/include
and files "libcudnn.so, libcudnn.so.7, libcudnn.so.7.0.5" in /home/login/cudnn/lib .

Last step : you need to configure three bash environment variables: LD_LIBRARY_PATH, C_INCLUDE_PATH, CPLUS_INCLUDE_PATH.

  export LD_LIBRARY_PATH=/home/login/cudnn/lib:$LD_LIBRARY_PATH
  export C_INCLUDE_PATH=/home/login/cudnn/include:$C_INCLUDE_PATH
  export CPLUS_INCLUDE_PATH=/home/login/cudnn/include:$CPLUS_INCLUDE_PATH

Put these export lines inside your bash scripts.

installation with anaconda[edit]

cuDNN can be installed with anaconda (note: the package may be already installed during the installation of keras) :

Terminal.png inside:
conda install cudnn

then you have to set three environment variables: LD_LIBRARY_PATH, C_INCLUDE_PATH, CPLUS_INCLUDE_PATH.

  • python 2.7 version of anaconda:
  export LD_LIBRARY_PATH=/home/login/miniconda2/lib:$LD_LIBRARY_PATH
  export C_INCLUDE_PATH=/home/login/miniconda2/include:$C_INCLUDE_PATH
  export CPLUS_INCLUDE_PATH=/home/login/miniconda2/include:$CPLUS_INCLUDE_PATH
  • python 3.6 version of anaconda:
  export LD_LIBRARY_PATH=/home/login/miniconda3/lib:$LD_LIBRARY_PATH
  export C_INCLUDE_PATH=/home/login/miniconda3/include:$C_INCLUDE_PATH
  export CPLUS_INCLUDE_PATH=/home/login/miniconda3/include:$CPLUS_INCLUDE_PATH
Note.png Note

If you get an error related to cudnn ("/usr/bin/ld: cannot find -lcudnn") when running keras with theano then you can try to copy cudnn libs to /usr/local/lib, when gpu node is ready:

sudo-g5k cp /home/login/miniconda*/lib/libcudnn* /usr/local/lib/ (if you use anaconda)
or

sudo-g5k cp /home/login/cudnn/lib/libcudnn* /usr/local/lib/ (if you have installed manually cudnn)

Configuration of deep learning software[edit]

Configuration of keras[edit]

You have to configure the deep learning backend for keras : theano or tensorflow, by default keras will use tensorflow if no configuration is set.

Configuration with a keras.json file[edit]

Keras keeps its configuration in a json file, located here:

/home/login/.keras/keras.json

This file is created automatically after the first launch of keras,
the contain of this configuration file :

  {
      "image_data_format": "channels_last",
      "epsilon": 1e-07,
      "floatx": "float32",
      "backend": "tensorflow"
  }

For setting a backend just put the right value for the field "backend" ( tensorflow or theano ), then save the file under the name "keras.json".

Configuration with a bash environment variable[edit]

An alternative for setting the backend is to set a bash environment variable called "KERAS_BACKEND" :

Terminal.png inside:
export KERAS_BACKEND=tensorflow

or this if you want to use theano as backend:

Terminal.png inside:
export KERAS_BACKEND=theano

Configuration of theano[edit]

Configuration with a .theanorc file[edit]

Theano can be configured with a configuration file named ".theanorc" located here :
/home/login/.theanorc

example of a configuration which uses the GPU:

  [global]
  floatX = float32
  device = cuda0
  
  [gpuarray]
  preallocate = 1

the most important field is "device" where you can set the GPU card id to use, "cuda0" for the first GPU card, "cuda1" for the second GPU card etc...

Configuration with a bash environment variable[edit]

Like keras, theano can also be configured with a bash environment variable: THEANO_FLAGS

Terminal.png inside:
export THEANO_FLAGS='floatX=float32,device=cuda0,gpuarray.preallocate=1'

If you use anaconda for your python virtual environment then you have to set a variable in order to avoid an error:

Terminal.png inside:
export MKL_THREADING_LAYER=GNU

More information about available configuration options for theano here:
http://deeplearning.net/software/theano/library/config.html

Configuration of tensorflow[edit]

Currently tensorflow doesn't need to have a configuration file, but cuDNN must be installed and some bash environment variables correctly configured ( check the previous section about the installation of cuDNN).
There is a feature request in the github of tensorflow for the use of a configuration file, so the situation may change in a next future:
https://github.com/tensorflow/tensorflow/issues/6629

Configuration of pytorch[edit]

Same as tensorflow, no need to have a configuration file, it can work directly with python.

Running deep learning software on grid5000[edit]

Running a deep learning software on grid5000 is possible, we have several GPU nodes, most of them are located in Nancy site ( queue production ) :

Nancy site:

  • grimani: 6 nodes, each node has 2 Nvidia K40m GPU cards
  • graphique: 6 nodes, 2 x Nvidia Titian Black ( graphique-1 ), 2 x Nvidia GTX 980 GPU ( other nodes )
  • grele: 14 nodes, each node has 2 Nvidia Geforce 1080 Ti GPU cards

Lille site:

  • chifflet: 8 nodes, each node has 2 Nvidia GTX 1080Ti gpu cards )

Lyon site:

  • orion: 4 nodes, each node has one Nvidia Tesla M2075 GPU card )

To run a deep learning script you have to reserve a node with oarsub, in interactive mode, or in a non-interactive mode ( aka "passive" mode, it's the recommended way ).

Making a reservation with oarsub[edit]

Reservation in interactive mode[edit]

"Interactive mode" is a mode where you are directly connected to the reserved grid5000 resource with an interactive shell, then you run inside this shell your python script,
all output, error and log messages will be printed in this interactive shell.

This mode is interesting if you need to tune/test a script and when this script doesn't take a long time to run ( less than 20 minutes ), but if it's a production/working script then the interactive mode is discouraged,
it's better to use the non-interactive mode.

- First you need to connect to a grid5000 frontend:

Terminal.png outside:
ssh login@access.grid5000.fr

- go to Nancy grid5000 site:

Terminal.png inside:
ssh nancy

- reserve a GPU node in queue production ( Nancy ), with a small walltime ( here 20 minutes ) :

Terminal.png inside:
oarsub -q production -p "GPU <> 'NO'" -l "nodes=1,walltime=0:20:00" -I

then if your reservation succeeded you will get this output:

  Generate a job key...
  OAR_JOB_ID=1498157
  Interactive mode : waiting...
  Starting...
  
  Connect to OAR job 1498157 via the node graphique-4.nancy.grid5000.fr

the last step is to activate your python virtual environment, the method depends if you have used virtualenv program or anaconda for the creation of the virtual environment:

for virtualenv

Terminal.png inside:
source /home/login/venv/bin/activate

for anaconda

  • python 2.7 version:
Terminal.png inside:
export /home/login/miniconda2/bin:$PATH
  • python 3.6 version:
Terminal.png inside:
export /home/login/miniconda3/bin:$PATH

and finally you can run your python script ( keras, theano, tensorflow or pytorch ) :

Terminal.png inside:
python <name of your script>

Reservation in non-interactive mode ( passive )[edit]

"Non-interactive" mode is more suitable for working python scripts ( bug free scripts ) and scripts which take long time to run.

The steps are the same than the interactive mode ( see the previous section), the only difference is the syntax for the oarsub command :

Terminal.png inside:
oarsub -q production -p "GPU <> 'NO'" -l "nodes=1,walltime=0:20:00" <path to a bash script>

The difference here is that we don't use the "-I" option, instead we put the path of a bash script ( "<path to a bash script>" ).
The purpose of this bash script is to activate our python virtual environment, and to launch the deep learning python script.

Note.png Note

Don't forget to make executable your script ( by using the "chmod +x <your script>" command ), and to insert inside your script the bash command in order to activate your virtual environment.


Reserving a particular gpu cluster name[edit]

Just use the "cluster=" option in your oarsub command, followed by the name of the machine in simple quotes ( for example 'grimani' ) :

  • interactive mode :
Terminal.png inside:
oarsub -q production -p "cluster='grimani'" -l "nodes=1, walltime=0:20:00" -I
  • non-interactive mode :
Terminal.png inside:
oarsub -q production -p "cluster='grimani'" -l "nodes=1, walltime=0:20:00" <path to a bash script>

Reserving just one gpu card on a node[edit]

If your script uses only one GPU card then a good practice is to not reserve all of the ressources of the node, this will allows others users to use the second GPU card.

To achieve this just use the option "host=1/core=6" in your oarsub command :

  • interactive mode:
Terminal.png inside:
oarsub -q production -p "GPU <> 'NO'" -l "host=1/core=6, walltime=0:20:00" -I
  • non-interactive mode:
Terminal.png inside:
oarsub -q production -p "GPU <> 'NO'" -l "host=1/core=6, walltime=0:20:00" <path to a bash script>

"host=" means the number of machines to reserve ( similar to "nodes=" option ), "core=" means the number of CPU cores to reserve, implicitly when we use this "core=" option it will tell to grid5000 to not reserve the entire node, then the second gpu card will be available for another user.

Note.png Note

To avoid problems ( cuda errors ) in your bash script when you reserve only one GPU card : it's recommended to test if the first ( or second GPU card ) is really available, in order to pick the first available GPU card and not the one used by another user.

You can test which gpu card is available with this bash script:

  #!/bin/bash
  
  # Checking gpu usage and identify the first available gpu...
  
  mem0=`nvidia-smi --query --display=MEMORY --id=0 | grep Used | head -1 | sed -r -e "s/.*: *([0-9]+).*/\1/"`
  mem1=`nvidia-smi --query --display=MEMORY --id=1 | grep Used | head -1 | sed -r -e "s/.*: *([0-9]+).*/\1/"`
  
  printf "GPUs status\n\tgpu0 => %d MiB\n\tgpu1 => %d MiB\n" $mem0 $mem1
  
  gpu=gpu0
  if [[ $mem1 -lt $mem0 ]]; then
      gpu=gpu1
  fi

Here the $gpu variable will contain the name of the first available gpu card ( not used by another user ), this is important for example if you use theano as backend, theano needs to know the name of the gpu card to use, you can set this in the configuration file of theano ( ~/.theanorc ) or with the theano bash environment variable $THEANO_FLAGS.

The behaviour of theano and tensorflow related to gpu cards can also be influenced by a cuda bash environment variable : CUDA_VISIBLE_DEVICES,
CUDA_VISIBLE_DEVICES allows to mask one or several GPU cards to the cuda layer, by declaring one or several GPU to use.

3 different settings :

  • only the first GPU will be seen by cuda ( gpu0 ) :
Terminal.png inside:
export CUDA_VISIBLE_DEVICES=0
  • only the second GPU will be seen by cuda ( gpu1 ) :
Terminal.png inside:
export CUDA_VISIBLE_DEVICES=1
  • first and second GPU will be seen by cuda ( gpu0, gpu1 ) :
Terminal.png inside:
export CUDA_VISIBLE_DEVICES=0,1

Monitoring a grid5000 job[edit]

You can check the state of your grid5000 job, by using the oarstat command, by checking the log files created by oarsub and by connecting to a node used by a job.
Jobs can also be killed by user.

Oarstat command[edit]

Each grid5000 job has an id, this id is printed on screen when you make a reservation with oarsub command :

  Generate a job key...
  OAR_JOB_ID=1498157

To check its state ( "waiting", "running", "finishing" ) just type the oarstat command followed by the "-j" option and the id of the job ( here 1498157 ) :

Terminal.png inside:
oarstat -j 1498157

If you forget the id of the job then you can use the "-u" option of oarstat, followed by your login, it will show the state of your(s) grid5000 job(s) :

Terminal.png inside:
oarstat -u login

Log files[edit]

Each time you use the oarsub command ( in non-interactive mode ) 2 log files are generated, which represents output generated by your bash script :

  • one for standard outputs : "OAR.<job id>.stdout"
  • one for error outputs : "OAR.<job id>.stderr"

They are automatically written in the directory where your script is located.

Connecting to a running job[edit]

At any moment you can connect to the node where the job is currently running, by using the "oarsub" command, with the "-C" option followed by the job id :

Terminal.png inside:
oarsub -C <job id>

Killing a job[edit]

Jobs can be killed by the oardel command followed by the id of the job ( here for example 1498157 ) :

Terminal.png inside:
oardel 1498157

Using several GPU cards[edit]

Using the two gpu cards of a node[edit]

Each gpu cluster ( except orion ) has 2 GPU cards, so your script can use already the 2 cards.

Note.png Note

As of keras v2.0.9 you can use the "single-machine multi-GPU data parallelism" feature, see this tutorial :

https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/

and the official documentation of keras related to multi-gpu:

https://keras.io/utils/#multi_gpu_model

Using multiple gpu nodes[edit]

If you want to reserve and use several gpu nodes ( several grid5000 machines ) then you have to set the "nodes" option on the oarsub command ( number of desired nodes ) :

Terminal.png inside:
oarsub -q production -p "GPU <> 'NO'" -l "nodes=<number of nodes>, walltime=0:20:00" <path to a bash script>

Your bash script can know the network name of the reserved nodes by reading a file located in a bash environment variable : $OAR_FILE_NODES

Terminal.png inside:
cat $OAR_FILE_NODES | sort -u

Then in your bash script you can connect to the desired node with the "oarsh" command, which is a wrapper of ssh command for grid5000:

Terminal.png inside:
oarsh <network name of the gpu node>

Using multiple gpu nodes for deep learning is called "distributed training on multiple machines", to be able to use this kind of distributed training you can use dist-keras or elephas, you have to install them :
https://github.com/cerndb/dist-keras
https://github.com/maxpumperla/elephas

Same thing ( distributed training on multiple machines ) can be done with tensorflow ( without keras ) but it's more complex:
https://www.tensorflow.org/deploy/distributed

Using several GPUs with pytorch[edit]

This can be done in pytorch by using 2 classes, depending on whether you want to use the data parallelism method or the distributed data parallelism method :

  • torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)

This class implements data parallelism at the module level.

  • torch.nn.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, dim=0, broadcast_buffers=True)

This class implements distributed data parallelism at the module level.

Details about how to use these 2 classes :
https://pytorch.org/docs/stable/nn.html#dataparallel-layers-multi-gpu-distributed

A tutorial with examples for using several GPU with pytorch :
https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html

Advanced use of grid5000[edit]

Creation of a custom grid5000 image[edit]

Grid5000 gives users the ability to create a custom environment, with an image that can be deployed on nodes.
The creation of a custom image is necessary when installing additionnal software is not possible in your grid5000 home.

2 ways to create an image:

  • by deploying an existing environment, then modifying it (installing software), then saving the result in a new image
  • from scratch by creating a new image from an iso image of a linux distribution

the first method is more suitable for beginners, you can find a tutorial here:
https://www.grid5000.fr/mediawiki/index.php/User:Ibada/custom-image

the second method requires to read this tutorial:
https://www.grid5000.fr/mediawiki/index.php/Advanced_Kadeploy#Deploy_an_environment_from_a_classical_ISO_installation

Tips and tricks[edit]

Put your dataset file in /tmp directory of the node[edit]

In order to get the best speed for input/output hard-disk operations it's highly recommended to copy your dataset file in the /tmp directory of the GPU node, with the "cp" bash command.
Because if you let your data file in your grid5000 home ( or another nfs share ) then there will be bottleneck and time delay due to the slowness of the network card, the execution time of your deep learning experiment will be more slow.
Remember that your grid5000 home ( and any other nfs share ) is not located in the same room of your reserved gpu node, that's why it's recommended to copy the data file in a local directory of the node like /tmp to avoid I/O slowness.

Use the batch-mode training[edit]

The size of the dataset in deep learning context can be huge ( several gigabytes ), if you try to load in memory all the contain of a very large dataset then you will likely have a crash due to a lack of memory.
To avoid this problem you can use the "batch-mode training", this method consists to load small data packets in a loop ( for loop ) and send them to your neural network for training,
for example if you have a dataset of 10 000 000 vectors then you can read your dataset in 1000 packets of 10 000 vectors, and send these data packets one by one to your neural network.

This can be achieved easily in keras with several api functions : train_on_batch(), fit_generator(), see this documentation :
https://keras.io/getting-started/faq/#how-can-i-use-keras-with-datasets-that-dont-fit-in-memory

Generally when you design a script : think to make it robust, it must be able to handle a very large dataset even on PC with small ram memory, your current dataset may be small, but in 6 months your script may have to handle a much bigger data set, a batch-mode feature for loading and training data will avoid any problem related to memory size.

Avoid text format when the dataset is large[edit]

Text format as csv, xml is easily readable for an human, but parsing a text file is time-consumming for computer languages such python, especially if your dataset is big,

for example parsing a csv file will require the use of the "split()" function, split() function requires high CPU usage, if your csv file has a lot of rows and columns then retrieving data will be very slow, at each iteration ( new epoch for the training ) this csv parsing will create bottleneck, slowness.

Solution here is to prefer the binary format instead of a text format for your dataset, you can create a script which will convert csv/txt files to binary files, a convenient binary format for big data is the hdf5 format, an excellent python module exists for hdf5, h5py :
http://www.h5py.org/

hdf5 is often used in scientific computing, deep learning, if your dataset is very big ( several gigabytes ) then hdf5 format is a good choice, with many advantages ( portable format, very fast data loading and can map data in virtual memory in order to save ram, hdf5 file can be parsed like a numpy object with slicing ).

Implement a backup/resume mode for the training[edit]

When you train a neural network you don't know exactly how many time the training will take, this can be a problem on grid5000 when you want to reserve a node, if you set a too low value for the "walltime" oarsub option then your grid5000 job will stop before the end of your training.
So you need to implement a "backup/resume" feature in your script, in order to avoid the redo of the all training when a grid5000 job is interrupted for various reasons ( bad estimation for the walltime, electric power failure, hard disk failure, maintenance periods etc... ).
You have just to reserve a new job, load the backup of your model in your script, the training will resume at the state where the model was before the interruption.

This feature exists with keras, by using the callback function : "ModelCheckpoint", this function will automatically save the model at the end of each epoch during the training :

  ModelCheckpoint(filepath, monitor='val_loss', verbose=0, save_best_only=False, save_weights_only=False, mode='auto', period=1)

https://keras.io/callbacks/

Other solution is to save manually the model by using "model.save(filepath)" function,

reloading a model ( with its weights ) can be done with the load_model(filepath) function, here the model will be saved in hdf5 format :

  from keras.models import load_model
  
  model.save('my_model.h5')  # creates a HDF5 file 'my_model.h5'
  del model  # deletes the existing model
  
  # returns a compiled model
  # identical to the previous one
  model = load_model('my_model.h5')

https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model

nvidia-smi tool[edit]

Nvidia provides a useful tool called nvidia-smi ( NVIDIA System Management Interface ), it's a command line utility which manages and configures nvidia gpu devices.
You can use it for several purposes:

  • check the gpu usage and video memory usage with an update every 2 seconds :
Terminal.png inside:
nvidia-smi -l 2
  • display system topology :
Terminal.png inside:
nvidia-smi topo --matrix
  • change the gpu compute mode with the "-c" option ( "0" for shared mode where multiple threads can use the GPU, "1" for "exclusive mode" where only one thread can use the GPU, "2" for prohibited mode where no threads can use the GPU )
Terminal.png inside:
sudo-g5k nvidia-smi -c <value>

Testing your deep learning configuration[edit]

If you need to test your deep learning configuration then you can use the example scripts provided in the source code of keras:
https://github.com/keras-team/keras/tree/master/examples

These scripts cover various deep learning models (mlp, lstm, cnn).

Reserving additional disk space with grid5000[edit]

Grid5000 provides a tool (storage5k) which allows to reserve large disk space for storing very large datasets.
You have to select the number of "chunks" for your reservation, each chunk has a size of 10 GB, so if you want to reserve 50 GB then you need "5 chunks", and the command to type is :

frontend: storage5k -a add -l chunks=5,walltime=24

where "5" is the number of "chunks" and "24" the walltime (number of hours for the duration of your reservation),
storage5k will give you a "storage_job_id", and your storage space will be available under the path :

frontend: cd /data/username_storage_job_id

Storage space created by storage5k tool can be used inside your oarsub reservations with that path, except for "deploy" oarsub reservation type, where you have to mount your disk space :

frontend: storage5k -a mount -j node_job_id

In this command above "node_job_id" means : the job-id of your oarsub job reservation.

You can have informations at any moment about your disk space reservation with this command :

frontend: storage5k -a info

Other methods for reserving additional disk space exist (like OSIRIM), more details here :
https://www.grid5000.fr/mediawiki/index.php/Storage

Troubleshooting[edit]

Random crash and node is not responding[edit]

If your grid5000 job has random crash with no valuable informations in the log file then it could be a bug related to cuda and/or nvidia graphic driver.

Especially if your job has still the "running" status when you check with "oarstat -j <job id>" command, despite the fact you can not connect to the node where the job is running,
it means likely that the linux kernel has crashed ( kernel panic ) on the node, due to a nvidia graphic driver bug.

What you can do is to contact grid5000 team by opening a bugreport:
https://intranet.grid5000.fr/bugzilla/

and also contact nvidia by providing maximum informations about the bug:
http://www.nvidia.com/object/driverqualityassurance.html
https://devtalk.nvidia.com/default/board/98/linux/

A workaround would be to create a custom grid5000 environment which contains a previous nvidia graphic driver version ( not affected by the bug ), in order to deploy it on grid5000.

My job has the "waiting status" with a very distant date[edit]

If you have reserved a node and you notice that the scheduling start date is very distant ( 2, 3 days or more ) then there are several explanations:

  • there are not enough available gpu nodes for the moment, too much users, you have to wait or reduce your "walltime" option, or try to reserve just one GPU card ( the host=1/core=6 option )
  • a maintenance is scheduled on gpu nodes and that explains the distant date for the scheduling start, you can check if a maintenance is scheduled on grid5000 :

https://www.grid5000.fr/status/

Not enough memory, job has been killed[edit]

This is a common error, due to a very large dataset and a bad design related to the way you manage your data in your python script.
Even if grid5000 gpu nodes have a lot of memory you can easily reach the memory limit if you try to load in memory the full contain of a large dataset, it's a bad practice to load the full dataset in memory without checking the memory size of the system, it can work with small datasets, but soon or later you will reach the limit with bigger datasets.

Best practice is to use the "batch-mode" for loading and training data, as explained in the "tips and tricks" section:
https://www.grid5000.fr/mediawiki/index.php/User:Ibada/Tuto_Deep_Learning#Use_the_batch-mode_training

You can also reduce the risk of memory exhaustion by using hdf5 format for your dataset, a mapping in virtual memory is done by default by the hdf5 driver when the filesize is huge, which means that the hard-disk will be used as ram memory by the system.

My deep learning script doesn't work anymore[edit]

If your script doesn't work after an update of keras, theano, tensorflow or pytorch (and your are sure that the error is not due to cuda or grid5000) then you need to update your script, because the API of keras/theano/tensorflow/pytorch can change often,
some functions can change, disappear, can be deprecated, or the syntax of a function (number of parameters, name) can change.
Check often the website of keras/theano/tensorflow/pytorch, the changelog file, in order to anticipate the changes of the API, if your script is older than 6 months then there is a risk that your script will not work with the last version of keras/theano/tensorflow/pytorch.

Useful links[edit]

Documentation of deep learning tools[edit]

  • keras:

https://keras.io/

  • tensorflow:

https://www.tensorflow.org/get_started/

  • theano:

http://deeplearning.net/software/theano/

  • pytorch:

http://pytorch.org/docs/0.3.0/

  • dist-keras:

https://github.com/cerndb/dist-keras

  • elephas:

https://github.com/maxpumperla/elephas

Tutorials for deep learning[edit]

  • an interesting book for deep learning:

http://www.deeplearningbook.org/

  • other tutorial :

http://neuralnetworksanddeeplearning.com/

  • deep learning portal:

http://deeplearning.net/

Various[edit]

  • keras blog:

https://blog.keras.io/

  • h5py ( hdf5 python module ):

http://docs.h5py.org/en/latest/quick.html