Machine Learning course

The dates of the course are:

  • 24/09 P320
  • 01/10 A101
  • 15/10 A101
  • 22/10 A101
  • 05/11 A101
  • 12/11 A101
  • 03/12 (exam) A007

Here is the syllabus of a course on Machine Learning that I’m teaching in 2018-2019:


Homework assignment

Objectives

First download the corpus syslogs: it contains 500k lines from a system logs, extracted from the LANL corpus. Each line is generated by a system process. The lines are ordered chronologically.

The final objective is to create a model of the normal execution of this system, so that it is then possible to detect hackers that try to get into the system by detecting anomalies in the observed logs. But this last part - detecting intrusion - is not part of the homework assignment: you are just required to complete the first part, that is training a model of the normal execution of the system.

More precisely, the assignment is the following:

  • Implement a character-level LSTM-encoder in pytorch that takes as input a single line and outputs a floating-point vector (a “sentence” embedding), which encodes the information contained in one line (e.g., line index t) into a single vector.
  • Implement another character-level LSTM-generator in pytorch that takes as input this vector for line t and generates, character by character, the next line t+1. You shall set once and for all a priori the maximum length of input and output lines to a reasonable value T: when the actual line is shorter, you “pad” (add at the end of the line) the line with a special PADDING character; and when the actual line is longer, you cut and remove the extra characters.
  • Train both LSTMs as an encoder-decoder in an end-to-end manner (jointly) to minimize the cross-entropy loss between the generated line t+1 and the actual line t+1.
  • Don’t forget to split the corpus into train/dev/test and evaluate the performances of your model in terms of predicted character accuracy. It is also recommended to report the specific accuracies of each character on the test set, and the training and dev loss curves during training.
  • Additional improvements and evaluation metrics may optionally be proposed.

What is expected

Send by email to cerisara@loria.fr before the 25th of January 2019 a zip file containing:

  • Your working directory, including the GIT directory .git/
  • A report in PDF of 5 pages maximum with an analysis of your experimental results
  • The source code

About GIT

This part is important, and differ from most other work assignments you are used to, so please read it carefully. As soon as you start your project, you must immediately create a (local) git repository that will contain all the python source code and README files of your project. In linux, assuming that you have just created a directory TP for your project and started to write your ideas in the file README.txt, then creating the local git repository can be done as follows:

cd TP
git init
git add README.txt
git commit -am "le debut de mon projet"

Then, whenever you add a new python or README file, you must git add this new file. Also, at the end of every day you are working on your project, you must run again the git commit command from within your repository, with a short description of what you’ve done that day, so that your work of the day is saved in the git history. You may check at anytime the git history with the command “git log”, or with the GUI “gitk”.

When you will build and send me the zip file of your project, double check that the .git/ directory (which contains all the git history) is in the zip file. The objective for me is thus to not only evaluate the final source code of your project, but also the whole process that you have adopted for this homework: in particular, I will check that you have not started your project too late and that you have made regular progress in your code; that you have not simply copy-paste some code from someone else without understanding it; that you have indeed faced issues when developing - which is normal and expected - and that you have correctly solved these issues and progressed; that you have achieved a sufficient amount of work, etc.

Notation

  • 6 points for analyzing the GIT history: how much did you work ? how did you organize your time ? How did you progress ? What were your first ideas and how did you change your mind during your work ? Is it really your own work/ideas or did you rather copy/paste code from someone else ? …
  • 6 points for the PDF report: did you analyze correctly the results ? What did you understood from your results ? Did you choose to plot the most relevant curves/metrics ? What are your conclusions ? …
  • 8 points for the source code: is it working as expected ? Does it fulfill the objectives ? Is it your own contributions ? Is it buggy ? Is it clear and well organized ? Is it maintainable ? …
Written on July 9, 2018