Paris NLP Season 5 Meetup #1

Ludan Stoecklé, CTO of the AI Lab at BNP Paribas CIB


RosaeNLG is the first open-source Natural Language Generation (NLG) engine for production use. Its author, Ludan Stoecklé, will present what is NLG, its usecases, the journey that led him to create RosaeNLG along with the legal issues encountered, and will present the library.

Ludan Stoecklé is a professional of the AI software industry. IT engineer INSA Lyon 2003, he was the CTO of Yseop, a Natural Language Generation (NLG) solution editor, for 9 years, before building the technical team of Addventa, an AI consulting company.
Today CTO of the AI Lab at BNP Paribas CIB, Ludan is developing and industrializing AI products internally.
Beyond his passion for NLG, chatbots and running, Ludan is also known worldwide as a paperweight collector.

Philippe de Saint Chamas, Data Scientist, Kili Technology

Data augmentation for NLP : what does really work ?

Having the data to build a model is sometimes a problem that seems insoluble: how to build a good model of chatbot intention without user data? How to have users without a good model?

One of the levers for this chicken and eggs problem is data augmentation. However naive augmentation of text data can lead to biases, do methods such as counterfactual annotation, massive exploration of augmentation techniques?

In this talk we will see on concrete examples how to combine human annotation and automatic augmentation to create large and rich datasets, increasing the robustness of the models.

Paris NLP Season 4 Meetup #6

Wacim Belblidia and Martin d’Hoffschmidt, ILLUIN Technology

We will share and present ILLUIN’s research works on Question Answering for french language.
These works cover the collection of the largest French Question Answering dataset (FQuAD), the training of a native french Question Answering model on the CamemBERT language models, and a large benchmark on multilingual transfer learning methods.


Recent advances in the field of language modeling have improved state-of-the-art results on many Natural Language Processing tasks. Among them, Reading Comprehension has made significant progress over the past few years. However, most results are reported in English since labeled resources available in other languages, such as French, remain scarce. In the present work, we introduce the French Question Answering Dataset (FQuAD). FQuAD is a French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 version and 60,000+ samples for the 1.1 version. We train a baseline model which achieves an F1 score of 92.2 and an exact match ratio of 82.1 on the test set.

Sebastien Stormacq, AWS

Build conversational interfaces for your customers

Talking and listening is the most natural way to interact, we learn to do so since the very day we were born. In this session, I will show you how to build great conversational UI to delight your customers. We will cover the basic of speech recognition and natural language processing, explore the main programming interface and best practices to create engaging conversational interfaces. I will illustrate these concepts with Amazon Lex and Amazon Polly.



Paris NLP Season 4 Meetup #3

We would first like to thank Jobteaser as host of this Season #4 Meetup #3 and also our speakers for their very interesting presentations.


• Thomas Belhalfaoui, Lead Data Scientist @ JobTeaser

Siamese CNN for jobs-candidate matching: learning document embeddings with triplet loss.

At JobTeaser, we are the official career center of more than 500 schools and universities throughout Europe, where we can multipost companies job offers.
Our mission: help students and recent graduates find their dream job. Among other tools we develop, we try to recommend job offers of interest to our users.

For this purpose, we build a Siamese Convolutional Neural Network, that takes job offer and student resume texts as inputs, and yields job and resume embeddings in a shared euclidean space. Then, recommendation simply amounts to finding the nearest neighbors.
We train the network with a triplet loss on historical application feedback.

Slides Jobteaser (Siamese CNN job candidate matching)


• Djamé Seddah, Associate Professor in CS @ Inria

Sesame street-based naming schemes must fade out, long live CamemBERT et le French fromage!

As cliché as it sounds, pretrained language models are now ubiquitous in Natural Language Processing, the most prominent ones being arguably Bert (Delvin et al, 2018). Many works have shown that Bert-based models are able to capture meaningful syntactic information using nothing else than raw data for training (eg. Jawahar et al, 2019) and this ability is probably one of the reasons of its success.

Anyway, until very recently, most available models have either been trained on English data or on the concatenation of data in multiple languages. In this talk, we’ll present the results of a work that investigates the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (a few gigabytes) leads to results that are as good as those obtained using two magnitudes larger datasets. Our best performing model Camembert reaches or improves the state of the art in all four downstream tasks.

Presented by Djamé Seddah, joint work with Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, and Benoît Sagot.

Slides camemBERT

Video of the talks


Paris NLP Season 4 Meetup #2

We would first like to thank Inato as host of this Season #4 Meetup #2 and also our speakers for their very interesting presentations.

• Pierre Bros, Inato ( [Talk in French]

“Building unique hospital profiles using clustering and classification: an evolution of our approach as Inato grew”

At Inato we match clinical trials with qualified sites [site ~ hospital] worldwide and optimize site performance [performance ~ number of patients recruited] throughout the trial. We help biopharma companies identify high-performing sites, increase the pool of available patients, and deliver efficient, on-time studies.

In order to get the most accurate and complete information possible, we scrape multiple data sources, that come in different languages and formatting. The creation of a unique profile for each site means we need to be able to identify all the different names a site can take.

We will show how we approached this challenge for the last 2 years, and how our solution evolved with Inato.

slides talk Inato


• Edouard d’Archimbaud & Pierre Marcenac, Kili Technology ( [Talk in French]

“How to scale up training data?”

“It is better to have a standard algorithm on a lot of quality data than a state-of-the-art algorithm on a lot of data.”

Thus, data labelling has become, even if very painful, an essential step in the modelling process.
However, annotation to scale requires a combination of intuitive interfaces and machine learning (for example, to pre-annotate). Moreover, labelling at scale without compromising data quality requires transparency throughout the labelling process to facilitate quality monitoring and internal collaboration and/or with external annotators.
We will show how we at Kili have structured our annotation pipeline to scale up and better yet, to help get it into production by facilitating human supervision and continuous learning

slides talk Kili Technoogy


• Pauline Chavallard, Doctrine ( [Talk in French]

“Structuring legal documents with deep learning”

Court decisions are traditionally long and complex documents. To make things worse, it is not uncommon for a lawyer to only be interested in the operative part of the judgement (for example, the outcome of the trial).
In fact, in general, it is pretty standard to be looking for a specific legal aspect, which can quickly feel like looking for a needle in a haystack. As such, our goal was to detect the underlying structure of decisions on Doctrine (i.e. the table of contents) to help users navigate them more easily.
Decisions can be seen as small stories. While humans can understand them because they are naturally context-aware and have some expectations, how should an algorithm operate?
In order to address this challenging issue, we trained a neural network (bi-LSTM with attention) using PyTorch to help us predict a suitable table of contents given a free text decision.
This talk gets into more details about our methodology and results

slides talk Doctrine


Videos of the talks

Paris NLP Season 4 Meetup #1 at Algolia

We would first like to thank Algolia as host of this Season #4 Meetup #1 and also our speakers for their very interesting presentations.

You can find the slides of our speakers below:


• Florian Strub, Research Scientist @ DeepMind
Multimodal learning

While our representation of the world is shaped by our perceptions, our languages, and our inter-actions, they have traditionally been distinct fields of study in machine learning. Fortunately, this partitioning started opening up with the recent advents of deep learning methods, which standardized raw feature extraction across communities. However, multimodal neural architectures are still at their beginning.
In this presentation, we will focuses on visually grounded language learning for three reasons (i) they are both well-studied modalities across different scientific fields (ii) it builds upon deep learning breakthroughs in natural language processing and computer vision (ii) the interplay between language and vision has been acknowledged in cognitive science.

This presentation will be divided into two parts:
As a first step, we will motive our line of research by speaking about the language grounding problem. (5-7min)
Then, we will introduce some fundamental visually grounding tasks that have been explored in the past 3 years. (2-3min)
Finally, we will focus on a specific kind of multimodal architecture, namely, Modulation Layers (i.e., Conditional Batch Norm and FiLM). (10-12min)


Slides: Language and Perception in Deep Learning


• Felix Le Chevallier, Lead Data Scientist @ Lifen

[PDF] Hacking Interoperability in Healthcare with AI: Structuring Medical Data to digitize medical communications

How we scaled from 0 to 100k daily predictions served to healthcare practitioners to help them communicate more efficiently, from simple heuristics with handcrafted rules and only a couple of clients, to classical machine learning, and then RNNs to structure information in free form medical notes.



• Janna Lipenkova, Founder @ Anacode

[PDF] Applications in data and text analytics often have an ontology as their conceptual backbone – that is, a hierarchical representation of the underlying knowledge domain. However, such representations are tedious to construct, maintain and customize in a manual fashion. In this talk, I will show how text data and lexical relations such as hypernymy, synonymy and meronymy can be leveraged to automatically construct ontologies. After a review of different unsupervised and distant-supervised methods proposed for lexical relation extraction from text, I will explain Anacode’s approach to building and maintaining large-scale, multilingual ontologies for the domain of business and market intelligence.

Paris NLP Season 3 Meetup #6 at Scaleway

We would first like to thank Scaleway as host of this Meetup #6 and also our speakers for their very interesting presentations.


You can find the slides of our speakers below:


  Olga Petrova, Machine Learning DevOps Engineer at Scaleway

Subject: Understanding text with BERT


Reading comprehension is one of the fundamental human skills that, however, presents a highly non trivial problem for a machine learning system. One of the ways to begin tackling it is to cast it in the form of question answering based on a given text. In this talk we shall look at how we can approach this task using the latest advance in deep learning for NLP: the Transformer architecture, which has come to replace RNN based models for many NLP tasks. In particular, we will go through an example of training a model based on BERT, a pre-trained encoder/transformer network, on SQuAD (the Stanford Question Answering Dataset).

[PDF] Meetup_Paris_NLP_24_07_2019_Scaleway


•  Axel de Romblay, Machine Learning Engineer at Dailymotion

Subject: How to build a multi-lingual text classifier ?

In this talk, we will introduce one of the biggest challenge we face at dailymotion : how do we accurately categorize our video catalog at scale using the descriptions ?
The purpose is to introduce the whole pipeline running at dailymotion which relies on a complex mixing of different methods : machine learning for language detection, NEL to Wikidata knowledge graph, deep learning using sparse representations and NLP with multi-lingual embeddings & robust transfer learning.

Reference :

[PDF] Meetup_Paris_NLP_24_07_2019_Dailymotion


• Arthur Darcet & Mehdi Hamoumi, Glose

Subject: Measuring text readability with strong and weak supervision


Text complexity is mainly described by three factors:
* Readability, text content described such as vocabulary, syntax, discourse.
* Legibility, text form such as character size, font and formatting such as emphasis.
* Reader-dependent features such as reading ability and reading context such as environment (noisy, calm, classroom, subway) or intent (educational, recreational).

At Glose, we built a product where readers can discover, read, and annotate thousands of e-books while being able to share with their friends. It is currently used by thousands of readers worldwide, especially in the academic field where collaborative reading is a great feature for professors/teachers and their students.
In order to improve reading experience, we are currently working on automatic text readability evaluation to enhance book recommendation, which should ease the learning curve of a reader.
We tackle this NLP task with both supervised and unsupervised machine learning approaches.

During this talk, we will present our supervised pipeline [1] which encodes a book’s content into a set of features and consumes it to fit model parameters that are able to predict a readability score.
Then, we will introduce an unsupervised approach to this task [2] based on the following hypothesis: the simpler a text is, the better it should be understood by a machine. It consists in correlating the ability of multiple language models (LMs) at infilling Cloze tests with readability level labels.


[PDF] Meetup Paris NLP 24_07_2019_Glose


Meetup’s video:

Paris NLP Season 3 Meetup #5 at LinkFluence

Full room for Paris NLP @ Linkfluence

We would first like to thank Linkfluence as host of this Meetup #5 and also our speakers for their very interesting presentations. Not forgetting the attendees for being so many at this session!

You can find the slides of our speakers below:

• Alexis Dutot, Linkfluence

At Linkfluence, we analyze millions of social media posts per day in more than 60 languages. This represents thousands of noisy user-generated documents per second passing through our internal enrichment pipeline. This volume combined with the real-time constraint prevents us from using cross lingual BERT-like models.

In this talk we will focus on multilingual sentiment analysis and emotion detection tasks based on social media data. Only a few annotated corpora tackle these tasks and the vast majority of them is dedicated to the English language. We will see how we fully exploit the potential of emojis as a universal expression of sentiment and emotion in order to build accurate sentiment analysis and emotion detection “real-time” deep learning systems in several languages using solely English annotated corpora.

[PDF] Alexis Dutot, Linkfluence

Benoît Lebreton, Sacha Samama and Tom Stringer, Quantmetry

Melusine is an open source library developed by Quantmetry and MAIF. The talk focuses on technical issues raised by Melusine’s open source implementation, as well as underlying neural models and algorithms that are being leveraged.

[PDF] Benoît Lebreton, Sacha Samama and Tom Stringer, Quantmetry


Paris NLP Season 3 Meetup #4 at MeilleursAgents

We would like first thank MeilleursAgents as host of this meetup, then thank our 3 speakers for their very interesting presentation and also thank the participants for coming still so many at this session.

You can find the slides of our three speakers below:

• Syrielle Montariol, LIMSI, CNRS

Word usage, meaning and connotation change throughout time ; it echoes the various aspects of the evolution of society (cultural, technological…). For example the word “Katrina”, originally associated with female surnames, came closer to the disaster vocabulary after Hurricane Katrina appeared in august 2005.
Diachronic word embeddings are used to grasp such change in an unsupervised way : it is useful to linguistic research to understand the evolution of languages, but also for standard NLP tasks to study long time-range corpora.
In this talk, I will introduce a selection of methods to train time-varying word embeddings and to evaluate it, placing greater emphasis on probabilistic word embeddings models.

Slides Syrielle Montariol, LIMSI, CNRS

Pierre Pakey and Dimitri Lozeve, Destygo

If data beats model, why not build models that produce data ? Vast quantities of realistic labeled data will always make the difference in all machine learning optimization problems. At Destygo, we automatically leverage the interactions between users and our conversational AI agents to produce vast quantities of labelled data and train our natural language understanding algorithms in a reinforcement learning framework. We will present the outline of our self-learning pipeline, its relation with state of the art literature and the specificity due to the NLP space. Finally, we will focus more specifically on the network responsible for choosing whether to try something new or not, which is one of the important pieces of the process.

Slides Pierre Pakey & Dimitri Lozeve, Destygo

 Julien Perez, Machine Learning and Optimization group, Naver Labs Europe

Over the last 5 years, differentiable programming and deep learning have become the-facto standard on a vast set of decision problems of data science. Three factors have enabled this rapid evolution. First, the availability and systematic collection of data have enabled to gather and leverage large quantities of traces of intelligent behavior. Second, the development of standardized development framework has dramatically accelerated the development of differentiable programming and its applications to the major’s modalities of the numerical world, image, text, and sound. Third, the availability of powerful and affordable computational infrastructure have enabled this new step toward machine intelligence. Beyond these encouraging results, new limits have arisen and need to be addressed. Automatic common-sense acquisition and reasoning capabilities are two of these frontiers that the major research labs of machine learning are now involved. In this context, human language has become once again a support of choice of such research. In this talk, we will take a task of natural language understanding, machine reading, as a medium to illustrate the problem and describe the research progress suggested throughout the machine reading project. First, we will describe several of the limitations the current decision models are suggesting. Secondly, we will speak of adversarial learning and how such approach robustifies learning. Thirdly, we will explore several differentiable transformations that aim at moving toward these goals. Finally, we will discuss ReviewQA, a machine reading corpus over human generated hotel review, that aims at encouraging research around these questions.

Slides Julien Perez, Machine Learning and Optimization group, Naver Labs Europe



Paris NLP Season 3 Meetup #3 at Doctrine

We would like first thank Doctrine as host of this meetup, then thank our 3 speakers for their presentation and also thank the participants for coming so  many at this session.

You can find the slides of our three speakers below:

Hugo Vasselin & Benoit Dumeunier, Artefact

Comment redéfinir l’image d’une marque avec un simple compteur de mots ? Ce talk célèbre la rencontre entre la data science et la créa. Il vous raconte comment des techniques de NLP basiques, croisées à une approche créative ont permis de re-définir une marque. Dans un premier temps, nous avons conçu un outil permettant de donner une idée de la perception des différentes marques d’un grand groupe hotelier à travers le monde, par rapport à ses concurrents. Ces données ont fait ressortir un certain nombre de valeurs chères aux hôtes, qui ont servi de piliers pour des expériences de marque créatives et innovantes…

Slides Hugo Vasseling & Benoît Dumeunier (Artefact)

Romain Vial, Hyperlex

Hyperlex is a contract analytics and management solution powered by artificial intelligence. Hyperlex helps companies manage and make the most of their contract portfolio by identifying relevant information and data to manage key contractual commitments during the whole life of the contract. Our technology rests on a combination of specifically trained Natural Language Processing (NLP) algorithms and advanced machine learning techniques.

In this talk, I will present some of the challenges we are currently solving at Hyperlex through a focus on two important NLP tasks: (i) learning representations for texts and words using recent language modelling techniques; and (ii) building knowledge from predictions by mining relations in legal documents.

Slides Romain Vial (Hyperlex)

Grégory Châtel, Lead R&D @ Disaitek et membre du programme Intel AI software innovator

In this talk, I will present two recent research articles from openAI and Google AI Language about transfer learning in NLP and their implementation.

Historically, transfer learning for NLP neural networks has been limited to reusing pre-computed word embeddings. Recently, a new trend appeared, much closer to what transfer learning looks like in computer vision, consisting in reusing a much larger part of a pre-trained network. This approach allows to reach state of the art results on many NLP tasks with minimal code modification and training time. In this presentation, I will present the underlying architectures of these models, the generic pre-training tasks and an example of using such network to complete a NLP task.

Slides Grégory Châtel (Disaitek)

Paris NLP Season 3 Meetup #2 at Méritis

Thanks to our host Meritis

• François Yvon, LIMSI/CNRS

Using monolingual data in Neural Machine Translation

Modern machine translation rests on the availability of appropriate parallel corpora, which are scarce and costly to accumulate. Monolingual corpora are much easier to get, and can be easily integrated into Statistical Machine Translation systems, where they have shown to be of great help. The issue is slightly different in Neural Machine Translation (NMT) , and how to take advantage of these resources is still the subject to discussions. In this talk, I will try to summarize a series of recent papers on this topic and comment on the current state of the debate. This will also give me the opportunity to discuss research in NMT in more general terms. This has been conducted jointly with Franck Burlot.


• Kezhan SHI, Data Science manager at Allianz France,

will show you interesting results with NLP techniques in an Insurance project through an in-depth case study involving :

– string distance or phonetic distance (used in geocoding for string fuzzy matching)
– documents classification (for construction firm’s activity recognition)
– word2vec (understanding construction firm’s activities)