Paris NLP Season 4 Meetup #2

We would first like to thank Inato as host of this Season #4 Meetup #2 and also our speakers for their very interesting presentations.

• Pierre Bros, Inato (https://inato.com/) [Talk in French]

“Building unique hospital profiles using clustering and classification: an evolution of our approach as Inato grew”

Description:
At Inato we match clinical trials with qualified sites [site ~ hospital] worldwide and optimize site performance [performance ~ number of patients recruited] throughout the trial. We help biopharma companies identify high-performing sites, increase the pool of available patients, and deliver efficient, on-time studies.

In order to get the most accurate and complete information possible, we scrape multiple data sources, that come in different languages and formatting. The creation of a unique profile for each site means we need to be able to identify all the different names a site can take.

We will show how we approached this challenge for the last 2 years, and how our solution evolved with Inato.

slides talk Inato

————

• Edouard d’Archimbaud & Pierre Marcenac, Kili Technology (https://kili-technology.com/) [Talk in French]

“How to scale up training data?”

Description:
“It is better to have a standard algorithm on a lot of quality data than a state-of-the-art algorithm on a lot of data.”

Thus, data labelling has become, even if very painful, an essential step in the modelling process.
However, annotation to scale requires a combination of intuitive interfaces and machine learning (for example, to pre-annotate). Moreover, labelling at scale without compromising data quality requires transparency throughout the labelling process to facilitate quality monitoring and internal collaboration and/or with external annotators.
We will show how we at Kili have structured our annotation pipeline to scale up and better yet, to help get it into production by facilitating human supervision and continuous learning

slides talk Kili Technoogy

————

• Pauline Chavallard, Doctrine (https://www.doctrine.fr/) [Talk in French]

“Structuring legal documents with deep learning”

Description:
Court decisions are traditionally long and complex documents. To make things worse, it is not uncommon for a lawyer to only be interested in the operative part of the judgement (for example, the outcome of the trial).
In fact, in general, it is pretty standard to be looking for a specific legal aspect, which can quickly feel like looking for a needle in a haystack. As such, our goal was to detect the underlying structure of decisions on Doctrine (i.e. the table of contents) to help users navigate them more easily.
Decisions can be seen as small stories. While humans can understand them because they are naturally context-aware and have some expectations, how should an algorithm operate?
In order to address this challenging issue, we trained a neural network (bi-LSTM with attention) using PyTorch to help us predict a suitable table of contents given a free text decision.
This talk gets into more details about our methodology and results

slides talk Doctrine

 

Videos of the talks

Paris NLP Season 4 Meetup #1 at Algolia

We would first like to thank Algolia as host of this Season #4 Meetup #1 and also our speakers for their very interesting presentations.

You can find the slides of our speakers below:

———-

• Florian Strub, Research Scientist @ DeepMind
Multimodal learning

Description:
While our representation of the world is shaped by our perceptions, our languages, and our inter-actions, they have traditionally been distinct fields of study in machine learning. Fortunately, this partitioning started opening up with the recent advents of deep learning methods, which standardized raw feature extraction across communities. However, multimodal neural architectures are still at their beginning.
In this presentation, we will focuses on visually grounded language learning for three reasons (i) they are both well-studied modalities across different scientific fields (ii) it builds upon deep learning breakthroughs in natural language processing and computer vision (ii) the interplay between language and vision has been acknowledged in cognitive science.

This presentation will be divided into two parts:
As a first step, we will motive our line of research by speaking about the language grounding problem. (5-7min)
Then, we will introduce some fundamental visually grounding tasks that have been explored in the past 3 years. (2-3min)
Finally, we will focus on a specific kind of multimodal architecture, namely, Modulation Layers (i.e., Conditional Batch Norm and FiLM). (10-12min)

Materials:
http://papers.nips.cc/paper/7237-modulating-early-visual-processing-by-language
https://distill.pub/2018/feature-wise-transformations/

Slides: Language and Perception in Deep Learning

———-

• Felix Le Chevallier, Lead Data Scientist @ Lifen

[PDF] Hacking Interoperability in Healthcare with AI: Structuring Medical Data to digitize medical communications

How we scaled from 0 to 100k daily predictions served to healthcare practitioners to help them communicate more efficiently, from simple heuristics with handcrafted rules and only a couple of clients, to classical machine learning, and then RNNs to structure information in free form medical notes.

 

———-

• Janna Lipenkova, Founder @ Anacode

[PDF] Applications in data and text analytics often have an ontology as their conceptual backbone – that is, a hierarchical representation of the underlying knowledge domain. However, such representations are tedious to construct, maintain and customize in a manual fashion. In this talk, I will show how text data and lexical relations such as hypernymy, synonymy and meronymy can be leveraged to automatically construct ontologies. After a review of different unsupervised and distant-supervised methods proposed for lexical relation extraction from text, I will explain Anacode’s approach to building and maintaining large-scale, multilingual ontologies for the domain of business and market intelligence.

Paris NLP Season 3 Meetup #6 at Scaleway

We would first like to thank Scaleway as host of this Meetup #6 and also our speakers for their very interesting presentations.

 

You can find the slides of our speakers below:

 

  Olga Petrova, Machine Learning DevOps Engineer at Scaleway

Subject: Understanding text with BERT

Abstract:

Reading comprehension is one of the fundamental human skills that, however, presents a highly non trivial problem for a machine learning system. One of the ways to begin tackling it is to cast it in the form of question answering based on a given text. In this talk we shall look at how we can approach this task using the latest advance in deep learning for NLP: the Transformer architecture, which has come to replace RNN based models for many NLP tasks. In particular, we will go through an example of training a model based on BERT, a pre-trained encoder/transformer network, on SQuAD (the Stanford Question Answering Dataset).

[PDF] Meetup_Paris_NLP_24_07_2019_Scaleway

———-

•  Axel de Romblay, Machine Learning Engineer at Dailymotion

Subject: How to build a multi-lingual text classifier ?

Abstract:
In this talk, we will introduce one of the biggest challenge we face at dailymotion : how do we accurately categorize our video catalog at scale using the descriptions ?
The purpose is to introduce the whole pipeline running at dailymotion which relies on a complex mixing of different methods : machine learning for language detection, NEL to Wikidata knowledge graph, deep learning using sparse representations and NLP with multi-lingual embeddings & robust transfer learning.

Reference : https://medium.com/dailymotion/topic-annotation-automatic-algorithms-data-377079d27936

[PDF] Meetup_Paris_NLP_24_07_2019_Dailymotion

———-

• Arthur Darcet & Mehdi Hamoumi, Glose

Subject: Measuring text readability with strong and weak supervision

Abstract:

Text complexity is mainly described by three factors:
* Readability, text content described such as vocabulary, syntax, discourse.
* Legibility, text form such as character size, font and formatting such as emphasis.
* Reader-dependent features such as reading ability and reading context such as environment (noisy, calm, classroom, subway) or intent (educational, recreational).

At Glose, we built a product where readers can discover, read, and annotate thousands of e-books while being able to share with their friends. It is currently used by thousands of readers worldwide, especially in the academic field where collaborative reading is a great feature for professors/teachers and their students.
In order to improve reading experience, we are currently working on automatic text readability evaluation to enhance book recommendation, which should ease the learning curve of a reader.
We tackle this NLP task with both supervised and unsupervised machine learning approaches.

During this talk, we will present our supervised pipeline [1] which encodes a book’s content into a set of features and consumes it to fit model parameters that are able to predict a readability score.
Then, we will introduce an unsupervised approach to this task [2] based on the following hypothesis: the simpler a text is, the better it should be understood by a machine. It consists in correlating the ability of multiple language models (LMs) at infilling Cloze tests with readability level labels.

References:
[1] https://medium.com/glose-team/how-to-evaluate-text-readability-with-nlp-9c04bd3f46a2
[2] https://storage.cloud.google.com/s5-bucket/research/marc/acl_bea_paper.pdf

[PDF] Meetup Paris NLP 24_07_2019_Glose

 

Meetup’s video:

https://www.youtube.com/watch?v=njXH4af5n8M&feature=youtu.be

Paris NLP Season 3 Meetup #5 at LinkFluence

Full room for Paris NLP @ Linkfluence

We would first like to thank Linkfluence as host of this Meetup #5 and also our speakers for their very interesting presentations. Not forgetting the attendees for being so many at this session!

You can find the slides of our speakers below:

• Alexis Dutot, Linkfluence

At Linkfluence, we analyze millions of social media posts per day in more than 60 languages. This represents thousands of noisy user-generated documents per second passing through our internal enrichment pipeline. This volume combined with the real-time constraint prevents us from using cross lingual BERT-like models.

In this talk we will focus on multilingual sentiment analysis and emotion detection tasks based on social media data. Only a few annotated corpora tackle these tasks and the vast majority of them is dedicated to the English language. We will see how we fully exploit the potential of emojis as a universal expression of sentiment and emotion in order to build accurate sentiment analysis and emotion detection “real-time” deep learning systems in several languages using solely English annotated corpora.

[PDF] Alexis Dutot, Linkfluence

Benoît Lebreton, Sacha Samama and Tom Stringer, Quantmetry

Melusine is an open source library developed by Quantmetry and MAIF. The talk focuses on technical issues raised by Melusine’s open source implementation, as well as underlying neural models and algorithms that are being leveraged.

[PDF] Benoît Lebreton, Sacha Samama and Tom Stringer, Quantmetry

 

Paris NLP Season 3 Meetup #4 at MeilleursAgents

We would like first thank MeilleursAgents as host of this meetup, then thank our 3 speakers for their very interesting presentation and also thank the participants for coming still so many at this session.

You can find the slides of our three speakers below:

• Syrielle Montariol, LIMSI, CNRS

Word usage, meaning and connotation change throughout time ; it echoes the various aspects of the evolution of society (cultural, technological…). For example the word “Katrina”, originally associated with female surnames, came closer to the disaster vocabulary after Hurricane Katrina appeared in august 2005.
Diachronic word embeddings are used to grasp such change in an unsupervised way : it is useful to linguistic research to understand the evolution of languages, but also for standard NLP tasks to study long time-range corpora.
In this talk, I will introduce a selection of methods to train time-varying word embeddings and to evaluate it, placing greater emphasis on probabilistic word embeddings models.

Slides Syrielle Montariol, LIMSI, CNRS

Pierre Pakey and Dimitri Lozeve, Destygo

If data beats model, why not build models that produce data ? Vast quantities of realistic labeled data will always make the difference in all machine learning optimization problems. At Destygo, we automatically leverage the interactions between users and our conversational AI agents to produce vast quantities of labelled data and train our natural language understanding algorithms in a reinforcement learning framework. We will present the outline of our self-learning pipeline, its relation with state of the art literature and the specificity due to the NLP space. Finally, we will focus more specifically on the network responsible for choosing whether to try something new or not, which is one of the important pieces of the process.

Slides Pierre Pakey & Dimitri Lozeve, Destygo

 Julien Perez, Machine Learning and Optimization group, Naver Labs Europe

Over the last 5 years, differentiable programming and deep learning have become the-facto standard on a vast set of decision problems of data science. Three factors have enabled this rapid evolution. First, the availability and systematic collection of data have enabled to gather and leverage large quantities of traces of intelligent behavior. Second, the development of standardized development framework has dramatically accelerated the development of differentiable programming and its applications to the major’s modalities of the numerical world, image, text, and sound. Third, the availability of powerful and affordable computational infrastructure have enabled this new step toward machine intelligence. Beyond these encouraging results, new limits have arisen and need to be addressed. Automatic common-sense acquisition and reasoning capabilities are two of these frontiers that the major research labs of machine learning are now involved. In this context, human language has become once again a support of choice of such research. In this talk, we will take a task of natural language understanding, machine reading, as a medium to illustrate the problem and describe the research progress suggested throughout the machine reading project. First, we will describe several of the limitations the current decision models are suggesting. Secondly, we will speak of adversarial learning and how such approach robustifies learning. Thirdly, we will explore several differentiable transformations that aim at moving toward these goals. Finally, we will discuss ReviewQA, a machine reading corpus over human generated hotel review, that aims at encouraging research around these questions.

Slides Julien Perez, Machine Learning and Optimization group, Naver Labs Europe

 

 

Paris NLP Season 3 Meetup #3 at Doctrine

We would like first thank Doctrine as host of this meetup, then thank our 3 speakers for their presentation and also thank the participants for coming so  many at this session.

You can find the slides of our three speakers below:

Hugo Vasselin & Benoit Dumeunier, Artefact

Comment redéfinir l’image d’une marque avec un simple compteur de mots ? Ce talk célèbre la rencontre entre la data science et la créa. Il vous raconte comment des techniques de NLP basiques, croisées à une approche créative ont permis de re-définir une marque. Dans un premier temps, nous avons conçu un outil permettant de donner une idée de la perception des différentes marques d’un grand groupe hotelier à travers le monde, par rapport à ses concurrents. Ces données ont fait ressortir un certain nombre de valeurs chères aux hôtes, qui ont servi de piliers pour des expériences de marque créatives et innovantes…

Slides Hugo Vasseling & Benoît Dumeunier (Artefact)

Romain Vial, Hyperlex

Hyperlex is a contract analytics and management solution powered by artificial intelligence. Hyperlex helps companies manage and make the most of their contract portfolio by identifying relevant information and data to manage key contractual commitments during the whole life of the contract. Our technology rests on a combination of specifically trained Natural Language Processing (NLP) algorithms and advanced machine learning techniques.

In this talk, I will present some of the challenges we are currently solving at Hyperlex through a focus on two important NLP tasks: (i) learning representations for texts and words using recent language modelling techniques; and (ii) building knowledge from predictions by mining relations in legal documents.

Slides Romain Vial (Hyperlex)

Grégory Châtel, Lead R&D @ Disaitek et membre du programme Intel AI software innovator

In this talk, I will present two recent research articles from openAI and Google AI Language about transfer learning in NLP and their implementation.

Historically, transfer learning for NLP neural networks has been limited to reusing pre-computed word embeddings. Recently, a new trend appeared, much closer to what transfer learning looks like in computer vision, consisting in reusing a much larger part of a pre-trained network. This approach allows to reach state of the art results on many NLP tasks with minimal code modification and training time. In this presentation, I will present the underlying architectures of these models, the generic pre-training tasks and an example of using such network to complete a NLP task.

Slides Grégory Châtel (Disaitek)

Paris NLP Season 3 Meetup #2 at Méritis

Thanks to our host Meritis

• François Yvon, LIMSI/CNRS

Using monolingual data in Neural Machine Translation

Modern machine translation rests on the availability of appropriate parallel corpora, which are scarce and costly to accumulate. Monolingual corpora are much easier to get, and can be easily integrated into Statistical Machine Translation systems, where they have shown to be of great help. The issue is slightly different in Neural Machine Translation (NMT) , and how to take advantage of these resources is still the subject to discussions. In this talk, I will try to summarize a series of recent papers on this topic and comment on the current state of the debate. This will also give me the opportunity to discuss research in NMT in more general terms. This has been conducted jointly with Franck Burlot.

presentation_francois_yvon

———-
• Kezhan SHI, Data Science manager at Allianz France,

will show you interesting results with NLP techniques in an Insurance project through an in-depth case study involving :

– string distance or phonetic distance (used in geocoding for string fuzzy matching)
– documents classification (for construction firm’s activity recognition)
– word2vec (understanding construction firm’s activities)

Paris NLP Season 3 Meetup #1 @Xebia

Thanks to our host : Xebia

Guillaume Lample, FAIR [Talk in English]
Unsupervised machine translation

Machine translation (MT) has achieved impressive results recently, thanks to recent advances in deep learning and the availability of large-scale parallel corpora. Yet, their effectiveness strongly relies on the availability of large amounts of parallel sentences, which hinders their applicability to the majority of language pairs.

Previous studies have shown that monolingual data — widely available in most languages — can be used to improve the performance of MT systems. However, these were used to augment, rather than replace, parallel corpora.

In this talk, I will present our recent research on Unsupervised Machine Translation, where we show that it is possible to train MT systems in a fully unsupervised setting, without the need of any cross-lingual dictionary or parallel resources whatsoever, but with access only to large monolingual corpora in each language. Beyond translating languages for which there is no parallel data, our method could potentially be used to decipher unknown languages.

Talk_Meetup_NLP_Guillaume_Lample

Thomas Wolf, Hugging Face [Talk in English]
Neural networks based dialog agents: going beyond the seq2seq model

I will present a summary of the technical tools behind our submission to the Conversational Intelligence Challenge 2 which is part of NIPS 2018 (convai.io).

This challenge tests how a dialog agent can incorporate personality as well as common sense reasoning in a free-form setting.

Our submission is leading the leaderboard topping all tested metrics with a significant margin over the second top model.

These strong improvements are obtained by an innovative use of transfert learning, data augmentation technics and multi-task learning in a non-seq2seq architecture.

Hugging Face Slides

Paris NLP Meetup #6 Season 2 @ LinkValue

You can find the video of the meetup here : https://www.youtube.com/watch?v=sIX8AxMe_bU

[Talk in English] Guillaume Barrois – Liegey Muller Pons
LMP is a technology company that develops tool to understand public opinion at a very local scale. This talk will present exemples of analysis that we apply to original textual data sources, in order to extract the dynamics and features of the opinion on a given territory.

meetup_nlp_liegey_muller_pons

[Talk in French] Ismael Belghiti – Hiresweet

HireSweet permet aux entreprises de recruter les meilleurs ingénieurs
en développant un moteur de recommandation classant des profils à partir d’une offre d’emploi. Ce talk présentera comment différentes techniques de NLP peuvent être appliquées pour calculer un score de matching entre un profil et une offre, en comparant leur performance sur une métrique de ranking dédiée.

meetup_nlp_hiresweet

• [Talk in English] Gil Katz earned his PhD in Information Theory from CentralSupélec in 2017. Today he is a senior data scientist in SAP Conversational AI (previously Recast.AI), based in Paris.

Unsupervised Learning and Word Embeddings

The field of Machine Learning can be divided into two main branches – supervised and unsupervised learning. While examples for applications of supervised learning are easy to come by, the power of unsupervised learning is less intuitive. In this talk, we will use the problem of representing words as a case study. The limitations of simple one-hot encoding will be discussed before describing the modern method of embedding words in a vector space of real numbers. After comparing several approaches, current advances and future challenges will be discussed.

meetup_nlp_recast

Paris NLP Meetup #5 Season 2 @ Snips

  • Adrien Ball, Snips

An Introduction to Snips NLU, the Open Source Library behind Snips Voice Platform

Integrating a voice or chatbot interface into a product used to require a Natural Language Understanding cloud service. Snips NLU is a Private by Design NLU engine. It can run on the edge or on a server, with minimal footprint, while performing as good or better than cloud solutions.

2018_05_NLP_meetup_snips

  • Jérôme Dockes, INRIA

Mapping neuroimaging text reports to spatial distributions over the brain.

We learn the statistical link between anatomical terms and spatial coordinates extracted from the neuroscience literature. This allows us to associate brain images with fragments of text which describe neuroimaging observations. Accessing the unstructured spatial information contained in such reports offers new possibilities for meta-analysis.

2018_05_NLP_meetup_inria

  • Charles Borderie, Victor de la Salmonière et Marian Szczesniak, Lettria

LETTRIA développe des outils de Traitement du Langage exclusivement dédiés à la compréhension du Français. L’accent est mis sur la facilité d’utilisation, la performance et l’appréciation du réel sens des mots.

2018_05_NLP_meetup_Lettria