Group meeting talk¶

Yi Liu, 24 September 2019

PyCon UK 2019, 13 - 17 September, Cardiff
Current NLP efforts

PyCon UK 2019¶

Home page: https://2019.pyconuk.org

PyCon UK

Annually in Cardiff around September
Features both introductory and advanced talks
Data science, web development, Python language, MicroPython, misc.
4 days packed with talks + 1 day on sprints (hackathons)
type of talks:
- Workshops: 1.5 / 3 hours
- Talks: 30 mins
- Lightning talks: 5 mins

Themes¶

https://pretalx.com/pyconuk-2019/schedule/

data science:
- pandas, scikit-learn, tf-keras, pytorch
- statistics, machine learning, NLP
web development:
- async io
- django, flask, fastapi, etc.
- building APIs
Python:
- coding styles, typing, packaging, testing, etc.

Workshops¶

Demystifying Neural Networks, Michal Grochmal¶

https://github.com/grochmal/nnag

A workshop on building neural networks from scratch
Use numpy and autograd to disseminate inner workings of pytorch neural net modules

What are they talking about? Mining topics in documents with topic modelling and Python, Marco Bonzanini¶

https://github.com/bonzanini/topic-modelling

NLTK and Gensim
Common strategies in preprocessing text data
Latent Dirichlet Allocation modelling and visualisation

Plug & train: flexible customisation and extension of python's deep learning frameworks, Jan Freyberg and Isobel Weinberg¶

https://github.com/janfreyberg/pycon-tutorial-2019

https://docs.google.com/presentation/d/1wmGzxLU1yuP6_rQiB0rl5kg5Pz_aZwW37QO54XOUgaM/edit#slide=id.g5febf343c3_0_107

Writing custom neural network layers on pytorch and keras
- gradient reversal
- residual layers (adding input to output) to overcome "vanishing gradient problem"
- using entire pretrained models as layers

Talks¶

Writing micro-services in Python... Sure! But which framework?, Emma Delescolle¶

Comparisons of python frameworks for writing microservices and APIs, on various aspects
- Django REST Framework: all batteries included, heavy weight
- flask: minimalistic but relies on the dev team to assemble things
- fastapi: just works, but lacks documentation
- pyramid

Static Typing in Python, Dustin Ingram¶

history of introducing typing to python3
static type analysis libraries in python
- mypy https://github.com/python/mypy
- pytype https://github.com/google/pytype

python3 type annotation (Note: This is my code)

def get_encode(
    text_list: List[str], model_name: str, url: str
) -> List[List[float]]:
    payload = {"text_list": text_list, "model_name": model_name}
    r = requests.post(f"{url}/encode", data=json.dumps(payload))
    res = r.json()["embeddings"]
    return res

mypy

# A function with wrong types
def foobar(x: str) -> str:
    y = x + 1
    return 1

# Static type analysis with mypy
mypy $(find . | grep "\.py$")
funcs/funcs.py:42: error: Unsupported operand types for + ("str" and "int")
funcs/funcs.py:43: error: Incompatible return value type (got "int", expected "str")

Code Styles Aren’t Black and White, Mika Naylor¶

PEP 8 (python style guide) https://www.python.org/dev/peps/pep-0008/
(For R, there are the tidyverse style and some others)
Code styles:
- Indentation: tab? 2 spaces? 4 spaces?
- Length of a line?
Readability matters
Have a consistent code style for collaborated projects

black is an opinionated code formatter to auto format your codebase so you don't have to (on your own, configure editors & IDEs etc)

# in:

def very_important_function(template: str, *variables, file: os.PathLike, engine: str, header: bool = True, debug: bool = False):
    """Applies `variables` to the `template` and writes to `file`."""
    with open(file, 'w') as f:
        ...

# out:

def very_important_function(
    template: str,
    *variables,
    file: os.PathLike,
    engine: str,
    header: bool = True,
    debug: bool = False,
):
    """Applies `variables` to the `template` and writes to `file`."""
    with open(file, "w") as f:
        ...

Code styles aren't black and white.

They should all be black. :)

NLP efforts¶

Natural language processing¶

Over its history NLP has progressed from pure linguistic models to statistical/machine learning models.

word2vec(Mikolov et al., 2013) is among the first batch of models that uses neural networks to encode texts into high-dimensional embedding vectors.

Neural network models start to show advantages over other methods due to its advantages in learning unstructured information.

The current state-of-the-art models are "transformer" models

BERT (Bidirectional Encoder representation from Transformers, Devlin et al., 2019)
GPT / GPT-2
RoBERTa
DistilBERT

A two-stage approach:

Pretraining: pretrain models using general-purpose tasks that are extremely time consuming
Finetuning: finetune pretrained models over downstream tasks

Transformer-related NLP ecosystems are starting to mature in recent months for us to use!

The medium-long-term goal is to use text embeddings generated from various state-of-the-art algorithms and models for various downstream tasks.

Comparing similarity of concepts: "body mass index" versus "body weight"
Name entity recognition: literature mining
Question answering: fact checker

The current efforts are still on building infrastructures and understanding various frameworks and ecosystems.

The BERT¶

A tech demo to play with text embeddings generated by BERT models.

webapp

Vectology¶

A tech demo to test text recommendations

webapp

Similarity in text embedding vectors is not a silver bullet¶

In [7]:

import requests

url = "http://jojo.epi.bris.ac.uk:8560"
text_1 = "I love cats"
text_2 = "I hate cats"
model_name = "biobert_v1.1_pubmed"
payload = {
    "text_1": text_1, 
    "text_2": text_2,
    "model_name": model_name,
}
r = requests.get(f"{url}/cosine_similarity", params=payload)

In [6]:

# text_1 = "I love cats"
# text_2 = "I hate cats"
res = r.json()
res

Out[6]:

{'cosine_similarity': 0.9612499346597072}