Group meeting talk

Yi Liu, 24 September 2019

  • PyCon UK 2019, 13 - 17 September, Cardiff
  • Current NLP efforts

PyCon UK 2019

PyCon UK

  • Annually in Cardiff around September
  • Features both introductory and advanced talks
  • Data science, web development, Python language, MicroPython, misc.
  • 4 days packed with talks + 1 day on sprints (hackathons)
  • type of talks:
    • Workshops: 1.5 / 3 hours
    • Talks: 30 mins
    • Lightning talks: 5 mins

Themes

https://pretalx.com/pyconuk-2019/schedule/

  • data science:
    • pandas, scikit-learn, tf-keras, pytorch
    • statistics, machine learning, NLP
  • web development:
    • async io
    • django, flask, fastapi, etc.
    • building APIs
  • Python:
    • coding styles, typing, packaging, testing, etc.

Workshops

Demystifying Neural Networks, Michal Grochmal

https://github.com/grochmal/nnag

  • A workshop on building neural networks from scratch
  • Use numpy and autograd to disseminate inner workings of pytorch neural net modules

What are they talking about? Mining topics in documents with topic modelling and Python, Marco Bonzanini

https://github.com/bonzanini/topic-modelling

  • NLTK and Gensim
  • Common strategies in preprocessing text data
  • Latent Dirichlet Allocation modelling and visualisation

Plug & train: flexible customisation and extension of python's deep learning frameworks, Jan Freyberg and Isobel Weinberg

https://github.com/janfreyberg/pycon-tutorial-2019

https://docs.google.com/presentation/d/1wmGzxLU1yuP6_rQiB0rl5kg5Pz_aZwW37QO54XOUgaM/edit#slide=id.g5febf343c3_0_107

  • Writing custom neural network layers on pytorch and keras
    • gradient reversal
    • residual layers (adding input to output) to overcome "vanishing gradient problem"
    • using entire pretrained models as layers

Talks

Writing micro-services in Python... Sure! But which framework?, Emma Delescolle

  • Comparisons of python frameworks for writing microservices and APIs, on various aspects
    • Django REST Framework: all batteries included, heavy weight
    • flask: minimalistic but relies on the dev team to assemble things
    • fastapi: just works, but lacks documentation
    • pyramid

Static Typing in Python, Dustin Ingram

python3 type annotation (Note: This is my code)

def get_encode(
    text_list: List[str], model_name: str, url: str
) -> List[List[float]]:
    payload = {"text_list": text_list, "model_name": model_name}
    r = requests.post(f"{url}/encode", data=json.dumps(payload))
    res = r.json()["embeddings"]
    return res

mypy

# A function with wrong types
def foobar(x: str) -> str:
    y = x + 1
    return 1
# Static type analysis with mypy
mypy $(find . | grep "\.py$")
funcs/funcs.py:42: error: Unsupported operand types for + ("str" and "int")
funcs/funcs.py:43: error: Incompatible return value type (got "int", expected "str")

Code Styles Aren’t Black and White, Mika Naylor

  • black is an opinionated code formatter to auto format your codebase so you don't have to (on your own, configure editors & IDEs etc)
# in:

def very_important_function(template: str, *variables, file: os.PathLike, engine: str, header: bool = True, debug: bool = False):
    """Applies `variables` to the `template` and writes to `file`."""
    with open(file, 'w') as f:
        ...

# out:

def very_important_function(
    template: str,
    *variables,
    file: os.PathLike,
    engine: str,
    header: bool = True,
    debug: bool = False,
):
    """Applies `variables` to the `template` and writes to `file`."""
    with open(file, "w") as f:
        ...

Code styles aren't black and white.

They should all be black. :)

NLP efforts

Natural language processing

Over its history NLP has progressed from pure linguistic models to statistical/machine learning models.

word2vec(Mikolov et al., 2013) is among the first batch of models that uses neural networks to encode texts into high-dimensional embedding vectors.

Neural network models start to show advantages over other methods due to its advantages in learning unstructured information.

The current state-of-the-art models are "transformer" models

  • BERT (Bidirectional Encoder representation from Transformers, Devlin et al., 2019)
  • GPT / GPT-2
  • RoBERTa
  • DistilBERT

A two-stage approach:

  • Pretraining: pretrain models using general-purpose tasks that are extremely time consuming
  • Finetuning: finetune pretrained models over downstream tasks

Transformer-related NLP ecosystems are starting to mature in recent months for us to use!

The medium-long-term goal is to use text embeddings generated from various state-of-the-art algorithms and models for various downstream tasks.

  • Comparing similarity of concepts: "body mass index" versus "body weight"
  • Name entity recognition: literature mining
  • Question answering: fact checker

The current efforts are still on building infrastructures and understanding various frameworks and ecosystems.

The BERT

A tech demo to play with text embeddings generated by BERT models.

webapp

Vectology

A tech demo to test text recommendations

webapp

Similarity in text embedding vectors is not a silver bullet

In [7]:
import requests

url = "http://jojo.epi.bris.ac.uk:8560"
text_1 = "I love cats"
text_2 = "I hate cats"
model_name = "biobert_v1.1_pubmed"
payload = {
    "text_1": text_1, 
    "text_2": text_2,
    "model_name": model_name,
}
r = requests.get(f"{url}/cosine_similarity", params=payload)
In [6]:
# text_1 = "I love cats"
# text_2 = "I hate cats"
res = r.json()
res
Out[6]:
{'cosine_similarity': 0.9612499346597072}