Vectorizing Data

../_images/convolution.png

Convolutional operators applied to recognize horizontal and vertical lines.

In this exercise, you will look at a few different ways to convert non-numerical data to vectors and matrices.

Categorical Columns

On categorical data, one-hot-encoding or dummy encoding converts one column to binary columns. On binary columns, all the usual linear algebra tools are applicable.

import pandas as pd
import seaborn as sns

data = sns.load_dataset("penguins").dropna()

pd.get_dummies(data["species"]).astype(int)

What is the size of the resulting data?

2D Convolution

../_images/cnn_face.png

Convolution is the fundamental operation that neural networks use to interpret pixel images. During the past decade it has led to tremendous progress in the field of image recognition. Convolution is based on sliding a 2D matrix (the kernel) over the image and calculating a dot product at each step.

The parameters of the kernel are typically trained as part of a neural network.

On adamharley.com/nn_vis/cnn/3d.html you can see a convolutional neural network that recognizes digits in action. Switch all but the first layer off and observe the row of highlighted images when you draw a shape. Each of them represents the output of one convolutional kernel. What do the individual kernels recognize?

Word Embeddings

Go to projector.tensorflow.org/ . Type some common words into the search box and look for related words in the vector space.

Extensive training by a neural network is required to train vector embeddings for each word. The size may differ, e.g.:

  • word vectors around 2015 had a size of 50-300

  • word vectors optimized for retrieving short text documents have a size of around 700

  • word vectors in GPT-3 have a size of more than 12000.

Word Vectors in Spacy

Spacy is a powerful language processing library. It contains embeddings for words and documents. The installation involves a very big download:

pip install spacy
spacy download en_core_web_lg

The following code example uses spacy to look up some word vectors from the downloaded model and calculates word similarities:

import spacy
import numpy as np

nlp = spacy.load('en_core_web_lg')

w1 = nlp("coffee")
w2 = nlp("espresso")
w3 = nlp("tea")
w4 = nlp("giraffe")
words = [w1, w2, w3, w4]

print(w1.vector.shape)

def l2_norm(a):
    return np.sqrt((a**2).sum())

def cosim(a, b):
    return np.dot(a.vector, b.vector) / (l2_norm(a.vector) * l2_norm(b.vector))

# all words against all
for word1 in words:
    for word2 in words:
        similarity = cosim(word1, word2)
        print(f"{str(word1):10} {str(word2):10} {similarity:5.2f}")