Lesson 16: Case Studies

Pragmatic AI Labs

alt text

This notebook was produced by Pragmatic AI Labs. You can continue learning about these topics by:

Buying a copy of Pragmatic AI: An Introduction to Cloud-Based Machine Learning
Reading an online copy of Pragmatic AI:Pragmatic AI: An Introduction to Cloud-Based Machine Learning
Watching video Essential Machine Learning and AI with Python and Jupyter Notebook-Video-SafariOnline on Safari Books Online.
Watching video AWS Certified Machine Learning-Speciality
Purchasing video Essential Machine Learning and AI with Python and Jupyter Notebook- Purchase Video
Viewing more content at noahgift.com

16.4 Ludwig (Open Source AutoML)

Github Project URL: https://uber.github.io/ludwig/

alt text

Install Ludwig

!pip install --upgrade numpy #must restart colab runtime
!pip install --upgrade scikit-image
!pip install -q ludwig
!python -m spacy download en 

Requirement already up-to-date: numpy in /usr/local/lib/python3.6/dist-packages (1.16.1)
Collecting scikit-image
[?25l  Downloading https://files.pythonhosted.org/packages/24/06/d560630eb9e36d90d69fe57d9ff762d8f501664ce478b8a0ae132b3c3008/scikit_image-0.14.2-cp36-cp36m-manylinux1_x86_64.whl (25.3MB)
[K    100% |████████████████████████████████| 25.3MB 1.9MB/s 
[?25hCollecting pillow>=4.3.0 (from scikit-image)
[?25l  Downloading https://files.pythonhosted.org/packages/85/5e/e91792f198bbc5a0d7d3055ad552bc4062942d27eaf75c3e2783cf64eae5/Pillow-5.4.1-cp36-cp36m-manylinux1_x86_64.whl (2.0MB)
[K    100% |████████████████████████████████| 2.0MB 18.3MB/s 
[?25hRequirement already satisfied, skipping upgrade: scipy>=0.17.0 in /usr/local/lib/python3.6/dist-packages (from scikit-image) (1.1.0)
Requirement already satisfied, skipping upgrade: matplotlib>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-image) (3.0.2)
Requirement already satisfied, skipping upgrade: six>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from scikit-image) (1.11.0)
Requirement already satisfied, skipping upgrade: cloudpickle>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from scikit-image) (0.6.1)
Requirement already satisfied, skipping upgrade: PyWavelets>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from scikit-image) (1.0.1)
Requirement already satisfied, skipping upgrade: networkx>=1.8 in /usr/local/lib/python3.6/dist-packages (from scikit-image) (2.2)
Collecting dask[array]>=1.0.0 (from scikit-image)
[?25l  Downloading https://files.pythonhosted.org/packages/7c/2b/cf9e5477bec3bd3b4687719876ea38e9d8c9dc9d3526365c74e836e6a650/dask-1.1.1-py2.py3-none-any.whl (701kB)
[K    100% |████████████████████████████████| 706kB 25.2MB/s 
[?25hRequirement already satisfied, skipping upgrade: numpy>=1.8.2 in /usr/local/lib/python3.6/dist-packages (from scipy>=0.17.0->scikit-image) (1.16.1)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=2.0.0->scikit-image) (0.10.0)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=2.0.0->scikit-image) (2.3.1)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=2.0.0->scikit-image) (1.0.1)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=2.0.0->scikit-image) (2.5.3)
Requirement already satisfied, skipping upgrade: decorator>=4.3.0 in /usr/local/lib/python3.6/dist-packages (from networkx>=1.8->scikit-image) (4.3.2)
Requirement already satisfied, skipping upgrade: toolz>=0.7.3; extra == "array" in /usr/local/lib/python3.6/dist-packages (from dask[array]>=1.0.0->scikit-image) (0.9.0)
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.6/dist-packages (from kiwisolver>=1.0.1->matplotlib>=2.0.0->scikit-image) (40.8.0)
[31mfeaturetools 0.4.1 has requirement pandas>=0.23.0, but you'll have pandas 0.22.0 which is incompatible.[0m
[31malbumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.8 which is incompatible.[0m
Installing collected packages: pillow, dask, scikit-image
  Found existing installation: Pillow 4.0.0
    Uninstalling Pillow-4.0.0:
      Successfully uninstalled Pillow-4.0.0
  Found existing installation: dask 0.20.2
    Uninstalling dask-0.20.2:
      Successfully uninstalled dask-0.20.2
  Found existing installation: scikit-image 0.13.1
    Uninstalling scikit-image-0.13.1:
      Successfully uninstalled scikit-image-0.13.1
Successfully installed dask-1.1.1 pillow-5.4.1 scikit-image-0.14.2

Requirement already satisfied: en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 in /usr/local/lib/python3.6/dist-packages (2.0.0)

[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en

    You can now load the model via spacy.load('en')

Basic Ideas

Training Models
Prediction (Inference)
Datatypes
binary
numerical
category
set
bag
sequence
text
timeseries
image

Topic Modeling Example

!wget https://raw.githubusercontent.com/uchidalab/book-dataset/master/Task1/book30-listing-train.csv
!wget https://raw.githubusercontent.com/noahgift/recommendations/master/model_definition.yaml

--2019-02-18 02:44:21--  https://raw.githubusercontent.com/uchidalab/book-dataset/master/Task1/book30-listing-train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9728786 (9.3M) [text/plain]
Saving to: ‘book30-listing-train.csv.3’

book30-listing-trai 100%[===================>]   9.28M  --.-KB/s    in 0.1s    

2019-02-18 02:44:23 (64.4 MB/s) - ‘book30-listing-train.csv.3’ saved [9728786/9728786]

--2019-02-18 02:44:24--  https://raw.githubusercontent.com/noahgift/recommendations/master/model_definition.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 180 [text/plain]
Saving to: ‘model_definition.yaml.2’

model_definition.ya 100%[===================>]     180  --.-KB/s    in 0s      

2019-02-18 02:44:25 (34.7 MB/s) - ‘model_definition.yaml.2’ saved [180/180]

Ingest

import pandas as pd
df = pd.read_csv("https://media.githubusercontent.com/media/noahgift/recommendations/master/data/book30-listing-train-with-headers.csv")
df = df.drop("Unnamed: 0", axis=1)
df.head()

	ASIN	FILENAME	IMAGE URL	TITLE	AUTHOR	CATEGORYID	CATEGORY
0	1404803335	1404803335.jpg	http://ecx.images-amazon.com/images/I/51UJnL3T...	Magnets: Pulling Together, Pushing Apart (Amaz...	Natalie M. Rosinsky	4	Children's Books
1	1446276082	1446276082.jpg	http://ecx.images-amazon.com/images/I/51MGUKhk...	Energy Security (SAGE Library of International...	NaN	10	Engineering & Transportation
2	1491522666	1491522666.jpg	http://ecx.images-amazon.com/images/I/51qKvjsi...	An Amish Gathering: Life in Lancaster County	Beth Wiseman	9	Christian Books & Bibles
3	970096410	0970096410.jpg	http://ecx.images-amazon.com/images/I/51qoUENb...	City of Rocks Idaho: A Climber's Guide (Region...	Dave Bingham	26	Sports & Outdoors
4	8436808053	8436808053.jpg	http://ecx.images-amazon.com/images/I/41aDW5pz...	Como vencer el insomnio. Tecnicas, reglas y co...	Choliz Montanes	11	Health, Fitness & Dieting

df.to_csv("book30-listing-train-with-headers.csv")

EDA

Columns

df.columns

Index(['ASIN', 'FILENAME', 'IMAGE URL', 'TITLE', 'AUTHOR', 'CATEGORYID',
       'CATEGORY'],
      dtype='object')

Shape

df.shape

(51299, 7)

Training w/Ludwig

!head book30-listing-train-with-headers.csv

,ASIN,FILENAME,IMAGE URL,TITLE,AUTHOR,CATEGORYID,CATEGORY
0,1404803335,1404803335.jpg,http://ecx.images-amazon.com/images/I/51UJnL3Tx6L.jpg,"Magnets: Pulling Together, Pushing Apart (Amazing Science)",Natalie M. Rosinsky,4,Children's Books
1,1446276082,1446276082.jpg,http://ecx.images-amazon.com/images/I/51MGUKhkyhL.jpg,Energy Security (SAGE Library of International Security),,10,Engineering & Transportation
2,1491522666,1491522666.jpg,http://ecx.images-amazon.com/images/I/51qKvjsi3ML.jpg,An Amish Gathering: Life in Lancaster County,Beth Wiseman,9,Christian Books & Bibles
3,970096410,0970096410.jpg,http://ecx.images-amazon.com/images/I/51qoUENb1CL.jpg,City of Rocks Idaho: A Climber's Guide (Regional Rock Climbing Series),Dave Bingham,26,Sports & Outdoors
4,8436808053,8436808053.jpg,http://ecx.images-amazon.com/images/I/41aDW5pzZBL.jpg,"Como vencer el insomnio. Tecnicas, reglas y consejos practicos para dormir mejor (BIBLIOTECA PRACTICA) (Spanish Edition)",Choliz Montanes,11,"Health, Fitness & Dieting"
5,1848291388,1848291388.jpg,http://ecx.images-amazon.com/images/I/51Lpg7xmrBL.jpg,John Martin Littlejohn: An Enigma of Osteopathy,John O'Brien,16,Medical Books
6,73402656,0073402656.jpg,http://ecx.images-amazon.com/images/I/51WccSzFUrL.jpg,Chemistry: The Molecular Nature of Matter and Change,Martin Silberberg,23,Science & Math
7,323045979,0323045979.jpg,http://ecx.images-amazon.com/images/I/51rJir5EpnL.jpg,"Mosby's Oncology Nursing Advisor: A Comprehensive Guide to Clinical Practice, 1e",Susan Newton MS  RN  AOCN  AOCNS,16,Medical Books
8,1847176968,1847176968.jpg,http://ecx.images-amazon.com/images/I/61KoC743OzL.jpg,Ireland's Wild Atlantic Way,Carsten Krieger,29,Travel

!cat model_definition.yaml

input_features:
    -
        name: TITLE
        type: text
        encoder: parallel_cnn
        level: word

output_features:
    -
        name: CATEGORY
        type: category

!ludwig experiment --data_csv book30-listing-train-with-headers.csv --model_definition_file model_definition.yaml

 _         _        _      
| |_  _ __| |_ __ _(_)__ _ 
| | || / _` \ V  V / / _` |
|_|\_,_\__,_|\_/\_/|_\__, |
                     |___/ 
ludwig v0.1.0 - Experiment

Experiment name: experiment
Model name: run
Output path: results/experiment_run_0

ludwig_version: '0.1.0'
command: ('/usr/local/bin/ludwig experiment --data_csv '
 'book30-listing-train-with-headers.csv --model_definition_file '
 'model_definition.yaml')
dataset_type: 'book30-listing-train-with-headers.csv'
model_definition: {   'combiner': {'type': 'concat'},
    'input_features': [   {   'encoder': 'parallel_cnn',
                              'level': 'word',
                              'name': 'TITLE',
                              'tied_weights': None,
                              'type': 'text'}],
    'output_features': [   {   'dependencies': [],
                               'loss': {   'class_distance_temperature': 0,
                                           'class_weights': 1,
                                           'confidence_penalty': 0,
                                           'distortion': 1,
                                           'labels_smoothing': 0,
                                           'negative_samples': 0,
                                           'robust_lambda': 0,
                                           'sampler': None,
                                           'type': 'softmax_cross_entropy',
                                           'unique': False,
                                           'weight': 1},
                               'name': 'CATEGORY',
                               'reduce_dependencies': 'sum',
                               'reduce_input': 'sum',
                               'top_k': 3,
                               'type': 'category'}],
    'preprocessing': {   'bag': {   'fill_value': '',
                                    'format': 'space',
                                    'lowercase': 10000,
                                    'missing_value_strategy': 'fill_with_const',
                                    'most_common': False},
                         'binary': {   'fill_value': 0,
                                       'missing_value_strategy': 'fill_with_const'},
                         'category': {   'fill_value': '<UNK>',
                                         'lowercase': False,
                                         'missing_value_strategy': 'fill_with_const',
                                         'most_common': 10000},
                         'force_split': False,
                         'image': {'missing_value_strategy': 'backfill'},
                         'numerical': {   'fill_value': 0,
                                          'missing_value_strategy': 'fill_with_const'},
                         'sequence': {   'fill_value': '',
                                         'format': 'space',
                                         'lowercase': False,
                                         'missing_value_strategy': 'fill_with_const',
                                         'most_common': 20000,
                                         'padding': 'right',
                                         'padding_symbol': '<PAD>',
                                         'sequence_length_limit': 256,
                                         'unknown_symbol': '<UNK>'},
                         'set': {   'fill_value': '',
                                    'format': 'space',
                                    'lowercase': False,
                                    'missing_value_strategy': 'fill_with_const',
                                    'most_common': 10000},
                         'split_probabilities': (0.7, 0.1, 0.2),
                         'stratify': None,
                         'text': {   'char_format': 'characters',
                                     'char_most_common': 70,
                                     'char_sequence_length_limit': 1024,
                                     'fill_value': '',
                                     'lowercase': True,
                                     'missing_value_strategy': 'fill_with_const',
                                     'padding': 'right',
                                     'padding_symbol': '<PAD>',
                                     'unknown_symbol': '<UNK>',
                                     'word_format': 'space_punct',
                                     'word_most_common': 20000,
                                     'word_sequence_length_limit': 256},
                         'timeseries': {   'fill_value': '',
                                           'format': 'space',
                                           'missing_value_strategy': 'fill_with_const',
                                           'padding': 'right',
                                           'padding_value': 0,
                                           'timeseries_length_limit': 256}},
    'training': {   'batch_size': 128,
                    'bucketing_field': None,
                    'decay': False,
                    'decay_rate': 0.96,
                    'decay_steps': 10000,
                    'dropout_rate': 0.0,
                    'early_stop': 3,
                    'epochs': 200,
                    'gradient_clipping': None,
                    'increase_batch_size_on_plateau': 0,
                    'increase_batch_size_on_plateau_max': 512,
                    'increase_batch_size_on_plateau_patience': 5,
                    'increase_batch_size_on_plateau_rate': 2,
                    'learning_rate': 0.001,
                    'learning_rate_warmup_epochs': 5,
                    'optimizer': {   'beta1': 0.9,
                                     'beta2': 0.999,
                                     'epsilon': 1e-08,
                                     'type': 'adam'},
                    'reduce_learning_rate_on_plateau': 0,
                    'reduce_learning_rate_on_plateau_patience': 5,
                    'reduce_learning_rate_on_plateau_rate': 0.5,
                    'regularization_lambda': 0,
                    'regularizer': 'l2',
                    'staircase': False,
                    'validation_field': 'combined',
                    'validation_measure': 'loss'}}

Using full raw csv, no hdf5 and json file with the same name have been found
Building dataset (it may take a while)
Loading NLP pipeline
Writing dataset
Writing train set metadata with vocabulary
Training set: 36059
Validation set: 5042
Test set: 10198
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/losses/losses_impl.py:209: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/losses/losses_impl.py:209: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.

╒══════════╕
│ TRAINING │
╘══════════╛

2019-02-18 01:21:33.899464: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2019-02-18 01:21:33.899801: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x318ac00 executing computations on platform Host. Devices:
2019-02-18 01:21:33.899835: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-02-18 01:21:34.055715: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-18 01:21:34.056285: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x318a100 executing computations on platform CUDA. Devices:
2019-02-18 01:21:34.056320: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-02-18 01:21:34.056733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2019-02-18 01:21:34.056767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-02-18 01:21:43.842054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-18 01:21:43.842116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-02-18 01:21:43.842133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-02-18 01:21:43.842364: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2019-02-18 01:21:43.842446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10752 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)

Epoch   1
Training:   0% 0/282 [00:00<?, ?it/s]2019-02-18 01:21:44.623868: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
Training: 100% 282/282 [00:20<00:00, 13.95it/s]
Evaluation train: 100% 282/282 [00:04<00:00, 59.52it/s]
Evaluation vali : 100% 40/40 [00:00<00:00, 59.33it/s]
Evaluation test : 100% 80/80 [00:01<00:00, 52.80it/s]
Took 27.3456s
╒════════════╤════════╤════════════╤═════════════╕
│ CATEGORY   │   loss │   accuracy │   hits_at_k │
╞════════════╪════════╪════════════╪═════════════╡
│ train      │ 3.3077 │     0.0791 │      0.1855 │
├────────────┼────────┼────────────┼─────────────┤
│ vali       │ 3.3093 │     0.0768 │      0.1813 │
├────────────┼────────┼────────────┼─────────────┤
│ test       │ 3.3204 │     0.0757 │      0.1757 │
╘════════════╧════════╧════════════╧═════════════╛
Validation loss on combined improved, model saved


Epoch   2
Training: 100% 282/282 [00:17<00:00, 16.85it/s]
Evaluation train: 100% 282/282 [00:04<00:00, 59.75it/s]
Evaluation vali : 100% 40/40 [00:00<00:00, 60.45it/s]
Evaluation test : 100% 80/80 [00:01<00:00, 59.97it/s]
Took 23.9509s
╒════════════╤════════╤════════════╤═════════════╕
│ CATEGORY   │   loss │   accuracy │   hits_at_k │
╞════════════╪════════╪════════════╪═════════════╡
│ train      │ 3.2823 │     0.0904 │      0.2009 │
├────────────┼────────┼────────────┼─────────────┤
│ vali       │ 3.2871 │     0.0851 │      0.1962 │
├────────────┼────────┼────────────┼─────────────┤
│ test       │ 3.3021 │     0.0828 │      0.1897 │
╘════════════╧════════╧════════════╧═════════════╛
Validation loss on combined improved, model saved


Epoch   3
Training: 100% 282/282 [00:17<00:00, 16.92it/s]
Evaluation train: 100% 282/282 [00:04<00:00, 59.96it/s]
Evaluation vali : 100% 40/40 [00:00<00:00, 59.63it/s]
Evaluation test : 100% 80/80 [00:01<00:00, 60.28it/s]
Took 23.9664s
╒════════════╤════════╤════════════╤═════════════╕
│ CATEGORY   │   loss │   accuracy │   hits_at_k │
╞════════════╪════════╪════════════╪═════════════╡
│ train      │ 3.2728 │     0.0940 │      0.2102 │
├────────────┼────────┼────────────┼─────────────┤
│ vali       │ 3.2773 │     0.0898 │      0.2071 │
├────────────┼────────┼────────────┼─────────────┤
│ test       │ 3.2966 │     0.0838 │      0.1968 │
╘════════════╧════════╧════════════╧═════════════╛
Validation loss on combined improved, model saved


Epoch   4
Training: 100% 282/282 [00:17<00:00, 16.86it/s]
Evaluation train: 100% 282/282 [00:04<00:00, 59.63it/s]
Evaluation vali : 100% 40/40 [00:00<00:00, 61.04it/s]
Evaluation test : 100% 80/80 [00:01<00:00, 59.69it/s]
Took 23.9503s
╒════════════╤════════╤════════════╤═════════════╕
│ CATEGORY   │   loss │   accuracy │   hits_at_k │
╞════════════╪════════╪════════════╪═════════════╡
│ train      │ 3.2530 │     0.0970 │      0.2159 │
├────────────┼────────┼────────────┼─────────────┤
│ vali       │ 3.2623 │     0.0926 │      0.2081 │
├────────────┼────────┼────────────┼─────────────┤
│ test       │ 3.2824 │     0.0884 │      0.2033 │
╘════════════╧════════╧════════════╧═════════════╛
Validation loss on combined improved, model saved


Epoch   5
Training: 100% 282/282 [00:17<00:00, 16.83it/s]
Evaluation train: 100% 282/282 [00:04<00:00, 60.61it/s]
Evaluation vali : 100% 40/40 [00:00<00:00, 59.69it/s]
Evaluation test : 100% 80/80 [00:01<00:00, 60.04it/s]
Took 23.9652s
╒════════════╤════════╤════════════╤═════════════╕
│ CATEGORY   │   loss │   accuracy │   hits_at_k │
╞════════════╪════════╪════════════╪═════════════╡
│ train      │ 3.2445 │     0.0983 │      0.2182 │
├────────────┼────────┼────────────┼─────────────┤
│ vali       │ 3.2562 │     0.0908 │      0.2130 │
├────────────┼────────┼────────────┼─────────────┤
│ test       │ 3.2762 │     0.0875 │      0.2024 │
╘════════════╧════════╧════════════╧═════════════╛
Validation loss on combined improved, model saved


Epoch   6
Training: 100% 282/282 [00:17<00:00, 16.85it/s]
Evaluation train: 100% 282/282 [00:04<00:00, 59.89it/s]
Evaluation vali : 100% 40/40 [00:00<00:00, 60.76it/s]
Evaluation test : 100% 80/80 [00:01<00:00, 60.00it/s]
Took 23.9497s
╒════════════╤════════╤════════════╤═════════════╕
│ CATEGORY   │   loss │   accuracy │   hits_at_k │
╞════════════╪════════╪════════════╪═════════════╡
│ train      │ 3.2367 │     0.1004 │      0.2211 │
├────────────┼────────┼────────────┼─────────────┤
│ vali       │ 3.2543 │     0.0898 │      0.2098 │
├────────────┼────────┼────────────┼─────────────┤
│ test       │ 3.2740 │     0.0868 │      0.2043 │
╘════════════╧════════╧════════════╧═════════════╛
Validation loss on combined improved, model saved


Epoch   7
Training: 100% 282/282 [00:17<00:00, 16.83it/s]
Evaluation train: 100% 282/282 [00:04<00:00, 60.08it/s]
Evaluation vali : 100% 40/40 [00:00<00:00, 61.33it/s]
Evaluation test : 100% 80/80 [00:01<00:00, 60.27it/s]
Took 23.9176s
╒════════════╤════════╤════════════╤═════════════╕
│ CATEGORY   │   loss │   accuracy │   hits_at_k │
╞════════════╪════════╪════════════╪═════════════╡
│ train      │ 3.2357 │     0.1012 │      0.2220 │
├────────────┼────────┼────────────┼─────────────┤
│ vali       │ 3.2567 │     0.0916 │      0.2108 │
├────────────┼────────┼────────────┼─────────────┤
│ test       │ 3.2771 │     0.0880 │      0.2010 │
╘════════════╧════════╧════════════╧═════════════╛
Last improvement of loss on combined happened 1 epoch ago


Epoch   8
Training: 100% 282/282 [00:17<00:00, 16.75it/s]
Evaluation train: 100% 282/282 [00:04<00:00, 59.96it/s]
Evaluation vali : 100% 40/40 [00:00<00:00, 60.53it/s]
Evaluation test : 100% 80/80 [00:01<00:00, 60.30it/s]
Took 23.9056s
╒════════════╤════════╤════════════╤═════════════╕
│ CATEGORY   │   loss │   accuracy │   hits_at_k │
╞════════════╪════════╪════════════╪═════════════╡
│ train      │ 3.2256 │     0.1046 │      0.2259 │
├────────────┼────────┼────────────┼─────────────┤
│ vali       │ 3.2541 │     0.0934 │      0.2114 │
├────────────┼────────┼────────────┼─────────────┤
│ test       │ 3.2751 │     0.0913 │      0.1995 │
╘════════════╧════════╧════════════╧═════════════╛
Validation loss on combined improved, model saved


Epoch   9
Training: 100% 282/282 [00:17<00:00, 16.76it/s]
Evaluation train: 100% 282/282 [00:04<00:00, 60.44it/s]
Evaluation vali : 100% 40/40 [00:00<00:00, 61.39it/s]
Evaluation test : 100% 80/80 [00:01<00:00, 60.05it/s]
Took 23.9047s
╒════════════╤════════╤════════════╤═════════════╕
│ CATEGORY   │   loss │   accuracy │   hits_at_k │
╞════════════╪════════╪════════════╪═════════════╡
│ train      │ 3.2222 │     0.1041 │      0.2277 │
├────────────┼────────┼────────────┼─────────────┤
│ vali       │ 3.2547 │     0.0962 │      0.2164 │
├────────────┼────────┼────────────┼─────────────┤
│ test       │ 3.2755 │     0.0917 │      0.2004 │
╘════════════╧════════╧════════════╧═════════════╛
Last improvement of loss on combined happened 1 epoch ago


Epoch  10
Training: 100% 282/282 [00:17<00:00, 16.97it/s]
Evaluation train: 100% 282/282 [00:04<00:00, 59.94it/s]
Evaluation vali : 100% 40/40 [00:00<00:00, 61.11it/s]
Evaluation test : 100% 80/80 [00:01<00:00, 59.82it/s]
Took 23.9255s
╒════════════╤════════╤════════════╤═════════════╕
│ CATEGORY   │   loss │   accuracy │   hits_at_k │
╞════════════╪════════╪════════════╪═════════════╡
│ train      │ 3.2181 │     0.1053 │      0.2331 │
├────────────┼────────┼────────────┼─────────────┤
│ vali       │ 3.2575 │     0.0958 │      0.2196 │
├────────────┼────────┼────────────┼─────────────┤
│ test       │ 3.2789 │     0.0886 │      0.2082 │
╘════════════╧════════╧════════════╧═════════════╛
Last improvement of loss on combined happened 2 epochs ago


Epoch  11
Training: 100% 282/282 [00:17<00:00, 16.88it/s]
Evaluation train: 100% 282/282 [00:04<00:00, 59.83it/s]
Evaluation vali : 100% 40/40 [00:00<00:00, 60.48it/s]
Evaluation test : 100% 80/80 [00:01<00:00, 60.10it/s]
Took 23.8798s
╒════════════╤════════╤════════════╤═════════════╕
│ CATEGORY   │   loss │   accuracy │   hits_at_k │
╞════════════╪════════╪════════════╪═════════════╡
│ train      │ 3.2211 │     0.1051 │      0.2338 │
├────────────┼────────┼────────────┼─────────────┤
│ vali       │ 3.2667 │     0.0936 │      0.2140 │
├────────────┼────────┼────────────┼─────────────┤
│ test       │ 3.2868 │     0.0891 │      0.2045 │
╘════════════╧════════╧════════════╧═════════════╛
Last improvement of loss on combined happened 3 epochs ago

EARLY STOPPING due to lack of validation improvement, it has been 3 epochs since last validation accuracy improvement

Best validation model epoch: 8
Best validation model loss on validation set combined: 3.2541212318720016
Best validation model loss on test set combined: 3.275094079606602

╒═════════╕
│ PREDICT │
╘═════════╛

Evaluation: 100% 80/80 [00:01<00:00, 57.96it/s]

===== CATEGORY =====
accuracy: 0.0891351245342224
hits_at_k: 0.20445185330456953
loss: 3.286845474856628
overall_stats: { 'avg_f1_score_macro': 0.06812071846149517,
  'avg_f1_score_micro': 0.0891351245342224,
  'avg_f1_score_weighted': 0.06790552679270521,
  'avg_precision_macro': 0.09177260758729153,
  'avg_precision_micro': 0.0891351245342224,
  'avg_precision_weighted': 0.0891351245342224,
  'avg_recall_macro': 0.09056387599530688,
  'avg_recall_micro': 0.0891351245342224,
  'avg_recall_weighted': 0.0891351245342224,
  'kappa_score': 0.058034041734078334,
  'overall_accuracy': 0.0891351245342224}
per_class_stats: {<UNK>: {   'accuracy': 1.0,
    'f1_score': 0,
    'fall_out': 0.0,
    'false_discovery_rate': 1.0,
    'false_negative_rate': 1.0,
    'false_negatives': 0,
    'false_omission_rate': 0.0,
    'false_positive_rate': 0.0,
    'false_positives': 0,
    'hit_rate': 0,
    'informedness': 0.0,
    'markedness': 0.0,
    'matthews_correlation_coefficient': 0,
    'miss_rate': 1.0,
    'negative_predictive_value': 1.0,
    'positive_predictive_value': 0,
    'precision': 0,
    'recall': 0,
    'sensitivity': 0,
    'specificity': 1.0,
    'true_negative_rate': 1.0,
    'true_negatives': 10198,
    'true_positive_rate': 0,
    'true_positives': 0},
  Children's Books: {   'accuracy': 0.9269464600902138,
    'f1_score': 0.10991636798088411,
    'fall_out': 0.02880446004542636,
    'false_discovery_rate': 0.8584615384615385,
    'false_negative_rate': 0.91015625,
    'false_negatives': 466,
    'false_omission_rate': 0.04719943279651573,
    'false_positive_rate': 0.02880446004542636,
    'false_positives': 279,
    'hit_rate': 0.08984375,
    'informedness': 0.06103928995457375,
    'markedness': 0.09433902874194589,
    'matthews_correlation_coefficient': 0.07588403869993006,
    'miss_rate': 0.91015625,
    'negative_predictive_value': 0.9528005672034843,
    'positive_predictive_value': 0.14153846153846153,
    'precision': 0.14153846153846153,
    'recall': 0.08984375,
    'sensitivity': 0.08984375,
    'specificity': 0.9711955399545736,
    'true_negative_rate': 0.9711955399545736,
    'true_negatives': 9407,
    'true_positive_rate': 0.08984375,
    'true_positives': 46},
  Engineering & Transportation: {   'accuracy': 0.963326142380859,
    'f1_score': 0.04591836734693878,
    'fall_out': 0.029946629768728972,
    'false_discovery_rate': 0.9711538461538461,
    'false_negative_rate': 0.8875,
    'false_negatives': 71,
    'false_omission_rate': 0.0071818733562614145,
    'false_positive_rate': 0.029946629768728972,
    'false_positives': 303,
    'hit_rate': 0.1125,
    'informedness': 0.08255337023127107,
    'markedness': 0.02166428048989233,
    'matthews_correlation_coefficient': 0.042290180516003875,
    'miss_rate': 0.8875,
    'negative_predictive_value': 0.9928181266437386,
    'positive_predictive_value': 0.028846153846153848,
    'precision': 0.028846153846153848,
    'recall': 0.1125,
    'sensitivity': 0.1125,
    'specificity': 0.970053370231271,
    'true_negative_rate': 0.970053370231271,
    'true_negatives': 9815,
    'true_positive_rate': 0.1125,
    'true_positives': 9},
  Christian Books & Bibles: {   'accuracy': 0.9656795450088252,
    'f1_score': 0.005681818181818181,
    'fall_out': 0.034229109454688156,
    'false_discovery_rate': 0.9971428571428571,
    'false_negative_rate': 0.5,
    'false_negatives': 1,
    'false_omission_rate': 0.00010154346060109454,
    'false_positive_rate': 0.034229109454688156,
    'false_positives': 349,
    'hit_rate': 0.5,
    'informedness': 0.46577089054531173,
    'markedness': 0.002755599396541797,
    'matthews_correlation_coefficient': 0.035825660983621235,
    'miss_rate': 0.5,
    'negative_predictive_value': 0.9998984565393989,
    'positive_predictive_value': 0.002857142857142857,
    'precision': 0.002857142857142857,
    'recall': 0.5,
    'sensitivity': 0.5,
    'specificity': 0.9657708905453118,
    'true_negative_rate': 0.9657708905453118,
    'true_negatives': 9847,
    'true_positive_rate': 0.5,
    'true_positives': 1},
  Sports & Outdoors: {   'accuracy': 0.963424200823691,
    'f1_score': 0,
    'fall_out': 0.03297244094488194,
    'false_discovery_rate': 1.0,
    'false_negative_rate': 1.0,
    'false_negatives': 38,
    'false_omission_rate': 0.0038527831288655,
    'false_positive_rate': 0.03297244094488194,
    'false_positives': 335,
    'hit_rate': 0.0,
    'informedness': -0.03297244094488194,
    'markedness': -0.0038527831288655,
    'matthews_correlation_coefficient': -0.011271009901067143,
    'miss_rate': 1.0,
    'negative_predictive_value': 0.9961472168711345,
    'positive_predictive_value': 0.0,
    'precision': 0.0,
    'recall': 0.0,
    'sensitivity': 0.0,
    'specificity': 0.9670275590551181,
    'true_negative_rate': 0.9670275590551181,
    'true_negatives': 9825,
    'true_positive_rate': 0.0,
    'true_positives': 0},
  Health, Fitness & Dieting: {   'accuracy': 0.9297901549323396,
    'f1_score': 0.0427807486631016,
    'fall_out': 0.03329248366013071,
    'false_discovery_rate': 0.9532163742690059,
    'false_negative_rate': 0.9605911330049262,
    'false_negatives': 390,
    'false_omission_rate': 0.03956980519480524,
    'false_positive_rate': 0.03329248366013071,
    'false_positives': 326,
    'hit_rate': 0.03940886699507389,
    'informedness': 0.006116383334943132,
    'markedness': 0.0072138205361889085,
    'matthews_correlation_coefficient': 0.0066424763235420695,
    'miss_rate': 0.9605911330049262,
    'negative_predictive_value': 0.9604301948051948,
    'positive_predictive_value': 0.04678362573099415,
    'precision': 0.04678362573099415,
    'recall': 0.03940886699507389,
    'sensitivity': 0.03940886699507389,
    'specificity': 0.9667075163398693,
    'true_negative_rate': 0.9667075163398693,
    'true_negatives': 9466,
    'true_positive_rate': 0.03940886699507389,
    'true_positives': 16},
  Medical Books: {   'accuracy': 0.9540105903118259,
    'f1_score': 0.07495069033530573,
    'fall_out': 0.0315180530620387,
    'false_discovery_rate': 0.9432835820895522,
    'false_negative_rate': 0.8895348837209303,
    'false_negatives': 153,
    'false_omission_rate': 0.0155125215451688,
    'false_positive_rate': 0.0315180530620387,
    'false_positives': 316,
    'hit_rate': 0.11046511627906977,
    'informedness': 0.07894706321703104,
    'markedness': 0.04120389636527899,
    'matthews_correlation_coefficient': 0.05703443355673547,
    'miss_rate': 0.8895348837209303,
    'negative_predictive_value': 0.9844874784548312,
    'positive_predictive_value': 0.056716417910447764,
    'precision': 0.056716417910447764,
    'recall': 0.11046511627906977,
    'sensitivity': 0.11046511627906977,
    'specificity': 0.9684819469379613,
    'true_negative_rate': 0.9684819469379613,
    'true_negatives': 9710,
    'true_positive_rate': 0.11046511627906977,
    'true_positives': 19},
  Science & Math: {   'accuracy': 0.9558737007256325,
    'f1_score': 0.030172413793103446,
    'fall_out': 0.036212525972098564,
    'false_discovery_rate': 0.9812332439678284,
    'false_negative_rate': 0.9230769230769231,
    'false_negatives': 84,
    'false_omission_rate': 0.008549618320610741,
    'false_positive_rate': 0.036212525972098564,
    'false_positives': 366,
    'hit_rate': 0.07692307692307693,
    'informedness': 0.04071055095097842,
    'markedness': 0.010217137711560742,
    'matthews_correlation_coefficient': 0.020394737198102416,
    'miss_rate': 0.9230769230769231,
    'negative_predictive_value': 0.9914503816793893,
    'positive_predictive_value': 0.01876675603217158,
    'precision': 0.01876675603217158,
    'recall': 0.07692307692307693,
    'sensitivity': 0.07692307692307693,
    'specificity': 0.9637874740279014,
    'true_negative_rate': 0.9637874740279014,
    'true_negatives': 9741,
    'true_positive_rate': 0.07692307692307693,
    'true_positives': 7},
  Travel: {   'accuracy': 0.9540105903118259,
    'f1_score': 0.016771488469601678,
    'fall_out': 0.030505433157212658,
    'false_discovery_rate': 0.9870967741935484,
    'false_negative_rate': 0.9760479041916168,
    'false_negatives': 163,
    'false_omission_rate': 0.016484627831715226,
    'false_positive_rate': 0.030505433157212658,
    'false_positives': 306,
    'hit_rate': 0.023952095808383235,
    'informedness': -0.006553337348829458,
    'markedness': -0.0035814020252635803,
    'matthews_correlation_coefficient': -0.004844598606007852,
    'miss_rate': 0.9760479041916168,
    'negative_predictive_value': 0.9835153721682848,
    'positive_predictive_value': 0.012903225806451613,
    'precision': 0.012903225806451613,
    'recall': 0.023952095808383235,
    'sensitivity': 0.023952095808383235,
    'specificity': 0.9694945668427873,
    'true_negative_rate': 0.9694945668427873,
    'true_negatives': 9725,
    'true_positive_rate': 0.023952095808383235,
    'true_positives': 4},
  Business & Money: {   'accuracy': 0.9681310060796234,
    'f1_score': 0,
    'fall_out': 0.030823598704230903,
    'false_discovery_rate': 1.0,
    'false_negative_rate': 1.0,
    'false_negatives': 11,
    'false_omission_rate': 0.0011129097531363819,
    'false_positive_rate': 0.030823598704230903,
    'false_positives': 314,
    'hit_rate': 0.0,
    'informedness': -0.030823598704230903,
    'markedness': -0.0011129097531363819,
    'matthews_correlation_coefficient': -0.00585695173487886,
    'miss_rate': 1.0,
    'negative_predictive_value': 0.9988870902468636,
    'positive_predictive_value': 0.0,
    'precision': 0.0,
    'recall': 0.0,
    'sensitivity': 0.0,
    'specificity': 0.9691764012957691,
    'true_negative_rate': 0.9691764012957691,
    'true_negatives': 9873,
    'true_positive_rate': 0.0,
    'true_positives': 0},
  Cookbooks, Food & Wine: {   'accuracy': 0.9595018631104139,
    'f1_score': 0.019002375296912115,
    'fall_out': 0.03492846571287622,
    'false_discovery_rate': 0.9888268156424581,
    'false_negative_rate': 0.9365079365079365,
    'false_negatives': 59,
    'false_omission_rate': 0.00599593495934958,
    'false_positive_rate': 0.03492846571287622,
    'false_positives': 354,
    'hit_rate': 0.06349206349206349,
    'informedness': 0.028563597779187155,
    'markedness': 0.005177249398192307,
    'matthews_correlation_coefficient': 0.012160627837924515,
    'miss_rate': 0.9365079365079365,
    'negative_predictive_value': 0.9940040650406504,
    'positive_predictive_value': 0.0111731843575419,
    'precision': 0.0111731843575419,
    'recall': 0.06349206349206349,
    'sensitivity': 0.06349206349206349,
    'specificity': 0.9650715342871238,
    'true_negative_rate': 0.9650715342871238,
    'true_negatives': 9781,
    'true_positive_rate': 0.06349206349206349,
    'true_positives': 4},
  Politics & Social Sciences: {   'accuracy': 0.928025102961365,
    'f1_score': 0.0516795865633075,
    'fall_out': 0.035834609494640124,
    'false_discovery_rate': 0.9460916442048517,
    'false_negative_rate': 0.9503722084367245,
    'false_negatives': 383,
    'false_omission_rate': 0.03897425460466064,
    'false_positive_rate': 0.035834609494640124,
    'false_positives': 351,
    'hit_rate': 0.04962779156327544,
    'informedness': 0.013793182068635224,
    'markedness': 0.014934101190487548,
    'matthews_correlation_coefficient': 0.01435230910870509,
    'miss_rate': 0.9503722084367245,
    'negative_predictive_value': 0.9610257453953394,
    'positive_predictive_value': 0.05390835579514825,
    'precision': 0.05390835579514825,
    'recall': 0.04962779156327544,
    'sensitivity': 0.04962779156327544,
    'specificity': 0.9641653905053599,
    'true_negative_rate': 0.9641653905053599,
    'true_negatives': 9444,
    'true_positive_rate': 0.04962779156327544,
    'true_positives': 20},
  Crafts, Hobbies & Home: {   'accuracy': 0.9681310060796234,
    'f1_score': 0,
    'fall_out': 0.0312990580847724,
    'false_discovery_rate': 1.0,
    'false_negative_rate': 1.0,
    'false_negatives': 6,
    'false_omission_rate': 0.000607348921955686,
    'false_positive_rate': 0.0312990580847724,
    'false_positives': 319,
    'hit_rate': 0.0,
    'informedness': -0.0312990580847724,
    'markedness': -0.000607348921955686,
    'matthews_correlation_coefficient': -0.004359982704783838,
    'miss_rate': 1.0,
    'negative_predictive_value': 0.9993926510780443,
    'positive_predictive_value': 0.0,
    'precision': 0.0,
    'recall': 0.0,
    'sensitivity': 0.0,
    'specificity': 0.9687009419152276,
    'true_negative_rate': 0.9687009419152276,
    'true_negatives': 9873,
    'true_positive_rate': 0.0,
    'true_positives': 0},
  Religion & Spirituality: {   'accuracy': 0.957834869582271,
    'f1_score': 0.009216589861751152,
    'fall_out': 0.03517091483896462,
    'false_discovery_rate': 0.994413407821229,
    'false_negative_rate': 0.9736842105263158,
    'false_negatives': 74,
    'false_omission_rate': 0.007520325203252076,
    'false_positive_rate': 0.03517091483896462,
    'false_positives': 356,
    'hit_rate': 0.02631578947368421,
    'informedness': -0.008855125365280436,
    'markedness': -0.0019337330244810769,
    'matthews_correlation_coefficient': -0.004138048858431092,
    'miss_rate': 0.9736842105263158,
    'negative_predictive_value': 0.9924796747967479,
    'positive_predictive_value': 0.00558659217877095,
    'precision': 0.00558659217877095,
    'recall': 0.02631578947368421,
    'sensitivity': 0.02631578947368421,
    'specificity': 0.9648290851610354,
    'true_negative_rate': 0.9648290851610354,
    'true_negatives': 9766,
    'true_positive_rate': 0.02631578947368421,
    'true_positives': 2},
  Literature & Fiction: {   'accuracy': 0.9111590507942734,
    'f1_score': 0.12884615384615386,
    'fall_out': 0.03088559722659945,
    'false_discovery_rate': 0.814404432132964,
    'false_negative_rate': 0.9013254786450663,
    'false_negatives': 612,
    'false_omission_rate': 0.06221408966148212,
    'false_positive_rate': 0.03088559722659945,
    'false_positives': 294,
    'hit_rate': 0.09867452135493372,
    'informedness': 0.06778892412833426,
    'markedness': 0.12338147820555401,
    'matthews_correlation_coefficient': 0.09145434743585469,
    'miss_rate': 0.9013254786450663,
    'negative_predictive_value': 0.9377859103385179,
    'positive_predictive_value': 0.18559556786703602,
    'precision': 0.18559556786703602,
    'recall': 0.09867452135493372,
    'sensitivity': 0.09867452135493372,
    'specificity': 0.9691144027734006,
    'true_negative_rate': 0.9691144027734006,
    'true_negatives': 9225,
    'true_positive_rate': 0.09867452135493372,
    'true_positives': 67},
  Humor & Entertainment: {   'accuracy': 0.9680329476367915,
    'f1_score': 0,
    'fall_out': 0.031492200529775305,
    'false_discovery_rate': 1.0,
    'false_negative_rate': 1.0,
    'false_negatives': 5,
    'false_omission_rate': 0.0005062265870203753,
    'false_positive_rate': 0.031492200529775305,
    'false_positives': 321,
    'hit_rate': 0.0,
    'informedness': -0.031492200529775305,
    'markedness': -0.0005062265870203753,
    'matthews_correlation_coefficient': -0.003992767109655738,
    'miss_rate': 1.0,
    'negative_predictive_value': 0.9994937734129796,
    'positive_predictive_value': 0.0,
    'precision': 0.0,
    'recall': 0.0,
    'sensitivity': 0.0,
    'specificity': 0.9685077994702247,
    'true_negative_rate': 0.9685077994702247,
    'true_negatives': 9872,
    'true_positive_rate': 0.0,
    'true_positives': 0},
  Law: {   'accuracy': 0.9264561678760541,
    'f1_score': 0.05778894472361809,
    'fall_out': 0.03343246846477288,
    'false_discovery_rate': 0.9340974212034384,
    'false_negative_rate': 0.9485458612975392,
    'false_negatives': 424,
    'false_omission_rate': 0.04305005584323285,
    'false_positive_rate': 0.03343246846477288,
    'false_positives': 326,
    'hit_rate': 0.05145413870246085,
    'informedness': 0.01802167023768808,
    'markedness': 0.022852522953328736,
    'matthews_correlation_coefficient': 0.020293857020391354,
    'miss_rate': 0.9485458612975392,
    'negative_predictive_value': 0.9569499441567672,
    'positive_predictive_value': 0.0659025787965616,
    'precision': 0.0659025787965616,
    'recall': 0.05145413870246085,
    'sensitivity': 0.05145413870246085,
    'specificity': 0.9665675315352271,
    'true_negative_rate': 0.9665675315352271,
    'true_negatives': 9425,
    'true_positive_rate': 0.05145413870246085,
    'true_positives': 23},
  Computers & Technology: {   'accuracy': 0.9531280643263385,
    'f1_score': 0.047808764940239036,
    'fall_out': 0.03508597554915016,
    'false_discovery_rate': 0.9671232876712329,
    'false_negative_rate': 0.9124087591240876,
    'false_negatives': 125,
    'false_omission_rate': 0.01271229533204521,
    'false_positive_rate': 0.03508597554915016,
    'false_positives': 353,
    'hit_rate': 0.08759124087591241,
    'informedness': 0.05250526532676236,
    'markedness': 0.02016441699672189,
    'matthews_correlation_coefficient': 0.03253825540148643,
    'miss_rate': 0.9124087591240876,
    'negative_predictive_value': 0.9872877046679548,
    'positive_predictive_value': 0.03287671232876712,
    'precision': 0.03287671232876712,
    'recall': 0.08759124087591241,
    'sensitivity': 0.08759124087591241,
    'specificity': 0.9649140244508498,
    'true_negative_rate': 0.9649140244508498,
    'true_negatives': 9708,
    'true_positive_rate': 0.08759124087591241,
    'true_positives': 12},
  Test Preparation: {   'accuracy': 0.9327319082172975,
    'f1_score': 0.15099009900990099,
    'fall_out': 0.027274598600247058,
    'false_discovery_rate': 0.8128834355828221,
    'false_negative_rate': 0.8734439834024896,
    'false_negatives': 421,
    'false_omission_rate': 0.04264586709886553,
    'false_positive_rate': 0.027274598600247058,
    'false_positives': 265,
    'hit_rate': 0.12655601659751037,
    'informedness': 0.09928141799726342,
    'markedness': 0.1444706973183123,
    'matthews_correlation_coefficient': 0.11976333198778118,
    'miss_rate': 0.8734439834024896,
    'negative_predictive_value': 0.9573541329011345,
    'positive_predictive_value': 0.18711656441717792,
    'precision': 0.18711656441717792,
    'recall': 0.12655601659751037,
    'sensitivity': 0.12655601659751037,
    'specificity': 0.9727254013997529,
    'true_negative_rate': 0.9727254013997529,
    'true_negatives': 9451,
    'true_positive_rate': 0.12655601659751037,
    'true_positives': 61},
  Arts & Photography: {   'accuracy': 0.941557168072171,
    'f1_score': 0.04487179487179488,
    'fall_out': 0.03171076550191876,
    'false_discovery_rate': 0.9573170731707317,
    'false_negative_rate': 0.9527027027027027,
    'false_negatives': 282,
    'false_omission_rate': 0.02857142857142858,
    'false_positive_rate': 0.03171076550191876,
    'false_positives': 314,
    'hit_rate': 0.0472972972972973,
    'informedness': 0.0155865317953785,
    'markedness': 0.014111498257839639,
    'matthews_correlation_coefficient': 0.01483068832779676,
    'miss_rate': 0.9527027027027027,
    'negative_predictive_value': 0.9714285714285714,
    'positive_predictive_value': 0.042682926829268296,
    'precision': 0.042682926829268296,
    'recall': 0.0472972972972973,
    'sensitivity': 0.0472972972972973,
    'specificity': 0.9682892344980812,
    'true_negative_rate': 0.9682892344980812,
    'true_negatives': 9588,
    'true_positive_rate': 0.0472972972972973,
    'true_positives': 14},
  Parenting & Relationships: {   'accuracy': 0.9494018434987253,
    'f1_score': 0.0851063829787234,
    'fall_out': 0.035068438405435054,
    'false_discovery_rate': 0.9359999999999999,
    'false_negative_rate': 0.873015873015873,
    'false_negatives': 165,
    'false_omission_rate': 0.016797312430011146,
    'false_positive_rate': 0.035068438405435054,
    'false_positives': 351,
    'hit_rate': 0.12698412698412698,
    'informedness': 0.09191568857869203,
    'markedness': 0.04720268756998891,
    'matthews_correlation_coefficient': 0.0658685625375291,
    'miss_rate': 0.873015873015873,
    'negative_predictive_value': 0.9832026875699889,
    'positive_predictive_value': 0.064,
    'precision': 0.064,
    'recall': 0.12698412698412698,
    'sensitivity': 0.12698412698412698,
    'specificity': 0.964931561594565,
    'true_negative_rate': 0.964931561594565,
    'true_negatives': 9658,
    'true_positive_rate': 0.12698412698412698,
    'true_positives': 24},
  Romance: {   'accuracy': 0.9173367326926848,
    'f1_score': 0.11542497376705142,
    'fall_out': 0.030138700594431134,
    'false_discovery_rate': 0.8401162790697674,
    'false_negative_rate': 0.909688013136289,
    'false_negatives': 554,
    'false_omission_rate': 0.05622082403085038,
    'false_positive_rate': 0.030138700594431134,
    'false_positives': 289,
    'hit_rate': 0.090311986863711,
    'informedness': 0.06017328626927987,
    'markedness': 0.10366289689938224,
    'matthews_correlation_coefficient': 0.07897934648140213,
    'miss_rate': 0.909688013136289,
    'negative_predictive_value': 0.9437791759691496,
    'positive_predictive_value': 0.15988372093023256,
    'precision': 0.15988372093023256,
    'recall': 0.090311986863711,
    'sensitivity': 0.090311986863711,
    'specificity': 0.9698612994055689,
    'true_negative_rate': 0.9698612994055689,
    'true_negatives': 9300,
    'true_positive_rate': 0.090311986863711,
    'true_positives': 55},
  History: {   'accuracy': 0.9323396744459698,
    'f1_score': 0.07999999999999999,
    'fall_out': 0.030978427563643773,
    'false_discovery_rate': 0.9099099099099099,
    'false_negative_rate': 0.9280575539568345,
    'false_negatives': 387,
    'false_omission_rate': 0.0392295995945261,
    'false_positive_rate': 0.030978427563643773,
    'false_positives': 303,
    'hit_rate': 0.07194244604316546,
    'informedness': 0.04096401847952169,
    'markedness': 0.05086049049556407,
    'matthews_correlation_coefficient': 0.04564482525476266,
    'miss_rate': 0.9280575539568345,
    'negative_predictive_value': 0.9607704004054739,
    'positive_predictive_value': 0.09009009009009009,
    'precision': 0.09009009009009009,
    'recall': 0.07194244604316546,
    'sensitivity': 0.07194244604316546,
    'specificity': 0.9690215724363562,
    'true_negative_rate': 0.9690215724363562,
    'true_negatives': 9478,
    'true_positive_rate': 0.07194244604316546,
    'true_positives': 30},
  Comics & Graphic Novels: {   'accuracy': 0.9580309864679349,
    'f1_score': 0.218978102189781,
    'fall_out': 0.028708612583775106,
    'false_discovery_rate': 0.8270893371757925,
    'false_negative_rate': 0.7014925373134329,
    'false_negatives': 141,
    'false_omission_rate': 0.014313267688559561,
    'false_positive_rate': 0.028708612583775106,
    'false_positives': 287,
    'hit_rate': 0.29850746268656714,
    'informedness': 0.2697988501027919,
    'markedness': 0.15859739513564786,
    'matthews_correlation_coefficient': 0.20685597607247405,
    'miss_rate': 0.7014925373134329,
    'negative_predictive_value': 0.9856867323114404,
    'positive_predictive_value': 0.1729106628242075,
    'precision': 0.1729106628242075,
    'recall': 0.29850746268656714,
    'sensitivity': 0.29850746268656714,
    'specificity': 0.9712913874162249,
    'true_negative_rate': 0.9712913874162249,
    'true_negatives': 9710,
    'true_positive_rate': 0.29850746268656714,
    'true_positives': 60},
  Reference: {   'accuracy': 0.9581290449107668,
    'f1_score': 0.027334851936218676,
    'fall_out': 0.034220156265453494,
    'false_discovery_rate': 0.9829545454545454,
    'false_negative_rate': 0.9310344827586207,
    'false_negatives': 81,
    'false_omission_rate': 0.00822669104204754,
    'false_positive_rate': 0.034220156265453494,
    'false_positives': 346,
    'hit_rate': 0.06896551724137931,
    'informedness': 0.03474536097592584,
    'markedness': 0.008818763503406934,
    'matthews_correlation_coefficient': 0.01750460286002505,
    'miss_rate': 0.9310344827586207,
    'negative_predictive_value': 0.9917733089579525,
    'positive_predictive_value': 0.017045454545454544,
    'precision': 0.017045454545454544,
    'recall': 0.06896551724137931,
    'sensitivity': 0.06896551724137931,
    'specificity': 0.9657798437345465,
    'true_negative_rate': 0.9657798437345465,
    'true_negatives': 9765,
    'true_positive_rate': 0.06896551724137931,
    'true_positives': 6},
  Teen & Young Adult: {   'accuracy': 0.9515591292410277,
    'f1_score': 0.04263565891472868,
    'fall_out': 0.034176962933439636,
    'false_discovery_rate': 0.9689265536723164,
    'false_negative_rate': 0.9320987654320988,
    'false_negatives': 151,
    'false_omission_rate': 0.015339292970337315,
    'false_positive_rate': 0.034176962933439636,
    'false_positives': 343,
    'hit_rate': 0.06790123456790123,
    'informedness': 0.03372427163446168,
    'markedness': 0.015734153357346292,
    'matthews_correlation_coefficient': 0.023035252587315484,
    'miss_rate': 0.9320987654320988,
    'negative_predictive_value': 0.9846607070296627,
    'positive_predictive_value': 0.031073446327683617,
    'precision': 0.031073446327683617,
    'recall': 0.06790123456790123,
    'sensitivity': 0.06790123456790123,
    'specificity': 0.9658230370665604,
    'true_negative_rate': 0.9658230370665604,
    'true_negatives': 9693,
    'true_positive_rate': 0.06790123456790123,
    'true_positives': 11},
  Self-Help: {   'accuracy': 0.8456560109825456,
    'f1_score': 0.11173814898419863,
    'fall_out': 0.025380130330398987,
    'false_discovery_rate': 0.691588785046729,
    'false_negative_rate': 0.9317711922811854,
    'false_negatives': 1352,
    'false_omission_rate': 0.1368836691303027,
    'false_positive_rate': 0.025380130330398987,
    'false_positives': 222,
    'hit_rate': 0.06822880771881461,
    'informedness': 0.04284867738841558,
    'markedness': 0.17152754582296836,
    'matthews_correlation_coefficient': 0.08573055741213308,
    'miss_rate': 0.9317711922811854,
    'negative_predictive_value': 0.8631163308696973,
    'positive_predictive_value': 0.308411214953271,
    'precision': 0.308411214953271,
    'recall': 0.06822880771881461,
    'sensitivity': 0.06822880771881461,
    'specificity': 0.974619869669601,
    'true_negative_rate': 0.974619869669601,
    'true_negatives': 8525,
    'true_positive_rate': 0.06822880771881461,
    'true_positives': 99},
  Calendars: {   'accuracy': 0.8434006667974112,
    'f1_score': 0.19627579265223954,
    'fall_out': 0.014421385860007074,
    'false_discovery_rate': 0.3867924528301887,
    'false_negative_rate': 0.8831635710005992,
    'false_negatives': 1474,
    'false_omission_rate': 0.14919028340080975,
    'false_positive_rate': 0.014421385860007074,
    'false_positives': 123,
    'hit_rate': 0.11683642899940083,
    'informedness': 0.10241504313939376,
    'markedness': 0.46401726376900143,
    'matthews_correlation_coefficient': 0.21799621117424445,
    'miss_rate': 0.8831635710005992,
    'negative_predictive_value': 0.8508097165991902,
    'positive_predictive_value': 0.6132075471698113,
    'precision': 0.6132075471698113,
    'recall': 0.11683642899940083,
    'sensitivity': 0.11683642899940083,
    'specificity': 0.9855786141399929,
    'true_negative_rate': 0.9855786141399929,
    'true_negatives': 8406,
    'true_positive_rate': 0.11683642899940083,
    'true_positives': 195},
  Science Fiction & Fantasy: {   'accuracy': 0.9561678760541282,
    'f1_score': 0.11485148514851486,
    'fall_out': 0.027994401119776025,
    'false_discovery_rate': 0.9061488673139159,
    'false_negative_rate': 0.8520408163265306,
    'false_negatives': 167,
    'false_omission_rate': 0.016887450702801066,
    'false_positive_rate': 0.027994401119776025,
    'false_positives': 280,
    'hit_rate': 0.14795918367346939,
    'informedness': 0.11996478255369336,
    'markedness': 0.07696368198328307,
    'matthews_correlation_coefficient': 0.09608814377255998,
    'miss_rate': 0.8520408163265306,
    'negative_predictive_value': 0.9831125492971989,
    'positive_predictive_value': 0.09385113268608414,
    'precision': 0.09385113268608414,
    'recall': 0.14795918367346939,
    'sensitivity': 0.14795918367346939,
    'specificity': 0.972005598880224,
    'true_negative_rate': 0.972005598880224,
    'true_negatives': 9722,
    'true_positive_rate': 0.14795918367346939,
    'true_positives': 29},
  Mystery, Thriller & Suspense: {   'accuracy': 0.9433222200431457,
    'f1_score': 0.12158054711246201,
    'fall_out': 0.030069859269008847,
    'false_discovery_rate': 0.8813056379821959,
    'false_negative_rate': 0.8753894080996885,
    'false_negatives': 281,
    'false_omission_rate': 0.028496095730656146,
    'false_positive_rate': 0.030069859269008847,
    'false_positives': 297,
    'hit_rate': 0.12461059190031153,
    'informedness': 0.09454073263130258,
    'markedness': 0.09019826628714811,
    'matthews_correlation_coefficient': 0.09234397748018172,
    'miss_rate': 0.8753894080996885,
    'negative_predictive_value': 0.9715039042693439,
    'positive_predictive_value': 0.11869436201780416,
    'precision': 0.11869436201780416,
    'recall': 0.12461059190031153,
    'sensitivity': 0.12461059190031153,
    'specificity': 0.9699301407309912,
    'true_negative_rate': 0.9699301407309912,
    'true_negatives': 9580,
    'true_positive_rate': 0.12461059190031153,
    'true_positives': 40},
  Biographies & Memoirs: {   'accuracy': 0.8951755246126691,
    'f1_score': 0.09329940627650551,
    'fall_out': 0.03210666666666662,
    'false_discovery_rate': 0.8455056179775281,
    'false_negative_rate': 0.9331713244228432,
    'false_negatives': 768,
    'false_omission_rate': 0.0780329201381833,
    'false_positive_rate': 0.03210666666666662,
    'false_positives': 301,
    'hit_rate': 0.06682867557715674,
    'informedness': 0.03472200891049004,
    'markedness': 0.0764614618842887,
    'matthews_correlation_coefficient': 0.05152567865497131,
    'miss_rate': 0.9331713244228432,
    'negative_predictive_value': 0.9219670798618167,
    'positive_predictive_value': 0.1544943820224719,
    'precision': 0.1544943820224719,
    'recall': 0.06682867557715674,
    'sensitivity': 0.06682867557715674,
    'specificity': 0.9678933333333334,
    'true_negative_rate': 0.9678933333333334,
    'true_negatives': 9074,
    'true_positive_rate': 0.06682867557715674,
    'true_positives': 55}}

Finished: experiment_run
Saved to: results/experiment_run_0

16.5 Sklearn Algorithm Cheatsheet

Sklearn model selection

url: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

cheat-sheet

Model Explainability with SHAP

https://github.com/slundberg/shap: A unified approach to explain the output of any machine learning model.

Install SHAP

!pip install -q shap

import sklearn
import shap

shap.initjs()

Load Census Data

Adult datasets

Predict whether income exceeds $50K/yr based on census data. Also known as “Census Income” dataset.

X,y = shap.datasets.adult()
X_display,y_display = shap.datasets.adult(display=True)
X_train, X_valid, y_train, y_valid = sklearn.model_selection.train_test_split(X, y, test_size=0.2, random_state=7)

X_train.shape, y_train.shape

((26048, 12), (26048,))

Train a k-nearest neighbor Classifier

knn = sklearn.neighbors.KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

Explain predictions

f = lambda x: knn.predict_proba(x)[:,1]
med = X_train.median().values.reshape((1,X_train.shape[1]))
explainer = shap.KernelExplainer(f, med)
shap_values_single = explainer.shap_values(X.iloc[0,:], nsamples=1000)

#Plot
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values_single, X_display.iloc[0,:])

Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

16.6 Recommendations

Install Surprise

 !pip install -q scikit-surprise

[K    100% |████████████████████████████████| 3.3MB 10.9MB/s 
[?25h  Building wheel for scikit-surprise (setup.py) ... [?25ldone
[?25h

from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate

# Load the movielens-100k dataset (download it if needed).
data = Dataset.load_builtin('ml-100k')

# Use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9460  0.9371  0.9344  0.9300  0.9354  0.9366  0.0053  
MAE (testset)     0.7441  0.7390  0.7338  0.7312  0.7397  0.7376  0.0045  
Fit time          5.30    5.22    5.25    5.23    5.23    5.24    0.03    
Test time         0.16    0.26    0.16    0.15    0.16    0.18    0.04    

{'fit_time': (5.302802085876465,
  5.2162816524505615,
  5.2515764236450195,
  5.2256152629852295,
  5.226689577102661),
 'test_mae': array([0.74405382, 0.73902602, 0.73379062, 0.73123877, 0.73968219]),
 'test_rmse': array([0.94601002, 0.93705768, 0.93435584, 0.93001856, 0.93540059]),
 'test_time': (0.16068744659423828,
  0.26168084144592285,
  0.1584162712097168,
  0.1538381576538086,
  0.16183090209960938)}

Handcoded Similarity Engine

"""Data Science Algorithms"""


def tanimoto(list1, list2):
    """tanimoto coefficient
    In [2]: list2=['39229', '31995', '32015']
    In [3]: list1=['31936', '35989', '27489', '39229', '15468', '31993', '26478']
    In [4]: tanimoto(list1,list2)
    Out[4]: 0.1111111111111111
    Uses intersection of two sets to determine numerical score
    """

    intersection = set(list1).intersection(set(list2))
    return float(len(intersection))/(len(list1) + len(list2) - len(intersection))

Collaborative Filtering Recommendation Exploration

Knn Exploration of MovieLens with Surprise

import io  # needed because of weird encoding of u.item file
from surprise import KNNBaseline
from surprise import Dataset
from surprise import get_dataset_dir

Helper Function to Convert IDS to Names

def read_item_names():
    """Read the u.item file from MovieLens 100-k dataset and return two
    mappings to convert raw ids into movie names and movie names into raw ids.
    """

    file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid

Train KNN based model

# First, train the algorithm to compute the similarities between items
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.

<surprise.prediction_algorithms.knns.KNNBaseline at 0x7f596007c1d0>

Recommendations

# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names()

# Retrieve inner id of the movie Toy Story
toy_story_raw_id = name_to_rid['Toy Story (1995)']
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)

# Retrieve inner ids of the nearest neighbors of Toy Story.
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)

# Convert inner ids of the neighbors into names.
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
                       for rid in toy_story_neighbors)

for movie in toy_story_neighbors:
  print(movie)

Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
That Thing You Do! (1996)
Lion King, The (1994)
Craft, The (1996)
Liar Liar (1997)
Aladdin (1992)
Cool Hand Luke (1967)
Winnie the Pooh and the Blustery Day (1968)
Indiana Jones and the Last Crusade (1989)