Preliminary setup

Before we delve into coding, you need to prepare your own pc, or your account in an HPC to be able to run codes.

The below instructions are some highlights. I assume you already know how to install the rest (for example the jupyter notebook), or how to run a python script without a jupyter notebook. I also assume you are familiar with terminal and linux.

Anaconda/Miniconda

This recipe is intended for Linux, specifically Ubuntu 16.04 or higher (64-bit). If you are using windows, please consider installing WSL2. There is no official GPU support for MacOs, so please avoid.

  1. Install Miniconda

Miniconda is the recommended approach for installing TensorFlow with GPU support. It creates a separate environment to avoid changing any installed software in your system. This is also the easiest way to install the required software especially for the GPU setup.

You can use the following command to install Miniconda. During installation, you may need to press enter and type “yes”.

curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh

You may need to restart your terminal or source ~/.bashrc to enable the conda command. Use conda -V to test if it is installed successfully. 2. Create a conda environment

Create a new conda environment named qml with the following command.

conda create –name qml python=3.8

You can deactivate and activate it with the following commands.

conda deactivate conda activate qml

Make sure it is activated for the rest of the installation. 3. GPU setup

You can skip this section if you only run TensorFlow on the CPU.

First install the NVIDIA GPU driver if you have not. You can use the following command to verify it is installed.

nvidia-smi

Then install CUDA and cuDNN with conda.

conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0

Configure the system paths. You can do it with following command everytime your start a new terminal after activating your conda environment.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/

For your convenience it is recommended that you automate it with the following commands. The system paths will be automatically configured when you activate this conda environment.

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
  1. Install TensorFlow

TensorFlow requires a recent version of pip, so upgrade your pip installation to be sure you’re running the latest version.

pip install --upgrade pip

Then, install TensorFlow with pip. Note: Do not install TensorFlow with conda. It may not have the latest stable version. pip is recommended since TensorFlow is only officially released to PyPI.

pip install tensorflow
  1. Verify install

Verify the CPU setup:

python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

If a tensor is returned, you’ve installed TensorFlow successfully.

Verify the GPU setup:

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

If a list of GPU devices is returned, you’ve installed TensorFlow successfully.

We will install other packages as we progress

Aptainer (formerly Singularity)

Apptainer/Singularity is the most widely used container system for HPC. It is designed to execute applications at bare-metal performance while being secure, portable, and 100% reproducible. Apptainer is an open-source project with a friendly community of developers and users. The user base continues to expand, with Apptainer/Singularity now used across industry and academia in many areas.

Apptainer/Singularity is already installed in carbon.physics.metu.edu.tr, so you can start using it immediately. If you want to install it to your own pc, the instructions are at https://docs.sylabs.io/guides/3.0/user-guide/quick_start.html

NVIDIA kindly provides, optimized containers for numerous academic software. Please check out NGC Catalog

In carbon, you can find the TensorFlow Containers at /share/apps/singularity-containers/

If you want to “pull” i.e. download and compile a container from NGC, try something like singularity pull tensorflow-22.09-tf1-py3.sif docker://nvcr.io/nvidia/tensorflow:22.09-tf1-py. This will download (a lot) and compile (a lot) to produce a sif image for you. Then you can run this image with singularity run --nv '-B<your working directory>:/host_pwd' --pwd /host_pwd tensorflow-22.09-tf1-py3.sif

Running a singularity container in an HPC environment is similar to running it in your own computer. An example script can be found here.

Basic regression: Predict fuel efficiency

In a regression problem, the aim is to predict the output of a continuous value, like a price or a probability. Contrast this with a classification problem, where the aim is to select a class from a list of classes (for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).

This tutorial uses the classic Auto MPG dataset and demonstrates how to build models to predict the fuel efficiency of the late-1970s and early 1980s automobiles. To do this, you will provide the models with a description of many automobiles from that time period. This description includes attributes like cylinders, displacement, horsepower, and weight.

This example uses the Keras API. (Visit the Keras tutorials and guides to learn more.)

# Use seaborn for pairplot.
!pip install -q seaborn
/bin/bash: /home/obm/Prog/miniconda3/envs/qml/lib/libtinfo.so.6: no version information available (required by /bin/bash)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Make NumPy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)
2022-10-27 12:48:37.014789: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-27 12:48:37.141219: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-10-27 12:48:37.713385: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/obm/Prog/miniconda3/envs/qml/lib/
2022-10-27 12:48:37.713458: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/obm/Prog/miniconda3/envs/qml/lib/
2022-10-27 12:48:37.713463: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2.10.0

The Auto MPG dataset

The dataset is available from the UCI Machine Learning Repository.

Get the data

First download and import the dataset using pandas:

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

raw_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)
dataset = raw_dataset.copy()
dataset.tail()
MPG Cylinders Displacement Horsepower Weight Acceleration Model Year Origin
393 27.0 4 140.0 86.0 2790.0 15.6 82 1
394 44.0 4 97.0 52.0 2130.0 24.6 82 2
395 32.0 4 135.0 84.0 2295.0 11.6 82 1
396 28.0 4 120.0 79.0 2625.0 18.6 82 1
397 31.0 4 119.0 82.0 2720.0 19.4 82 1

Clean the data

The dataset contains a few unknown values:

dataset.isna().sum()
MPG             0
Cylinders       0
Displacement    0
Horsepower      6
Weight          0
Acceleration    0
Model Year      0
Origin          0
dtype: int64

Drop those rows to keep this initial tutorial simple:

dataset = dataset.dropna()

The "Origin" column is categorical, not numeric. So the next step is to one-hot encode the values in the column with pd.get_dummies.

Note: You can set up the tf.keras.Model to do this kind of transformation for you but that’s beyond the scope of this tutorial. Check out the Classify structured data using Keras preprocessing layers or Load CSV data tutorials for examples.

dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})
dataset = pd.get_dummies(dataset, columns=['Origin'], prefix='', prefix_sep='')
dataset.tail()
MPG Cylinders Displacement Horsepower Weight Acceleration Model Year Europe Japan USA
393 27.0 4 140.0 86.0 2790.0 15.6 82 0 0 1
394 44.0 4 97.0 52.0 2130.0 24.6 82 1 0 0
395 32.0 4 135.0 84.0 2295.0 11.6 82 0 0 1
396 28.0 4 120.0 79.0 2625.0 18.6 82 0 0 1
397 31.0 4 119.0 82.0 2720.0 19.4 82 0 0 1

Split the data into training and test sets

Now, split the dataset into a training set and a test set. You will use the test set in the final evaluation of your models.

train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)

Inspect the data

Review the joint distribution of a few pairs of columns from the training set.

The top row suggests that the fuel efficiency (MPG) is a function of all the other parameters. The other rows indicate they are functions of each other.

sns.pairplot(train_dataset[['MPG', 'Cylinders', 'Displacement', 'Weight']], diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x7f8980f1db20>
../../../../_images/68056d70975315ea9a9187b94e98ddef67da7cc766a36dbf79cdae7999606616.png

Let’s also check the overall statistics. Note how each feature covers a very different range:

train_dataset.describe().transpose()
count mean std min 25% 50% 75% max
MPG 314.0 23.310510 7.728652 10.0 17.00 22.0 28.95 46.6
Cylinders 314.0 5.477707 1.699788 3.0 4.00 4.0 8.00 8.0
Displacement 314.0 195.318471 104.331589 68.0 105.50 151.0 265.75 455.0
Horsepower 314.0 104.869427 38.096214 46.0 76.25 94.5 128.00 225.0
Weight 314.0 2990.251592 843.898596 1649.0 2256.50 2822.5 3608.00 5140.0
Acceleration 314.0 15.559236 2.789230 8.0 13.80 15.5 17.20 24.8
Model Year 314.0 75.898089 3.675642 70.0 73.00 76.0 79.00 82.0
Europe 314.0 0.178344 0.383413 0.0 0.00 0.0 0.00 1.0
Japan 314.0 0.197452 0.398712 0.0 0.00 0.0 0.00 1.0
USA 314.0 0.624204 0.485101 0.0 0.00 1.0 1.00 1.0

Split features from labels

Separate the target value—the “label”—from the features. This label is the value that you will train the model to predict.

train_features = train_dataset.copy()
test_features = test_dataset.copy()

train_labels = train_features.pop('MPG')
test_labels = test_features.pop('MPG')

Normalization

In the table of statistics it’s easy to see how different the ranges of each feature are:

train_dataset.describe().transpose()[['mean', 'std']]
mean std
MPG 23.310510 7.728652
Cylinders 5.477707 1.699788
Displacement 195.318471 104.331589
Horsepower 104.869427 38.096214
Weight 2990.251592 843.898596
Acceleration 15.559236 2.789230
Model Year 75.898089 3.675642
Europe 0.178344 0.383413
Japan 0.197452 0.398712
USA 0.624204 0.485101

It is good practice to normalize features that use different scales and ranges.

One reason this is important is because the features are multiplied by the model weights. So, the scale of the outputs and the scale of the gradients are affected by the scale of the inputs.

Although a model might converge without feature normalization, normalization makes training much more stable.

Note: There is no advantage to normalizing the one-hot features—it is done here for simplicity. For more details on how to use the preprocessing layers, refer to the Working with preprocessing layers guide and the Classify structured data using Keras preprocessing layers tutorial.

The Normalization layer

The tf.keras.layers.Normalization is a clean and simple way to add feature normalization into your model.

The first step is to create the layer:

normalizer = tf.keras.layers.Normalization(axis=-1)

Then, fit the state of the preprocessing layer to the data by calling Normalization.adapt:

normalizer.adapt(np.array(train_features))
2022-10-27 12:48:41.432307: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:48:41.478466: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:48:41.478687: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:48:41.479354: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-27 12:48:41.481218: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:48:41.481411: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:48:41.481574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:48:41.920037: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:48:41.928261: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:48:41.928420: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:48:41.928551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6911 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:21:00.0, compute capability: 7.5

Calculate the mean and variance, and store them in the layer:

print(normalizer.mean.numpy())
[[   5.478  195.318  104.869 2990.252   15.559   75.898    0.178    0.197
     0.624]]

When the layer is called, it returns the input data, with each feature independently normalized:

first = np.array(train_features[:1])

with np.printoptions(precision=2, suppress=True):
  print('First example:', first)
  print()
  print('Normalized:', normalizer(first).numpy())
First example: [[   4.    90.    75.  2125.    14.5   74.     0.     0.     1. ]]

Normalized: [[-0.87 -1.01 -0.79 -1.03 -0.38 -0.52 -0.47 -0.5   0.78]]

Linear regression

Before building a deep neural network model, start with linear regression using one and several variables.

Linear regression with one variable

Begin with a single-variable linear regression to predict 'MPG' from 'Horsepower'.

Training a model with tf.keras typically starts by defining the model architecture. Use a tf.keras.Sequential model, which represents a sequence of steps.

There are two steps in your single-variable linear regression model:

  • Normalize the 'Horsepower' input features using the tf.keras.layers.Normalization preprocessing layer.

  • Apply a linear transformation (\(y = mx+b\)) to produce 1 output using a linear layer (tf.keras.layers.Dense).

The number of inputs can either be set by the input_shape argument, or automatically when the model is run for the first time.

First, create a NumPy array made of the 'Horsepower' features. Then, instantiate the tf.keras.layers.Normalization and fit its state to the horsepower data:

horsepower = np.array(train_features['Horsepower'])

horsepower_normalizer = layers.Normalization(input_shape=[1,], axis=None)
horsepower_normalizer.adapt(horsepower)

Build the Keras Sequential model:

horsepower_model = tf.keras.Sequential([
    horsepower_normalizer,
    layers.Dense(units=1)
])

horsepower_model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 normalization_1 (Normalizat  (None, 1)                3         
 ion)                                                            
                                                                 
 dense (Dense)               (None, 1)                 2         
                                                                 
=================================================================
Total params: 5
Trainable params: 2
Non-trainable params: 3
_________________________________________________________________

This model will predict 'MPG' from 'Horsepower'.

Run the untrained model on the first 10 ‘Horsepower’ values. The output won’t be good, but notice that it has the expected shape of (10, 1):

horsepower_model.predict(horsepower[:10])
1/1 [==============================] - ETA: 0s

1/1 [==============================] - 1s 826ms/step
array([[-1.176],
       [-0.664],
       [ 2.171],
       [-1.648],
       [-1.491],
       [-0.585],
       [-1.767],
       [-1.491],
       [-0.389],
       [-0.664]], dtype=float32)

Once the model is built, configure the training procedure using the Keras Model.compile method. The most important arguments to compile are the loss and the optimizer, since these define what will be optimized (mean_absolute_error) and how (using the tf.keras.optimizers.Adam).

horsepower_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

Use Keras Model.fit to execute the training for 100 epochs:

%%time
history = horsepower_model.fit(
    train_features['Horsepower'],
    train_labels,
    epochs=100,
    # Suppress logging.
    verbose=0,
    # Calculate validation results on 20% of the training data.
    validation_split = 0.2)
/home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages/keras/engine/data_adapter.py:1699: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
  return t[start:end]
CPU times: user 3.58 s, sys: 796 ms, total: 4.37 s
Wall time: 3.16 s

Visualize the model’s training progress using the stats stored in the history object:

hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()
loss val_loss epoch
95 3.803624 4.189087 95
96 3.804405 4.211332 96
97 3.806713 4.183324 97
98 3.805672 4.184466 98
99 3.802745 4.190432 99
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.plot(history.history['val_loss'], label='val_loss')
  plt.ylim([0, 10])
  plt.xlabel('Epoch')
  plt.ylabel('Error [MPG]')
  plt.legend()
  plt.grid(True)
plot_loss(history)
../../../../_images/c8b2dbf573ff08671de17e4f4860680b65ced3196b656e55e19444343f2fc929.png

Collect the results on the test set for later:

test_results = {}

test_results['horsepower_model'] = horsepower_model.evaluate(
    test_features['Horsepower'],
    test_labels, verbose=0)

Since this is a single variable regression, it’s easy to view the model’s predictions as a function of the input:

x = tf.linspace(0.0, 250, 251)
y = horsepower_model.predict(x)
1/8 [==>...........................] - ETA: 0s

8/8 [==============================] - 0s 971us/step
def plot_horsepower(x, y):
  plt.scatter(train_features['Horsepower'], train_labels, label='Data')
  plt.plot(x, y, color='k', label='Predictions')
  plt.xlabel('Horsepower')
  plt.ylabel('MPG')
  plt.legend()
plot_horsepower(x, y)
../../../../_images/7dc426636781b8eb35f573033c7abbff9aa1b40bda126d52ee285e50f7de43db.png

Linear regression with multiple inputs

You can use an almost identical setup to make predictions based on multiple inputs. This model still does the same \(y = mx+b\) except that \(m\) is a matrix and \(b\) is a vector.

Create a two-step Keras Sequential model again with the first layer being normalizer (tf.keras.layers.Normalization(axis=-1)) you defined earlier and adapted to the whole dataset:

linear_model = tf.keras.Sequential([
    normalizer,
    layers.Dense(units=1)
])

When you call Model.predict on a batch of inputs, it produces units=1 outputs for each example:

linear_model.predict(train_features[:10])
1/1 [==============================] - ETA: 0s

1/1 [==============================] - 0s 30ms/step
array([[ 1.233],
       [ 0.684],
       [-1.213],
       [ 2.07 ],
       [ 1.148],
       [ 0.034],
       [ 1.103],
       [-1.954],
       [-0.219],
       [ 0.038]], dtype=float32)

When you call the model, its weight matrices will be built—check that the kernel weights (the \(m\) in \(y=mx+b\)) have a shape of (9, 1):

linear_model.layers[1].kernel
<tf.Variable 'dense_1/kernel:0' shape=(9, 1) dtype=float32, numpy=
array([[-0.417],
       [ 0.115],
       [-0.235],
       [-0.45 ],
       [-0.605],
       [ 0.362],
       [-0.222],
       [ 0.462],
       [ 0.544]], dtype=float32)>

Configure the model with Keras Model.compile and train with Model.fit for 100 epochs:

linear_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')
%%time
history = linear_model.fit(
    train_features,
    train_labels,
    epochs=100,
    # Suppress logging.
    verbose=0,
    # Calculate validation results on 20% of the training data.
    validation_split = 0.2)
/home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages/keras/engine/data_adapter.py:1699: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
  return t[start:end]
CPU times: user 3.44 s, sys: 711 ms, total: 4.15 s
Wall time: 2.85 s

Using all the inputs in this regression model achieves a much lower training and validation error than the horsepower_model, which had one input:

plot_loss(history)
../../../../_images/0593193ff2aeaacad43827cd3a0da0289feb50421571433b70e597a3799e2d68.png

Collect the results on the test set for later:

test_results['linear_model'] = linear_model.evaluate(
    test_features, test_labels, verbose=0)

Regression with a deep neural network (DNN)

In the previous section, you implemented two linear models for single and multiple inputs.

Here, you will implement single-input and multiple-input DNN models.

The code is basically the same except the model is expanded to include some “hidden” non-linear layers. The name “hidden” here just means not directly connected to the inputs or outputs.

These models will contain a few more layers than the linear model:

  • The normalization layer, as before (with horsepower_normalizer for a single-input model and normalizer for a multiple-input model).

  • Two hidden, non-linear, Dense layers with the ReLU (relu) activation function nonlinearity.

  • A linear Dense single-output layer.

Both models will use the same training procedure, so the compile method is included in the build_and_compile_model function below.

def build_and_compile_model(norm):
  model = keras.Sequential([
      norm,
      layers.Dense(64, activation='relu'),
      layers.Dense(64, activation='relu'),
      layers.Dense(1)
  ])

  model.compile(loss='mean_absolute_error',
                optimizer=tf.keras.optimizers.Adam(0.001))
  return model

Regression using a DNN and a single input

Create a DNN model with only 'Horsepower' as input and horsepower_normalizer (defined earlier) as the normalization layer:

dnn_horsepower_model = build_and_compile_model(horsepower_normalizer)

This model has quite a few more trainable parameters than the linear models:

dnn_horsepower_model.summary()
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 normalization_1 (Normalizat  (None, 1)                3         
 ion)                                                            
                                                                 
 dense_2 (Dense)             (None, 64)                128       
                                                                 
 dense_3 (Dense)             (None, 64)                4160      
                                                                 
 dense_4 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 4,356
Trainable params: 4,353
Non-trainable params: 3
_________________________________________________________________

Train the model with Keras Model.fit:

%%time
history = dnn_horsepower_model.fit(
    train_features['Horsepower'],
    train_labels,
    validation_split=0.2,
    verbose=0, epochs=100)
/home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages/keras/engine/data_adapter.py:1699: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
  return t[start:end]
CPU times: user 3.65 s, sys: 725 ms, total: 4.38 s
Wall time: 3.04 s

This model does slightly better than the linear single-input horsepower_model:

plot_loss(history)
../../../../_images/e94ab62a9bc7cfe5a4df57b6eaab652f2e2099c60a192c589c578f0f36f31ad6.png

If you plot the predictions as a function of 'Horsepower', you should notice how this model takes advantage of the nonlinearity provided by the hidden layers:

x = tf.linspace(0.0, 250, 251)
y = dnn_horsepower_model.predict(x)
1/8 [==>...........................] - ETA: 0s

8/8 [==============================] - 0s 757us/step
plot_horsepower(x, y)
../../../../_images/da2fb7e1e87bd4d8955b1d094aebc5a82545f1cab58921c5e0239ee59da3b52d.png

Collect the results on the test set for later:

test_results['dnn_horsepower_model'] = dnn_horsepower_model.evaluate(
    test_features['Horsepower'], test_labels,
    verbose=0)

Regression using a DNN and multiple inputs

Repeat the previous process using all the inputs. The model’s performance slightly improves on the validation dataset.

dnn_model = build_and_compile_model(normalizer)
dnn_model.summary()
Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 normalization (Normalizatio  (None, 9)                19        
 n)                                                              
                                                                 
 dense_5 (Dense)             (None, 64)                640       
                                                                 
 dense_6 (Dense)             (None, 64)                4160      
                                                                 
 dense_7 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 4,884
Trainable params: 4,865
Non-trainable params: 19
_________________________________________________________________
%%time
history = dnn_model.fit(
    train_features,
    train_labels,
    validation_split=0.2,
    verbose=0, epochs=100)
CPU times: user 3.55 s, sys: 693 ms, total: 4.24 s
Wall time: 2.94 s
plot_loss(history)
../../../../_images/e4d671ddcdfec4c399e4b46d562298f048e8ae0e888bc516f7b978151ac27bb5.png

Collect the results on the test set:

test_results['dnn_model'] = dnn_model.evaluate(test_features, test_labels, verbose=0)

Performance

Since all models have been trained, you can review their test set performance:

pd.DataFrame(test_results, index=['Mean absolute error [MPG]']).T
Mean absolute error [MPG]
horsepower_model 3.652521
linear_model 2.501625
dnn_horsepower_model 2.879357
dnn_model 1.650163

These results match the validation error observed during training.

Make predictions

You can now make predictions with the dnn_model on the test set using Keras Model.predict and review the loss:

test_predictions = dnn_model.predict(test_features).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
lims = [0, 50]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)
1/3 [=========>....................] - ETA: 0s

3/3 [==============================] - 0s 1ms/step
../../../../_images/8a274e742325397cfe73abed33252f4c70ddb504634fdb12dd385d3ea6b1c694.png

It appears that the model predicts reasonably well.

Now, check the error distribution:

error = test_predictions - test_labels
plt.hist(error, bins=25)
plt.xlabel('Prediction Error [MPG]')
_ = plt.ylabel('Count')
../../../../_images/6f968415da6f9ac398bb619404ef1a62d3b6a25f88ffb6060dcbd936944f4599.png

If you’re happy with the model, save it for later use with Model.save:

dnn_model.save('dnn_model')
INFO:tensorflow:Assets written to: dnn_model/assets

If you reload the model, it gives identical output:

reloaded = tf.keras.models.load_model('dnn_model')

test_results['reloaded'] = reloaded.evaluate(
    test_features, test_labels, verbose=0)
pd.DataFrame(test_results, index=['Mean absolute error [MPG]']).T
Mean absolute error [MPG]
horsepower_model 3.652521
linear_model 2.501625
dnn_horsepower_model 2.879357
dnn_model 1.650163
reloaded 1.650163

Takeaways:

This part introduced a few techniques to handle a regression problem. Here are a few more tips that may help:

  • Mean squared error (MSE) (tf.keras.losses.MeanSquaredError) and mean absolute error (MAE) (tf.keras.losses.MeanAbsoluteError) are common loss functions used for regression problems. MAE is less sensitive to outliers. Different loss functions are used for classification problems.

  • Similarly, evaluation metrics used for regression differ from classification.

  • When numeric input data features have values with different ranges, each feature should be scaled independently to the same range.

  • Overfitting is a common problem for DNN models, though it wasn’t a problem for this tutorial.

DO IT YOURSELF 1:

Please repeat the same steps as above, using the superconductivity dataset. Write it as a standalone python script and run it in carbon.physics.metu.edu.tr using singularity.

Overfit and underfit

As always, the code in this example will use the tf.keras API, which you can learn more about in the TensorFlow Keras guide.

In both of the previous examples—classifying text and predicting fuel efficiency—the accuracy of models on the validation data would peak after training for a number of epochs and then stagnate or start decreasing.

In other words, your model would overfit to the training data. Learning how to deal with overfitting is important. Although it’s often possible to achieve high accuracy on the training set, what you really want is to develop models that generalize well to a testing set (or data they haven’t seen before).

The opposite of overfitting is underfitting. Underfitting occurs when there is still room for improvement on the train data. This can happen for a number of reasons: If the model is not powerful enough, is over-regularized, or has simply not been trained long enough. This means the network has not learned the relevant patterns in the training data.

If you train for too long though, the model will start to overfit and learn patterns from the training data that don’t generalize to the test data. You need to strike a balance. Understanding how to train for an appropriate number of epochs as you’ll explore below is a useful skill.

To prevent overfitting, the best solution is to use more complete training data. The dataset should cover the full range of inputs that the model is expected to handle. Additional data may only be useful if it covers new and interesting cases.

A model trained on more complete data will naturally generalize better. When that is no longer possible, the next best solution is to use techniques like regularization. These place constraints on the quantity and type of information your model can store. If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns, which have a better chance of generalizing well.

In this notebook, you’ll explore several common regularization techniques, and use them to improve on a classification model.

Setup

Before getting started, import the necessary packages:

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import regularizers

print(tf.__version__)
2.10.0
!pip install git+https://github.com/tensorflow/docs

import tensorflow_docs as tfdocs
import tensorflow_docs.modeling
import tensorflow_docs.plots
/bin/bash: /home/obm/Prog/miniconda3/envs/qml/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Collecting git+https://github.com/tensorflow/docs
  Cloning https://github.com/tensorflow/docs to /tmp/pip-req-build-9_r6grn8
  Running command git clone --filter=blob:none --quiet https://github.com/tensorflow/docs /tmp/pip-req-build-9_r6grn8
  Resolved https://github.com/tensorflow/docs to commit 48bbec70106050fedf462e5f79993aceef114cfb
  Preparing metadata (setup.py) ... ?25l-
 done
?25hRequirement already satisfied: astor in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from tensorflow-docs==0.0.0.dev0) (0.8.1)
Requirement already satisfied: absl-py in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from tensorflow-docs==0.0.0.dev0) (1.3.0)
Requirement already satisfied: jinja2 in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from tensorflow-docs==0.0.0.dev0) (3.0.3)
Requirement already satisfied: nbformat in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from tensorflow-docs==0.0.0.dev0) (5.5.0)
Requirement already satisfied: protobuf<3.20,>=3.12.0 in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from tensorflow-docs==0.0.0.dev0) (3.19.6)
Requirement already satisfied: pyyaml in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from tensorflow-docs==0.0.0.dev0) (6.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from jinja2->tensorflow-docs==0.0.0.dev0) (2.1.1)
Requirement already satisfied: jsonschema>=2.6 in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from nbformat->tensorflow-docs==0.0.0.dev0) (4.16.0)
Requirement already satisfied: fastjsonschema in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from nbformat->tensorflow-docs==0.0.0.dev0) (2.16.2)
Requirement already satisfied: jupyter_core in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from nbformat->tensorflow-docs==0.0.0.dev0) (4.11.1)
Requirement already satisfied: traitlets>=5.1 in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from nbformat->tensorflow-docs==0.0.0.dev0) (5.1.1)
Requirement already satisfied: importlib-resources>=1.4.0 in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from jsonschema>=2.6->nbformat->tensorflow-docs==0.0.0.dev0) (5.2.0)
Requirement already satisfied: attrs>=17.4.0 in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from jsonschema>=2.6->nbformat->tensorflow-docs==0.0.0.dev0) (21.4.0)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from jsonschema>=2.6->nbformat->tensorflow-docs==0.0.0.dev0) (0.18.0)
Requirement already satisfied: pkgutil-resolve-name>=1.3.10 in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from jsonschema>=2.6->nbformat->tensorflow-docs==0.0.0.dev0) (1.3.10)
Requirement already satisfied: zipp>=3.1.0 in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from importlib-resources>=1.4.0->jsonschema>=2.6->nbformat->tensorflow-docs==0.0.0.dev0) (3.9.0)
from  IPython import display
from matplotlib import pyplot as plt

import numpy as np

import pathlib
import shutil
import tempfile
logdir = pathlib.Path(tempfile.mkdtemp())/"tensorboard_logs"
shutil.rmtree(logdir, ignore_errors=True)

The Higgs dataset

The goal of this tutorial is not to do particle physics, so don’t dwell on the details of the dataset. It contains 11,000,000 examples, each with 28 features, and a binary class label.

gz = tf.keras.utils.get_file('HIGGS.csv.gz', 'http://mlphysics.ics.uci.edu/data/higgs/HIGGS.csv.gz')
FEATURES = 28

The tf.data.experimental.CsvDataset class can be used to read csv records directly from a gzip file with no intermediate decompression step.

ds = tf.data.experimental.CsvDataset(gz,[float(),]*(FEATURES+1), compression_type="GZIP")

That csv reader class returns a list of scalars for each record. The following function repacks that list of scalars into a (feature_vector, label) pair.

def pack_row(*row):
  label = row[0]
  features = tf.stack(row[1:],1)
  return features, label

TensorFlow is most efficient when operating on large batches of data.

So, instead of repacking each row individually make a new tf.data.Dataset that takes batches of 10,000 examples, applies the pack_row function to each batch, and then splits the batches back up into individual records:

packed_ds = ds.batch(10000).map(pack_row).unbatch()

Inspect some of the records from this new packed_ds.

The features are not perfectly normalized, but this is sufficient for this tutorial.

for features,label in packed_ds.batch(1000).take(1):
  print(features[0])
  plt.hist(features.numpy().flatten(), bins = 101)
tf.Tensor(
[ 0.869 -0.635  0.226  0.327 -0.69   0.754 -0.249 -1.092  0.     1.375
 -0.654  0.93   1.107  1.139 -1.578 -1.047  0.     0.658 -0.01  -0.046
  3.102  1.354  0.98   0.978  0.92   0.722  0.989  0.877], shape=(28,), dtype=float32)
../../../../_images/de8453f3e3485eeed0a249f9b4660a646e9e691ced57922b42ffd527664ef215.png

To keep this tutorial relatively short, use just the first 1,000 samples for validation, and the next 10,000 for training:

N_VALIDATION = int(1e3)
N_TRAIN = int(1e4)
BUFFER_SIZE = int(1e4)
BATCH_SIZE = 500
STEPS_PER_EPOCH = N_TRAIN//BATCH_SIZE

The Dataset.skip and Dataset.take methods make this easy.

At the same time, use the Dataset.cache method to ensure that the loader doesn’t need to re-read the data from the file on each epoch:

validate_ds = packed_ds.take(N_VALIDATION).cache()
train_ds = packed_ds.skip(N_VALIDATION).take(N_TRAIN).cache()
train_ds
<CacheDataset element_spec=(TensorSpec(shape=(28,), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None))>

These datasets return individual examples. Use the Dataset.batch method to create batches of an appropriate size for training. Before batching, also remember to use Dataset.shuffle and Dataset.repeat on the training set.

validate_ds = validate_ds.batch(BATCH_SIZE)
train_ds = train_ds.shuffle(BUFFER_SIZE).repeat().batch(BATCH_SIZE)

Demonstrate overfitting

The simplest way to prevent overfitting is to start with a small model: A model with a small number of learnable parameters (which is determined by the number of layers and the number of units per layer). In deep learning, the number of learnable parameters in a model is often referred to as the model’s “capacity”.

Intuitively, a model with more parameters will have more “memorization capacity” and therefore will be able to easily learn a perfect dictionary-like mapping between training samples and their targets, a mapping without any generalization power, but this would be useless when making predictions on previously unseen data.

Always keep this in mind: deep learning models tend to be good at fitting to the training data, but the real challenge is generalization, not fitting.

On the other hand, if the network has limited memorization resources, it will not be able to learn the mapping as easily. To minimize its loss, it will have to learn compressed representations that have more predictive power. At the same time, if you make your model too small, it will have difficulty fitting to the training data. There is a balance between “too much capacity” and “not enough capacity”.

Unfortunately, there is no magical formula to determine the right size or architecture of your model (in terms of the number of layers, or the right size for each layer). You will have to experiment using a series of different architectures.

To find an appropriate model size, it’s best to start with relatively few layers and parameters, then begin increasing the size of the layers or adding new layers until you see diminishing returns on the validation loss.

Start with a simple model using only densely-connected layers (tf.keras.layers.Dense) as a baseline, then create larger models, and compare them.

Training procedure

Many models train better if you gradually reduce the learning rate during training. Use tf.keras.optimizers.schedules to reduce the learning rate over time:

lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(
  0.001,
  decay_steps=STEPS_PER_EPOCH*1000,
  decay_rate=1,
  staircase=False)

def get_optimizer():
  return tf.keras.optimizers.Adam(lr_schedule)

The code above sets a tf.keras.optimizers.schedules.InverseTimeDecay to hyperbolically decrease the learning rate to 1/2 of the base rate at 1,000 epochs, 1/3 at 2,000 epochs, and so on.

step = np.linspace(0,100000)
lr = lr_schedule(step)
plt.figure(figsize = (8,6))
plt.plot(step/STEPS_PER_EPOCH, lr)
plt.ylim([0,max(plt.ylim())])
plt.xlabel('Epoch')
_ = plt.ylabel('Learning Rate')
../../../../_images/a501749163d80c97073913581075e4baa94ecc3d8238645921843133e320b095.png

Each model in this tutorial will use the same training configuration. So set these up in a reusable way, starting with the list of callbacks.

The training for this tutorial runs for many short epochs. To reduce the logging noise use the tfdocs.EpochDots which simply prints a . for each epoch, and a full set of metrics every 100 epochs.

Next include tf.keras.callbacks.EarlyStopping to avoid long and unnecessary training times. Note that this callback is set to monitor the val_binary_crossentropy, not the val_loss. This difference will be important later.

Use callbacks.TensorBoard to generate TensorBoard logs for the training.

def get_callbacks(name):
  return [
    tfdocs.modeling.EpochDots(),
    tf.keras.callbacks.EarlyStopping(monitor='val_binary_crossentropy', patience=200),
    tf.keras.callbacks.TensorBoard(logdir/name),
  ]

Similarly each model will use the same Model.compile and Model.fit settings:

def compile_and_fit(model, name, optimizer=None, max_epochs=10000):
  if optimizer is None:
    optimizer = get_optimizer()
  model.compile(optimizer=optimizer,
                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                metrics=[
                  tf.keras.losses.BinaryCrossentropy(
                      from_logits=True, name='binary_crossentropy'),
                  'accuracy'])

  model.summary()

  history = model.fit(
    train_ds,
    steps_per_epoch = STEPS_PER_EPOCH,
    epochs=max_epochs,
    validation_data=validate_ds,
    callbacks=get_callbacks(name),
    verbose=0)
  return history

Tiny model

Start by training a model:

tiny_model = tf.keras.Sequential([
    layers.Dense(16, activation='elu', input_shape=(FEATURES,)),
    layers.Dense(1)
])
size_histories = {}
size_histories['Tiny'] = compile_and_fit(tiny_model, 'sizes/Tiny')
Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_8 (Dense)             (None, 16)                464       
                                                                 
 dense_9 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 481
Trainable params: 481
Non-trainable params: 0
_________________________________________________________________
Epoch: 0, accuracy:0.5086,  binary_crossentropy:0.7727,  loss:0.7727,  val_accuracy:0.5040,  val_binary_crossentropy:0.7448,  val_loss:0.7448,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 100, accuracy:0.6019,  binary_crossentropy:0.6234,  loss:0.6234,  val_accuracy:0.5760,  val_binary_crossentropy:0.6287,  val_loss:0.6287,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 200, accuracy:0.6385,  binary_crossentropy:0.6065,  loss:0.6065,  val_accuracy:0.6270,  val_binary_crossentropy:0.6056,  val_loss:0.6056,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 300, accuracy:0.6511,  binary_crossentropy:0.5959,  loss:0.5959,  val_accuracy:0.6450,  val_binary_crossentropy:0.5913,  val_loss:0.5913,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 400, accuracy:0.6614,  binary_crossentropy:0.5888,  loss:0.5888,  val_accuracy:0.6610,  val_binary_crossentropy:0.5833,  val_loss:0.5833,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 500, accuracy:0.6703,  binary_crossentropy:0.5849,  loss:0.5849,  val_accuracy:0.6760,  val_binary_crossentropy:0.5810,  val_loss:0.5810,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 600, accuracy:0.6728,  binary_crossentropy:0.5821,  loss:0.5821,  val_accuracy:0.6500,  val_binary_crossentropy:0.5808,  val_loss:0.5808,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 700, accuracy:0.6802,  binary_crossentropy:0.5795,  loss:0.5795,  val_accuracy:0.6710,  val_binary_crossentropy:0.5779,  val_loss:0.5779,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 800, accuracy:0.6777,  binary_crossentropy:0.5777,  loss:0.5777,  val_accuracy:0.6790,  val_binary_crossentropy:0.5776,  val_loss:0.5776,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 900, accuracy:0.6845,  binary_crossentropy:0.5758,  loss:0.5758,  val_accuracy:0.6790,  val_binary_crossentropy:0.5782,  val_loss:0.5782,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Now check how the model did:

plotter = tfdocs.plots.HistoryPlotter(metric = 'binary_crossentropy', smoothing_std=10)
plotter.plot(size_histories)
plt.ylim([0.5, 0.7])
(0.5, 0.7)
../../../../_images/07272624836ab7fe80a1f6088fdd322da3c6339f3a214dfb55893cac9e4044de.png

Small model

To check if you can beat the performance of the small model, progressively train some larger models.

Try two hidden layers with 16 units each:

small_model = tf.keras.Sequential([
    # `input_shape` is only required here so that `.summary` works.
    layers.Dense(16, activation='elu', input_shape=(FEATURES,)),
    layers.Dense(16, activation='elu'),
    layers.Dense(1)
])
size_histories['Small'] = compile_and_fit(small_model, 'sizes/Small')
Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_10 (Dense)            (None, 16)                464       
                                                                 
 dense_11 (Dense)            (None, 16)                272       
                                                                 
 dense_12 (Dense)            (None, 1)                 17        
                                                                 
=================================================================
Total params: 753
Trainable params: 753
Non-trainable params: 0
_________________________________________________________________
Epoch: 0, accuracy:0.4784,  binary_crossentropy:0.7893,  loss:0.7893,  val_accuracy:0.4910,  val_binary_crossentropy:0.7389,  val_loss:0.7389,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 100, accuracy:0.6241,  binary_crossentropy:0.6140,  loss:0.6140,  val_accuracy:0.6270,  val_binary_crossentropy:0.6097,  val_loss:0.6097,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 200, accuracy:0.6642,  binary_crossentropy:0.5867,  loss:0.5867,  val_accuracy:0.6460,  val_binary_crossentropy:0.5886,  val_loss:0.5886,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 300, accuracy:0.6755,  binary_crossentropy:0.5751,  loss:0.5751,  val_accuracy:0.6780,  val_binary_crossentropy:0.5781,  val_loss:0.5781,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 400, accuracy:0.6827,  binary_crossentropy:0.5686,  loss:0.5686,  val_accuracy:0.6580,  val_binary_crossentropy:0.5762,  val_loss:0.5762,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 500, accuracy:0.6842,  binary_crossentropy:0.5638,  loss:0.5638,  val_accuracy:0.6760,  val_binary_crossentropy:0.5702,  val_loss:0.5702,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 600, accuracy:0.6924,  binary_crossentropy:0.5591,  loss:0.5591,  val_accuracy:0.6810,  val_binary_crossentropy:0.5713,  val_loss:0.5713,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 700, accuracy:0.6942,  binary_crossentropy:0.5546,  loss:0.5546,  val_accuracy:0.6730,  val_binary_crossentropy:0.5729,  val_loss:0.5729,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 800, accuracy:0.6973,  binary_crossentropy:0.5518,  loss:0.5518,  val_accuracy:0.6650,  val_binary_crossentropy:0.5776,  val_loss:0.5776,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Medium model

Now try three hidden layers with 64 units each:

medium_model = tf.keras.Sequential([
    layers.Dense(64, activation='elu', input_shape=(FEATURES,)),
    layers.Dense(64, activation='elu'),
    layers.Dense(64, activation='elu'),
    layers.Dense(1)
])

And train the model using the same data:

size_histories['Medium']  = compile_and_fit(medium_model, "sizes/Medium")
Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_13 (Dense)            (None, 64)                1856      
                                                                 
 dense_14 (Dense)            (None, 64)                4160      
                                                                 
 dense_15 (Dense)            (None, 64)                4160      
                                                                 
 dense_16 (Dense)            (None, 1)                 65        
                                                                 
=================================================================
Total params: 10,241
Trainable params: 10,241
Non-trainable params: 0
_________________________________________________________________
Epoch: 0, accuracy:0.4971,  binary_crossentropy:0.6986,  loss:0.6986,  val_accuracy:0.4890,  val_binary_crossentropy:0.6791,  val_loss:0.6791,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 100, accuracy:0.7060,  binary_crossentropy:0.5381,  loss:0.5381,  val_accuracy:0.6780,  val_binary_crossentropy:0.6080,  val_loss:0.6080,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 200, accuracy:0.7829,  binary_crossentropy:0.4352,  loss:0.4352,  val_accuracy:0.6390,  val_binary_crossentropy:0.6822,  val_loss:0.6822,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Large model

As an exercise, you can create an even larger model and check how quickly it begins overfitting. Next, add to this benchmark a network that has much more capacity, far more than the problem would warrant:

large_model = tf.keras.Sequential([
    layers.Dense(512, activation='elu', input_shape=(FEATURES,)),
    layers.Dense(512, activation='elu'),
    layers.Dense(512, activation='elu'),
    layers.Dense(512, activation='elu'),
    layers.Dense(1)
])

And, again, train the model using the same data:

size_histories['large'] = compile_and_fit(large_model, "sizes/large")
Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_17 (Dense)            (None, 512)               14848     
                                                                 
 dense_18 (Dense)            (None, 512)               262656    
                                                                 
 dense_19 (Dense)            (None, 512)               262656    
                                                                 
 dense_20 (Dense)            (None, 512)               262656    
                                                                 
 dense_21 (Dense)            (None, 1)                 513       
                                                                 
=================================================================
Total params: 803,329
Trainable params: 803,329
Non-trainable params: 0
_________________________________________________________________
Epoch: 0, accuracy:0.5174,  binary_crossentropy:0.8343,  loss:0.8343,  val_accuracy:0.4600,  val_binary_crossentropy:0.6981,  val_loss:0.6981,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 100, accuracy:1.0000,  binary_crossentropy:0.0024,  loss:0.0024,  val_accuracy:0.6720,  val_binary_crossentropy:1.7037,  val_loss:1.7037,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 200, accuracy:1.0000,  binary_crossentropy:0.0001,  loss:0.0001,  val_accuracy:0.6680,  val_binary_crossentropy:2.3695,  val_loss:2.3695,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Plot the training and validation losses

The solid lines show the training loss, and the dashed lines show the validation loss (remember: a lower validation loss indicates a better model).

While building a larger model gives it more power, if this power is not constrained somehow it can easily overfit to the training set.

In this example, typically, only the "Tiny" model manages to avoid overfitting altogether, and each of the larger models overfit the data more quickly. This becomes so severe for the "large" model that you need to switch the plot to a log-scale to really figure out what’s happening.

This is apparent if you plot and compare the validation metrics to the training metrics.

  • It’s normal for there to be a small difference.

  • If both metrics are moving in the same direction, everything is fine.

  • If the validation metric begins to stagnate while the training metric continues to improve, you are probably close to overfitting.

  • If the validation metric is going in the wrong direction, the model is clearly overfitting.

plotter.plot(size_histories)
a = plt.xscale('log')
plt.xlim([5, max(plt.xlim())])
plt.ylim([0.5, 0.7])
plt.xlabel("Epochs [Log Scale]")
Text(0.5, 0, 'Epochs [Log Scale]')
../../../../_images/8fcf9ae754198181afb7270f27069b0f0007bd72918c7d7102d4ebc56bf1612b.png

Note: All the above training runs used the callbacks.EarlyStopping to end the training once it was clear the model was not making progress.

View in TensorBoard

These models all wrote TensorBoard logs during training.

Open an embedded TensorBoard viewer inside a notebook:

#docs_infra: no_execute

# Load the TensorBoard notebook extension
%load_ext tensorboard

# Open an embedded TensorBoard viewer
%tensorboard --logdir {logdir}/sizes

You can view the results of a previous run of this notebook on TensorBoard.dev.

TensorBoard.dev is a managed experience for hosting, tracking, and sharing ML experiments with everyone.

It’s also included in an <iframe> for convenience:

display.IFrame(
    src="https://tensorboard.dev/experiment/vW7jmmF9TmKmy3rbheMQpw/#scalars&_smoothingWeight=0.97",
    width="100%", height="800px")

If you want to share TensorBoard results you can upload the logs to TensorBoard.dev by copying the following into a code-cell.

Note: This step requires a Google account.

!tensorboard dev upload --logdir  {logdir}/sizes

Caution: This command does not terminate. It’s designed to continuously upload the results of long-running experiments. Once your data is uploaded you need to stop it using the “interrupt execution” option in your notebook tool.

Strategies to prevent overfitting

Before getting into the content of this section copy the training logs from the "Tiny" model above, to use as a baseline for comparison.

shutil.rmtree(logdir/'regularizers/Tiny', ignore_errors=True)
shutil.copytree(logdir/'sizes/Tiny', logdir/'regularizers/Tiny')
PosixPath('/tmp/tmpt5wb_tsk/tensorboard_logs/regularizers/Tiny')
regularizer_histories = {}
regularizer_histories['Tiny'] = size_histories['Tiny']

Add weight regularization

You may be familiar with Occam’s Razor principle: given two explanations for something, the explanation most likely to be correct is the “simplest” one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some training data and a network architecture, there are multiple sets of weights values (multiple models) that could explain the data, and simpler models are less likely to overfit than complex ones.

A “simple model” in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters altogether, as demonstrated in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights only to take small values, which makes the distribution of weight values more “regular”. This is called “weight regularization”, and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:

  • L1 regularization, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the “L1 norm” of the weights).

  • L2 regularization, where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the squared “L2 norm” of the weights). L2 regularization is also called weight decay in the context of neural networks. Don’t let the different name confuse you: weight decay is mathematically the exact same as L2 regularization.

L1 regularization pushes weights towards exactly zero, encouraging a sparse model. L2 regularization will penalize the weights parameters without making them sparse since the penalty goes to zero for small weights—one reason why L2 is more common.

In tf.keras, weight regularization is added by passing weight regularizer instances to layers as keyword arguments. Add L2 weight regularization:

l2_model = tf.keras.Sequential([
    layers.Dense(512, activation='elu',
                 kernel_regularizer=regularizers.l2(0.001),
                 input_shape=(FEATURES,)),
    layers.Dense(512, activation='elu',
                 kernel_regularizer=regularizers.l2(0.001)),
    layers.Dense(512, activation='elu',
                 kernel_regularizer=regularizers.l2(0.001)),
    layers.Dense(512, activation='elu',
                 kernel_regularizer=regularizers.l2(0.001)),
    layers.Dense(1)
])

regularizer_histories['l2'] = compile_and_fit(l2_model, "regularizers/l2")
Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_22 (Dense)            (None, 512)               14848     
                                                                 
 dense_23 (Dense)            (None, 512)               262656    
                                                                 
 dense_24 (Dense)            (None, 512)               262656    
                                                                 
 dense_25 (Dense)            (None, 512)               262656    
                                                                 
 dense_26 (Dense)            (None, 1)                 513       
                                                                 
=================================================================
Total params: 803,329
Trainable params: 803,329
Non-trainable params: 0
_________________________________________________________________
Epoch: 0, accuracy:0.5091,  binary_crossentropy:0.8260,  loss:2.3510,  val_accuracy:0.4620,  val_binary_crossentropy:0.7253,  val_loss:2.1769,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 100, accuracy:0.6579,  binary_crossentropy:0.5974,  loss:0.6198,  val_accuracy:0.6440,  val_binary_crossentropy:0.5902,  val_loss:0.6126,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 200, accuracy:0.6784,  binary_crossentropy:0.5840,  loss:0.6071,  val_accuracy:0.6560,  val_binary_crossentropy:0.5865,  val_loss:0.6098,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 300, accuracy:0.6804,  binary_crossentropy:0.5726,  loss:0.5946,  val_accuracy:0.6620,  val_binary_crossentropy:0.5803,  val_loss:0.6022,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 400, accuracy:0.6882,  binary_crossentropy:0.5658,  loss:0.5899,  val_accuracy:0.6950,  val_binary_crossentropy:0.5791,  val_loss:0.6032,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 500, accuracy:0.6920,  binary_crossentropy:0.5629,  loss:0.5881,  val_accuracy:0.6840,  val_binary_crossentropy:0.5796,  val_loss:0.6048,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 600, accuracy:0.6990,  binary_crossentropy:0.5513,  loss:0.5797,  val_accuracy:0.6860,  val_binary_crossentropy:0.5772,  val_loss:0.6055,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 700, accuracy:0.7068,  binary_crossentropy:0.5456,  loss:0.5754,  val_accuracy:0.6920,  val_binary_crossentropy:0.5734,  val_loss:0.6031,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 800, accuracy:0.7142,  binary_crossentropy:0.5349,  loss:0.5648,  val_accuracy:0.6910,  val_binary_crossentropy:0.5796,  val_loss:0.6095,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 900, accuracy:0.7179,  binary_crossentropy:0.5303,  loss:0.5610,  val_accuracy:0.6710,  val_binary_crossentropy:0.5826,  val_loss:0.6132,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

l2(0.001) means that every coefficient in the weight matrix of the layer will add 0.001 * weight_coefficient_value**2 to the total loss of the network.

That is why we’re monitoring the binary_crossentropy directly. Because it doesn’t have this regularization component mixed in.

So, that same "Large" model with an L2 regularization penalty performs much better:

plotter.plot(regularizer_histories)
plt.ylim([0.5, 0.7])
(0.5, 0.7)
../../../../_images/21614b30536277943533ae54a630c78be5c468aa15f13d7e56d32d0fc183030a.png

As demonstrated in the diagram above, the "L2" regularized model is now much more competitive with the "Tiny" model. This "L2" model is also much more resistant to overfitting than the "Large" model it was based on despite having the same number of parameters.

More info

There are two important things to note about this sort of regularization:

  1. If you are writing your own training loop, then you need to be sure to ask the model for its regularization losses.

result = l2_model(features)
regularization_loss=tf.add_n(l2_model.losses)
  1. This implementation works by adding the weight penalties to the model’s loss, and then applying a standard optimization procedure after that.

There is a second approach that instead only runs the optimizer on the raw loss, and then while applying the calculated step the optimizer also applies some weight decay. This “decoupled weight decay” is used in optimizers like tf.keras.optimizers.Ftrl and tfa.optimizers.AdamW.

Add dropout

Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his students at the University of Toronto.

The intuitive explanation for dropout is that because individual nodes in the network cannot rely on the output of the others, each node must output features that are useful on their own.

Dropout, applied to a layer, consists of randomly “dropping out” (i.e. set to zero) a number of output features of the layer during training. For example, a given layer would normally have returned a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. [0, 0.5, 1.3, 0, 1.1].

The “dropout rate” is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer’s output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.

In Keras, you can introduce dropout in a network via the tf.keras.layers.Dropout layer, which gets applied to the output of layer right before.

Add two dropout layers to your network to check how well they do at reducing overfitting:

dropout_model = tf.keras.Sequential([
    layers.Dense(512, activation='elu', input_shape=(FEATURES,)),
    layers.Dropout(0.5),
    layers.Dense(512, activation='elu'),
    layers.Dropout(0.5),
    layers.Dense(512, activation='elu'),
    layers.Dropout(0.5),
    layers.Dense(512, activation='elu'),
    layers.Dropout(0.5),
    layers.Dense(1)
])

regularizer_histories['dropout'] = compile_and_fit(dropout_model, "regularizers/dropout")
Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_27 (Dense)            (None, 512)               14848     
                                                                 
 dropout (Dropout)           (None, 512)               0         
                                                                 
 dense_28 (Dense)            (None, 512)               262656    
                                                                 
 dropout_1 (Dropout)         (None, 512)               0         
                                                                 
 dense_29 (Dense)            (None, 512)               262656    
                                                                 
 dropout_2 (Dropout)         (None, 512)               0         
                                                                 
 dense_30 (Dense)            (None, 512)               262656    
                                                                 
 dropout_3 (Dropout)         (None, 512)               0         
                                                                 
 dense_31 (Dense)            (None, 1)                 513       
                                                                 
=================================================================
Total params: 803,329
Trainable params: 803,329
Non-trainable params: 0
_________________________________________________________________
Epoch: 0, accuracy:0.5018,  binary_crossentropy:0.8187,  loss:0.8187,  val_accuracy:0.5720,  val_binary_crossentropy:0.6890,  val_loss:0.6890,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 100, accuracy:0.6568,  binary_crossentropy:0.5966,  loss:0.5966,  val_accuracy:0.6820,  val_binary_crossentropy:0.5713,  val_loss:0.5713,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 200, accuracy:0.6929,  binary_crossentropy:0.5539,  loss:0.5539,  val_accuracy:0.6910,  val_binary_crossentropy:0.5774,  val_loss:0.5774,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 300, accuracy:0.7214,  binary_crossentropy:0.5114,  loss:0.5114,  val_accuracy:0.6860,  val_binary_crossentropy:0.5982,  val_loss:0.5982,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
plotter.plot(regularizer_histories)
plt.ylim([0.5, 0.7])
(0.5, 0.7)
../../../../_images/1c23f178247902b8579c359fa0c715a8a1e27b0847c0f46c8b5f034e8eb69ea0.png

It’s clear from this plot that both of these regularization approaches improve the behavior of the "Large" model. But this still doesn’t beat even the "Tiny" baseline.

Next try them both, together, and see if that does better.

Combined L2 + dropout

combined_model = tf.keras.Sequential([
    layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
                 activation='elu', input_shape=(FEATURES,)),
    layers.Dropout(0.5),
    layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
                 activation='elu'),
    layers.Dropout(0.5),
    layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
                 activation='elu'),
    layers.Dropout(0.5),
    layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
                 activation='elu'),
    layers.Dropout(0.5),
    layers.Dense(1)
])

regularizer_histories['combined'] = compile_and_fit(combined_model, "regularizers/combined")
Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_32 (Dense)            (None, 512)               14848     
                                                                 
 dropout_4 (Dropout)         (None, 512)               0         
                                                                 
 dense_33 (Dense)            (None, 512)               262656    
                                                                 
 dropout_5 (Dropout)         (None, 512)               0         
                                                                 
 dense_34 (Dense)            (None, 512)               262656    
                                                                 
 dropout_6 (Dropout)         (None, 512)               0         
                                                                 
 dense_35 (Dense)            (None, 512)               262656    
                                                                 
 dropout_7 (Dropout)         (None, 512)               0         
                                                                 
 dense_36 (Dense)            (None, 1)                 513       
                                                                 
=================================================================
Total params: 803,329
Trainable params: 803,329
Non-trainable params: 0
_________________________________________________________________
Epoch: 0, accuracy:0.4995,  binary_crossentropy:0.8109,  loss:0.9692,  val_accuracy:0.5220,  val_binary_crossentropy:0.6790,  val_loss:0.8366,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 100, accuracy:0.6497,  binary_crossentropy:0.6062,  loss:0.6349,  val_accuracy:0.6440,  val_binary_crossentropy:0.5934,  val_loss:0.6219,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 200, accuracy:0.6647,  binary_crossentropy:0.5905,  loss:0.6159,  val_accuracy:0.6830,  val_binary_crossentropy:0.5816,  val_loss:0.6070,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 300, accuracy:0.6718,  binary_crossentropy:0.5834,  loss:0.6108,  val_accuracy:0.6800,  val_binary_crossentropy:0.5613,  val_loss:0.5888,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 400, accuracy:0.6796,  binary_crossentropy:0.5778,  loss:0.6073,  val_accuracy:0.6740,  val_binary_crossentropy:0.5608,  val_loss:0.5903,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 500, accuracy:0.6832,  binary_crossentropy:0.5706,  loss:0.6025,  val_accuracy:0.6830,  val_binary_crossentropy:0.5558,  val_loss:0.5877,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 600, accuracy:0.6856,  binary_crossentropy:0.5664,  loss:0.6000,  val_accuracy:0.6870,  val_binary_crossentropy:0.5464,  val_loss:0.5799,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 700, accuracy:0.6886,  binary_crossentropy:0.5611,  loss:0.5965,  val_accuracy:0.6890,  val_binary_crossentropy:0.5451,  val_loss:0.5805,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epoch: 800, accuracy:0.6895,  binary_crossentropy:0.5560,  loss:0.5926,  val_accuracy:0.6880,  val_binary_crossentropy:0.5453,  val_loss:0.5819,  
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
plotter.plot(regularizer_histories)
plt.ylim([0.5, 0.7])
(0.5, 0.7)
../../../../_images/ca5fb29cf74788a64db07d8bd5b013eb3a67b116e22fb0b3009ac0514575ab4d.png

This model with the "Combined" regularization is obviously the best one so far.

View in TensorBoard

These models also recorded TensorBoard logs.

To open an embedded tensorboard viewer inside a notebook, copy the following into a code-cell:

%tensorboard --logdir {logdir}/regularizers

Takeaway

To recap, here are the most common ways to prevent overfitting in neural networks:

  • Get more training data.

  • Reduce the capacity of the network.

  • Add weight regularization.

  • Add dropout.

Two important approaches not covered in this guide are:

  • Data augmentation

  • Batch normalization (tf.keras.layers.BatchNormalization)

DO IT YOURSELF 2:

Modify any example above with a batch normalization layer. Prepare it as a seperate pyton script. Run it in Carbon

Save and load models

Model progress can be saved during and after training. This means a model can resume where it left off and avoid long training times. Saving also means you can share your model and others can recreate your work. When publishing research models and techniques, most machine learning practitioners share:

  • code to create the model, and

  • the trained weights, or parameters, for the model

Sharing this data helps others understand how the model works and try it themselves with new data.

Caution: TensorFlow models are code and it is important to be careful with untrusted code. See Using TensorFlow Securely for details.

Options

There are different ways to save TensorFlow models depending on the API you’re using. This guide uses tf.keras—a high-level API to build and train models in TensorFlow. For other approaches, refer to the Using the SavedModel format guide and the Save and load Keras models guide.

Setup

Installs and imports

Install and import TensorFlow and dependencies:

!pip install pyyaml h5py  # Required to save models in HDF5 format
/bin/bash: /home/obm/Prog/miniconda3/envs/qml/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Requirement already satisfied: pyyaml in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (6.0)
Requirement already satisfied: h5py in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (3.7.0)
Requirement already satisfied: numpy>=1.14.5 in /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages (from h5py) (1.23.4)
import os

import tensorflow as tf
from tensorflow import keras

print(tf.version.VERSION)
2.10.0

Get an example dataset

To demonstrate how to save and load weights, you’ll use the MNIST dataset. To speed up these runs, use the first 1000 examples:

(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()

train_labels = train_labels[:1000]
test_labels = test_labels[:1000]

train_images = train_images[:1000].reshape(-1, 28 * 28) / 255.0
test_images = test_images[:1000].reshape(-1, 28 * 28) / 255.0

Define a model

Start by building a simple sequential model:

# Define a simple sequential model
def create_model():
  model = tf.keras.Sequential([
    keras.layers.Dense(512, activation='relu', input_shape=(784,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(10)
  ])

  model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

  return model

# Create a basic model instance
model = create_model()

# Display the model's architecture
model.summary()
Model: "sequential_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_37 (Dense)            (None, 512)               401920    
                                                                 
 dropout_8 (Dropout)         (None, 512)               0         
                                                                 
 dense_38 (Dense)            (None, 10)                5130      
                                                                 
=================================================================
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________

Save checkpoints during training

You can use a trained model without having to retrain it, or pick-up training where you left off in case the training process was interrupted. The tf.keras.callbacks.ModelCheckpoint callback allows you to continually save the model both during and at the end of training.

Checkpoint callback usage

Create a tf.keras.callbacks.ModelCheckpoint callback that saves weights only during training:

checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)

# Train the model with the new callback
model.fit(train_images, 
          train_labels,  
          epochs=10,
          validation_data=(test_images, test_labels),
          callbacks=[cp_callback])  # Pass callback to training

# This may generate warnings related to saving the state of the optimizer.
# These warnings (and similar warnings throughout this notebook)
# are in place to discourage outdated usage, and can be ignored.
Epoch 1/10
 1/32 [..............................] - ETA: 5s - loss: 2.3800 - sparse_categorical_accuracy: 0.0000e+00

32/32 [==============================] - ETA: 0s - loss: 1.1511 - sparse_categorical_accuracy: 0.6790    
Epoch 1: saving model to training_1/cp.ckpt

32/32 [==============================] - 0s 7ms/step - loss: 1.1511 - sparse_categorical_accuracy: 0.6790 - val_loss: 0.7251 - val_sparse_categorical_accuracy: 0.7710
Epoch 2/10
 1/32 [..............................] - ETA: 0s - loss: 0.6371 - sparse_categorical_accuracy: 0.8125
Epoch 2: saving model to training_1/cp.ckpt

32/32 [==============================] - 0s 4ms/step - loss: 0.4274 - sparse_categorical_accuracy: 0.8780 - val_loss: 0.5376 - val_sparse_categorical_accuracy: 0.8400
Epoch 3/10
 1/32 [..............................] - ETA: 0s - loss: 0.1832 - sparse_categorical_accuracy: 0.9688

30/32 [===========================>..] - ETA: 0s - loss: 0.2894 - sparse_categorical_accuracy: 0.9271
Epoch 3: saving model to training_1/cp.ckpt

32/32 [==============================] - 0s 5ms/step - loss: 0.2879 - sparse_categorical_accuracy: 0.9250 - val_loss: 0.4587 - val_sparse_categorical_accuracy: 0.8570
Epoch 4/10
 1/32 [..............................] - ETA: 0s - loss: 0.1983 - sparse_categorical_accuracy: 0.9375

30/32 [===========================>..] - ETA: 0s - loss: 0.2148 - sparse_categorical_accuracy: 0.9510
Epoch 4: saving model to training_1/cp.ckpt

32/32 [==============================] - 0s 4ms/step - loss: 0.2094 - sparse_categorical_accuracy: 0.9530 - val_loss: 0.4463 - val_sparse_categorical_accuracy: 0.8620
Epoch 5/10
 1/32 [..............................] - ETA: 0s - loss: 0.2585 - sparse_categorical_accuracy: 0.9375
Epoch 5: saving model to training_1/cp.ckpt

32/32 [==============================] - 0s 4ms/step - loss: 0.1544 - sparse_categorical_accuracy: 0.9680 - val_loss: 0.4129 - val_sparse_categorical_accuracy: 0.8660
Epoch 6/10
 1/32 [..............................] - ETA: 0s - loss: 0.0530 - sparse_categorical_accuracy: 1.0000
Epoch 6: saving model to training_1/cp.ckpt

32/32 [==============================] - 0s 4ms/step - loss: 0.1133 - sparse_categorical_accuracy: 0.9760 - val_loss: 0.4269 - val_sparse_categorical_accuracy: 0.8630
Epoch 7/10
 1/32 [..............................] - ETA: 0s - loss: 0.1295 - sparse_categorical_accuracy: 1.0000
Epoch 7: saving model to training_1/cp.ckpt

32/32 [==============================] - 0s 5ms/step - loss: 0.0943 - sparse_categorical_accuracy: 0.9880 - val_loss: 0.4148 - val_sparse_categorical_accuracy: 0.8640
Epoch 8/10
 1/32 [..............................] - ETA: 0s - loss: 0.0677 - sparse_categorical_accuracy: 1.0000

28/32 [=========================>....] - ETA: 0s - loss: 0.0707 - sparse_categorical_accuracy: 0.9922
Epoch 8: saving model to training_1/cp.ckpt

32/32 [==============================] - 0s 4ms/step - loss: 0.0700 - sparse_categorical_accuracy: 0.9920 - val_loss: 0.4009 - val_sparse_categorical_accuracy: 0.8710
Epoch 9/10
 1/32 [..............................] - ETA: 0s - loss: 0.0380 - sparse_categorical_accuracy: 1.0000
Epoch 9: saving model to training_1/cp.ckpt

32/32 [==============================] - 0s 3ms/step - loss: 0.0508 - sparse_categorical_accuracy: 0.9950 - val_loss: 0.4181 - val_sparse_categorical_accuracy: 0.8670
Epoch 10/10
 1/32 [..............................] - ETA: 0s - loss: 0.0248 - sparse_categorical_accuracy: 1.0000
Epoch 10: saving model to training_1/cp.ckpt

32/32 [==============================] - 0s 4ms/step - loss: 0.0389 - sparse_categorical_accuracy: 0.9990 - val_loss: 0.4115 - val_sparse_categorical_accuracy: 0.8670
<keras.callbacks.History at 0x7f8a47f8e2b0>

This creates a single collection of TensorFlow checkpoint files that are updated at the end of each epoch:

os.listdir(checkpoint_dir)
['cp.ckpt.index', 'checkpoint', 'cp.ckpt.data-00000-of-00001']

As long as two models share the same architecture you can share weights between them. So, when restoring a model from weights-only, create a model with the same architecture as the original model and then set its weights.

Now rebuild a fresh, untrained model and evaluate it on the test set. An untrained model will perform at chance levels (~10% accuracy):

# Create a basic model instance
model = create_model()

# Evaluate the model
loss, acc = model.evaluate(test_images, test_labels, verbose=2)
print("Untrained model, accuracy: {:5.2f}%".format(100 * acc))
32/32 - 0s - loss: 2.3659 - sparse_categorical_accuracy: 0.1080 - 106ms/epoch - 3ms/step
Untrained model, accuracy: 10.80%

Then load the weights from the checkpoint and re-evaluate:

# Loads the weights
model.load_weights(checkpoint_path)

# Re-evaluate the model
loss, acc = model.evaluate(test_images, test_labels, verbose=2)
print("Restored model, accuracy: {:5.2f}%".format(100 * acc))
32/32 - 0s - loss: 0.4115 - sparse_categorical_accuracy: 0.8670 - 43ms/epoch - 1ms/step
Restored model, accuracy: 86.70%

Checkpoint callback options

The callback provides several options to provide unique names for checkpoints and adjust the checkpointing frequency.

Train a new model, and save uniquely named checkpoints once every five epochs:

# Include the epoch in the file name (uses `str.format`)
checkpoint_path = "training_2/cp-{epoch:04d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

batch_size = 32

# Create a callback that saves the model's weights every 5 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path, 
    verbose=1, 
    save_weights_only=True,
    save_freq=5*batch_size)

# Create a new model instance
model = create_model()

# Save the weights using the `checkpoint_path` format
model.save_weights(checkpoint_path.format(epoch=0))

# Train the model with the new callback
model.fit(train_images, 
          train_labels,
          epochs=50, 
          batch_size=batch_size, 
          callbacks=[cp_callback],
          validation_data=(test_images, test_labels),
          verbose=0)
Epoch 5: saving model to training_2/cp-0005.ckpt
Epoch 10: saving model to training_2/cp-0010.ckpt
Epoch 15: saving model to training_2/cp-0015.ckpt
Epoch 20: saving model to training_2/cp-0020.ckpt
Epoch 25: saving model to training_2/cp-0025.ckpt
Epoch 30: saving model to training_2/cp-0030.ckpt
Epoch 35: saving model to training_2/cp-0035.ckpt
Epoch 40: saving model to training_2/cp-0040.ckpt
Epoch 45: saving model to training_2/cp-0045.ckpt
Epoch 50: saving model to training_2/cp-0050.ckpt
<keras.callbacks.History at 0x7f8acce3f3d0>

Now, review the resulting checkpoints and choose the latest one:

os.listdir(checkpoint_dir)
['cp-0035.ckpt.index',
 'cp-0050.ckpt.index',
 'cp-0000.ckpt.index',
 'cp-0025.ckpt.data-00000-of-00001',
 'cp-0030.ckpt.index',
 'cp-0000.ckpt.data-00000-of-00001',
 'cp-0025.ckpt.index',
 'cp-0045.ckpt.data-00000-of-00001',
 'checkpoint',
 'cp-0015.ckpt.data-00000-of-00001',
 'cp-0030.ckpt.data-00000-of-00001',
 'cp-0015.ckpt.index',
 'cp-0020.ckpt.data-00000-of-00001',
 'cp-0045.ckpt.index',
 'cp-0020.ckpt.index',
 'cp-0050.ckpt.data-00000-of-00001',
 'cp-0035.ckpt.data-00000-of-00001',
 'cp-0005.ckpt.index',
 'cp-0010.ckpt.index',
 'cp-0040.ckpt.index',
 'cp-0040.ckpt.data-00000-of-00001',
 'cp-0010.ckpt.data-00000-of-00001',
 'cp-0005.ckpt.data-00000-of-00001']
latest = tf.train.latest_checkpoint(checkpoint_dir)
latest
'training_2/cp-0050.ckpt'

Note: The default TensorFlow format only saves the 5 most recent checkpoints.

To test, reset the model, and load the latest checkpoint:

# Create a new model instance
model = create_model()

# Load the previously saved weights
model.load_weights(latest)

# Re-evaluate the model
loss, acc = model.evaluate(test_images, test_labels, verbose=2)
print("Restored model, accuracy: {:5.2f}%".format(100 * acc))
32/32 - 0s - loss: 0.4811 - sparse_categorical_accuracy: 0.8790 - 97ms/epoch - 3ms/step
Restored model, accuracy: 87.90%

What are these files?

The above code stores the weights to a collection of checkpoint-formatted files that contain only the trained weights in a binary format. Checkpoints contain:

  • One or more shards that contain your model’s weights.

  • An index file that indicates which weights are stored in which shard.

If you are training a model on a single machine, you’ll have one shard with the suffix: .data-00000-of-00001

Manually save weights

To save weights manually, use tf.keras.Model.save_weights. By default, tf.keras—and the Model.save_weights method in particular—uses the TensorFlow Checkpoint format with a .ckpt extension. To save in the HDF5 format with a .h5 extension, refer to the Save and load models guide.

# Save the weights
model.save_weights('./checkpoints/my_checkpoint')

# Create a new model instance
model = create_model()

# Restore the weights
model.load_weights('./checkpoints/my_checkpoint')

# Evaluate the model
loss, acc = model.evaluate(test_images, test_labels, verbose=2)
print("Restored model, accuracy: {:5.2f}%".format(100 * acc))
WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use `status.expect_partial()`. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function.
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.iter
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.beta_1
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.beta_2
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.decay
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.learning_rate
32/32 - 0s - loss: 0.4811 - sparse_categorical_accuracy: 0.8790 - 97ms/epoch - 3ms/step
Restored model, accuracy: 87.90%

Save the entire model

Call tf.keras.Model.save to save a model’s architecture, weights, and training configuration in a single file/folder. This allows you to export a model so it can be used without access to the original Python code*. Since the optimizer-state is recovered, you can resume training from exactly where you left off.

An entire model can be saved in two different file formats (SavedModel and HDF5). The TensorFlow SavedModel format is the default file format in TF2.x. However, models can be saved in HDF5 format. More details on saving entire models in the two file formats is described below.

Saving a fully-functional model is very useful—you can load them in TensorFlow.js (Saved Model, HDF5) and then train and run them in web browsers, or convert them to run on mobile devices using TensorFlow Lite (Saved Model, HDF5)

*Custom objects (for example, subclassed models or layers) require special attention when saving and loading. Refer to the Saving custom objects section below.

SavedModel format

The SavedModel format is another way to serialize models. Models saved in this format can be restored using tf.keras.models.load_model and are compatible with TensorFlow Serving. The SavedModel guide goes into detail about how to serve/inspect the SavedModel. The section below illustrates the steps to save and restore the model.

# Create and train a new model instance.
model = create_model()
model.fit(train_images, train_labels, epochs=5)

# Save the entire model as a SavedModel.
!mkdir -p saved_model
model.save('saved_model/my_model') 
Epoch 1/5
 1/32 [..............................] - ETA: 5s - loss: 2.3582 - sparse_categorical_accuracy: 0.0938

32/32 [==============================] - 0s 1ms/step - loss: 1.1544 - sparse_categorical_accuracy: 0.6850
Epoch 2/5
 1/32 [..............................] - ETA: 0s - loss: 0.5347 - sparse_categorical_accuracy: 0.9062

29/32 [==========================>...] - ETA: 0s - loss: 0.4139 - sparse_categorical_accuracy: 0.8869

32/32 [==============================] - 0s 2ms/step - loss: 0.4138 - sparse_categorical_accuracy: 0.8860
Epoch 3/5
 1/32 [..............................] - ETA: 0s - loss: 0.3314 - sparse_categorical_accuracy: 0.9375

30/32 [===========================>..] - ETA: 0s - loss: 0.2936 - sparse_categorical_accuracy: 0.9146

32/32 [==============================] - 0s 2ms/step - loss: 0.2937 - sparse_categorical_accuracy: 0.9140
Epoch 4/5
 1/32 [..............................] - ETA: 0s - loss: 0.1980 - sparse_categorical_accuracy: 0.9688

29/32 [==========================>...] - ETA: 0s - loss: 0.1998 - sparse_categorical_accuracy: 0.9547

32/32 [==============================] - 0s 2ms/step - loss: 0.2069 - sparse_categorical_accuracy: 0.9520
Epoch 5/5
 1/32 [..............................] - ETA: 0s - loss: 0.3500 - sparse_categorical_accuracy: 0.9375

29/32 [==========================>...] - ETA: 0s - loss: 0.1576 - sparse_categorical_accuracy: 0.9709

32/32 [==============================] - 0s 2ms/step - loss: 0.1538 - sparse_categorical_accuracy: 0.9720
/bin/bash: /home/obm/Prog/miniconda3/envs/qml/lib/libtinfo.so.6: no version information available (required by /bin/bash)
INFO:tensorflow:Assets written to: saved_model/my_model/assets

The SavedModel format is a directory containing a protobuf binary and a TensorFlow checkpoint. Inspect the saved model directory:

# my_model directory
!ls saved_model

# Contains an assets folder, saved_model.pb, and variables folder.
!ls saved_model/my_model
/bin/bash: /home/obm/Prog/miniconda3/envs/qml/lib/libtinfo.so.6: no version information available (required by /bin/bash)
my_model
/bin/bash: /home/obm/Prog/miniconda3/envs/qml/lib/libtinfo.so.6: no version information available (required by /bin/bash)
assets	keras_metadata.pb  saved_model.pb  variables

Reload a fresh Keras model from the saved model:

new_model = tf.keras.models.load_model('saved_model/my_model')

# Check its architecture
new_model.summary()
Model: "sequential_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_47 (Dense)            (None, 512)               401920    
                                                                 
 dropout_13 (Dropout)        (None, 512)               0         
                                                                 
 dense_48 (Dense)            (None, 10)                5130      
                                                                 
=================================================================
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________

The restored model is compiled with the same arguments as the original model. Try running evaluate and predict with the loaded model:

# Evaluate the restored model
loss, acc = new_model.evaluate(test_images, test_labels, verbose=2)
print('Restored model, accuracy: {:5.2f}%'.format(100 * acc))

print(new_model.predict(test_images).shape)
32/32 - 0s - loss: 0.4337 - sparse_categorical_accuracy: 0.8570 - 96ms/epoch - 3ms/step
Restored model, accuracy: 85.70%

 1/32 [..............................] - ETA: 0s

32/32 [==============================] - 0s 668us/step
(1000, 10)

HDF5 format

Keras provides a basic save format using the HDF5 standard.

# Create and train a new model instance.
model = create_model()
model.fit(train_images, train_labels, epochs=5)

# Save the entire model to a HDF5 file.
# The '.h5' extension indicates that the model should be saved to HDF5.
model.save('my_model.h5') 
Epoch 1/5
 1/32 [..............................] - ETA: 5s - loss: 2.3057 - sparse_categorical_accuracy: 0.0938

32/32 [==============================] - 0s 1ms/step - loss: 1.1472 - sparse_categorical_accuracy: 0.6820
Epoch 2/5
 1/32 [..............................] - ETA: 0s - loss: 0.4974 - sparse_categorical_accuracy: 0.8750

32/32 [==============================] - 0s 1ms/step - loss: 0.4142 - sparse_categorical_accuracy: 0.8860
Epoch 3/5
 1/32 [..............................] - ETA: 0s - loss: 0.2761 - sparse_categorical_accuracy: 0.8438

31/32 [============================>.] - ETA: 0s - loss: 0.2756 - sparse_categorical_accuracy: 0.9234

32/32 [==============================] - 0s 2ms/step - loss: 0.2734 - sparse_categorical_accuracy: 0.9240
Epoch 4/5
 1/32 [..............................] - ETA: 0s - loss: 0.3729 - sparse_categorical_accuracy: 0.9062

29/32 [==========================>...] - ETA: 0s - loss: 0.1999 - sparse_categorical_accuracy: 0.9515

32/32 [==============================] - 0s 2ms/step - loss: 0.1980 - sparse_categorical_accuracy: 0.9520
Epoch 5/5
 1/32 [..............................] - ETA: 0s - loss: 0.1235 - sparse_categorical_accuracy: 1.0000

30/32 [===========================>..] - ETA: 0s - loss: 0.1456 - sparse_categorical_accuracy: 0.9719

32/32 [==============================] - 0s 2ms/step - loss: 0.1455 - sparse_categorical_accuracy: 0.9720

Now, recreate the model from that file:

# Recreate the exact same model, including its weights and the optimizer
new_model = tf.keras.models.load_model('my_model.h5')

# Show the model architecture
new_model.summary()
Model: "sequential_17"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_49 (Dense)            (None, 512)               401920    
                                                                 
 dropout_14 (Dropout)        (None, 512)               0         
                                                                 
 dense_50 (Dense)            (None, 10)                5130      
                                                                 
=================================================================
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________

Check its accuracy:

loss, acc = new_model.evaluate(test_images, test_labels, verbose=2)
print('Restored model, accuracy: {:5.2f}%'.format(100 * acc))
32/32 - 0s - loss: 0.4245 - sparse_categorical_accuracy: 0.8620 - 96ms/epoch - 3ms/step
Restored model, accuracy: 86.20%

Keras saves models by inspecting their architectures. This technique saves everything:

  • The weight values

  • The model’s architecture

  • The model’s training configuration (what you pass to the .compile() method)

  • The optimizer and its state, if any (this enables you to restart training where you left off)

Keras is not able to save the v1.x optimizers (from tf.compat.v1.train) since they aren’t compatible with checkpoints. For v1.x optimizers, you need to re-compile the model after loading—losing the state of the optimizer.

Saving custom objects

If you are using the SavedModel format, you can skip this section. The key difference between HDF5 and SavedModel is that HDF5 uses object configs to save the model architecture, while SavedModel saves the execution graph. Thus, SavedModels are able to save custom objects like subclassed models and custom layers without requiring the original code.

To save custom objects to HDF5, you must do the following:

  1. Define a get_config method in your object, and optionally a from_config classmethod.

  • get_config(self) returns a JSON-serializable dictionary of parameters needed to recreate the object.

  • from_config(cls, config) uses the returned config from get_config to create a new object. By default, this function will use the config as initialization kwargs (return cls(**config)).

  1. Pass the object to the custom_objects argument when loading the model. The argument must be a dictionary mapping the string class name to the Python class. E.g. tf.keras.models.load_model(path, custom_objects={'CustomLayer': CustomLayer})

Refer to the Writing layers and models from scratch tutorial for examples of custom objects and get_config.

TFP Probabilistic Layers: Regression

In this example we show how to fit regression models using TFP’s “probabilistic layers.”

Dependencies & Prerequisites

from pprint import pprint
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

import tensorflow.compat.v2 as tf
tf.enable_v2_behavior()

import tensorflow_probability as tfp

sns.reset_defaults()
#sns.set_style('whitegrid')
#sns.set_context('talk')
sns.set_context(context='talk',font_scale=0.7)

%matplotlib inline

tfd = tfp.distributions
WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use `status.expect_partial()`. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function.
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.iter
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.beta_1
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.beta_2
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.decay
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.learning_rate

Make things Fast!

Before we dive in, let’s make sure we’re using a GPU for this demo.

To do this, select “Runtime” -> “Change runtime type” -> “Hardware accelerator” -> “GPU”.

The following snippet will verify that we have access to a GPU.

if tf.test.gpu_device_name() != '/device:GPU:0':
  print('WARNING: GPU device not found.')
else:
  print('SUCCESS: Found GPU: {}'.format(tf.test.gpu_device_name()))
SUCCESS: Found GPU: /device:GPU:0
2022-10-27 12:55:33.922075: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:55:33.922294: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:55:33.922438: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:55:33.922630: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:55:33.922781: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:55:33.922903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /device:GPU:0 with 6911 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:21:00.0, compute capability: 7.5
2022-10-27 12:55:33.923030: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:55:33.923180: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:55:33.923321: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:55:33.923482: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:55:33.923628: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-27 12:55:33.923746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /device:GPU:0 with 6911 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:21:00.0, compute capability: 7.5

Note: if for some reason you cannot access a GPU, this colab will still work. (Training will just take longer.)

Motivation

Wouldn’t it be great if we could use TFP to specify a probabilistic model then simply minimize the negative log-likelihood, i.e.,

negloglik = lambda y, rv_y: -rv_y.log_prob(y)

Well not only is it possible, but this colab shows how! (In context of linear regression problems.)

#@title Synthesize dataset.
w0 = 0.125
b0 = 5.
x_range = [-20, 60]

def load_dataset(n=150, n_tst=150):
  np.random.seed(43)
  def s(x):
    g = (x - x_range[0]) / (x_range[1] - x_range[0])
    return 3 * (0.25 + g**2.)
  x = (x_range[1] - x_range[0]) * np.random.rand(n) + x_range[0]
  eps = np.random.randn(n) * s(x)
  y = (w0 * x * (1. + np.sin(x)) + b0) + eps
  x = x[..., np.newaxis]
  x_tst = np.linspace(*x_range, num=n_tst).astype(np.float32)
  x_tst = x_tst[..., np.newaxis]
  return y, x, x_tst

y, x, x_tst = load_dataset()

Case 1: No Uncertainty

# Build model.
model = tf.keras.Sequential([
  tf.keras.layers.Dense(1),
  tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t, scale=1)),
])

# Do inference.
model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.01), loss=negloglik)
model.fit(x, y, epochs=1000, verbose=False);

# Profit.
[print(np.squeeze(w.numpy())) for w in model.weights];
yhat = model(x_tst)
assert isinstance(yhat, tfd.Distribution)
0.13465934
5.133879
#@title Figure 1: No uncertainty.
w = np.squeeze(model.layers[-2].kernel.numpy())
b = np.squeeze(model.layers[-2].bias.numpy())

plt.figure(figsize=[6, 1.5])  # inches
#plt.figure(figsize=[8, 5])  # inches
plt.plot(x, y, 'b.', label='observed');
plt.plot(x_tst, yhat.mean(),'r', label='mean', linewidth=4);
plt.ylim(-0.,17);
plt.yticks(np.linspace(0, 15, 4)[1:]);
plt.xticks(np.linspace(*x_range, num=9));

ax=plt.gca();
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#ax.spines['left'].set_smart_bounds(True)
#ax.spines['bottom'].set_smart_bounds(True)
plt.legend(loc='center left', fancybox=True, framealpha=0., bbox_to_anchor=(1.05, 0.5))

plt.savefig('/tmp/fig1.png', bbox_inches='tight', dpi=300)
../../../../_images/67ade29eabfe48cd83e8e9d749ab6cb57e069ae2829d923b0760030051ecd5d6.png

Case 2: Aleatoric Uncertainty

# Build model.
model = tf.keras.Sequential([
  tf.keras.layers.Dense(1 + 1),
  tfp.layers.DistributionLambda(
      lambda t: tfd.Normal(loc=t[..., :1],
                           scale=1e-3 + tf.math.softplus(0.05 * t[...,1:]))),
])

# Do inference.
model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.01), loss=negloglik)
model.fit(x, y, epochs=1000, verbose=False);

# Profit.
[print(np.squeeze(w.numpy())) for w in model.weights];
yhat = model(x_tst)
assert isinstance(yhat, tfd.Distribution)
[0.126 0.966]
[5.191 6.73 ]
#@title Figure 2: Aleatoric Uncertainty
plt.figure(figsize=[6, 1.5])  # inches
plt.plot(x, y, 'b.', label='observed');

m = yhat.mean()
s = yhat.stddev()

plt.plot(x_tst, m, 'r', linewidth=4, label='mean');
plt.plot(x_tst, m + 2 * s, 'g', linewidth=2, label=r'mean + 2 stddev');
plt.plot(x_tst, m - 2 * s, 'g', linewidth=2, label=r'mean - 2 stddev');

plt.ylim(-0.,17);
plt.yticks(np.linspace(0, 15, 4)[1:]);
plt.xticks(np.linspace(*x_range, num=9));

ax=plt.gca();
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#ax.spines['left'].set_smart_bounds(True)
#ax.spines['bottom'].set_smart_bounds(True)
plt.legend(loc='center left', fancybox=True, framealpha=0., bbox_to_anchor=(1.05, 0.5))

plt.savefig('/tmp/fig2.png', bbox_inches='tight', dpi=300)
../../../../_images/8f152020de03b329d82bae1d802280d690d5244ef8cd43a2c8c810ded055769d.png

Case 3: Epistemic Uncertainty

# Specify the surrogate posterior over `keras.layers.Dense` `kernel` and `bias`.
def posterior_mean_field(kernel_size, bias_size=0, dtype=None):
  n = kernel_size + bias_size
  c = np.log(np.expm1(1.))
  return tf.keras.Sequential([
      tfp.layers.VariableLayer(2 * n, dtype=dtype),
      tfp.layers.DistributionLambda(lambda t: tfd.Independent(
          tfd.Normal(loc=t[..., :n],
                     scale=1e-5 + tf.nn.softplus(c + t[..., n:])),
          reinterpreted_batch_ndims=1)),
  ])
# Specify the prior over `keras.layers.Dense` `kernel` and `bias`.
def prior_trainable(kernel_size, bias_size=0, dtype=None):
  n = kernel_size + bias_size
  return tf.keras.Sequential([
      tfp.layers.VariableLayer(n, dtype=dtype),
      tfp.layers.DistributionLambda(lambda t: tfd.Independent(
          tfd.Normal(loc=t, scale=1),
          reinterpreted_batch_ndims=1)),
  ])
# Build model.
model = tf.keras.Sequential([
  tfp.layers.DenseVariational(1, posterior_mean_field, prior_trainable, kl_weight=1/x.shape[0]),
  tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t, scale=1)),
])

# Do inference.
model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.01), loss=negloglik)
model.fit(x, y, epochs=1000, verbose=False);

# Profit.
[print(np.squeeze(w.numpy())) for w in model.weights];
yhat = model(x_tst)
assert isinstance(yhat, tfd.Distribution)
[ 0.139  5.136 -4.123 -2.283]
[0.137 5.122]
#@title Figure 3: Epistemic Uncertainty
plt.figure(figsize=[6, 1.5])  # inches
plt.clf();
plt.plot(x, y, 'b.', label='observed');

yhats = [model(x_tst) for _ in range(100)]
avgm = np.zeros_like(x_tst[..., 0])
for i, yhat in enumerate(yhats):
  m = np.squeeze(yhat.mean())
  s = np.squeeze(yhat.stddev())
  if i < 25:
    plt.plot(x_tst, m, 'r', label='ensemble means' if i == 0 else None, linewidth=0.5)
  avgm += m
plt.plot(x_tst, avgm/len(yhats), 'r', label='overall mean', linewidth=4)

plt.ylim(-0.,17);
plt.yticks(np.linspace(0, 15, 4)[1:]);
plt.xticks(np.linspace(*x_range, num=9));

ax=plt.gca();
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#ax.spines['left'].set_smart_bounds(True)
#ax.spines['bottom'].set_smart_bounds(True)
plt.legend(loc='center left', fancybox=True, framealpha=0., bbox_to_anchor=(1.05, 0.5))

plt.savefig('/tmp/fig3.png', bbox_inches='tight', dpi=300)
../../../../_images/004bb1d0b34662b1ffd95cf211a0e2c472427449cfd50ab50e47fa42181bcf89.png

Case 4: Aleatoric & Epistemic Uncertainty

# Build model.
model = tf.keras.Sequential([
  tfp.layers.DenseVariational(1 + 1, posterior_mean_field, prior_trainable, kl_weight=1/x.shape[0]),
  tfp.layers.DistributionLambda(
      lambda t: tfd.Normal(loc=t[..., :1],
                           scale=1e-3 + tf.math.softplus(0.01 * t[...,1:]))),
])

# Do inference.
model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.01), loss=negloglik)
model.fit(x, y, epochs=1000, verbose=False);

# Profit.
[print(np.squeeze(w.numpy())) for w in model.weights];
yhat = model(x_tst)
assert isinstance(yhat, tfd.Distribution)
[ 0.133  2.172  5.165  3.044 -3.117 -0.886 -2.087 -0.053]
[0.133 2.192 5.172 3.03 ]
#@title Figure 4: Both Aleatoric & Epistemic Uncertainty
plt.figure(figsize=[6, 1.5])  # inches
plt.plot(x, y, 'b.', label='observed');

yhats = [model(x_tst) for _ in range(100)]
avgm = np.zeros_like(x_tst[..., 0])
for i, yhat in enumerate(yhats):
  m = np.squeeze(yhat.mean())
  s = np.squeeze(yhat.stddev())
  if i < 15:
    plt.plot(x_tst, m, 'r', label='ensemble means' if i == 0 else None, linewidth=1.)
    plt.plot(x_tst, m + 2 * s, 'g', linewidth=0.5, label='ensemble means + 2 ensemble stdev' if i == 0 else None);
    plt.plot(x_tst, m - 2 * s, 'g', linewidth=0.5, label='ensemble means - 2 ensemble stdev' if i == 0 else None);
  avgm += m
plt.plot(x_tst, avgm/len(yhats), 'r', label='overall mean', linewidth=4)

plt.ylim(-0.,17);
plt.yticks(np.linspace(0, 15, 4)[1:]);
plt.xticks(np.linspace(*x_range, num=9));

ax=plt.gca();
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#ax.spines['left'].set_smart_bounds(True)
#ax.spines['bottom'].set_smart_bounds(True)
plt.legend(loc='center left', fancybox=True, framealpha=0., bbox_to_anchor=(1.05, 0.5))

plt.savefig('/tmp/fig4.png', bbox_inches='tight', dpi=300)
../../../../_images/d4321b79c68313b9479fee784e7d781727fe45096d1242e8eaac46cfa312cf7d.png

Case 5: Functional Uncertainty

#@title Custom PSD Kernel
class RBFKernelFn(tf.keras.layers.Layer):
  def __init__(self, **kwargs):
    super(RBFKernelFn, self).__init__(**kwargs)
    dtype = kwargs.get('dtype', None)

    self._amplitude = self.add_variable(
            initializer=tf.constant_initializer(0),
            dtype=dtype,
            name='amplitude')
    
    self._length_scale = self.add_variable(
            initializer=tf.constant_initializer(0),
            dtype=dtype,
            name='length_scale')

  def call(self, x):
    # Never called -- this is just a layer so it can hold variables
    # in a way Keras understands.
    return x

  @property
  def kernel(self):
    return tfp.math.psd_kernels.ExponentiatedQuadratic(
      amplitude=tf.nn.softplus(0.1 * self._amplitude),
      length_scale=tf.nn.softplus(5. * self._length_scale)
    )
# For numeric stability, set the default floating-point dtype to float64
tf.keras.backend.set_floatx('float64')

# Build model.
num_inducing_points = 40
model = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=[1]),
    tf.keras.layers.Dense(1, kernel_initializer='ones', use_bias=False),
    tfp.layers.VariationalGaussianProcess(
        num_inducing_points=num_inducing_points,
        kernel_provider=RBFKernelFn(),
        event_shape=[1],
        inducing_index_points_initializer=tf.constant_initializer(
            np.linspace(*x_range, num=num_inducing_points,
                        dtype=x.dtype)[..., np.newaxis]),
        unconstrained_observation_noise_variance_initializer=(
            tf.constant_initializer(np.array(0.54).astype(x.dtype))),
    ),
])

# Do inference.
batch_size = 32
loss = lambda y, rv_y: rv_y.variational_loss(
    y, kl_weight=np.array(batch_size, x.dtype) / x.shape[0])
model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.01), loss=loss)
model.fit(x, y, batch_size=batch_size, epochs=1000, verbose=False)

# Profit.
yhat = model(x_tst)
assert isinstance(yhat, tfd.Distribution)
/tmp/ipykernel_81160/1709427333.py:7: UserWarning: `layer.add_variable` is deprecated and will be removed in a future version. Please use the `layer.add_weight()` method instead.
  self._amplitude = self.add_variable(
/tmp/ipykernel_81160/1709427333.py:12: UserWarning: `layer.add_variable` is deprecated and will be removed in a future version. Please use the `layer.add_weight()` method instead.
  self._length_scale = self.add_variable(
WARNING:tensorflow:From /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages/tensorflow_probability/python/distributions/distribution.py:342: calling GaussianProcess.__init__ (from tensorflow_probability.python.distributions.gaussian_process) with jitter is deprecated and will be removed after 2021-05-10.
Instructions for updating:
`jitter` is deprecated; please use `marginal_fn` directly.
/home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages/tensorflow_probability/python/distributions/gaussian_process.py:402: UserWarning: Unable to detect statically whether the number of index_points is 1. As a result, defaulting to treating the marginal GP at `index_points` as a multivariate Gaussian. This makes some methods, like `cdf` unavailable.
  warnings.warn(
WARNING:tensorflow:From /home/obm/Prog/miniconda3/envs/qml/lib/python3.8/site-packages/tensorflow_probability/python/internal/auto_composite_tensor.py:97: GaussianProcess.jitter (from tensorflow_probability.python.distributions.gaussian_process) is deprecated and will be removed after 2022-02-04.
Instructions for updating:
the `jitter` property of `tfd.GaussianProcess` is deprecated; use the `marginal_fn` property instead.
2022-10-27 12:56:17.168452: I tensorflow/core/util/cuda_solvers.cc:179] Creating GpuSolver handles for stream 0x55ce4315f4c0
#@title Figure 5: Functional Uncertainty

y, x, _ = load_dataset()

plt.figure(figsize=[6, 1.5])  # inches
plt.plot(x, y, 'b.', label='observed');

num_samples = 7
for i in range(num_samples):
  sample_ = yhat.sample().numpy()
  plt.plot(x_tst,
           sample_[..., 0].T,
           'r',
           linewidth=0.9,
           label='ensemble means' if i == 0 else None);

plt.ylim(-0.,17);
plt.yticks(np.linspace(0, 15, 4)[1:]);
plt.xticks(np.linspace(*x_range, num=9));

ax=plt.gca();
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data', 0))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#ax.spines['left'].set_smart_bounds(True)
#ax.spines['bottom'].set_smart_bounds(True)
plt.legend(loc='center left', fancybox=True, framealpha=0., bbox_to_anchor=(1.05, 0.5))

plt.savefig('/tmp/fig5.png', bbox_inches='tight', dpi=300)
../../../../_images/786c1ccf8808ae4fea980acb0338e0ac00dad237f9ee239786b022d7322cdc86.png