Base Line - 《算法笔记》

1、Data loading and preprocessing
2、Data loading
3、Data preprocessing with Keras
4、Building models with the Keras Functional API
- You can run this code in colab:
- The model summary is :
5、Training models with fit()
- Let’s look at it in practice with dataset PetImages. you can run this code in colab
- Tips:
6、Using callbacks for checkpointing and more

1、Data loading and preprocessing

Neural network don’t process raw data，like text files，encodeed JPEG image files，or CSV files。They process vectorized and standarlized representations。

Text files need to be read into String tensors,then split into words ,finally ,the words need to be indexed and turned into integer tensors
Images need to be read and decoded into integer tensors,then converted to floating point and normalized to small values ( such as [0,1] , [-1 , 1] )
CSV data needs to be parsed ,with numerical features converted to floating point tensors and categorical features indexed and converted to integer tensors .then each feature typically needs tobe normalized to zero-mean and unit - variance.

2、Data loading

keras model accept three types of inputs:

numpy array , just like Scikit-Learn and many other Python-Base libraries.This is a good option if your data fits in memory.
Tensorflow Dataset Objects.
Python generators that yield batches of data(such as subclasses of the keras.utils.Sequence class)

keras features a range of utilities to help you turn row on disk into a Dataset.

tf.keras.preprocessing.image_dataset_from_directory turns image files sorted into class specific into a labeled dataset of image tensors.
tf.keras.preprocessing.text_dataset_from_directory does the same for text files.
tf.data.experimental.make_csv_dataset to load structured data from CSV files

In order to load image data into (image,label),you need to hava image files sorted by class in different folders,like this:

main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg

then ,you can do:

dataset = keras.preprocessing.image_dataset_from_directory(
    "path/to/main_directory",
    batch_size = 64,
    image_size = (224,224)
)
# for demostration,iterate over the batches yielded by the dataset:
for data,labels in dataset:
    print(data.shape) # (64,224,224,3)
    print(data.dtype) # float32
    print(labels.shape) #(64,)
    print(labels.dtype) # int32

The label of a samples is the rank of its folder in alphanumeric order . Naturally ,this can also be configured explicitly by passing, e.g. class_names = [“class_a”,”class_b”],in which cases label 0 will be class_a and 1 will be class_b

Hava a try , you can run this code in colab.

!curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
!ls
!unzip -q kagglecatsanddogs_3367a.zip
!ls PetImages
import tensorflow as tf
import tensorflow.keras as keras
import os
num_skipped = 0
for folder_name in ("Cat", "Dog"):
    folder_path = os.path.join("PetImages", folder_name)
    for fname in os.listdir(folder_path):
        fpath = os.path.join(folder_path, fname)
        try:
            fobj = open(fpath, "rb")
            is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
        finally:
            fobj.close()
        if not is_jfif:
            num_skipped += 1
            # Delete corrupted image
            os.remove(fpath)
print("Deleted %d images" % num_skipped)
train_dataset = keras.preprocessing.image_dataset_from_directory(
    "PetImages",
    batch_size = 64,
    validation_split=0.2,
    subset="training",
    seed=1337,
    image_size = (224,224))
val_dataset = keras.preprocessing.image_dataset_from_directory(
    "PetImages",
    batch_size = 64,
    validation_split=0.2,
    subset="validation",
    seed=1337,
    image_size = (224,224))
for data, labels in train_dataset:
   print(data.shape)  # (64,)
   print(data.dtype)  # string
   print(labels.shape)  # (64,)
   print(labels.dtype)  # int32

3、Data preprocessing with Keras

once your data is in the form of string/int/float Numpy arrays , or a Dataset object ( or Python generator ) that yields batches of String/int/float tensors , it is time to preprocess the data.This can mean:

Tokenization of String data,followed by token indexing
Feature normalization
Rescaling the data to small values ( in general , input values to a neural network should be close to zero - typically we expect either data with zero-mean and unit-variance , or data in the [0,1] range. )

In general , you should seek to do data preprocessing as part of your model as much as possiable,not via an external preprocessing pipeline.That`s because external datapreprocessing makes your models less portable when it is time to use them in production.

It would be much easier to be able to simply export an end - to - end model that already includes preprocessing. The ideal model should expect as input something as close as possibale to raw data:

An image model should expect RGB pixel values in the [0,255] range
A text model should accept String of utf-8 characters.

That way,the consumer of the exported model dosen’t hava to know about the preprocessing pipeline.

Using keras preprocessing layers

In keras , you do in-model data preprocessing via preprocessing layers. This includes:

Vectorizing raw string of text via the TextVectorization layer
Feature normalization via the Normalization layer
Image rescaling、cropping、or image data augmentation

example:

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
train_data  = np.array([
  [....],
  [...]
])
vectorizer = TextVectorization(output_model = 'int')
vectorizer.adapt(train_data)
integer_data = vectorizer(train_data)

Example:normalizing features

from tensorflow.keras.layers.experimental.preprocessing import Normalization
import numpy as np
train_data = np.random.randint(0,256,size = (64,224,224,3)).astype('float32')
normalizer = Normalization(axis = -1) # last dim
normalizer.adapt(train_data) # adapt
normalizer_data = normalizer(train_data) # transform
# normalizer provider the data var = 1 and mean = 0
print(np.var(normalizer_data)) # output 1.0
print(np.mean(normalizer_data)) # output -0.0

Example:rescaling and center-cropping images

Both the Rescaling layer and the CenterCrop layer are stateless,so it isn’t necessary to call adapt() in this case:

from tensorflow.keras.layers.experimental.preprocessing import CenterCrop,Rescaling
import numpy as np
train_data = np.random.randint(0,256,size = (64,224,224,3)).astype('float32')
cropper = CenterCrop(height = 150,width=150)
scaler = Rescaling(scale = 1.0/255.0)
# use without calling adapt()
output_data = scaler(cropper(train_data))
print("shape:",output_data.shape) # output = (64,150,250,3)
print("min:",np.min(output_data),"\tmax:",np.max(output_data)) # output  = min:0.0    max:1.0

You can run this code in colab:

!curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
!ls
!unzip -q kagglecatsanddogs_3367a.zip
!ls PetImages
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
import os
from tensorflow.keras.layers.experimental.preprocessing import CenterCrop,Rescaling,Normalization
num_skipped = 0
for folder_name in ("Cat", "Dog"):
    folder_path = os.path.join("PetImages", folder_name)
    for fname in os.listdir(folder_path):
        fpath = os.path.join(folder_path, fname)
        try:
            fobj = open(fpath, "rb")
            is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
        finally:
            fobj.close()
        if not is_jfif:
            num_skipped += 1
            # Delete corrupted image
            os.remove(fpath)
print("Deleted %d images" % num_skipped)
train_dataset = keras.preprocessing.image_dataset_from_directory(
    "PetImages",
    batch_size = 64,
    validation_split=0.2,
    subset="training",
    seed=1337,
    image_size = (224,224))
val_dataset = keras.preprocessing.image_dataset_from_directory(
    "PetImages",
    batch_size = 64,
    validation_split=0.2,
    subset="validation",
    seed=1337,
    image_size = (224,224))
for data, labels in train_dataset:
   print(data.shape)  # (64,)
   print(data.dtype)  # string
   print(labels.shape)  # (64,)
   print(labels.dtype)  # int32
for data,labels in train_dataset.take(1):
  print(data.shape) # output  = (64,224,224,3)
normalizer = Normalization(axis = -1)
normalizer.adapt(data)
normalizer_data = normalizer(data)
print(np.var(normalizer_data.numpy()))  # output = 1.0
print(np.mean(normalizer_data.numpy())) # output = 0.0
cropper = CenterCrop(height=150,width = 150)
scaler = Rescaling(scale = 1.0/255.0)
output_data = scaler(cropper(data))
print(output_data.shape,np.min(output_data),np.max(output_data))
# output = (64, 150, 150, 3) 0.0 1.0

4、Building models with the Keras Functional API

A “layer” is a simple input - output transformation ( such as scaling and center - cropping transformations above ). For instance , here is a linear projection layer that maps its output to a 16-dimensional feature space:

dense = keras.layers.Dense(units = 16)

A “model” is a directed acyclic graph of layers . You can think of a model as a “bigger layer” that encompasses multiple sublayers and that can be trained via exposure to data.

If any dimension of your input can vary,you can specify it as None. For this instance,an input for 224x224 RGB image would hava shape (224,224,3),but an input for RGB image of any size would hava shape (None,None,3)

You can run this code in colab:

!curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
!ls
!unzip -q kagglecatsanddogs_3367a.zip
!ls PetImages
import tensorflow as tf
import tensorflow.keras as keras
import os
num_skipped = 0
for folder_name in ("Cat", "Dog"):
    folder_path = os.path.join("PetImages", folder_name)
    for fname in os.listdir(folder_path):
        fpath = os.path.join(folder_path, fname)
        try:
            fobj = open(fpath, "rb")
            is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
        finally:
            fobj.close()
        if not is_jfif:
            num_skipped += 1
            # Delete corrupted image
            os.remove(fpath)
print("Deleted %d images" % num_skipped)
train_dataset = keras.preprocessing.image_dataset_from_directory(
    "PetImages",
    batch_size = 64,
    validation_split=0.2,
    subset="training",
    seed=1337,
    image_size = (224,224))
val_dataset = keras.preprocessing.image_dataset_from_directory(
    "PetImages",
    batch_size = 64,
    validation_split=0.2,
    subset="validation",
    seed=1337,
    image_size = (224,224))
for data, labels in train_dataset.take(1):
   print(data.shape)  # (64,)
   print(data.dtype)  # string
   print(labels.shape)  # (64,)
   print(labels.dtype)  # int32
# ---------  build model start  ------------------------
import tensorflow.keras as keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental.preprocessing import CenterCrop,Rescaling,Normalization
inputs = keras.Input(shape = (None,None,3))
x = CenterCrop(height=150,width = 150)(inputs)
x = Rescaling(scale = 1.0/255.0)(x) # make input range to [0,1]
x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)
x = layers.MaxPooling2D(pool_size = (3,3))(x)
x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)
x = layers.MaxPooling2D(pool_size = (3,3))(x)
x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)
x = layers.GlobalAveragePooling2D()(x)
outputs = layers.Dense(2,activation="softmax")(x)
model = keras.Model(inputs = inputs,outputs = outputs)
# ---------  build model end    ------------------------
processed_data = model(data)
print(processed_data.shape) # output is (64,2)

The model summary is :

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         [(None, None, None, 3)]   0         
_________________________________________________________________
center_crop_4 (CenterCrop)   (None, 150, 150, 3)       0         
_________________________________________________________________
rescaling_4 (Rescaling)      (None, 150, 150, 3)       0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 148, 148, 32)      896       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 49, 49, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 47, 47, 32)        9248      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 13, 13, 32)        9248      
_________________________________________________________________
global_average_pooling2d (Gl (None, 32)                0         
_________________________________________________________________
dense (Dense)                (None, 2)                66       
=================================================================
Total params: 19,458
Trainable params: 19,458
Non-trainable params: 0
_________________________________________________________________

5、Training models with fit()

we hava a PetImages dataset , now start training model.

Before you can call fit(), you need to specify an optimizer and a loss function ( we assume you are familiar with these concepts ). This is the compile() step:

model.compile(optimizer = keras.optimizers.RMSprop(learning_rate = 1e-3),
             loss = keras.losses.CategoricalCrossentropy())

Loss and optimizer can be specified via their string identifiers ( in this ase their default constructor argument values are used ):

model.compile(optimizer = 'rmsprop',loss = 'categorical_crossentropy')

Once your model is compiled , you can start fitting the model to the data.Here is what fitting a model looks like with numpy data:

model.fit(numpy_array_of_samples,numpy_array_of_labels,batch_size = batch_size,epochs = epochs)

Besides the data,you hava to specify two key parameters: batch_size and epochs. Since the data yielded by a dataset is expected to be already batched,you don’t need to specify the batch_size here.

Let’s look at it in practice with dataset PetImages. you can run this code in colab

!curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
!ls
!unzip -q kagglecatsanddogs_3367a.zip
!ls PetImages
import tensorflow as tf
import tensorflow.keras as keras
import os
num_skipped = 0
for folder_name in ("Cat", "Dog"):
    folder_path = os.path.join("PetImages", folder_name)
    for fname in os.listdir(folder_path):
        fpath = os.path.join(folder_path, fname)
        try:
            fobj = open(fpath, "rb")
            is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
        finally:
            fobj.close()
        if not is_jfif:
            num_skipped += 1
            # Delete corrupted image
            os.remove(fpath)
print("Deleted %d images" % num_skipped)
train_dataset = keras.preprocessing.image_dataset_from_directory(
    "PetImages",
    batch_size = 64,
    validation_split=0.2,
    subset="training",
    seed=1337,
    image_size = (224,224))
val_dataset = keras.preprocessing.image_dataset_from_directory(
    "PetImages",
    batch_size = 64,
    validation_split=0.2,
    subset="validation",
    seed=1337,
    image_size = (224,224))
for data, labels in train_dataset.take(1):
   print(data.shape)  # (64,)
   print(data.dtype)  # string
   print(labels.shape)  # (64,)
   print(labels.dtype)  # int32
# ---------  build model start  ------------------------
import tensorflow.keras as keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental.preprocessing import CenterCrop,Rescaling,Normalization
inputs = keras.Input(shape = (None,None,3))
x = CenterCrop(height=150,width = 150)(inputs)
x = Rescaling(scale = 1.0/255.0)(x) # make input range to [0,1]
x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)
x = layers.MaxPooling2D(pool_size = (3,3))(x)
x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)
x = layers.MaxPooling2D(pool_size = (3,3))(x)
x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)
x = layers.GlobalAveragePooling2D()(x)
outputs = layers.Dense(2,activation="softmax")(x)
model = keras.Model(inputs = inputs,outputs = outputs)
# ---------  build model end    ------------------------
processed_data = model(data)
print(processed_data.shape) # output is (64,2)
# using loss = keras.losses.SparseCategoricalCrossentropy(),our label.dtype is integer => one_hot encode
model.compile(optimizer = keras.optimizers.RMSprop(learning_rate=1e-3),
              loss = keras.losses.SparseCategoricalCrossentropy(),
              metrics = ["acc"])
"""
model.fit(x=None
, y=None
, batch_size=None
, epochs=1
, verbose=1
, callbacks=None
, validation_split=0.0
, validation_data=None
, shuffle=True
, class_weight=None
, sample_weight=None
, initial_epoch=0
, steps_per_epoch=None
, validation_steps=None
, validation_batch_size=None
, validation_freq=1
, max_queue_size=10
, workers=1
, use_multiprocessing=False)
"""
# fit
history = model.fit(train_dataset,
          epochs = 10,
          validation_data = val_dataset,
          use_multiprocessing = True
          )

Epoch 1/10
293/293 [==============================] - 57s 192ms/step - loss: 0.5152 - acc: 0.7455 - val_loss: 0.5316 - val_acc: 0.7307
Epoch 2/10
293/293 [==============================] - 57s 194ms/step - loss: 0.4909 - acc: 0.7676 - val_loss: 0.5127 - val_acc: 0.7469
Epoch 3/10
293/293 [==============================] - 57s 192ms/step - loss: 0.4934 - acc: 0.7669 - val_loss: 0.4811 - val_acc: 0.7702
Epoch 4/10
293/293 [==============================] - 59s 198ms/step - loss: 0.4862 - acc: 0.7689 - val_loss: 0.4812 - val_acc: 0.7768
Epoch 5/10
293/293 [==============================] - 57s 192ms/step - loss: 0.4686 - acc: 0.7787 - val_loss: 0.5609 - val_acc: 0.7215
Epoch 6/10
293/293 [==============================] - 57s 191ms/step - loss: 0.4705 - acc: 0.7782 - val_loss: 0.4914 - val_acc: 0.7655
Epoch 7/10
293/293 [==============================] - 57s 193ms/step - loss: 0.4652 - acc: 0.7825 - val_loss: 0.6022 - val_acc: 0.6873
Epoch 8/10
293/293 [==============================] - 57s 193ms/step - loss: 0.4570 - acc: 0.7908 - val_loss: 0.4484 - val_acc: 0.7945
Epoch 9/10
293/293 [==============================] - 57s 191ms/step - loss: 0.4472 - acc: 0.7953 - val_loss: 0.4738 - val_acc: 0.7785
Epoch 10/10
293/293 [==============================] - 58s 196ms/step - loss: 0.4434 - acc: 0.7995 - val_loss: 0.5420 - val_acc: 0.7322

Tips:

The fit() call returns a “history” object which records what happend over the course of training. The history.history dict contains per-epoch timeseries of meetrics values.

6、Using callbacks for checkpointing and more

If training goes on for more than a few minutes , it’s important to save your model at regular intervals during training.You can then use your save models to restart training in case your training process crashes ( this is important for mutil-worker distributed training, since with many worker at lease one of them is bound to fail at some point).

An important feature of Keras is callbacks, configured in fit(). Callbacks are objects that get called by the model at different point during training , in particular:

At the begining and end of each batch
At the begining and end of each epoch

You can use callbacks like this:

callbacks = [
    keras.callbacks.ModelCheckpoint(
        filepath = 'path/to/my/model_{epoch}',
        save_freq = 'epoch'
    )
]