1、Data loading and preprocessing

Neural network don’t process raw data,like text files,encodeed JPEG image files,or CSV files。They process vectorized and standarlized representations。

  • Text files need to be read into String tensors,then split into words ,finally ,the words need to be indexed and turned into integer tensors
  • Images need to be read and decoded into integer tensors,then converted to floating point and normalized to small values ( such as [0,1] , [-1 , 1] )
  • CSV data needs to be parsed ,with numerical features converted to floating point tensors and categorical features indexed and converted to integer tensors .then each feature typically needs tobe normalized to zero-mean and unit - variance.

2、Data loading

keras model accept three types of inputs:

  • numpy array , just like Scikit-Learn and many other Python-Base libraries.This is a good option if your data fits in memory.
  • Tensorflow Dataset Objects.
  • Python generators that yield batches of data(such as subclasses of the keras.utils.Sequence class)

keras features a range of utilities to help you turn row on disk into a Dataset.

  • tf.keras.preprocessing.image_dataset_from_directory turns image files sorted into class specific into a labeled dataset of image tensors.
  • tf.keras.preprocessing.text_dataset_from_directory does the same for text files.
  • tf.data.experimental.make_csv_dataset to load structured data from CSV files

In order to load image data into (image,label),you need to hava image files sorted by class in different folders,like this:

  1. main_directory/
  2. ...class_a/
  3. ......a_image_1.jpg
  4. ......a_image_2.jpg
  5. ...class_b/
  6. ......b_image_1.jpg
  7. ......b_image_2.jpg

then ,you can do:

  1. dataset = keras.preprocessing.image_dataset_from_directory(
  2. "path/to/main_directory",
  3. batch_size = 64,
  4. image_size = (224,224)
  5. )
  6. # for demostration,iterate over the batches yielded by the dataset:
  7. for data,labels in dataset:
  8. print(data.shape) # (64,224,224,3)
  9. print(data.dtype) # float32
  10. print(labels.shape) #(64,)
  11. print(labels.dtype) # int32

The label of a samples is the rank of its folder in alphanumeric order . Naturally ,this can also be configured explicitly by passing, e.g. class_names = [“class_a”,”class_b”],in which cases label 0 will be class_a and 1 will be class_b

Hava a try , you can run this code in colab.

  1. !curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
  2. !ls
  3. !unzip -q kagglecatsanddogs_3367a.zip
  4. !ls PetImages
  5. import tensorflow as tf
  6. import tensorflow.keras as keras
  7. import os
  8. num_skipped = 0
  9. for folder_name in ("Cat", "Dog"):
  10. folder_path = os.path.join("PetImages", folder_name)
  11. for fname in os.listdir(folder_path):
  12. fpath = os.path.join(folder_path, fname)
  13. try:
  14. fobj = open(fpath, "rb")
  15. is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
  16. finally:
  17. fobj.close()
  18. if not is_jfif:
  19. num_skipped += 1
  20. # Delete corrupted image
  21. os.remove(fpath)
  22. print("Deleted %d images" % num_skipped)
  23. train_dataset = keras.preprocessing.image_dataset_from_directory(
  24. "PetImages",
  25. batch_size = 64,
  26. validation_split=0.2,
  27. subset="training",
  28. seed=1337,
  29. image_size = (224,224))
  30. val_dataset = keras.preprocessing.image_dataset_from_directory(
  31. "PetImages",
  32. batch_size = 64,
  33. validation_split=0.2,
  34. subset="validation",
  35. seed=1337,
  36. image_size = (224,224))
  37. for data, labels in train_dataset:
  38. print(data.shape) # (64,)
  39. print(data.dtype) # string
  40. print(labels.shape) # (64,)
  41. print(labels.dtype) # int32

3、Data preprocessing with Keras

once your data is in the form of string/int/float Numpy arrays , or a Dataset object ( or Python generator ) that yields batches of String/int/float tensors , it is time to preprocess the data.This can mean:

  • Tokenization of String data,followed by token indexing
  • Feature normalization
  • Rescaling the data to small values ( in general , input values to a neural network should be close to zero - typically we expect either data with zero-mean and unit-variance , or data in the [0,1] range. )

In general , you should seek to do data preprocessing as part of your model as much as possiable,not via an external preprocessing pipeline.That`s because external datapreprocessing makes your models less portable when it is time to use them in production.

It would be much easier to be able to simply export an end - to - end model that already includes preprocessing. The ideal model should expect as input something as close as possibale to raw data:

  • An image model should expect RGB pixel values in the [0,255] range
  • A text model should accept String of utf-8 characters.

That way,the consumer of the exported model dosen’t hava to know about the preprocessing pipeline.

Using keras preprocessing layers

In keras , you do in-model data preprocessing via preprocessing layers. This includes:

  • Vectorizing raw string of text via the TextVectorization layer
  • Feature normalization via the Normalization layer
  • Image rescaling、cropping、or image data augmentation

example:
  1. from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
  2. train_data = np.array([
  3. [....],
  4. [...]
  5. ])
  6. vectorizer = TextVectorization(output_model = 'int')
  7. vectorizer.adapt(train_data)
  8. integer_data = vectorizer(train_data)

Example:normalizing features
  1. from tensorflow.keras.layers.experimental.preprocessing import Normalization
  2. import numpy as np
  3. train_data = np.random.randint(0,256,size = (64,224,224,3)).astype('float32')
  4. normalizer = Normalization(axis = -1) # last dim
  5. normalizer.adapt(train_data) # adapt
  6. normalizer_data = normalizer(train_data) # transform
  7. # normalizer provider the data var = 1 and mean = 0
  8. print(np.var(normalizer_data)) # output 1.0
  9. print(np.mean(normalizer_data)) # output -0.0

Example:rescaling and center-cropping images

Both the Rescaling layer and the CenterCrop layer are stateless,so it isn’t necessary to call adapt() in this case:

  1. from tensorflow.keras.layers.experimental.preprocessing import CenterCrop,Rescaling
  2. import numpy as np
  3. train_data = np.random.randint(0,256,size = (64,224,224,3)).astype('float32')
  4. cropper = CenterCrop(height = 150,width=150)
  5. scaler = Rescaling(scale = 1.0/255.0)
  6. # use without calling adapt()
  7. output_data = scaler(cropper(train_data))
  8. print("shape:",output_data.shape) # output = (64,150,250,3)
  9. print("min:",np.min(output_data),"\tmax:",np.max(output_data)) # output = min:0.0 max:1.0

You can run this code in colab:
  1. !curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
  2. !ls
  3. !unzip -q kagglecatsanddogs_3367a.zip
  4. !ls PetImages
  5. import tensorflow as tf
  6. import tensorflow.keras as keras
  7. import numpy as np
  8. import os
  9. from tensorflow.keras.layers.experimental.preprocessing import CenterCrop,Rescaling,Normalization
  10. num_skipped = 0
  11. for folder_name in ("Cat", "Dog"):
  12. folder_path = os.path.join("PetImages", folder_name)
  13. for fname in os.listdir(folder_path):
  14. fpath = os.path.join(folder_path, fname)
  15. try:
  16. fobj = open(fpath, "rb")
  17. is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
  18. finally:
  19. fobj.close()
  20. if not is_jfif:
  21. num_skipped += 1
  22. # Delete corrupted image
  23. os.remove(fpath)
  24. print("Deleted %d images" % num_skipped)
  25. train_dataset = keras.preprocessing.image_dataset_from_directory(
  26. "PetImages",
  27. batch_size = 64,
  28. validation_split=0.2,
  29. subset="training",
  30. seed=1337,
  31. image_size = (224,224))
  32. val_dataset = keras.preprocessing.image_dataset_from_directory(
  33. "PetImages",
  34. batch_size = 64,
  35. validation_split=0.2,
  36. subset="validation",
  37. seed=1337,
  38. image_size = (224,224))
  39. for data, labels in train_dataset:
  40. print(data.shape) # (64,)
  41. print(data.dtype) # string
  42. print(labels.shape) # (64,)
  43. print(labels.dtype) # int32
  44. for data,labels in train_dataset.take(1):
  45. print(data.shape) # output = (64,224,224,3)
  46. normalizer = Normalization(axis = -1)
  47. normalizer.adapt(data)
  48. normalizer_data = normalizer(data)
  49. print(np.var(normalizer_data.numpy())) # output = 1.0
  50. print(np.mean(normalizer_data.numpy())) # output = 0.0
  51. cropper = CenterCrop(height=150,width = 150)
  52. scaler = Rescaling(scale = 1.0/255.0)
  53. output_data = scaler(cropper(data))
  54. print(output_data.shape,np.min(output_data),np.max(output_data))
  55. # output = (64, 150, 150, 3) 0.0 1.0

4、Building models with the Keras Functional API

A “layer” is a simple input - output transformation ( such as scaling and center - cropping transformations above ). For instance , here is a linear projection layer that maps its output to a 16-dimensional feature space:

  1. dense = keras.layers.Dense(units = 16)

A “model” is a directed acyclic graph of layers . You can think of a model as a “bigger layer” that encompasses multiple sublayers and that can be trained via exposure to data.

If any dimension of your input can vary,you can specify it as None. For this instance,an input for 224x224 RGB image would hava shape (224,224,3),but an input for RGB image of any size would hava shape (None,None,3)

You can run this code in colab:
  1. !curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
  2. !ls
  3. !unzip -q kagglecatsanddogs_3367a.zip
  4. !ls PetImages
  5. import tensorflow as tf
  6. import tensorflow.keras as keras
  7. import os
  8. num_skipped = 0
  9. for folder_name in ("Cat", "Dog"):
  10. folder_path = os.path.join("PetImages", folder_name)
  11. for fname in os.listdir(folder_path):
  12. fpath = os.path.join(folder_path, fname)
  13. try:
  14. fobj = open(fpath, "rb")
  15. is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
  16. finally:
  17. fobj.close()
  18. if not is_jfif:
  19. num_skipped += 1
  20. # Delete corrupted image
  21. os.remove(fpath)
  22. print("Deleted %d images" % num_skipped)
  23. train_dataset = keras.preprocessing.image_dataset_from_directory(
  24. "PetImages",
  25. batch_size = 64,
  26. validation_split=0.2,
  27. subset="training",
  28. seed=1337,
  29. image_size = (224,224))
  30. val_dataset = keras.preprocessing.image_dataset_from_directory(
  31. "PetImages",
  32. batch_size = 64,
  33. validation_split=0.2,
  34. subset="validation",
  35. seed=1337,
  36. image_size = (224,224))
  37. for data, labels in train_dataset.take(1):
  38. print(data.shape) # (64,)
  39. print(data.dtype) # string
  40. print(labels.shape) # (64,)
  41. print(labels.dtype) # int32
  42. # --------- build model start ------------------------
  43. import tensorflow.keras as keras
  44. from tensorflow.keras import layers
  45. from tensorflow.keras.layers.experimental.preprocessing import CenterCrop,Rescaling,Normalization
  46. inputs = keras.Input(shape = (None,None,3))
  47. x = CenterCrop(height=150,width = 150)(inputs)
  48. x = Rescaling(scale = 1.0/255.0)(x) # make input range to [0,1]
  49. x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)
  50. x = layers.MaxPooling2D(pool_size = (3,3))(x)
  51. x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)
  52. x = layers.MaxPooling2D(pool_size = (3,3))(x)
  53. x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)
  54. x = layers.GlobalAveragePooling2D()(x)
  55. outputs = layers.Dense(2,activation="softmax")(x)
  56. model = keras.Model(inputs = inputs,outputs = outputs)
  57. # --------- build model end ------------------------
  58. processed_data = model(data)
  59. print(processed_data.shape) # output is (64,2)

The model summary is :
  1. Model: "model"
  2. _________________________________________________________________
  3. Layer (type) Output Shape Param #
  4. =================================================================
  5. input_2 (InputLayer) [(None, None, None, 3)] 0
  6. _________________________________________________________________
  7. center_crop_4 (CenterCrop) (None, 150, 150, 3) 0
  8. _________________________________________________________________
  9. rescaling_4 (Rescaling) (None, 150, 150, 3) 0
  10. _________________________________________________________________
  11. conv2d_1 (Conv2D) (None, 148, 148, 32) 896
  12. _________________________________________________________________
  13. max_pooling2d (MaxPooling2D) (None, 49, 49, 32) 0
  14. _________________________________________________________________
  15. conv2d_2 (Conv2D) (None, 47, 47, 32) 9248
  16. _________________________________________________________________
  17. max_pooling2d_1 (MaxPooling2 (None, 15, 15, 32) 0
  18. _________________________________________________________________
  19. conv2d_3 (Conv2D) (None, 13, 13, 32) 9248
  20. _________________________________________________________________
  21. global_average_pooling2d (Gl (None, 32) 0
  22. _________________________________________________________________
  23. dense (Dense) (None, 2) 66
  24. =================================================================
  25. Total params: 19,458
  26. Trainable params: 19,458
  27. Non-trainable params: 0
  28. _________________________________________________________________

5、Training models with fit()

we hava a PetImages dataset , now start training model.

Before you can call fit(), you need to specify an optimizer and a loss function ( we assume you are familiar with these concepts ). This is the compile() step:

  1. model.compile(optimizer = keras.optimizers.RMSprop(learning_rate = 1e-3),
  2. loss = keras.losses.CategoricalCrossentropy())

Loss and optimizer can be specified via their string identifiers ( in this ase their default constructor argument values are used ):

  1. model.compile(optimizer = 'rmsprop',loss = 'categorical_crossentropy')

Once your model is compiled , you can start fitting the model to the data.Here is what fitting a model looks like with numpy data:

  1. model.fit(numpy_array_of_samples,numpy_array_of_labels,batch_size = batch_size,epochs = epochs)

Besides the data,you hava to specify two key parameters: batch_size and epochs. Since the data yielded by a dataset is expected to be already batched,you don’t need to specify the batch_size here.

Let’s look at it in practice with dataset PetImages. you can run this code in colab
  1. !curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
  2. !ls
  3. !unzip -q kagglecatsanddogs_3367a.zip
  4. !ls PetImages
  5. import tensorflow as tf
  6. import tensorflow.keras as keras
  7. import os
  8. num_skipped = 0
  9. for folder_name in ("Cat", "Dog"):
  10. folder_path = os.path.join("PetImages", folder_name)
  11. for fname in os.listdir(folder_path):
  12. fpath = os.path.join(folder_path, fname)
  13. try:
  14. fobj = open(fpath, "rb")
  15. is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
  16. finally:
  17. fobj.close()
  18. if not is_jfif:
  19. num_skipped += 1
  20. # Delete corrupted image
  21. os.remove(fpath)
  22. print("Deleted %d images" % num_skipped)
  23. train_dataset = keras.preprocessing.image_dataset_from_directory(
  24. "PetImages",
  25. batch_size = 64,
  26. validation_split=0.2,
  27. subset="training",
  28. seed=1337,
  29. image_size = (224,224))
  30. val_dataset = keras.preprocessing.image_dataset_from_directory(
  31. "PetImages",
  32. batch_size = 64,
  33. validation_split=0.2,
  34. subset="validation",
  35. seed=1337,
  36. image_size = (224,224))
  37. for data, labels in train_dataset.take(1):
  38. print(data.shape) # (64,)
  39. print(data.dtype) # string
  40. print(labels.shape) # (64,)
  41. print(labels.dtype) # int32
  42. # --------- build model start ------------------------
  43. import tensorflow.keras as keras
  44. from tensorflow.keras import layers
  45. from tensorflow.keras.layers.experimental.preprocessing import CenterCrop,Rescaling,Normalization
  46. inputs = keras.Input(shape = (None,None,3))
  47. x = CenterCrop(height=150,width = 150)(inputs)
  48. x = Rescaling(scale = 1.0/255.0)(x) # make input range to [0,1]
  49. x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)
  50. x = layers.MaxPooling2D(pool_size = (3,3))(x)
  51. x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)
  52. x = layers.MaxPooling2D(pool_size = (3,3))(x)
  53. x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)
  54. x = layers.GlobalAveragePooling2D()(x)
  55. outputs = layers.Dense(2,activation="softmax")(x)
  56. model = keras.Model(inputs = inputs,outputs = outputs)
  57. # --------- build model end ------------------------
  58. processed_data = model(data)
  59. print(processed_data.shape) # output is (64,2)
  60. # using loss = keras.losses.SparseCategoricalCrossentropy(),our label.dtype is integer => one_hot encode
  61. model.compile(optimizer = keras.optimizers.RMSprop(learning_rate=1e-3),
  62. loss = keras.losses.SparseCategoricalCrossentropy(),
  63. metrics = ["acc"])
  64. """
  65. model.fit(x=None
  66. , y=None
  67. , batch_size=None
  68. , epochs=1
  69. , verbose=1
  70. , callbacks=None
  71. , validation_split=0.0
  72. , validation_data=None
  73. , shuffle=True
  74. , class_weight=None
  75. , sample_weight=None
  76. , initial_epoch=0
  77. , steps_per_epoch=None
  78. , validation_steps=None
  79. , validation_batch_size=None
  80. , validation_freq=1
  81. , max_queue_size=10
  82. , workers=1
  83. , use_multiprocessing=False)
  84. """
  85. # fit
  86. history = model.fit(train_dataset,
  87. epochs = 10,
  88. validation_data = val_dataset,
  89. use_multiprocessing = True
  90. )
Epoch 1/10
293/293 [==============================] - 57s 192ms/step - loss: 0.5152 - acc: 0.7455 - val_loss: 0.5316 - val_acc: 0.7307
Epoch 2/10
293/293 [==============================] - 57s 194ms/step - loss: 0.4909 - acc: 0.7676 - val_loss: 0.5127 - val_acc: 0.7469
Epoch 3/10
293/293 [==============================] - 57s 192ms/step - loss: 0.4934 - acc: 0.7669 - val_loss: 0.4811 - val_acc: 0.7702
Epoch 4/10
293/293 [==============================] - 59s 198ms/step - loss: 0.4862 - acc: 0.7689 - val_loss: 0.4812 - val_acc: 0.7768
Epoch 5/10
293/293 [==============================] - 57s 192ms/step - loss: 0.4686 - acc: 0.7787 - val_loss: 0.5609 - val_acc: 0.7215
Epoch 6/10
293/293 [==============================] - 57s 191ms/step - loss: 0.4705 - acc: 0.7782 - val_loss: 0.4914 - val_acc: 0.7655
Epoch 7/10
293/293 [==============================] - 57s 193ms/step - loss: 0.4652 - acc: 0.7825 - val_loss: 0.6022 - val_acc: 0.6873
Epoch 8/10
293/293 [==============================] - 57s 193ms/step - loss: 0.4570 - acc: 0.7908 - val_loss: 0.4484 - val_acc: 0.7945
Epoch 9/10
293/293 [==============================] - 57s 191ms/step - loss: 0.4472 - acc: 0.7953 - val_loss: 0.4738 - val_acc: 0.7785
Epoch 10/10
293/293 [==============================] - 58s 196ms/step - loss: 0.4434 - acc: 0.7995 - val_loss: 0.5420 - val_acc: 0.7322

Tips:
  1. The fit() call returns a “history” object which records what happend over the course of training. The history.history dict contains per-epoch timeseries of meetrics values.

6、Using callbacks for checkpointing and more

If training goes on for more than a few minutes , it’s important to save your model at regular intervals during training.You can then use your save models to restart training in case your training process crashes ( this is important for mutil-worker distributed training, since with many worker at lease one of them is bound to fail at some point).

An important feature of Keras is callbacks, configured in fit(). Callbacks are objects that get called by the model at different point during training , in particular:

  • At the begining and end of each batch
  • At the begining and end of each epoch

You can use callbacks like this:

callbacks = [
    keras.callbacks.ModelCheckpoint(
        filepath = 'path/to/my/model_{epoch}',
        save_freq = 'epoch'
    )
]