1、Data loading and preprocessing
Neural network don’t process raw data,like text files,encodeed JPEG image files,or CSV files。They process vectorized and standarlized representations。
- Text files need to be read into String tensors,then split into words ,finally ,the words need to be indexed and turned into integer tensors
- Images need to be read and decoded into integer tensors,then converted to floating point and normalized to small values ( such as [0,1] , [-1 , 1] )
- CSV data needs to be parsed ,with numerical features converted to floating point tensors and categorical features indexed and converted to integer tensors .then each feature typically needs tobe normalized to zero-mean and unit - variance.
2、Data loading
keras model accept three types of inputs:
- numpy array , just like Scikit-Learn and many other Python-Base libraries.This is a good option if your data fits in memory.
- Tensorflow Dataset Objects.
- Python generators that yield batches of data(such as subclasses of the keras.utils.Sequence class)
keras features a range of utilities to help you turn row on disk into a Dataset.
- tf.keras.preprocessing.image_dataset_from_directory turns image files sorted into class specific into a labeled dataset of image tensors.
- tf.keras.preprocessing.text_dataset_from_directory does the same for text files.
- tf.data.experimental.make_csv_dataset to load structured data from CSV files
In order to load image data into (image,label),you need to hava image files sorted by class in different folders,like this:
main_directory/...class_a/......a_image_1.jpg......a_image_2.jpg...class_b/......b_image_1.jpg......b_image_2.jpg
then ,you can do:
dataset = keras.preprocessing.image_dataset_from_directory("path/to/main_directory",batch_size = 64,image_size = (224,224))# for demostration,iterate over the batches yielded by the dataset:for data,labels in dataset:print(data.shape) # (64,224,224,3)print(data.dtype) # float32print(labels.shape) #(64,)print(labels.dtype) # int32
The label of a samples is the rank of its folder in alphanumeric order . Naturally ,this can also be configured explicitly by passing, e.g. class_names = [“class_a”,”class_b”],in which cases label 0 will be class_a and 1 will be class_b
Hava a try , you can run this code in colab.
!curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip!ls!unzip -q kagglecatsanddogs_3367a.zip!ls PetImagesimport tensorflow as tfimport tensorflow.keras as kerasimport osnum_skipped = 0for folder_name in ("Cat", "Dog"):folder_path = os.path.join("PetImages", folder_name)for fname in os.listdir(folder_path):fpath = os.path.join(folder_path, fname)try:fobj = open(fpath, "rb")is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)finally:fobj.close()if not is_jfif:num_skipped += 1# Delete corrupted imageos.remove(fpath)print("Deleted %d images" % num_skipped)train_dataset = keras.preprocessing.image_dataset_from_directory("PetImages",batch_size = 64,validation_split=0.2,subset="training",seed=1337,image_size = (224,224))val_dataset = keras.preprocessing.image_dataset_from_directory("PetImages",batch_size = 64,validation_split=0.2,subset="validation",seed=1337,image_size = (224,224))for data, labels in train_dataset:print(data.shape) # (64,)print(data.dtype) # stringprint(labels.shape) # (64,)print(labels.dtype) # int32
3、Data preprocessing with Keras
once your data is in the form of string/int/float Numpy arrays , or a Dataset object ( or Python generator ) that yields batches of String/int/float tensors , it is time to preprocess the data.This can mean:
- Tokenization of String data,followed by token indexing
- Feature normalization
- Rescaling the data to small values ( in general , input values to a neural network should be close to zero - typically we expect either data with zero-mean and unit-variance , or data in the [0,1] range. )
In general , you should seek to do data preprocessing as part of your model as much as possiable,not via an external preprocessing pipeline.That`s because external datapreprocessing makes your models less portable when it is time to use them in production.
It would be much easier to be able to simply export an end - to - end model that already includes preprocessing. The ideal model should expect as input something as close as possibale to raw data:
- An image model should expect RGB pixel values in the [0,255] range
- A text model should accept String of utf-8 characters.
That way,the consumer of the exported model dosen’t hava to know about the preprocessing pipeline.
Using keras preprocessing layers
In keras , you do in-model data preprocessing via preprocessing layers. This includes:
- Vectorizing raw string of text via the TextVectorization layer
- Feature normalization via the Normalization layer
- Image rescaling、cropping、or image data augmentation
example:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorizationtrain_data = np.array([[....],[...]])vectorizer = TextVectorization(output_model = 'int')vectorizer.adapt(train_data)integer_data = vectorizer(train_data)
Example:normalizing features
from tensorflow.keras.layers.experimental.preprocessing import Normalizationimport numpy as nptrain_data = np.random.randint(0,256,size = (64,224,224,3)).astype('float32')normalizer = Normalization(axis = -1) # last dimnormalizer.adapt(train_data) # adaptnormalizer_data = normalizer(train_data) # transform# normalizer provider the data var = 1 and mean = 0print(np.var(normalizer_data)) # output 1.0print(np.mean(normalizer_data)) # output -0.0
Example:rescaling and center-cropping images
Both the Rescaling layer and the CenterCrop layer are stateless,so it isn’t necessary to call adapt() in this case:
from tensorflow.keras.layers.experimental.preprocessing import CenterCrop,Rescalingimport numpy as nptrain_data = np.random.randint(0,256,size = (64,224,224,3)).astype('float32')cropper = CenterCrop(height = 150,width=150)scaler = Rescaling(scale = 1.0/255.0)# use without calling adapt()output_data = scaler(cropper(train_data))print("shape:",output_data.shape) # output = (64,150,250,3)print("min:",np.min(output_data),"\tmax:",np.max(output_data)) # output = min:0.0 max:1.0
You can run this code in colab:
!curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip!ls!unzip -q kagglecatsanddogs_3367a.zip!ls PetImagesimport tensorflow as tfimport tensorflow.keras as kerasimport numpy as npimport osfrom tensorflow.keras.layers.experimental.preprocessing import CenterCrop,Rescaling,Normalizationnum_skipped = 0for folder_name in ("Cat", "Dog"):folder_path = os.path.join("PetImages", folder_name)for fname in os.listdir(folder_path):fpath = os.path.join(folder_path, fname)try:fobj = open(fpath, "rb")is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)finally:fobj.close()if not is_jfif:num_skipped += 1# Delete corrupted imageos.remove(fpath)print("Deleted %d images" % num_skipped)train_dataset = keras.preprocessing.image_dataset_from_directory("PetImages",batch_size = 64,validation_split=0.2,subset="training",seed=1337,image_size = (224,224))val_dataset = keras.preprocessing.image_dataset_from_directory("PetImages",batch_size = 64,validation_split=0.2,subset="validation",seed=1337,image_size = (224,224))for data, labels in train_dataset:print(data.shape) # (64,)print(data.dtype) # stringprint(labels.shape) # (64,)print(labels.dtype) # int32for data,labels in train_dataset.take(1):print(data.shape) # output = (64,224,224,3)normalizer = Normalization(axis = -1)normalizer.adapt(data)normalizer_data = normalizer(data)print(np.var(normalizer_data.numpy())) # output = 1.0print(np.mean(normalizer_data.numpy())) # output = 0.0cropper = CenterCrop(height=150,width = 150)scaler = Rescaling(scale = 1.0/255.0)output_data = scaler(cropper(data))print(output_data.shape,np.min(output_data),np.max(output_data))# output = (64, 150, 150, 3) 0.0 1.0
4、Building models with the Keras Functional API
A “layer” is a simple input - output transformation ( such as scaling and center - cropping transformations above ). For instance , here is a linear projection layer that maps its output to a 16-dimensional feature space:
dense = keras.layers.Dense(units = 16)
A “model” is a directed acyclic graph of layers . You can think of a model as a “bigger layer” that encompasses multiple sublayers and that can be trained via exposure to data.
If any dimension of your input can vary,you can specify it as None. For this instance,an input for 224x224 RGB image would hava shape (224,224,3),but an input for RGB image of any size would hava shape (None,None,3)
You can run this code in colab:
!curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip!ls!unzip -q kagglecatsanddogs_3367a.zip!ls PetImagesimport tensorflow as tfimport tensorflow.keras as kerasimport osnum_skipped = 0for folder_name in ("Cat", "Dog"):folder_path = os.path.join("PetImages", folder_name)for fname in os.listdir(folder_path):fpath = os.path.join(folder_path, fname)try:fobj = open(fpath, "rb")is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)finally:fobj.close()if not is_jfif:num_skipped += 1# Delete corrupted imageos.remove(fpath)print("Deleted %d images" % num_skipped)train_dataset = keras.preprocessing.image_dataset_from_directory("PetImages",batch_size = 64,validation_split=0.2,subset="training",seed=1337,image_size = (224,224))val_dataset = keras.preprocessing.image_dataset_from_directory("PetImages",batch_size = 64,validation_split=0.2,subset="validation",seed=1337,image_size = (224,224))for data, labels in train_dataset.take(1):print(data.shape) # (64,)print(data.dtype) # stringprint(labels.shape) # (64,)print(labels.dtype) # int32# --------- build model start ------------------------import tensorflow.keras as kerasfrom tensorflow.keras import layersfrom tensorflow.keras.layers.experimental.preprocessing import CenterCrop,Rescaling,Normalizationinputs = keras.Input(shape = (None,None,3))x = CenterCrop(height=150,width = 150)(inputs)x = Rescaling(scale = 1.0/255.0)(x) # make input range to [0,1]x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)x = layers.MaxPooling2D(pool_size = (3,3))(x)x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)x = layers.MaxPooling2D(pool_size = (3,3))(x)x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)x = layers.GlobalAveragePooling2D()(x)outputs = layers.Dense(2,activation="softmax")(x)model = keras.Model(inputs = inputs,outputs = outputs)# --------- build model end ------------------------processed_data = model(data)print(processed_data.shape) # output is (64,2)
The model summary is :
Model: "model"_________________________________________________________________Layer (type) Output Shape Param #=================================================================input_2 (InputLayer) [(None, None, None, 3)] 0_________________________________________________________________center_crop_4 (CenterCrop) (None, 150, 150, 3) 0_________________________________________________________________rescaling_4 (Rescaling) (None, 150, 150, 3) 0_________________________________________________________________conv2d_1 (Conv2D) (None, 148, 148, 32) 896_________________________________________________________________max_pooling2d (MaxPooling2D) (None, 49, 49, 32) 0_________________________________________________________________conv2d_2 (Conv2D) (None, 47, 47, 32) 9248_________________________________________________________________max_pooling2d_1 (MaxPooling2 (None, 15, 15, 32) 0_________________________________________________________________conv2d_3 (Conv2D) (None, 13, 13, 32) 9248_________________________________________________________________global_average_pooling2d (Gl (None, 32) 0_________________________________________________________________dense (Dense) (None, 2) 66=================================================================Total params: 19,458Trainable params: 19,458Non-trainable params: 0_________________________________________________________________
5、Training models with fit()
we hava a PetImages dataset , now start training model.
Before you can call fit(), you need to specify an optimizer and a loss function ( we assume you are familiar with these concepts ). This is the compile() step:
model.compile(optimizer = keras.optimizers.RMSprop(learning_rate = 1e-3),loss = keras.losses.CategoricalCrossentropy())
Loss and optimizer can be specified via their string identifiers ( in this ase their default constructor argument values are used ):
model.compile(optimizer = 'rmsprop',loss = 'categorical_crossentropy')
Once your model is compiled , you can start fitting the model to the data.Here is what fitting a model looks like with numpy data:
model.fit(numpy_array_of_samples,numpy_array_of_labels,batch_size = batch_size,epochs = epochs)
Besides the data,you hava to specify two key parameters: batch_size and epochs. Since the data yielded by a dataset is expected to be already batched,you don’t need to specify the batch_size here.
Let’s look at it in practice with dataset PetImages. you can run this code in colab
!curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip!ls!unzip -q kagglecatsanddogs_3367a.zip!ls PetImagesimport tensorflow as tfimport tensorflow.keras as kerasimport osnum_skipped = 0for folder_name in ("Cat", "Dog"):folder_path = os.path.join("PetImages", folder_name)for fname in os.listdir(folder_path):fpath = os.path.join(folder_path, fname)try:fobj = open(fpath, "rb")is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)finally:fobj.close()if not is_jfif:num_skipped += 1# Delete corrupted imageos.remove(fpath)print("Deleted %d images" % num_skipped)train_dataset = keras.preprocessing.image_dataset_from_directory("PetImages",batch_size = 64,validation_split=0.2,subset="training",seed=1337,image_size = (224,224))val_dataset = keras.preprocessing.image_dataset_from_directory("PetImages",batch_size = 64,validation_split=0.2,subset="validation",seed=1337,image_size = (224,224))for data, labels in train_dataset.take(1):print(data.shape) # (64,)print(data.dtype) # stringprint(labels.shape) # (64,)print(labels.dtype) # int32# --------- build model start ------------------------import tensorflow.keras as kerasfrom tensorflow.keras import layersfrom tensorflow.keras.layers.experimental.preprocessing import CenterCrop,Rescaling,Normalizationinputs = keras.Input(shape = (None,None,3))x = CenterCrop(height=150,width = 150)(inputs)x = Rescaling(scale = 1.0/255.0)(x) # make input range to [0,1]x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)x = layers.MaxPooling2D(pool_size = (3,3))(x)x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)x = layers.MaxPooling2D(pool_size = (3,3))(x)x = layers.Conv2D(32,kernel_size=(3,3),activation="relu")(x)x = layers.GlobalAveragePooling2D()(x)outputs = layers.Dense(2,activation="softmax")(x)model = keras.Model(inputs = inputs,outputs = outputs)# --------- build model end ------------------------processed_data = model(data)print(processed_data.shape) # output is (64,2)# using loss = keras.losses.SparseCategoricalCrossentropy(),our label.dtype is integer => one_hot encodemodel.compile(optimizer = keras.optimizers.RMSprop(learning_rate=1e-3),loss = keras.losses.SparseCategoricalCrossentropy(),metrics = ["acc"])"""model.fit(x=None, y=None, batch_size=None, epochs=1, verbose=1, callbacks=None, validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, validation_steps=None, validation_batch_size=None, validation_freq=1, max_queue_size=10, workers=1, use_multiprocessing=False)"""# fithistory = model.fit(train_dataset,epochs = 10,validation_data = val_dataset,use_multiprocessing = True)
Epoch 1/10
293/293 [==============================] - 57s 192ms/step - loss: 0.5152 - acc: 0.7455 - val_loss: 0.5316 - val_acc: 0.7307
Epoch 2/10
293/293 [==============================] - 57s 194ms/step - loss: 0.4909 - acc: 0.7676 - val_loss: 0.5127 - val_acc: 0.7469
Epoch 3/10
293/293 [==============================] - 57s 192ms/step - loss: 0.4934 - acc: 0.7669 - val_loss: 0.4811 - val_acc: 0.7702
Epoch 4/10
293/293 [==============================] - 59s 198ms/step - loss: 0.4862 - acc: 0.7689 - val_loss: 0.4812 - val_acc: 0.7768
Epoch 5/10
293/293 [==============================] - 57s 192ms/step - loss: 0.4686 - acc: 0.7787 - val_loss: 0.5609 - val_acc: 0.7215
Epoch 6/10
293/293 [==============================] - 57s 191ms/step - loss: 0.4705 - acc: 0.7782 - val_loss: 0.4914 - val_acc: 0.7655
Epoch 7/10
293/293 [==============================] - 57s 193ms/step - loss: 0.4652 - acc: 0.7825 - val_loss: 0.6022 - val_acc: 0.6873
Epoch 8/10
293/293 [==============================] - 57s 193ms/step - loss: 0.4570 - acc: 0.7908 - val_loss: 0.4484 - val_acc: 0.7945
Epoch 9/10
293/293 [==============================] - 57s 191ms/step - loss: 0.4472 - acc: 0.7953 - val_loss: 0.4738 - val_acc: 0.7785
Epoch 10/10
293/293 [==============================] - 58s 196ms/step - loss: 0.4434 - acc: 0.7995 - val_loss: 0.5420 - val_acc: 0.7322
Tips:
- The fit() call returns a “history” object which records what happend over the course of training. The history.history dict contains per-epoch timeseries of meetrics values.
6、Using callbacks for checkpointing and more
If training goes on for more than a few minutes , it’s important to save your model at regular intervals during training.You can then use your save models to restart training in case your training process crashes ( this is important for mutil-worker distributed training, since with many worker at lease one of them is bound to fail at some point).
An important feature of Keras is callbacks, configured in fit(). Callbacks are objects that get called by the model at different point during training , in particular:
- At the begining and end of each batch
- At the begining and end of each epoch
You can use callbacks like this:
callbacks = [
keras.callbacks.ModelCheckpoint(
filepath = 'path/to/my/model_{epoch}',
save_freq = 'epoch'
)
]
