Train TensorFlow models for image/video/features classification or other tasks. Currently the repository is set to train on image classification by default.
- TensorFlow Model Training
Install tensorflow and related cudnn libraries from the tensorflow-official-documentation if cudnn librarues are not setup.
Create a .env
file with the following contents with the correct paths ensuring the correct CUDA install path:
XLA_FLAGS="--xla_gpu_cuda_data_dir=/usr/local/cuda"
TF_XLA_FLAGS="--tf_xla_enable_xla_devices --tf_xla_auto_jit=2 --tf_xla_cpu_global_jit"
TF_CPP_MIN_LOG_LEVEL='3'
TF_FORCE_GPU_ALLOW_GROWTH="true"
OMP_NUM_THREADS="15"
KMP_BLOCKTIME="0"
KMP_SETTINGS="1"
KMP_AFFINITY="granularity=fine,verbose,compact,1,0"
CUDA_DEVICE_ORDER="PCI_BUS_ID"
CUDA_VISIBLE_DEVICES="0"
Set up docker to run with NVIDIA-container-toolkit first.
Create checkpoints
dir in the current project directory.
bash scripts/build_docker.sh
bash scripts/run_docker.sh -p TF_BOARD_PORT
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
When using a python venv
, CUDA libraries must be present in the current path. If CUDA was installed to usr/local/cuda
, the following commands should be added to the current shell source file (~/.bash_profile
or ~/.bashrc
).
# Set cuda LD_LIBRARY_PATH
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
# add cuda bin dir to path
export PATH="$PATH:/usr/local/cuda/bin"
The environment variable XLA_FLAGS="--xla_gpu_cuda_data_dir=/usr/local/cuda
must also be set to the cuda directory in the .env
file.
conda create --name tf_gpu tensorflow-gpu python=3.9 -y
conda activate tf_gpu
while read requirement; do conda install --yes $requirement; done < requirements.txt
Note: Conda
sets the cuda
, cudnn
and cudatoolkit
automatically, downloading non-python dependencies as well.
Assuming the data directory must be organized according to the following structure, with sub-directories having class names containing images. THe CIFAR-10 dataset in JPG fomrat can be acquired from https://github.com/YoongiKim/CIFAR-10-images for a sample train and test.
i.e.
data
|_ src_dataset
|_ class_1
|_ img1
|_ img2
|_ ....
|_ class_2
|_ img1
|_ img2
|_ ....
...
Note: ImageNet style ordering of data is also supported i.e. images ordered under subdirectories inside the class directories.
i.e.
data
|_ src_dataset
|_ class_1
|_ 00d
|_ img1
|_ img2
|_ 01
|_ img1
|_ img2
|_ ...
|_ ...
If all the classes do not have equal number of training samples, data Duplication can be done.
python data_preparation/duplicate_data.py --sd data/src_dataset --td data/duplicated_dataset -n NUM_TO_DUPLICATE
# find corrupt images (i.e. that cannot be opened with tf.io.decode_image)
python data_preparation/find_corrupt_imgs.py --rd data/src_dataset
Set validation and test split in fractions (i.e. 0.1). Both splits are optional.
python data_preparation/create_train_val_test_split.py --sd data/duplicated_dataset --td data/split_dataset[ --vs VAL_SPLIT] [--ts TEST_SPLIT]
# to check the number of images in train, val and test dirs
bash scripts/count_files_per_subdir.sh data/split_dataset
Note: The test split should not be converted into tfrecords and the original data->class_sub_directory
format should be used.
# convert train files into train tfrecord, select NUM_SHARDS so that each shard has a size of 100 MB+
python data_preparation/convert_imgs_to_tfrecord.py --sd data/split_dataset/train --td data/tfrecord_dataset/train [--cp CLASS_MAP_TXT_SAVEPATH] [--ns NUM_SAMPLES_PER_SHARDS]
# convert val files into val tfrecord, select NUM_SHARDS so that each shard has a size of 100 MB+
python data_preparation/convert_imgs_to_tfrecord.py --sd data/split_dataset/val --td data/tfrecord_dataset/val [--cp CLASS_MAP_TXT_SAVEPATH] [--ns NUM_SAMPLES_PER_SHARDS]
# to use multiprocessing use the --mt flag
Note: test dataset is not converted to tfrecord as fast-loading is not a priority as we only run through the test data once.
To extract frames from videos into npy.npz
files install opencv and pyav, then run:
python data_preparation/extract_frames_from_video_dataset.py --sd SOURCE_DATA_DIR
# use -h for help
Configure all values in the YAML
files inside the config
dir. A sample config file is provided for training on the src_dataset
directory in config/train_image_clsf.yaml
.
The model information repository is located at tf_train/model/models_info.py
. New models can be added or model parameters can be modified through this file.
Set number of GPUs to use, Tensorflow, and other system environment variables in .env
.
python train.py --cfg CONFIG_YAML_PATH [-r RESUME_CHECKPOINT_PATH]
Notes:
- Using the
-r
option while training will override theresume_checkpoint
param in config yaml if this param is not null. - To add tensorflow logs to train/test logs, set
"disable_existing_loggers"
parameter totrue
intf_train/logging/logger_config.json
. - Out of Memory errors during training could be caused by large batch sizes, model size or dataset.cache() call in train preprocessing in
tf_train/pipelines/data_pipeline.py
. - When using mixed_float16 precision, the dtypes of the final dense and activation layers must be set to
float32
. - An error like:
ValueError: Unexpected result of train_function (Empty logs)
could be caused by incorrect paths to train and validation directories in the config.yaml files
tensorboard --logdir=checkpoints/tf_logs/ --port=PORT_NUM
Make sure to set the correct test_data_dir
under data
and the class_map_txt_path
under tester
in the yaml config file.
The class_map_txt_path file is generated by the convert_imgs_to_tfrecord.py
script when converting images to tfrecord format.
python test.py --cfg CONFIG_YAML_PATH -r TEST_CHECKPOINT_PATH
We can use a dockerized uvicorn and fastapi webserver with triton-server to serve the model through a HTTPS API endpoint. Instructions are at tensorflow_training/server/README.md.
Unit and integration testing with pytest
python -m pytest tf_train # from the top project directory