This is the second post about my Master’s thesis and I’m going to talk about the classifier that we used for this project. Regarding this task, we want to classify the main action that is happening in a video among the available categories in the dataset UCF-101. According to this, I want to introduce the Neural Network model that we used for this task.
Classification model: Local Deep Neural Network
We have used a Neural Network at patch level that we called Local Deep Neural Network. The aim of this model is to learn how to classify each patch into one of the 101 classes. Once the training is completed, we can extract features from one video and then classify each patch individually. The following image shows the structure of this Local Deep Neural Networks.
Regarding the last of the network, the box with Sigma represents the late fusion procedure. We evaluated two different schemes that work well with other problems:
- Voting: The output of each patch provides its label, and then they are added, considering the most voted as the final label for the video.
- Posterior Sum: The different values are summed up in one vector, obtaining the label after choosing the maximum value.
The following diagram shows an example with 2 classes, {1,2} and 3 outputs {CL1, CL2, CL3} corresponding to 3 different patches. In the case of voting, label 1 receives one vote while label 2 receives two votes, classifying the instance with the label 2. On the other hand, after adding each dimension of the output for each patch, we get a new vector with the maximum value for label 1. This example illustrates that these methods could differ each other, so using both is an idea worth evaluating.
Once the model is presented, I will show, in the last part of this topic, the results and conclusions of the experimentation.