Predicting action progress in videos [paper reproduction]

Anish Diwan

Mar 14, 202318 min read

Anish Diwan* ; George Sotirchos* ; Chandran Nandkumar*

3mE, Delft University of Technology, Netherlands

* These authors contributed equally to this work and share first authorship

Github Repositories

Faster R-CNN | ProgressNet

Some Results (from the original paper) - Qualitative results of the linear model. Each row represents the progression of an action. Progress values are plotted inside the detection box with time on the x axis and progress values on the y axis. Progress targets are shown in green and predicted progresses in blue.

Introduction

As intelligent agents such as autonomous vehicles and assistive robots have become prevalent in our everyday world, the demand on their cognition capabilities has also increased. The deep learning systems behind this cognition are now expected to process and interpret large amounts of video information – often online and in an efficient manner. Predicting actions and their progress in video is one example of this kind of interpretation. With the ability to predict action progress, intelligent agents can better understand future states and even choose to act preemptively; first predicting when an action might take place and then following its progress until it ends. This could be especially useful for tasks such as accident avoidance, intercepting objects mid-action (such as lane changing or catching projectiles), and much more. Becattini et. al. [1] propose a deep architecture to do the same. Their architecture – named ProgressNet – is capable of “predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during its execution”. ProgressNet is an LSTM based architecture that is attached to an off-the-shelf Faster R-CNN backbone. It learns to predict actions progess by exploiting the temporal nature of actions by modelling actions as time based sequences of images and region of interest bounding boxes.

In this blog post, we aim to explain the working of ProgressNet, summarizing its key components and notable novel contributions. We then present our reproduction of ProgressNet and discuss the reproduction methods, the results, and the associated analysis. This reproduction is carried out as part of a course project for the course CS4240 Deep Learning.

Main Ideas & Novel Contributions

What is Action Progress?

As mentioned above, one of the key contributions of the paper is to predict the progress of actions in video. But what exactly is action progress? The authors propose two distinct definitions of progress based on two different classes of actions. They base their classification on linguistics literature, defining actions as either durative or punctual and either telic or atelic. A durative action is something that does not happen instantaneously. Parallelly, telic actions have definite goals while atelic actions don’t. For example, jumping from point A to B (within the video frame) is a durative action, while clapping hands is a punctual action. Walking however is an atelic action as its goal is not apparent just from video. In contrast, jumping from point A to B is telic as the goal is defined as point B. Hence, by definition, the notion of action progress only exists for durative telic actions (those actions that are not instantaneous and have some apparent goal).

Knowing which actions can have a notion of progress is not enough. We must also define the characteristics of this progress. Are these actions continuous (say walking or jumping) or do they happen in phases (say the action of running in a hurdles race that requires the athlete to run, jump, then run again)?. The authors categorize actions as either linear or phase-based. Where linear actions happen throughout the video at the same rate and phase-based actions happen as small linear actions with temporal boundaries. Both interpretations use the idea of an “action tube” – a sequence of bounding boxes (and not video frames) enclosing the subject. Progress is predicted frame-by-frame and is tied to the bounding box and not the whole video frame. This enables multiple actions from multiple subjects in the same video frame. Linear action progress is modeled as a percentage of completed tube frames. While phase-based progress is modeled with punctual actions demarcating phase transitions.

Boundary Observant Loss

Boundary Observant Loss is a novel approach introduced in the paper based on the idea that, at phase boundaries and clear time defined events, we must tolerate lesser error (and hence a have a higher comparitive loss) on account of greater certainty. While instances in the video that occur in between are not as certain and therefore may be allowed to tolerate more error. Thus the boundary observant loss punishes errors closer to the action boundaries far more significantly. This helps the model precisely learn when the action begins and ends. An illustration of the same is shown below.

Figure 1 - An illustration of the boundary observant loss on linear and phased actions

The BO loss still applies as a simplified case to actions with linear progress. Phase boundaries in this case are considered to be the start and end points of the action and the loss penalises errors closer to action start and end points more severely. The loss can be computed as a average weighted error found using L1 norm as follows.

ProgressNet Architecture

Figure 2 - ProgressNet architecture (credit: [1])

Figure 1 shows a flowchart of the ProgressNet architecture. The definitions of the illustrated blocks are as follows.

CONV5 - a pre-trained faster R-CNN model that acts as a backbone for detecting (spatio-temporal classification and localization) actions in the frame. It returns a bounding box around the subject and a confidence value (class score). The progress prediction part of the architecture takes in the output of CONV5 as its input. The faster R-CNN model architecture and implementation is explained in detail in a later section.
Box Linking - this is simply the process of overlaying the action detection on top of the original video frame.
SPP - a spatial pyramid pooling layer to summarise information from the original video frame. This encodes contextual information for arbitrarily sized images.
ROI - a region of interest pooling layer to summarise information from the detected bounding box region.
FC7 - a fully connected layer that takes in the concatenated outputs of SPP and ROI to "blend in" the action detection information.
LSTM - chained up long short-term memory RNN layers to generalize from temporal information in the video. LSTMs learn encodings from the whole history of detections in the action tube up until the current frame.
FC8 - a fullly connected layer to learn to predict action progress from the LSTM representations of the action detections. This layer has a sigmoid activation and outputs a probability distribution.

Videos are treated as oredered sequences (with action tubes). ReLU activations are used after every fully connected layer and dropout with a probability of 0.5 is used on the fully connected layers to moderate overfitting. ProgressNet trains all layers except for the pre-trained detector backbone. All layers are initialized with a uniform distribution as specified in [5]. The paper uses the Adam optimizer with a learning rate of 10^-4.

Evaluation Metrics

There are two main evaluation metrics that are used in the paper - Framewise Mean Squared Error (FMSE) and Average Progress Precision (APP).

The FMSE is used when the spatio-temporal coordinates of the particular action are known beforehand. It focuses solely on the progress prediction on account of it disregarding the action detection itself. It is computed as the mean squared error of the prediction of ground truths against the action progress targets.

The APP meanwhile is similar to the framewise average precision with the difference being that the true positives must lay within certain bounds of the ground truth target. The IoU threshold is set to 0.5 to ensure at least 50% overlap and the condition that the absolute difference between the predictions and ground truth progress targets is lower than the respective progress margin. The APP is calculated for each class and its mean is found over the different margin values.

Our Reproduction

Scope

For the purpose of this reproduction, our aim is to create our own implementation of the ProgressNet architecture and to run some preliminary tests to tune the hyperparameters that are not explicitly mentioned in the paper. We also aim to carry out a comparitive study of the boundary observant loss with the standard MSE loss in terms of the model's results. In the end, we aim to achieve the following results.

A working reimplementation of the pre-existing Faster R-CNN backbone (originally implemented partially in Matlab) using PyTorch.
An independent reproduction of the ProgressNet network.
An independent reproduction of the boundary observant loss and the average mean squared error metric.
Train the implemented network as per the available computational resources.

Some small scale tests will also be carried out to obtain the following.

Hyperparameter tuning carried out using a smaller fraction of the UCF24 training dataset to obtain the parameters that are not explicitly mentioned in [1]. These include the number of units in the fully connected layers, the kernel size and stride used in the spatial pyramid pooling, and the output size of the region of interest pooling layer.
A comparison of BO loss with MSE loss in terms of the training metrics on the test dataset.

Faster R-CNN Implementation

Following in the footsteps of this work’s authors, in order to retrieve frame-wise bounding box detections for the actions, we decided to use a pre-trained Faster R-CNN as the backbone of our framework. To that end, we started by using the pre-trained action detector proposed by Saha et al. in 2016 [3] and Singh et al. at 2017 [4] from their repository.

After cloning the repository along along with the model parameter file (rgb-ssd300_ucf24_120000.pth) and the UCF24 dataset from the developers’ Google Drive we imported the pre-trained model along with the dataset and lastly we proceed to get frame-wise detections.

# load pre-trained model
net = build_ssd(SSD_DIM, NUM_CLASSES)  # initialize SSD
net.load_state_dict(torch.load(BASENET))

# print a summary of the loaded network's architecture
summary(net)

# generate per-frame detections
detections = detect_actions(net, dataset) # our own method to get progess

=================================================================
Layer (type:depth-idx)                   Param #
=================================================================
SSD                                      --
├─ModuleList: 1-1                        --
│    └─Conv2d: 2-1                       1,792
│    └─ReLU: 2-2                         --
│    └─Conv2d: 2-3                       36,928
│    └─ReLU: 2-4                         --
│    └─MaxPool2d: 2-5                    --
│    └─Conv2d: 2-6                       73,856
│    └─ReLU: 2-7                         --
│    └─Conv2d: 2-8                       147,584
│    └─ReLU: 2-9                         --
│    └─MaxPool2d: 2-10                   --
│    └─Conv2d: 2-11                      295,168
│    └─ReLU: 2-12                        --
│    └─Conv2d: 2-13                      590,080
│    └─ReLU: 2-14                        --
│    └─Conv2d: 2-15                      590,080
│    └─ReLU: 2-16                        --
│    └─MaxPool2d: 2-17                   --
│    └─Conv2d: 2-18                      1,180,160
│    └─ReLU: 2-19                        --
│    └─Conv2d: 2-20                      2,359,808
│    └─ReLU: 2-21                        --
│    └─Conv2d: 2-22                      2,359,808
│    └─ReLU: 2-23                        --
│    └─MaxPool2d: 2-24                   --
│    └─Conv2d: 2-25                      2,359,808
│    └─ReLU: 2-26                        --
│    └─Conv2d: 2-27                      2,359,808
│    └─ReLU: 2-28                        --
│    └─Conv2d: 2-29                      2,359,808
│    └─ReLU: 2-30                        --
│    └─MaxPool2d: 2-31                   --
│    └─Conv2d: 2-32                      4,719,616
│    └─ReLU: 2-33                        --
│    └─Conv2d: 2-34                      1,049,600
│    └─ReLU: 2-35                        --
├─L2Norm: 1-2                            512
├─ModuleList: 1-3                        --
│    └─Conv2d: 2-36                      262,400
│    └─Conv2d: 2-37                      1,180,160
│    └─Conv2d: 2-38                      65,664
│    └─Conv2d: 2-39                      295,168
│    └─Conv2d: 2-40                      32,896
│    └─Conv2d: 2-41                      295,168
│    └─Conv2d: 2-42                      32,896
│    └─Conv2d: 2-43                      295,168
├─ModuleList: 1-4                        --
│    └─Conv2d: 2-44                      73,744
│    └─Conv2d: 2-45                      221,208
│    └─Conv2d: 2-46                      110,616
│    └─Conv2d: 2-47                      55,320
│    └─Conv2d: 2-48                      36,880
│    └─Conv2d: 2-49                      36,880
├─ModuleList: 1-5                        --
│    └─Conv2d: 2-50                      460,900
│    └─Conv2d: 2-51                      1,382,550
│    └─Conv2d: 2-52                      691,350
│    └─Conv2d: 2-53                      345,750
│    └─Conv2d: 2-54                      230,500
│    └─Conv2d: 2-55                      230,500
├─Softmax: 1-6                           --
=================================================================
Total params: 26,820,134
Trainable params: 26,820,134
Non-trainable params: 0
=================================================================

Faster R-CNN is a popular object detection algorithm that was introduced by Shaoqing et al. in 2014 [2]. It was built on top of the previous state-of-the-art region-based convolutional neural network (R-CNN) and Fast R-CNN methods, and introduced a new way of generating region proposals, which makes it significantly faster than its predecessors. Faster R-CNN has achieved state-of-the-art results on various object detection benchmarks, and it is widely used in computer vision applications such as autonomous driving, robotics, and surveillance.

In Faster R-CNN, the region proposal network (RPN) is introduced, which generates object proposals by sliding a small network over the convolutional feature map of the input image. The RPN is trained to score objectness and regression offsets, which are used to refine the proposals. The proposals are then fed into a region-based CNN, which performs classification and bounding box regression for each object proposal. The final output of Faster R-CNN is a list of object detections, each with a class label and bounding box coordinates. To leverage the RPN we had to iterate over every class in every frame and process the output.

In the RPN, prior boxes (also known as anchor boxes) and loc layers are used to generate object proposals and refine their bounding box coordinates. Prior boxes are a set of fixed bounding boxes with different sizes and aspect ratios that are placed at each spatial location of the feature map produced by the convolutional layers. The RPN uses these prior boxes as reference to generate proposals by predicting the offsets between each prior box and the corresponding object bounding box.

The loc layers are responsible for computing the refined bounding box coordinates of each proposal generated by the RPN. Specifically, the loc layers take as input the features of each proposal and output four numbers that represent the offsets between the proposal and the ground truth bounding box of the object in the image. These offsets are then used to adjust the coordinates of the proposal bounding box and improve its localization accuracy.

First, the net’s output was split to loc data, prior boxes, and confidence predictions. Then, the prior boxes were updated using the loc data and a softmax was applied on the confidence predictions to get meaningful scores.

def get_scores_and_boxes(output, net):
    """ Retrieve the confidence scores and bounding boxes
    from the net's output. """
    # split the the output to:
    loc_data = output[0]  # loc layers' output
    conf_preds = output[1]  # confidence predictions
    prior_data = output[2]  # prior boxes

    # use the loc data to refine the prior boxes' coordinates
    decoded_boxes = decode(loc_data[0].data,
                           prior_data.data,
                           v2['variance']
                           ).clone()

    # apply softmax to the confidence predictions
    conf_scores = net.softmax(conf_preds[0]).data.clone()

    return conf_scores, decoded_boxes

Finally, the frame-wise detections were constructed for every single class after applying a confidence threshold and non-maximum suppression to the per-class confidence scores and decoded bounding boxes.

# filter the class scores with the confidence threshold
conf_mask = class_scores.gt(CONF_THRESH)
class_scores = class_scores[conf_mask].squeeze()

# filter the bounding boxes with the confidence threshold
l_mask = conf_mask.unsqueeze(1).expand_as(class_boxes)
class_boxes = class_boxes[l_mask].view(-1, 4)

# apply non-maximum suppression
# indices of top k highest scoring and non-overlapping
# boxes per class, after nms
ids, counts = nms(class_boxes,
                  class_scores,
                  NMS_THRESH,
                  TOP_K)

class_scores = class_scores[ids[:counts]].cpu().numpy()
class_boxes = class_boxes[ids[:counts]].cpu().numpy()

for ik in range(class_boxes.shape[0]):
    class_boxes[ik, 0] = max(0, class_boxes[ik, 0])
    class_boxes[ik, 2] = min(width, class_boxes[ik, 2])
    class_boxes[ik, 1] = max(0, class_boxes[ik, 1])
    class_boxes[ik, 3] = min(height, class_boxes[ik, 3])

# class_detections will be of shape:
# (classes) * (samples) * (# dets. in sample for class) * (5)
class_detections = np.hstack((
    class_boxes,
    class_scores[:, np.newaxis])
).astype(np.float32, copy=True)

return class_detections

The resulting detections array is of the following shape. The per-frame detections were subsequently pooled and used as input to the ProgressNet.

num_classes x num_samples x num_detections_per_sample x bbox

Model & Associated Hyperparameters

The ProgressNet architecture was created using Pytorch and readily available layers. The architecture itself is quite straightforward and can be constructed as follows.

class ProgressNet(nn.Module):
  def __init__(self):
    super().__init__()
    self.spp = nn.MaxPool2d((30, 30), stride=10)
    self.fc7 = nn.Linear(2928, 128)
    self.lstm1 = nn.LSTM(128, 64, num_layers=1)
    self.lstm2 = nn.LSTM(64, 32, num_layers=1)
    self.fc8 = nn.Linear(32, 1)
    self.relu = nn.ReLU()
    self.dropout = nn.Dropout(0.5)
  
  def forward(self, x, bbox):
    '''
    x = image
    bbox = list with x1, y1, x2, y2 as bbox coordinates
    '''
    z = self.spp(x.view(1,3,300,300))
    y = roi_pool(x.view(1,3,300,300), bbox, (16,12))
    x = torch.cat((z.flatten(), y.flatten())).view(1,-1)
    x = self.fc7(x)
    x = self.relu(x)
    x = self.dropout(x)
    x, (h_n, c_n) = self.lstm1(x.view(1,1,128))
    x, (h_n, c_n) = self.lstm2(x)
    x = self.fc8(x)
    return torch.special.expit(x)

In out attempt to independently reimplement the ProgressNet architecture, we found that the original paper is quite sparse in terms of specifying the exact model parameters. The paper only mentions the number of units in the two LSTM layers, the activation functions, and dropout. However, the authors fail to mention the number of units in the fully connected layers, the kernel size and stride of the spatial pyramid pooling layer, the output size of the region of interest pooling layer, and and regularization parameters. The paper is also quite hand wavy about the purpose of the LSTM layers, their input characteristics (sequence lengths, padding etc.), and how the LSTMs learn progess values. Nor does the paper mention batch wise training or why it was most likely omitted.

This, coupled with the lack of any code or similar documentation, makes it quite difficult to reproduce their results. From our limited testing and based on the image properties in the UCF24 dataset, we were able to set these parameters to the following values. These are however, unlikely to be the most tuned values, and further experimentation on the complete dataset could lead to much better results.

Spatial Pyramid Pooling --> kernel_size = (30, 30) | stride = 10
Region of Interest Pooling --> output_size = (16,12)
FC7 --> in_dim = 2928 | out_dim = 128
LSTM 1 --> in_dim = 128 | sequence_len = 1 (to allow online inference) | hidden_dim = 64
LSTM 1 --> in_dim = 64 | sequence_len = 1 (to allow online inference) | hidden_dim = 32
FC8 --> in_dim = 32 | out_dim = 1
ProgressNet trained per-frame with batch size = 1

Apart from the architecture, we also independently implemented the boundary observant loss and the average progress MSE metrics. These were implemented as independent modules that could be instantiated in the main training script. The BO loss was implemented with Pytorch, by considering linear progress as a special case of phase based progress that happens with phase intervals at the start and end of the action tube. However, this was implemented in a generic fashion assuming the existence of phase boundaries (by passing phase intervals) to enable easier adoptation to phase based progress in the future.

def forward(self, predictions, targets, phase_intervals):
    errors = torch.abs(predictions - targets)
    potentials = []

    for (l_k, u_k) in phase_intervals:
        m_k = (l_k + u_k) / 2
        r_k = (u_k - l_k) / 2
        e_k = torch.min(
        torch.tensor(1.0),
        ((targets - m_k) / (r_k * torch.sqrt(torch.tensor(2.0))))**2 +
        ((predictions - m_k) / (r_k * torch.sqrt(torch.tensor(2.0)))) **2)
        potentials.append(e_k)

    min_potentials = torch.stack(potentials).min(dim=0).values
    bo_loss = torch.mean(min_potentials * errors)

    return bo_loss

Finally, we used the splitfiles and annotation dictionaries from the UFC24 dataset (mentioned in the next section) to generate linear progress values. Since, ProgressNet learns with a batch size of 1, linear progress is returned based on the current sample index (by comparing the sample index with the list of action tube durations). This trick enables us to avoid saving ground truth progress as part of the dataset and instead calculate this in an online fashion.

progress = linear_progess.get_progress_value(sample_idx) 
# where sample_idx is the index of the image being fed into ProgressNet (wrt the dataloader)

Datasets

The dataset used to train ProgressNet is called UCF101. This is a dataset of 101 human actions. It is quite a large dataset and is notationally different from the one used in this reproduction. We use the UCF24 dataset. This is a comparitively smaller dataset with 24 action categories. More importantly, these actions are pre-annotated as bounding boxes over each video frame (as opposed to UCF101 that has actions annotated with human body pose). The authors of ProgressNet had to independently transform the UCF101 dataset to contain bounding boxes, however, we felt that this was out of the scope for our reproduction. UCF24 is the same dataset that [3,4] use to train the Faster R-CNN model that is used as a backbone for ProgressNet.

As an added positive, using the UCF24 dataset to train our implementation of ProgressNet allows us to avoid unforeseen spatiotemporal detection errors from the Faster R-CNN backbone as it is more likely to return accurate predictions on data that it is already trained on. This lowers the dependence of the backbone on the training error of ProgressNet. Another advantage fof UCF24 is that is comes pre-packaged with utilities and post-prediction functionalities as part of the code for the Faster R-CNN backbone. This includes dataloaders, dataset generation functions, annotation related utilities, bounding box generation utilities, and more. This enables us to smoothly integrate the Faster R-CNN backbone and directly obtain bounding box prediction to pass onto ProgressNet. It also comes with a out-of-the-box dataset split for smoother training and testing. We leverage these splitfiles to directly obtain ground truth progress values from the dataloaders in an online fashion.

Results

Although UCF24 is comparitively smaller than UCF101, it is still quite large in size. The testing split contains approximately 160k individual images and the training set is approximately 5 times that size. Furthermore, as shown in the previous section, the Faster R-CNN backbone is a very large model and contains 40 independent layers with 26,820,134 parameters. From our tests, one epoch of inference with the Faster R-CNN model coupled with ProgessNet on only 20% of the test data takes approximately 40 minutes on an RTX 3060 laptop GPU.

The figure below shows the results of training ProgressNet on the training data (of size equivalent to 20% of the test data) for 5 epochs (taking about 3.3 hours in total). The graph plots the training BO loss versus training steps. Unfortunately, we could not get any reliable training results based on this small training exercise. The loss seems to fluctuate quite drastically and does not reduce or even stabilise.

Figure 3 - Loss vs training steps during one of our longer training experiments

However, on running this trained model on the test data during inference, the average MSE error between the ground truth and the predicted progress is quite similar to the results shown in the ProgressNet paper.

Figure 4 - Results of out reimplementation of ProgressNet along with the results presented in the original paper

In the following section, we discuss these results and speculate on the possible reasons behind this. Ultimately though, we believe that more conclusive results could would be obtained by re-training this reimplementation of ProgressNet on the complete UCF24 or possibly, the UCF101 dataset through the use of cloud computing solutions.

Discussion

Upon discussion with the external supervisor, we belive that the poor training performance is most likely due to a few interpretation related discrepancies and is possible exacerbated by a lack of computational resources to properly train the model. The current section outlines our hypotheses and later presents ideas for improvement in the future.

One of the primary reasons for poor results could be a discrepancy between the intended ProgessNet architecture and our interpretation of what it is. Figure 2 shows the architecture as specified in the paper. Our understanding is that the spatial pyramid pooling and the region of interest pooling layers take in the spatiotemporal detection outputs of the Faster R-CNN as their input. This is also what is mentioned in the paper and what seems to be shown in figure 2. The exact excerpt from the paper is as follows.

"We concatenate a contextual feature, computed by spatial pyramid pooling (SPP) of the whole frame [17], with a region feature extracted with ROI Pooling [45]"

However, the paper fails to explain this any further and does not provide any clear explanations on what exactly is being fed into these two layers. It could be the case that instead of taking in the actual frame and region of interest as inputs, these pooling layers take the output of the final convolutional layer from the Faster R-CNN (the convolutional features). This could be what is depicted by the blue and orange highlighted boxes over the VGG in figure 2. It could be the case that the purpose of the Faster R-CNN is not just to obtain spatiotemporal detections but to also obtain some feature representation of the actual image and its region of interest. Omitting this could have made the network quite shallow and could be the reason why our implementation of ProgressNet fails to learn progress values.

Apart from the interpretation discrepancies, ProgessNet takes quite some time and resources to train. As mentioned before, it took us about 40 minutes to finish one epoch of training on a dedicated GPU with only a small fraction of the UCF24 dataset. A reason for this could be that the batch size is set to 1. This means that we are feeding in only one frame at a time and are potentially not fully utilising the GPU. The primary reason for setting batch size to 1 was to accurately replicate the paper. Similar to the previous issue, the paper does not seem to mention any batch size parameters or even the nuances of using batch wise training. It was unclear to us whether batches would have to be sampled across tubes or within tubes. From what we gather, if the sequence length is set to 1 -- this was also not mentioned in the paper and was something we inferred seeing as images are to be fed one at a time -- then a batch might have to be sampled across multiple action tubes (i.e. taking in the first frame of n tubes and then the second frame and so on). This would require us to be modify the Faster R-CNN dataloader which we think is out of scope for this reiproduction.

It is also possible that we just did not train the network sufficiently. The bulky nature of the Faster R-CNN model coupled with the lack of training speed benefits that could have come from potential batch wise training, reduced our training capabilities in general. It is possible that training for longer on the complete dataset could have helped with improving performance. Unfortunately, due to a shortage of time and difficulty with setting up cloud computing resources, we were not able to train the model further.

Finally, we would like to comment on the surprisingly close match with the results mentioned in the original paper. Unfortunately, in our opinion, this is just happenstance. We believe that there are multiple reasons why these results are not statistically significant. First, the original implementation of ProgressNet was trained and tested on a different dataset than ours. It is unlikely that these two come from the same generating distribution and hence any results on one could not indicate "accurate" reproduction based on results on the other. Second, our version of the network was trained for substantially lower durations on a fraction of the total dataset. Finally, it could also be the case that average progress MSE is not a very good indicator of actual model peformance. It is possible that two models with similar error metrics perform completely differently. From our testing during inference, our version of the network does not reliably predict progess. However, these poor results also do not provide enough evidence to simply discount our reproduction. As highlighted in the previous paragraphs, further training on a complete dataset could indeed improve progess prediction results on our network. Further, discrepancies related to interpretation of the original paper shed light on a series of "blind spots" that otherwise make it quite challenging to reproduce the original work. This serves as an example to highlight why code submissions need to be given higher importance for submissions to scientific journals and conferences. It further highlights the importance of the implementation related sections in academic papers and consequences of not being explicit about experiment and architecture details. Nevertheless, with the provided time and resources, we belive that our reproduction of ProgessNet is fairly close to the upper limit of feasible reproduction closeness.

Future Work

At present, we are working on the following improvements. These are included as future work as they are a bit out of scope considering the time constraints of the Deep Learning course. However, we will try to push these changes as soon as possible.

Changes to include an option to take in the Faster R-CNN last layer as an input for SPP and ROI pooling
Modifications to the dataloader to sample batches across action tubes
Modification to the dataloader to replicate the data augmentation strategies presented in the paper. Namely, to first sample action tubes with random start and end durations so that the network does not just learn to predict progess as 0% initially and 100% at the end. And second, to sample action tubes with random frame rates so that the network does not learn to predict action progess at the same rate for all actions
Further testing to compare the BO loss with the MSE loss

References

[1] Becattini F, Uricchio T, Seidenari L, Ballan L, Bimbo AD. Am I done? Predicting action progress in videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM). 2020 Dec 16;16(4):1-24.

[2] He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence. 2015 Jan 9;37(9):1904-16.

[3] Saha S, Singh G, Sapienza M, Torr PH, Cuzzolin F. Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529. 2016 Aug 4.

[4] Singh G, Saha S, Sapienza M, Torr PH, Cuzzolin F. Online real-time multiple spatiotemporal action localisation and prediction. InProceedings of the IEEE International Conference on Computer Vision 2017 (pp. 3637-3646).

[5] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the thirteenth international conference on artificial intelligence and statistics 2010 Mar 31 (pp. 249-256). JMLR Workshop and Conference Proceedings.