CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos

Anonymous Author(s)

Highlights

  • We introduce a new AeroEye dataset for VidSGG in drone videos, augmented with numerous predicates and diverse scenes to capture the complex relationships in aerial videos.
  • We propose the CYCLO approach, utilizing circular connectivity among frames to enable periodic and overlapping relationships. It allows the model to capture long-range dependencies and process object interactions in the appropriate temporal arrangement.
  • The proposed CYCLO approach outperforms prior methods on two large-scale in-the-wild VidSGG datasets, including OpenPVSG and ASPIRe.

Abstract

Video scene graph generation (VidSGG) has emerged as a transformative approach to capturing and interpreting the intricate relationships among objects and their temporal dynamics in video sequences. In this paper, we introduce the new AeroEye dataset that focuses on multi-object relationship modeling in aerial videos. Our AeroEye dataset features a rich variety of drone scenes and includes a comprehensive collection of predicates that capture the intricate relationships and spatial arrangements among objects. To this end, we propose the novel Cyclic Graph Transformer (CYCLO) approach that allows the model to capture both direct and long-range temporal dependencies by continuously updating the history of interactions in a circular manner. The proposed approach also allows one to handle sequences with inherent cyclical patterns and process object relationships in the correct sequential order. Therefore, it can effectively capture periodic and overlapping relationships while minimizing information loss. The extensive experiments on the AeroEye dataset demonstrate the effectiveness of the proposed CYCLO model, demonstrating its potential to perform scene understanding on drone videos. Finally, the CYCLO method consistently achieves State-of-the-Art (SOTA) results on two in-the-wild benchmarks, i.e., OpenPVSG and ASPIRe.

Qualitative Results

Scene graphs generated by the CYCLO model on the AeroEye dataset.

Data Statistics

AeroEye includes videos from the MAVREC and ERA along with our own object detection, tracking, and relationship instance annotations.
2.3K videos 261.5K frames 2.2M bounding boxes 43M relationship instances
Statistics for each scene on AeroEye dataset.
Relationship word cloud on AeroEye dataset

Data Annotations

We provide a subset of the AeroEye dataset. The complete AeroEye dataset will be made publicly available upon acceptance of the paper. [Download]

Comparison

Experimental Results

Results on AeroEye Dataset
Our CYCLO apporach is compared on three task Predicate Classification (PredCls), Scene Graph Classification (SGCls) and Scene Graph Detection (SGDet).
  • Comparison (mean ± std) against baseline methods at Recall (R). The best results are in bold.
  • fail
  • Comparison (mean ± std) against baseline methods at mean Recall (mR). The best results are in bold.
  • fail

    Results on PVSG Dataset
  • Comparative performance (%) of our model and previous methods on the PVSG dataset, evaluated by Recall (R) and mean Recall (mR). The best results are in bold.
  • fail

    Results on PVSG Dataset
  • Comparative performance (%) of our model and previous methods on the ASPIRe dataset, evaluated by Recall (R) and mean Recall (mR). The best results are in bold.
  • fail