SSGR-AR: Semantic-Enhanced Scene Graph Reasoning for Robust Video Action Recognition

Published 2025 in 2025 IEEE International Conference on Knowledge Graph (ICKG)

ABSTRACT

Due to the inherent complexity of video data, video action recognition faces significant challenges in modeling spatial-temporal dynamics and handling diverse scene contexts. Although scene graph-based methods can effectively model interactions between entities, most existing approaches overlook the rich semantic information embedded within scene graphs. Additionally, integrating large language models (LLMs) for semantic enhancement often suffers from hallucination problems, potentially introducing incorrect reasoning that misleads action recognition. To address these limitations, we propose SSGR-AR, a novel framework that structurally represents videos through scene graphs and constrain LLM reasoning using structured semantic paths derived from scene graph knowledge, ensuring controllable and reliable semantic enrichment. Moreover, we formulate entity alignment as a link prediction task and leverage a graph transformer to model the dynamic evolution of actions, thereby enhancing the model's capacity for long-term temporal reasoning. Experimental results on three widely used benchmark datasets show that our method outperforms state-of-the-art methods in terms of action recognition accuracy and generalization robustness.

PUBLICATION RECORD

Publication year
2025
Venue
2025 IEEE International Conference on Knowledge Graph (ICKG)
Publication date
2025-11-13
Fields of study
Not labeled
Identifiers
DOI 10.1109/ICKG66886.2025.00050
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Vision-Semantics-Label: A New Two-Step Paradigm for Action Recognition With Large Language Model
2026influential reference
Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models
2024cited by this paper
Dynamic and static mutual fitting for action recognition
2024cited by this paper
Incorporating Scene Graphs into Pre-trained Vision-Language Models for Multimodal Open-vocabulary Action Recognition
2024cited by this paper
Hierarchical Aggregated Graph Neural Network for Skeleton-Based Action Recognition
2024cited by this paper
Hypergraph-Based Multi-View Action Recognition Using Event Cameras
2024cited by this paper
Multimodal Event Causality Reasoning with Scene Graph Enhanced Interaction Network
2024cited by this paper
Multi-Scale Structural Graph Convolutional Network for Skeleton-Based Action Recognition
2024cited by this paper
Dynamic Semantic-Based Spatial Graph Convolution Network for Skeleton-Based Human Action Recognition
2024cited by this paper
Ordered GNN: Ordering Message Passing to Deal with Heterophily and Over-smoothing
2023cited by this paper
Learning Scene-Aware Spatio-Temporal GNNs for Few-Shot Early Action Prediction
2023cited by this paper
M3S: Scene Graph Driven Multi-Granularity Multi-Task Learning for Multi-Modal NER
2023cited by this paper
AIM: Adapting Image Models for Efficient Video Action Recognition
2023cited by this paper
Unbiased Scene Graph Generation in Videos
2023cited by this paper
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
2023cited by this paper
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
2023cited by this paper
Augmenting Recurrent Graph Neural Networks with a Cache
2023cited by this paper
Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning
2023cited by this paper
How Powerful are K-hop Message Passing Graph Neural Networks
2022cited by this paper
SVFormer: Semi-supervised Video Transformer for Action Recognition
2022influential reference
Scalable Spatiotemporal Graph Neural Networks
2022cited by this paper
Recurring the Transformer for Video Action Recognition
2022influential reference
Learning Graphs for Knowledge Transfer with Limited Labels
2021cited by this paper
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
2021cited by this paper
Masked Label Prediction: Unified Massage Passing Model for Semi-Supervised Classification
2020cited by this paper
Weakly Supervised Visual Semantic Parsing
2020cited by this paper
Image-to-Image Retrieval by Learning Similarity between Scene Graphs
2020cited by this paper
Heterogeneous Graph Transformer
2020cited by this paper
Skeleton-Based Action Recognition With Directed Graph Neural Networks
2019cited by this paper
Visual Relationship Detection with Language Priors
2016cited by this paper
P-CNN: Pose-Based CNN Features for Action Recognition
2015cited by this paper
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2015cited by this paper

CITED BY

No citing papers are available for this paper.