Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration (2025)

Jiazhou Zhou1,3  Kanghao Chen1  Lei Zhang3  Lin Wang1,2
1AI Thrust, HKUST(GZ)
2Dept. of CSE, HKUST
3 International Digital Economy Academy (IDEA)
{jzhou297,kchen879}@connect.hkust-gz.edu.cn, leizhang@idea.edu.cn, linwang@ust.hk
Corresponding Author

Abstract

Event cameras are bio-inspired sensors that capture the intensity changes asynchronously and output event streams with distinct advantages, such as high temporal resolution.To exploit event cameras for object/action recognition, existing methods predominantly sample and aggregate events in a second-level duration at every fixed temporal interval (or frequency). However, they often face difficulties in capturing the spatiotemporal relationships for longer,e.g., minute-level, events and generalizing across varying temporal frequencies. To fill the gap, we present a novel framework, dubbed PAST-SSM, exhibiting superior capacity in recognizing events of arbitrary duration (e.g., 0.1s to 4.5min) and generalizing to varying inference frequencies.Our key insight is to learn the spatiotemporal relationships from the encoded event features via the state space model (SSM) – whose linear complexity makes it ideal for modeling high temporal resolution events with longer sequences. To achieve this goal, we first propose a Path-Adaptive Event Aggregation and Scan (PEAS) module to encode events of varying duration into features with fixed dimensions by adaptively scanning and selecting aggregated event frames. On top of PEAS, we introduce a novel Multi-faceted Selection Guiding (MSG) loss to minimize the randomness and redundancy of the encoded features. This subtly enhances the model generalization across different inference frequencies. Lastly, the SSM is employed to better learn the spatiotemporal properties from the encoded features. Moreover, we build a minute-level event-based recognition dataset, named ArDVS100, with arbitrary duration for the benefit of the community. Extensive experiments prove that our method outperforms prior arts by +3.45%, +0.38% and +8.31% on the DVS Action, SeAct, and HARDVS datasets, respectively. In addition, it achieves an accuracy of 97.35%, 89.00%, and 100.00% in our ArDVS100, TemArDVS100, and Real-ArDVS10 datasets respectively with the duration from 1s to 265s. Our method also shows strong generalization with a maximum accuracy drop of only 8.62% for varying inference frequencies while the baseline’s drop reaches 27.59%. Project page: https://vlislab22.github.io/pastssm/.

1 Introduction

Event cameras are bio-inspired sensors that trigger signals when the relative intensity change exceeds a threshold, adapting to scene brightness, motion, and texture. Compared with standard cameras, event cameras output asynchronous event streams, instead of fixed frame rates. They offer distinct advantages, such as high dynamic range, microsecond temporal resolution, and low latency(Gallego etal., 2020; Zheng etal., 2023). Due to these merits, event cameras have been applied to address various vision tasks, such as object/action recognition(Deng etal., 2024; Cannici etal., 2020; Klenk etal., 2024; Zheng & Wang, 2024; Zhou etal., 2024; Sabater etal., 2022; deBlegiers etal., 2023; Gao etal., 2023)

The spatiotemporal richness of events introduces complexities in data processing and necessitates models that can efficiently process and interpret them. To address this problem, existing methods predominantly sample and aggregate them at every fixed temporal interval, i.e., frequency. In this way, the raw stream can be converted into denserepresentations(Zhou etal., 2023; Zubic etal., 2024; Bi etal., 2020; Sabater etal., 2022) akin to multi-channel images. In general, existing methods mainly follow two representative model structures: (a) step-by-step structure models(Xie etal., 2024; Yao etal., 2021; Zhou etal., 2024; 2023; Zheng & Wang, 2024; Kim etal., 2022) and (b) recurrent structure models (Sabater etal., 2022; Zubić etal., 2023). The former processes all time step event frames in parallel, employing local-range and long-range temporal modeling sequentially, as shown in Fig.2 (a). By contrast, the latter process event frames sequentially at each time step, updating a memory feature that affects the next input, as illustrated in Fig.2 (b).

Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration (1)

However, both models face two pivotal challenges, as shown in Fig.1. 1) Limited temporal duration. Our world tells an ongoing story about people and objects and how they interact(Wu & Krahenbuhl, 2021). This indicates that recognizing event streams of arbitrary duration is more practical and beneficial for real-world scenarios. However, existing methodsoften struggle with longer, e.g., minute-level, spatiotemporal relationships of events because step-by-step structure models face high computational complexity with long events, while recurrent models struggle with forgetting nature of initial information and longer training times.2) Limited generalization to varying frequencies.The performance of existing recognition models significantly declines at inference frequencies that differ from those used during training, which is crucial for high-speed, dynamic visual scenarios(Zubic etal., 2024). For example, as illustrated in Fig. 7, the existing event sampling strategies exhibit poor generalization when evaluated at both higher and lower sampling frequencies with a maximum performance drop of 27.59%.

Recently, the selective state space model (SSM) rivals previous backbones such as vision transformer in performance while offering a significant reduction in memory usage and linear-scale complexity, as evidenced by Mamba(Gu & Dao, 2023), Vision Mamba(Zhu etal., 2024), and Video Mamba(Li etal., 2024) in language, image, and video modalities.Given the inherently longer sequences because of the event stream’s high temporal resolution, a natural motivation arises for harnessing the exceptional power of SSM for event spatiotemporal modeling with linear complexity. This prompts us to explore an interesting question: how to effectively recognize events of arbitrary duration (e.g., second-level to minuter-level) while generalizing across varying inference frequencies based on the SSM backbone?To this end, we propose PAST-SSM, a novel framework for recognizing event streams of arbitrary duration (0.1s to 4.5min), as depicted in Fig.1. By harnessing the linear complexity of SSM, PAST-SSM delivers exceptional recognition performance and frequency generalization.Our PAST-SSM brings two key technical breakthroughs.

Firstly, the number of aggregated event frames can vary dramatically due to the high temporal resolution of events. For example, if events lasting between 0.1s and 300s are sampled at 50Hz (every 0.02s), the number of resulting frames can range from 5 to 15,000. This variability causes difficulties for SSM in effectively learning the spatiotemporal properties from events, as SSM’s hidden state updates rely heavily on the sequence length and feature order.To this end, we propose a novel Path-Adaptive Event Aggregation and Scan (PEAS) module to encode events of arbitrary duration into sequence features with fixed dimensions.Concretely, as shown in Fig.3, a selection mask is first learned from the original event frames to facilitate frame selection. Then the bidirectional event scan is conducted on the selected frames to convert them into sequence features. This adaptive process ensures the event scan path is end-to-end learnable and responsive to every event input, thus enabling our PAST-SSM to effectively process event streams of arbitrary duration (Tab.4).

Secondly, the varying sampling frequencies hinder the framework’s generalization during the inference, as empirically verified in Tab.8. This suggests that alterations in the input sequence order, resulting from changes in sampling frequency, significantly impact model performance. For this reason, we propose a novel Multi-faceted Selection Guiding (MSG) loss. It minimizes the randomness of the event frame selection caused by the random initialization of the selection mask’s weight.As evidently shown in Fig.5, our MSG loss better facilitates alleviating the redundancy issue of the selected event frames, thus enhancing the SSM optimization’s effectiveness. Meanwhile, it also strengthens the generalization of the SSM model in varying inference frequencies (Tab.8).

Given the absence of datasets for minute-level duration event-based recognition, we collected ArDVS100 dataset (1s to 265s) and the more challenging TemArDVS100 dataset (14s to 215s) with temporal fine-grained classes through direct concatenation, each containing event streams across 100 classes, created through direct concatenation. Besides, we recorded the Real-ArDVS10 dataset, which includes real-world events from 2s to 75s across 10 classes. We believe they will enhance evaluation for recognizing event streams of arbitrary duration and inspire further research in this field.We conduct extensive experiments to evaluate our PAST-SSM on four publicly available datasets, showing superior or competitive performance with fewer model parameters. For example, it outperforms previous methods by +3.45%, +0.38%, and +8.31% on the DVS Action, SeAct, and HARDVS datasets, respectively. Meanwhile, it achieves 97.35%, 100.00% and 89.00% Top-1 accuracy on our proposed ArDVS100, Real-ArDVS10, and TemArDVS datasets respectively. Additionally, our PAST-SSM shows strong generalization with a maximum performance drop of only 8.62% across varying inference frequencies, compared to 27.59% for the previous sampling method.

2 Related Works

Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration (2)

Event-based Object / Action Recognition. Existing event-based recognition works cover two main tasks based on the event’s duration: object recognition(Zhou etal., 2023; Zheng & Wang, 2024; Gallego etal., 2020; Kim etal., 2021; Zheng etal., 2023; Gehrig etal., 2019; Gu etal., 2020; Deng etal., 2022a; Li etal., 2021; Liu etal., 2022) and action recognition(Zhou etal., 2024; Xie etal., 2024; Sabater etal., 2022; Xie etal., 2023; Gao etal., 2023; Plizzari etal., 2022; Xie etal., 2022; Liu etal., 2021). Specifically, events for object recognition capture stationary objects with duration from 0.1 to 0.3 s, whereas action recognition records dynamic human actions over a longer duration (avg. 1-10 s). Among them, methods for modeling spatiotemporal relationships of events with varying duration can be structurally categorized into two types, as shown in Fig.2: 1) step-by-step structure models and 2) recurrent structure models. Initially, the events are sampled into slices at fixed time intervals. The step-by-step structure models then use off-the-shelf backbones to extract local-range spatiotemporal features from event slices and then perform long-range temporal modeling using various methods, such as simple average operation(Zhou etal., 2024; 2023), proposedmodules(Xie etal., 2024; Yao etal., 2021) and loss guidance(Zheng & Wang, 2024; Kim etal., 2022). Recurrent structure models(Sabater etal., 2022; Zubić etal., 2023), on the other hand, process the event slices sequentially, updating their hidden state based on the input at each time step. Both structures ensure adaptability to varying time durations. However, step-by-step structure models struggle with high computational complexity when handling longer-duration events, such as those at minute-level granularity. Recurrent structure models tend to forget the initial information due to their simplistic recurrent design and require longer training time because of their inability to process data in parallel. Additionally, as evidenced in Tab.8, existing methods struggle to generalize across different inference frequencies, which is essential for applications in high-speed, dynamic visual scenarios(Zubic etal., 2024). In this work, we aim to improve event-based recognition for minute-level duration with improved generalization across varying inference frequencies.

State Space Model (SSM).It has recently demonstrated considerable effectiveness in capturing the dynamics and dependencies of long sequences. Various models have been developed, such as S4D(Gu etal., 2022), S5(Smith etal., 2022), S6(Wang etal., 2023) and H3(Fu etal., 2022). Mamba(Gu & Dao, 2023) stands out by introducing a data-dependent SSM layer, a selection mechanism, and performance optimizations at the hardware level. Compared to transformers(Brown etal., 2020; Lu etal., 2019), which rely on quadratic complexity attention, SSMs excel at processing long sequences with linear complexity. Mamba(Gu & Dao, 2023) distinguishes itself by introducing a data-dependent SSM layer and a selection mechanism, employing parallel scanning as input during training and recurrent input during evaluation. It motivates a step-by-step of works in the vision(Zhu etal., 2024), video(Li etal., 2024), and point cloud(Zhang etal., 2024) domains.Recently, there has been growing interest in exploring the temporal modeling capabilities of SSMs for event data, given the high temporal resolution of event cameras(Zubic etal., 2024). Specifically, Zubic etal. (2024) first integrates several SSMs with a recurrent ViT framework for event-based object detection. It enhances the adaptability for varying sampling frequencies by low-pass band-limiting loss. However, it overlooks generalization across different event durations and achieves unsatisfactory performance in sampling frequency generalization. In contrast, our work seeks to recognize event streams of arbitrary duration based on SSM by employing a path-adaptive event scan module and generalizing over varying inference frequencies.

3 Preliminaries

Event Stream. Event cameras capture object movement by recording the pixel-level log intensity changes, rather than capturing full-frame at fixed intervals for conventional cameras. The asynchronous events, denoted as ={ei(xi,yi,ti,pi)},i=1,2,,Nformulae-sequencesubscript𝑒𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑡𝑖subscript𝑝𝑖𝑖12𝑁\mathcal{E}=\{e_{i}(x_{i},y_{i},t_{i},p_{i})\},i=1,2,...,Ncaligraphic_E = { italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } , italic_i = 1 , 2 , … , italic_N, reflects the brightness change eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for a pixel at the timestamp tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, with coordinates (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and polarity pi{1,1}subscript𝑝𝑖11p_{i}\in\{1,-1\}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , - 1 }(Gallego etal., 2020; Zheng etal., 2023). Here, 1 and -1 represent the positive and negative brightness changes. Refer to the appendix for more details about the principle of event cameras.

SSM for Vision. SSMs(Gu etal., 2022; Smith etal., 2022; Fu etal., 2022; Wang etal., 2023) originate from the principles of continuous systems that map an input 1D sequence x(t)L𝑥𝑡superscript𝐿x(t)\in\mathbb{R}^{L}italic_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT into the output sequence y(t)L𝑦𝑡superscript𝐿y(t)\in\mathbb{R}^{L}italic_y ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT through an underlying hidden state h(t)N𝑡superscript𝑁h(t)\in\mathbb{R}^{N}italic_h ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Specifically, it is formalized by dh(t)/dt=Ah(t)+Bx(t)𝑑𝑡𝑑𝑡𝐴𝑡𝐵𝑥𝑡dh(t)/dt=Ah(t)+Bx(t)italic_d italic_h ( italic_t ) / italic_d italic_t = italic_A italic_h ( italic_t ) + italic_B italic_x ( italic_t ) and y(t)=Ch(t)+Dx(t)𝑦𝑡𝐶𝑡𝐷𝑥𝑡y(t)=Ch(t)+Dx(t)italic_y ( italic_t ) = italic_C italic_h ( italic_t ) + italic_D italic_x ( italic_t ), where AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, BN×1𝐵superscript𝑁1B\in\mathbb{R}^{N\times 1}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT, CN×1𝐶superscript𝑁1C\in\mathbb{R}^{N\times 1}italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT, DN×1𝐷superscript𝑁1D\in\mathbb{R}^{N\times 1}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT are the state matrix, the input projection matrix, the output projection matrix, and the feed-forward matrix. Refer to the appendix for more technical details.

Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration (3)

4 Proposed Method

Overview. The PAST-SSM framework, as depicted in Fig.3, processes arbitrary-duration events using our PEAS module, followed by the SSM’s spatiotemporal modeling to predict various recognition outcomes, including objects, actions, and event streams of arbitrary duration. It comprises two components: 1) the PEAS module introduced in Sec.4.1 for event sampling, frame aggregation path-adaptive event selection, and bidirectional event scan to encode events into sequence features with fixed dimensions. On top of PEAS, the MSG loss MSGsubscript𝑀𝑆𝐺\mathcal{L}_{MSG}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_G end_POSTSUBSCRIPT detailed in Sec.4.3 is proposed for minimizing the randomness and redundancy of encoded features; and 2) the event spatiotemporal modeling module discussed in Sec.4.2 to predict the final recognition results. The following subsections provide a detailed description of these components.

4.1 Path-adaptive Event Aggregation and Scan (PEAS) Module

We aim to recognize event streams of arbitrary duration. Following the previous frame-based event presentation methods, the events are prepossessed into aggregated event frames. Given the high temporal resolution of events, the number of aggregated event frames P𝑃Pitalic_P for arbitrary duration may vary significantly. For example, if events lasting between 0.1s and 300s are sampled at 50Hz (every 0.02s), the number of resulting frames can range from 5 to 15K. This variability introduces complexity for spatiotemporal event modeling. Additionally, due to SSM’s recurrent nature, its hidden state update is greatly affected by the input sequence length and feature order, especially when modeling the long-range temporal dependencies. To reduce this variability, we propose our PEAS module, which consists of the following four components to encode events of arbitrary duration into sequence features with fixed dimensions in an end-to-end learning manner.

Event Sampling and Frame Aggregation.Unlike sequential language with compact semantics, events ={ei(xi,yi,ti,pi)}N×4,i=1,2,,Nformulae-sequencesubscript𝑒𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑡𝑖subscript𝑝𝑖superscript𝑁4𝑖12𝑁\mathcal{E}=\{e_{i}(x_{i},y_{i},t_{i},p_{i})\}\in\mathbb{R}^{N\times 4},i=1,2,%...,Ncaligraphic_E = { italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 4 end_POSTSUPERSCRIPT , italic_i = 1 , 2 , … , italic_N denotes the asynchronous intensity change at the pixel (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) at time tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with polarity pi{1,1}subscript𝑝𝑖11p_{i}\in\{1,-1\}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , - 1 }. The complexity of spatiotemporal event data requires efficient processing of this high-dimensional data. Following previous methods(Zhou etal., 2023; Zubic etal., 2024; Bi etal., 2020; Sabater etal., 2022),

Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration (4)

we sample events with duration T𝑇Titalic_T at every fixed temporal windows 1/f1𝑓1/f1 / italic_f, where f𝑓fitalic_f denotes the sampling frequency, e.g.50 ms time windows 1/f1𝑓1/f1 / italic_f corresponding to sampling frequency f=20Hz𝑓20𝐻𝑧f=20Hzitalic_f = 20 italic_H italic_z. We group a number of events G𝐺Gitalic_G at each sampling time, as shown in Fig.4 (b)). This sampling method is more effective and robust than grouping events within fixed time windows as illustrated in Fig.4 (a), as evidenced in the following Sec5.3. Therefore, we obtain P=Tf𝑃𝑇𝑓P=Tfitalic_P = italic_T italic_f event groups P×G×4superscriptsuperscript𝑃𝐺4\mathcal{E}^{{}^{\prime}}\in\mathbb{R}^{P\times G\times 4}caligraphic_E start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_G × 4 end_POSTSUPERSCRIPT. Then, we utilize the event frame representation(Zhou etal., 2023) to transform the event groups superscript\mathcal{E}^{{}^{\prime}}caligraphic_E start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT into a series of event frames FP×H×W×3𝐹superscript𝑃𝐻𝑊3F\in\mathbb{R}^{P\times H\times W\times 3}italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_H × italic_W × 3 end_POSTSUPERSCRIPT. This transformation enables the use of traditional computer vision methods designed for frame-based data.

Path-adaptive Event Selection.

Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration (5)

With the aggregated event frame input F𝐹Fitalic_F, we then conduct our path-adaptive event selection to select K𝐾Kitalic_K event frames to reduce the variability of events of arbitrary duration. Concretely, as shown in Fig.3, the input of this module is the aggregated event frames F𝐹absentF\initalic_F ∈ P×H×W×3superscript𝑃𝐻𝑊3\mathbb{R}^{P\times H\times W\times 3}blackboard_R start_POSTSUPERSCRIPT italic_P × italic_H × italic_W × 3 end_POSTSUPERSCRIPT. We utilize a lightweight score predictor composed of two 3D convolutional layers, followed by an activation function to generate a selection mask MK×P𝑀superscript𝐾𝑃M\in\mathbb{R}^{K\times P}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_P end_POSTSUPERSCRIPT, where K𝐾Kitalic_K represents the number of selected frames and P𝑃Pitalic_P represents the number of original frames. M𝑀Mitalic_M consists of 0s and 1s, where the position of each 1 indicates the corresponding position of the selected event frame. Due to the non-differentiable nature of the max operation applied after the standard Softmax function to produce class probabilities, we employ a differentiable Gumbel Softmax(Jang etal., 2016) to facilitate backpropagation. To enhance the training process, the Gumbel Softmax is used exclusively during training, while the standard Softmax is applied during inference.Next, we utilize the Einsum matrix-matrix multiplication between the selection mask M𝑀Mitalic_M and the original event frames F𝐹Fitalic_F to obtain the final selected event frames Fsuperscript𝐹absentF^{{}^{\prime}}\initalic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ K×H×W×3superscript𝐾𝐻𝑊3\mathbb{R}^{K\times H\times W\times 3}blackboard_R start_POSTSUPERSCRIPT italic_K × italic_H × italic_W × 3 end_POSTSUPERSCRIPT. The above process ensures that Fsuperscript𝐹F^{{}^{\prime}}italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT can be derived from the original event frame input F𝐹Fitalic_F through an end-to-end learning approach. Please refer to the appendix for the pseudocode for the PEAS Module.

Fig.5 presents the original event frames alongside the K𝐾Kitalic_K selected ones at the start (epoch 0) and end (epoch 100) of the training process. Due to the events of arbitrary duration leading to different numbers of input event frames, frame padding is necessary to maintain consistent input sizes to ensure training among a mini-batch. In Fig.5, the black parts resent the padded zero-valued frames within a mini-batch. At epoch 0, the PEAS module randomly selects event frames, resulting in unnecessary padded frames and redundant event frames with repetitive information. After 100 epochs, the eight chosen frames exclude redundant frames and non-informative padding, demonstrating the effectiveness of the PEAS module and the MSG loss proposed in Sect.4.3.

Bidirectional Event Scan. Next, with the obtained selected event frames Fsuperscript𝐹absentF^{{}^{\prime}}\initalic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ K×H×W×3superscript𝐾𝐻𝑊3\mathbb{R}^{K\times H\times W\times 3}blackboard_R start_POSTSUPERSCRIPT italic_K × italic_H × italic_W × 3 end_POSTSUPERSCRIPT, we convert the selected event frames into a 1D sequence using the bidirectional event scan, following the spatiotemporal scan proposed in (Li etal., 2024). As illustrated in Fig.3, this scan elegantly follows the temporal and spatial order, sweeping from left to right and cascading from top to bottom. In this way, the events of arbitrary duration are transformed into encoded features with fixed dimensions.

4.2 Event Spatiotemporal Modeling Module

On top of the PEAS module, the events of arbitrary duration are transformed into the event frame sequence FK×H×W×3superscript𝐹superscript𝐾𝐻𝑊3F^{{}^{\prime}}\in\mathbb{R}^{K\times H\times W\times 3}italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_H × italic_W × 3 end_POSTSUPERSCRIPT. Given the inherently longer sequences because of the event stream’s high temporal resolution, we leverage the SSM for event spatiotemporal modeling with linear complexity.As shown in Fig.3, we first employ the 3D convolution with kernel size 1×16×16116161\times 16\times 161 × 16 × 16 for patch embedding to transform the event frames into L𝐿Litalic_L non-overlapping spatiotemporal tokens xeL×Csubscript𝑥𝑒superscript𝐿𝐶x_{e}\in\mathbb{R}^{L\times C}italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_C end_POSTSUPERSCRIPT, where L=Ts×H×W/16×16𝐿subscript𝑇𝑠𝐻𝑊1616L=T_{s}\times H\times W/16\times 16italic_L = italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_H × italic_W / 16 × 16. The SSM model, designed for sequential data, is sensitive to token positions, making preserving spatiotemporal position information crucial. Thus, we concatenate a learnable classification token Xcls1×Csubscript𝑋𝑐𝑙𝑠superscript1𝐶X_{cls}\in\mathbb{R}^{1\times C}italic_X start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_C end_POSTSUPERSCRIPT at the start of the sequence and then add a learnable spatial position embedding Ps(1+L)×Csubscript𝑃𝑠superscript1𝐿𝐶P_{s}\in\mathbb{R}^{(1+L)\times C}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 + italic_L ) × italic_C end_POSTSUPERSCRIPT and temporal embedding PtTs×Csubscript𝑃𝑡superscriptsubscript𝑇𝑠𝐶P_{t}\in\mathbb{R}^{T_{s}\times C}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT to obtain the final input sequence x=[xcls,xe]+Ps+Pt𝑥subscript𝑥𝑐𝑙𝑠subscript𝑥𝑒subscript𝑃𝑠subscript𝑃𝑡x=[x_{cls},x_{e}]+P_{s}+P_{t}italic_x = [ italic_x start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ] + italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Next, the input sequence x𝑥xitalic_x passes into L𝐿Litalic_L layers of stacked B-Mamba blocks.(Gu & Dao, 2023). Note that the bidirectional event scan is actually conducted in the B-Mamba blocks for code implementation.Finally, the [CLS] token is extracted from the final layer’s output and forwarded to the classification head, which consists of the normalization layer and the linear classification layer for the final prediction y𝑦yitalic_y.

4.3 Multi-faceted Selection Guiding (MSG) Loss

While the proposed PEAS module is differentiable and capable of learning through back-propagation, the basic multi-class cross-entropy loss, LCLSsubscript𝐿𝐶𝐿𝑆{L}_{CLS}italic_L start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT, is inadequate for effectively guiding model optimization.Due to the random weight initialization of the PEAS module, the selection of event frames is stochastic at the onset of training. However, throughout the training process, the model is limited to optimizing its performance based on the distribution of the randomly selected event frames, rather than enhancing the PEAS module to facilitate adaptive selection and scanning of the input events. To facilitate effective optimization, we propose the MSG loss that addresses two crucial aspects:1)minimizing the randomness of the selection process to ensure the selected sequence features can encapsulate the entirety of the sequence; and 2) guaranteeing that each selected events feature stands out with each other, thus eliminating redundancy.The MSG loss comprises three components, which will be detailed in the subsequent subsections.

Within-Frame Event Information Entropy (WEIE) Loss: Given the random initialization of the score predictor’s weight proposed in the PEAS module (Sec.4.1), the frame selection process tends to be random. For each selected event frame, we introduce a WEIE Loss WEIEsubscript𝑊𝐸𝐼𝐸\mathcal{L}_{WEIE}caligraphic_L start_POSTSUBSCRIPT italic_W italic_E italic_I italic_E end_POSTSUBSCRIPT, which quantifies the image entropy of each event frame. Intuitively, a higher WEIE loss indicates that the selected event frame contains more information and richer details. Maximizing this loss helps enhance model optimization to minimize randomness in the selection process. It is defined as follows:

WEIE=k=1Ki=1NPiklogPik/K,Pk=hist(gray(Fk)),formulae-sequencesubscript𝑊𝐸𝐼𝐸superscriptsubscript𝑘1𝐾superscriptsubscript𝑖1𝑁superscriptsubscript𝑃𝑖𝑘𝑙𝑜𝑔subscriptsuperscript𝑃𝑘𝑖𝐾superscript𝑃𝑘histgraysubscriptsuperscript𝐹𝑘\mathcal{L}_{WEIE}=-\sum_{k=1}^{K}\sum_{i=1}^{N}P_{i}^{k}logP^{k}_{i}/K,\quad P%^{k}=\textit{hist}(\textit{gray}(F^{{}^{\prime}}_{k})),caligraphic_L start_POSTSUBSCRIPT italic_W italic_E italic_I italic_E end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_l italic_o italic_g italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_K , italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = hist ( gray ( italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ,(1)

wherehist(.)(.)( . ) indicateshistogram statistics; N𝑁Nitalic_N is the number of histogrambins; gray(.)(.)( . ) converts RGB event frames to grayscale; Pksuperscript𝑃𝑘P^{k}italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT indicates the histogram statistics frequency for selected event frame Fksubscriptsuperscript𝐹𝑘F^{{}^{\prime}}_{k}italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT; K𝐾Kitalic_K indicates the number of selected event frames.

Inter-frame Event Mutual Information (IEMI) Loss: The IEMI loss is proposed to reduce redundancy among the selected event frames. In light of the mutual information from the information theory(Russakoff etal., 2004), the IEMI loss is defined to IEMIsubscript𝐼𝐸𝑀𝐼\mathcal{L}_{IEMI}caligraphic_L start_POSTSUBSCRIPT italic_I italic_E italic_M italic_I end_POSTSUBSCRIPT quantifies the uncommon information between two event frames. Intuitively, a lower IEMI loss signifies greater differences between the frames. Thus, minimizing IEMI loss guides the model to maximize the difference of selected event frames. While the IEMI loss can be computed between any two event frames, we restrict our calculation within every consecutive event frames Fsuperscript𝐹absentF^{{}^{\prime}}\initalic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ K×H×W×3superscript𝐾𝐻𝑊3\mathbb{R}^{K\times H\times W\times 3}blackboard_R start_POSTSUPERSCRIPT italic_K × italic_H × italic_W × 3 end_POSTSUPERSCRIPT to reduce computational cost. Formally, the proposed event mutual information is composed of the coordinate-weighted joint event count histogram hist(.)hist(.)italic_h italic_i italic_s italic_t ( . ) between every two consecutive event frames Fksubscriptsuperscript𝐹𝑘F^{{}^{\prime}}_{k}italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Fk+1subscriptsuperscript𝐹𝑘1F^{{}^{\prime}}_{k+1}italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT, added with their spatial coordinates Cxsubscript𝐶𝑥C_{x}italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and Cysubscript𝐶𝑦C_{y}italic_C start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. The IEMI loss IEMIsubscript𝐼𝐸𝑀𝐼\mathcal{L}_{IEMI}caligraphic_L start_POSTSUBSCRIPT italic_I italic_E italic_M italic_I end_POSTSUBSCRIPT is formulated as follows:

Pjointk=hist(gray(Fk+Fk+1+Cx+Cy)),subscriptsuperscript𝑃𝑘𝑗𝑜𝑖𝑛𝑡histgraysubscriptsuperscript𝐹𝑘subscriptsuperscript𝐹𝑘1subscript𝐶𝑥subscript𝐶𝑦P^{k}_{joint}=\textit{hist}(\textit{gray}(F^{{}^{\prime}}_{k}+F^{{}^{\prime}}_%{k+1}+C_{x}+C_{y})),italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT = hist ( gray ( italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ) ,(2)
IEMI=k=1K1(i=1Nj=1NPjointk(i,j)log(P(i)P(j)/Pjointk(i,j)))/(K1),subscript𝐼𝐸𝑀𝐼superscriptsubscript𝑘1𝐾1superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁subscriptsuperscript𝑃𝑘𝑗𝑜𝑖𝑛𝑡𝑖𝑗𝑙𝑜𝑔𝑃𝑖𝑃𝑗subscriptsuperscript𝑃𝑘𝑗𝑜𝑖𝑛𝑡𝑖𝑗𝐾1\mathcal{L}_{IEMI}=-\sum_{k=1}^{K-1}(\sum_{i=1}^{N}\sum_{j=1}^{N}P^{k}_{joint}%(i,j)log(P(i)P(j)/P^{k}_{joint}(i,j)))/(K-1),caligraphic_L start_POSTSUBSCRIPT italic_I italic_E italic_M italic_I end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT ( italic_i , italic_j ) italic_l italic_o italic_g ( italic_P ( italic_i ) italic_P ( italic_j ) / italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT ( italic_i , italic_j ) ) ) / ( italic_K - 1 ) ,(3)

where N𝑁Nitalic_N indicates the number of histogram bins and K𝐾Kitalic_K is the number of selected event frames;

Mask Selection (MS) Loss: Due to the arbitrary length of event streams with different numbers of input event frames, frame padding is necessary to maintain consistent input sizes to ensure

Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration (6)

training among a mini-batch. Therefore, we propose an MS loss MSsubscript𝑀𝑆\mathcal{L}_{MS}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT to filter out the padded frames during the selection process. Specifically, as shown in Fig.6, given original event frames input FP×H×W×3𝐹superscript𝑃𝐻𝑊3F\in\mathbb{R}^{P\times H\times W\times 3}italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_H × italic_W × 3 end_POSTSUPERSCRIPT and the selection mask MK×P𝑀superscript𝐾𝑃M\in\mathbb{R}^{K\times P}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_P end_POSTSUPERSCRIPT mentioned in Sec.4.1, the MSsubscript𝑀𝑆\mathcal{L}_{MS}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT loss sum the mask value Mj,j=Ori+1,,Ori+Padformulae-sequencesubscript𝑀𝑗𝑗𝑂𝑟𝑖1𝑂𝑟𝑖𝑃𝑎𝑑M_{j},j=Ori+1,...,Ori+Paditalic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j = italic_O italic_r italic_i + 1 , … , italic_O italic_r italic_i + italic_P italic_a italic_d at the corresponding position of the padding frame in Fj,j=ori+1,,Ori+Padformulae-sequencesubscript𝐹𝑗𝑗𝑜𝑟𝑖1𝑂𝑟𝑖𝑃𝑎𝑑F_{j},j=ori+1,...,Ori+Paditalic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j = italic_o italic_r italic_i + 1 , … , italic_O italic_r italic_i + italic_P italic_a italic_d, which is formulated as follows:

MS=i=1Kj=Ori+1PadMi,j/(K×Pad),subscript𝑀𝑆superscriptsubscript𝑖1𝐾superscriptsubscript𝑗𝑂𝑟𝑖1𝑃𝑎𝑑subscript𝑀𝑖𝑗𝐾𝑃𝑎𝑑\mathcal{L}_{MS}=\sum_{i=1}^{K}\sum_{j=Ori+1}^{Pad}M_{i,j}/(K\times Pad),caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_O italic_r italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P italic_a italic_d end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT / ( italic_K × italic_P italic_a italic_d ) ,(4)

K𝐾Kitalic_K, Ori=P𝑂𝑟𝑖𝑃Ori=Pitalic_O italic_r italic_i = italic_P, and Pad𝑃𝑎𝑑Paditalic_P italic_a italic_d indicate the number of selected event frames, original event frames, and padding frames respectively.

Total Objective:Given the final prediction class y𝑦yitalic_y and the ground-truth class Y𝑌Yitalic_Y, the total objective is composed by the MSG loss MSGsubscript𝑀𝑆𝐺\mathcal{L}_{MSG}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_G end_POSTSUBSCRIPT with three components and the commonly used multiclass cross-entropy loss CLSsubscript𝐶𝐿𝑆\mathcal{L}_{CLS}caligraphic_L start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT:

total=IEMIWEIE+MSMSG+CLS(y,Y).subscript𝑡𝑜𝑡𝑎𝑙subscriptsubscript𝐼𝐸𝑀𝐼subscript𝑊𝐸𝐼𝐸subscript𝑀𝑆subscript𝑀𝑆𝐺subscript𝐶𝐿𝑆𝑦𝑌\mathcal{L}_{total}=\underbrace{\mathcal{L}_{IEMI}-\mathcal{L}_{WEIE}+\mathcal%{L}_{MS}}_{\mathcal{L}_{MSG}}+\mathcal{L}_{CLS}(y,Y).caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = under⏟ start_ARG caligraphic_L start_POSTSUBSCRIPT italic_I italic_E italic_M italic_I end_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_W italic_E italic_I italic_E end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT ( italic_y , italic_Y ) .(5)

5 Experiments and Evaluation

5.1 Experiments settings

Public Available Datasets: Four publicly available event-based datasets are evaluated in this paper as follows: 1) DVS Action(Miao etal., 2019), also known as PAF, is an indoor dataset featuring 450 recordings across ten action categories lasting around 5s. 2) SeAct(Zhou etal., 2024) is a newly released dataset for event-based action recognition, covering 58 actions within four themes lasting around 2s-10s. This work uses only class-level labels despite available caption-level labels. 3) HARDVS(Wang etal., 2024b) is currently the largest dataset for event-based action recognition, comprising 107,646 recordings of 300 action categories. It also has an average duration of 5s and a resolution of 346 ×\times× 260.4) N-Caltech101(Orchard etal., 2015) contains event streams captured by an event camera in front of a mobile 180 ×\times× 240 ATIS system(Posch etal., 2010) with the LCD monitor presenting the original RGB images in Caltech101. There are 8,246 samples comprising 300 ms in length, covering 101 different types of items.

Our Minute-level ArDVS100, Real-ArDVS10 and TemArDVS100 Dataset. Given existing datasets only provide second-level duration events lasting approximately 0.1s to 0.3 s for objects and up to 20s for actions (Please refer to the appendix for all event-based object & action recognition dataset comparison),we propose the first arbitrary-duration dataset consisting of event streams of arbitrary durations, named ArDVS100 and TemArDVS datasets. Specifically, both the ArDVS100 and TemArDVS datasets contain 100 action classes with events ranging from 1s to 256s and 14s to 215s respectively; however, TemArDVS offers with fine-grained temporal labels that highlight the temporal sequence of actions. For instance, in TemArDVS100, ‘sit down then get up’ and ‘get up then sit down’ are distinct actions, while in ArDVS100, they are considered the same. Both datasets are synthesized by concatenating event streams from the HARDVS(Wang etal., 2024b) dataset for ArDVS100 and from HARDVS and DailyDVS-200(Wang etal., 2024a) for the TemArDVS dataset. We allocated 80% for training and 20% for testing (evaluating). Additionally, to assess the model’s real-world applicability, we created a real-world dataset, named Real-ArDVS10, comprising event-based actions lasting from 2s to 75s, encompassing 10 distinct classes selected from the ArDVS100 datasets. It was recorded using the DVS346 event camera, which has a resolution of 346 ×\times× 240 pixels. It is divided into 70% for training and 30% for testing (evaluation). We aim for our ArDVS100, Real-ArDVS10, and TemArDVS datasets to enhance evaluation for event-based action recognition and inspire further research.

Model Architecture: We utilize the default hyperparameters for the B-Mamba layer(Zhu etal., 2024), setting the state dimension to 16 and the expansion ratio to 2.

ModelLayer L𝐿Litalic_LDim D𝐷Ditalic_DParam.
Tiny(T)241927M
Small(S)2438425M
Middle(M)3257674M

In alignment with ViT(Dosovitskiy etal., 2020), we modify the depth and embedding dimensions to match models of comparable sizes, including Tiny (T), Small (S), and Middle (M), as outlined in Tab.1. The stated model parameter is an estimate, as the actual parameter varies depending on the number of categories and selected event frames amount K𝐾Kitalic_K.

Experimental Settings: We utilize the AdamW optimizer with a cosine learning rate schedule with the initial 5 epochs for linear warm-up. Unless a special statement, the default settings for the learning rate and weight decay are 1e-3 and 0.05, respectively. The model is trained with 100 epochs for DVS Action, SeAct, and N-Caltech101 datasets and 50 epochs for HARDVS and our ArDVS100 datasets. Additionally, we employ BFloat16 precision during training to improve stability. For data augmentation, we implement random scaling, random cropping, random flipping, and data mixup of the event frames during the training phase. We adopt the pre-trained VideoMamba(Li etal., 2024) model checkpoints for initialization. Refer to the appendix for additional experimental settings for each dataset. All ablation studies, unless specifically stated, use the Tiny version on the DVS Action dataset at a sampling frequency of 0.8 Hz with 32 selected event frames.

5.2 Experiments Results

5.2.1 Event-based Arbitrary Duration Recognition Results

In this section, we evaluate our proposed PAST-SSM method for the recognition of event streams across three time duration: (1) 0.1s to 0.3s, (2) 1s to 10s, and (3) 1s to 265s.

Results for recognizing 0.1s to 0.3s event streamsWe evaluate our PAST-SSM on the popular event-based object recognition datasets, namely N-Caltech101, the average duration of which is 0.3s.

Object Recognition (Avg. 0.1s-0.3s)
ModelParam.Top-1 Accuracy(%)
RG-CNNs(Cannici etal., 2020)19M65.70
Cho etal. (2023)-82.61
EDGCN(Deng etal., 2024)0.77M83.50
Matrix-LSTM(Cannici etal., 2020)-84.31
Yang etal. (2023)21M87.66
MEM(Klenk etal., 2024)-90.10
EventDance(Zheng & Wang, 2024)26M92.35
PAST-SSM-T-K𝐾Kitalic_K(1)88.29
PAST-SSM-T-K𝐾Kitalic_K(2)7M89.72
PAST-SSM-S-K𝐾Kitalic_K(1)90.92
PAST-SSM-S-K𝐾Kitalic_K(2)25M91.96
PAST-SSM-M-K𝐾Kitalic_K(1)94.20
PAST-SSM-M-K𝐾Kitalic_K(2)74M94.60

As shown in Tab.2, our PAST-SSM-M-K(2) secures a notable advantage, outperforming EventDance(Zheng & Wang, 2024) by +2.25%. This achievement underscores the potential of our purely SSM-based model in efficiently and effectively recognizing second-level event streams, highlighting its competence for local-rang event spatiotemporal modeling.

Results for recognizing 1s to 20s event streams Tab.3 presents results from event-based action recognition datasets with average durations of 1s to 10s. Our PAST-SSM-M outperforms previous methods, exceeding ExAct(Zhou etal., 2024) by +3.45% and +0.38% on the DVS Action and SeAct datasets, respectively. Additionally, the PAST-SSM-S-K(16) achieves a remarkable 98.41% Top-1 accuracy on the HARDVS dataset, surpassing ExAct(Zhou etal., 2024) by +8.31% while using only 25M parameters. This advancement also reduces computational demands due to the fewer parameters.

Action Recognition (Avg. 1s-10s)
Top-1 Accuracy(%)
ModelParam.DVS ActionSeActHARDVS
EV-ACT(Gao etal., 2023)21.3M92.60--
EventTransAct(deBlegiers etal., 2023)--57.81-
EvT(Sabater etal., 2022)0.48M61.30-
TTPIONT(Ren etal., 2023)0.33M92.70--
Speck(Yao etal., 2024)---46.70
ASA(Yao etal., 2023)---47.10
ESTF(Wang etal., 2024b)---51.22
ExACT(Zhou etal., 2024)471M94.8366.0790.10
PAST-SSM-T-K𝐾Kitalic_K(8)91.3851.7298.40
PAST-SSM-T-K𝐾Kitalic_K(16)7M94.8349.1498.37
PAST-SSM-S-K𝐾Kitalic_K(8)93.3360.3498.20
PAST-SSM-S-K𝐾Kitalic_K(16)25M96.5562.0798.41
PAST-SSM-M-K𝐾Kitalic_K(8)98.2865.5298.05
PAST-SSM-M-K𝐾Kitalic_K(16)74M96.5566.3898.20

Arbitrary-duration Event Recognition (Avg. 1s to 265s)
Top-1 Accuracy(%)
ModelParam.ArDVS100Real-ArDVS10TemArDVS100
PAST-SSM-T-K𝐾Kitalic_K(16)90.2080.0059.20
PAST-SSM-T-K𝐾Kitalic_K(32)7M93.8593.3389.00
PAST-SSM-S-K𝐾Kitalic_K(16)94.9090.0062.90
PAST-SSM-S-K𝐾Kitalic_K(32)25M96.00100.0073.41
PAST-SSM-M-K𝐾Kitalic_K(16)96.0093.3371.06
PAST-SSM-M-K𝐾Kitalic_K(32)74M97.35100.0082.50

Results for recognizing 1s to 265s event streamsAs illustrated in Tab.4, the linear complexity of PAST-SSM makes it well-suited for end-to-end training with arbitrary-duration event streams. We evaluate PAST-SSM on our ArDVS100 and TemArDVS100 datasets with event streams ranging from 1s to 265s. For ArDVS100 dataset, our PAST-SSM-M-K(32) achieves excellent 97.35% Top-1 accuracy. In the case of the more challenging TemArDVS100 dataset with fine-grained temporal labels, our PAST-SSM-T-K(32) reaches a Top-1 accuracy of 89.00% with reduced computational complexity and less training time, proving its advanced spatiotemporal modeling ability for distinguishing the timing of each action. Additionally, our PAST-SSM-S-K(32) achieved 100% Top-1 accuracy for recognizing the real-word event stream raging from 2s to 75s across 10 classes, demonstrating its effectiveness for real-world applications. For comparison methods, we fail to evaluate previous methods based on ViT or CNN backbones on our minute-level datasets because of their quadratic computational complexity or limited receptive field. This result highlights PAST-SSM’s effectiveness and its great potential for future arbitrary-duration event stream comprehension.

5.2.2 Generalization results across Varying Inference Frequencies.

Datasets & Specific Experiments settings We trained our PAST-SSM-S on the DVS Action dataset across varying sampling frequencies, specifically at 20 Hz, 60 Hz, and 100 Hz, which correspond to low, medium, and high sampling frequencies, respectively. We assessed their performance under different inference frequencies ranging from 20 Hz to 100 Hz. We also examine two frame aggregation methods for sampling at fixed time intervals, which serve as the baseline: fixed ’Time Windows’ aggregation and fixed ’Event Counts’ aggregation. Fig.4 highlights the differences between these methods: ’Time Windows’ results in varying temporal ranges for aggregation, whereas ’Event Counts’ ensure consistent temporal ranges for aggregation. (please refer to Sec.5.3 for more explanation and discussion.)

Results & DiscussionAs shown in Fig.7, regardless of whether the model is trained at low, medium, or high frequencies, our models demonstrate consistently strong performance across various inference frequencies, with a maximum performance drop of only 8.62% when our PAST-SSM model trained at 60Hz and evaluated at 100HZ. This finding underscores their robustness and generalizability compared to the baseline methods (’Time Windows’ and ’Event Counts’), which typically experience significant performance declines, such as -18.96%, -20.59%, -29.32% for ’Time Windows’ trained at 20 Hz, 60 Hz, and 100 Hz and evaluated at 60 Hz, 100 Hz, and 20 Hz, respectively. (Please refer to the appendix for the specific statistics result for Fig.7.)

Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration (7)

5.3 Ablation Study

We conduct ablation experiments on our PAST-SSM framework to evaluate the effectiveness of the PEAS module (Sec.4.1), MSGsubscript𝑀𝑆𝐺\mathcal{L}_{MSG}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_G end_POSTSUBSCRIPT loss (Sec.4.3), and other hyper-parameters.

Impact of PEAS module & MSGsubscript𝑀𝑆𝐺\mathcal{L}_{MSG}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_G end_POSTSUBSCRIPT loss. We ablate the key two components of our PAST-SSM model, namely the PEAS module (Sec.4.1) and the MSGsubscript𝑀𝑆𝐺\mathcal{L}_{MSG}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_G end_POSTSUBSCRIPT loss (Sec.4.3). As shown in Tab.5,

SettingsDVS Action (K𝐾Kitalic_K(16))
Top1(%)Top5(%)
Random Sampling92.98%100.00%
PEAS93.33%100.00%
PEAS + MSGsubscript𝑀𝑆𝐺\mathcal{L}_{MSG}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_G end_POSTSUBSCRIPT94.83%100.00%

the ’Random Selection’ refers to the baseline where we select K𝐾Kitalic_K event frames randomly and it achieves 92.98% Top-1 accuracy. With the PEAS module for path-adaptive event frame selection in an end-to-end manner, we achieve 93.33% Top-1 accuracy with +0.35% performance gain, thus proving the effectiveness of the PEAS module. When equipped with both the PEAS module and (Sec.4.1) and the MSGsubscript𝑀𝑆𝐺\mathcal{L}_{MSG}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_G end_POSTSUBSCRIPT loss, the full model achieves 94.83% Top-1 accuracy with +1.85% performance gain, thus proving the effectiveness of proposed MSGsubscript𝑀𝑆𝐺\mathcal{L}_{MSG}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_G end_POSTSUBSCRIPT loss to reduce randomness in the selection and promote effective sequence feature learning.

Effectiveness of Multi-faceted Selection Guiding Loss MSGsubscript𝑀𝑆𝐺\mathcal{L}_{MSG}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_G end_POSTSUBSCRIPT. As presented in Tab. 6, we conduct

MSGsubscript𝑀𝑆𝐺\mathcal{L}_{MSG}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_G end_POSTSUBSCRIPTDVS Action(K𝐾Kitalic_K(16))
CLSsubscript𝐶𝐿𝑆\mathcal{L}_{CLS}caligraphic_L start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPTIEMIsubscript𝐼𝐸𝑀𝐼\mathcal{L}_{IEMI}caligraphic_L start_POSTSUBSCRIPT italic_I italic_E italic_M italic_I end_POSTSUBSCRIPTWEIEsubscript𝑊𝐸𝐼𝐸\mathcal{L}_{WEIE}caligraphic_L start_POSTSUBSCRIPT italic_W italic_E italic_I italic_E end_POSTSUBSCRIPTMSsubscript𝑀𝑆\mathcal{L}_{MS}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPTTop1(%)Top5(%)
89.6598.25
91.38100.00
93.10100.00
94.83100.00

an ablation study on the four components ofMSGsubscript𝑀𝑆𝐺\mathcal{L}_{MSG}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S italic_G end_POSTSUBSCRIPT (Eq.5). The component CLSsubscript𝐶𝐿𝑆\mathcal{L}_{CLS}caligraphic_L start_POSTSUBSCRIPT italic_C italic_L italic_S end_POSTSUBSCRIPT serves as the baseline, performing optimization exclusively with the standard cross-entropy loss and achieving a Top-1 accuracy of 89.65%. By employing the proposed IEMIsubscript𝐼𝐸𝑀𝐼\mathcal{L}_{IEMI}caligraphic_L start_POSTSUBSCRIPT italic_I italic_E italic_M italic_I end_POSTSUBSCRIPT (Eq.2) to enhance comprehension of the selected input sequence, we attain a Top-1 accuracy of 91.38%, representing a performance gain of 1.73%. The integration of WEIEsubscript𝑊𝐸𝐼𝐸\mathcal{L}_{WEIE}caligraphic_L start_POSTSUBSCRIPT italic_W italic_E italic_I italic_E end_POSTSUBSCRIPT (Eq.1) for frame distinctness yields an additional 3.45% increase in accuracy, resulting in a Top-1 accuracy of 93.10%. Lastly, the component MSsubscript𝑀𝑆\mathcal{L}_{MS}caligraphic_L start_POSTSUBSCRIPT italic_M italic_S end_POSTSUBSCRIPT (Eq.4), designed for filtering out padded frames, also contributes a 5.18% improvement in accuracy, achieving a Top-1 accuracy of 94.83%. In summary, all three proposed components positively impact the final classification, thereby demonstrating their effectiveness.

Frame Aggregation Method: Time Windows vs. Event Counts. To erase the computational burden when processing the event with spatiotemporal richness, existing methodspredominantly sample and aggregate events at every fixed temporal interval, i.e., frequency. In general, this aggregation process can be categorized into two methods: fixed time windows and fixed event counts. Fig.4 illustrates distinctions between the two methods: ’Event Counts’ aggregation leads to varying aggregation temporal ranges, while ’Time Windows’ keeps them consistent. Tab. 8 presents our model’s performance with these two aggregation methods at different evaluated frequencies. We observe that ’Event Counts’ tend to achieve better Top-1 accuracy compared to ’Time Windows’. For example, ’Event Counts’ achieves 96.55% Top-1 accuracy in comparison to 94.83% Top-1 accuracy for ’Time Windows’ when both trained and evaluated at 60Hz. However, both methods perform poorly when training and evaluating at different frequencies, with -24.14% for ’Event Counts’ at 20 Hz evaluated at 100 Hz, and -25.86% for ’Time Windows’ at 100 Hz evaluated at 20 Hz. This leads us to propose the PEAS module to improve model generalization across inference frequencies.

RepresentationN-Caltech101 (K𝐾Kitalic_K(1))DVS Action (K𝐾Kitalic_K(16))
Top1(%)Top5(%)Top1(%)Top5(%)
Frame(Gray)90.48%97.53%93.33%100.00%
Frame(RGB)90.94%97.82%94.83%100.00%
Voxel90.19%97.02%92.47%100.00%
TBR90.24%97.13%91.72%100.00%

Frame-based Event Representation.Tab.8 displays the impact of four existing frame-based event representations. The RGB frame(Zhou etal., 2023) representation attains Top-1 accuracy rates of 90.94% on the N-Caltech101 dataset and 94.83% on the DVS Action dataset, surpassing the performance of the other three frame-based representations, including gray frame(Zhou etal., 2023), Voxell(Deng etal., 2022b) and TBR(Innocenti etal., 2021).

6 Conclusion and Future Work

In this paper, we have presented a novel approach, named PAST-SSM, for recognizing events of arbitrary duration and generalizing to varying inference frequencies.Extensive experiments prove that PAST-SSM outperforms prior arts with fewer parameters on four publicly available datasets and can successfully recognize events of arbitrary duration on our ArDVS100 (1s to 256s), Real-ArDVS10 (2s to 75s) and TemArDVS (14s to 215s) datasets. Moreover, it also shows strong generalization across varying inference frequencies. We hope this method can pave the way for future model design for recognizing events with longer duration and applications for high-seed dynamic visual scenarios.

Limitation. We observe that larger VideoMamba tends to overfit during our experiments, resulting to suboptimal performance. This issue is not limited to our models but also observed in VMamba(Gu & Dao, 2023) and VideoMamba(Li etal., 2024). Future research could explore training strategies such as Self-Distillation and advanced data augmentation to mitigate this overfitting.

References

  • Bi etal. (2020)Yin Bi, Aaron Chadha, Alhabib Abbas, Eirina Bourtsoulatze, and Yiannis Andreopoulos.Graph-based spatio-temporal feature learning for neuromorphic vision sensing.IEEE Transactions on Image Processing, 29:9084–9098, 2020.
  • Brown etal. (2020)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal.Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020.
  • Cannici etal. (2020)Marco Cannici, Marco Ciccone, Andrea Romanoni, and Matteo Matteucci.A differentiable recurrent surface for asynchronous event-based data.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pp. 136–152. Springer, 2020.
  • Cho etal. (2023)Hoonhee Cho, Hyeonseong Kim, Yujeong Chae, and Kuk-Jin Yoon.Label-free event-based object recognition via joint learning with image reconstruction from events.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19866–19877, 2023.
  • deBlegiers etal. (2023)Tristan deBlegiers, IshanRajendrakumar Dave, Adeel Yousaf, and Mubarak Shah.Eventtransact: A video transformer-based framework for event-camera based action recognition.In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–7. IEEE, 2023.
  • Deng etal. (2022a)Yongjian Deng, Hao Chen, Hai Liu, and Youfu Li.A Voxel Graph CNN for Object Classification With Event Cameras.In CVPR, 2022a.
  • Deng etal. (2022b)Yongjian Deng, Hao Chen, Hai Liu, and Youfu Li.A voxel graph cnn for object classification with event cameras.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1172–1181, 2022b.
  • Deng etal. (2024)Yongjian Deng, Hao Chen, and Youfu Li.A dynamic gcn with cross-representation distillation for event-based learning.In Proceedings of the AAAI Conference on Artificial Intelligence, volume38, pp. 1492–1500, 2024.
  • Dosovitskiy etal. (2020)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020.
  • Fu etal. (2022)DanielY Fu, Tri Dao, KhaledK Saab, ArminW Thomas, Atri Rudra, and Christopher Ré.Hungry hungry hippos: Towards language modeling with state space models.arXiv preprint arXiv:2212.14052, 2022.
  • Gallego etal. (2020)Guillermo Gallego, Tobi Delbrück, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, AndrewJ Davison, Jörg Conradt, Kostas Daniilidis, etal.Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020.
  • Gao etal. (2023)Yue Gao, Jiaxuan Lu, Siqi Li, Nan Ma, Shaoyi Du, Yipeng Li, and Qionghai Dai.Action recognition and benchmark using event cameras.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • Gehrig etal. (2019)Daniel Gehrig, Antonio Loquercio, KonstantinosG Derpanis, and Davide Scaramuzza.End-to-end learning of representations for asynchronous event-based data.In ICCV, 2019.
  • Gu & Dao (2023)Albert Gu and Tri Dao.Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023.
  • Gu etal. (2022)Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré.On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
  • Gu etal. (2020)Fuqiang Gu, Weicong Sng, Tasbolat Taunyazov, and Harold Soh.Tactilesgnet: A spiking graph neural network for event-based tactile object recognition.In IROS, 2020.
  • Innocenti etal. (2021)SimoneUndri Innocenti, Federico Becattini, Federico Pernici, and Alberto DelBimbo.Temporal binary representation for event-based action recognition.In 2020 25th International Conference on Pattern Recognition (ICPR), pp. 10426–10432. IEEE, 2021.
  • Jang etal. (2016)Eric Jang, Shixiang Gu, and Ben Poole.Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144, 2016.
  • Kim etal. (2021)Junho Kim, Jaehyeok Bae, Gangin Park, Dongsu Zhang, and YoungMin Kim.N-imagenet: Towards robust, fine-grained object recognition with event cameras.In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2146–2156, 2021.
  • Kim etal. (2022)Junho Kim, Inwoo Hwang, and YoungMin Kim.Ev-tta: Test-time adaptation for event-based object recognition.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17745–17754, 2022.
  • Klenk etal. (2024)Simon Klenk, David Bonello, Lukas Koestler, Nikita Araslanov, and Daniel Cremers.Masked event modeling: Self-supervised pretraining for event cameras.In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2378–2388, 2024.
  • Li etal. (2024)Kunchang Li, Xinhao Li, YiWang, Yinan He, Yali Wang, Limin Wang, and YuQiao.Videomamba: State space model for efficient video understanding.arXiv preprint arXiv:2403.06977, 2024.
  • Li etal. (2021)Yijin Li, Han Zhou, Bangbang Yang, YeZhang, Zhaopeng Cui, Hujun Bao, and Guofeng Zhang.Graph-based asynchronous event processing for rapid object recognition.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 934–943, 2021.
  • Liu etal. (2022)Chang Liu, Xiaojuan Qi, EdmundY Lam, and Ngai Wong.Fast classification and action recognition with event-based imaging.IEEE Access, 10:55638–55649, 2022.
  • Liu etal. (2021)Qianhui Liu, Dong Xing, Huajin Tang, DeMa, and Gang Pan.Event-based action recognition using motion information and spiking neural networks.In IJCAI, pp. 1743–1749, 2021.
  • Lu etal. (2019)Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee.Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.Advances in neural information processing systems, 32, 2019.
  • Miao etal. (2019)Shu Miao, Guang Chen, Xiangyu Ning, Yang Zi, Kejia Ren, Zhenshan Bing, and Alois Knoll.Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection.Frontiers in neurorobotics, 13:38, 2019.
  • Orchard etal. (2015)Garrick Orchard, Ajinkya Jayawant, GregoryK Cohen, and Nitish Thakor.Converting static image datasets to spiking neuromorphic datasets using saccades.Frontiers in neuroscience, 9:437, 2015.
  • Plizzari etal. (2022)Chiara Plizzari, Mirco Planamente, Gabriele Goletto, Marco Cannici, Emanuele Gusso, Matteo Matteucci, and Barbara Caputo.E2 (go) motion: Motion augmented event stream for egocentric action recognition.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19935–19947, 2022.
  • Posch etal. (2010)Christoph Posch, Daniel Matolin, and Rainer Wohlgenannt.A qvga 143 db dynamic range frame-free pwm image sensor with lossless pixel-level video compression and time-domain cds.IEEE Journal of Solid-State Circuits, 46(1):259–275, 2010.
  • Ren etal. (2023)Hongwei Ren, Yue Zhou, Haotian Fu, Yulong Huang, Renjing Xu, and Bojun Cheng.Ttpoint: A tensorized point cloud network for lightweight action recognition with event cameras.In Proceedings of the 31st ACM International Conference on Multimedia, pp. 8026–8034, 2023.
  • Russakoff etal. (2004)DanielB Russakoff, Carlo Tomasi, Torsten Rohlfing, and CalvinR Maurer.Image similarity using mutual information of regions.In Computer Vision-ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11-14, 2004. Proceedings, Part III 8, pp. 596–607. Springer, 2004.
  • Sabater etal. (2022)Alberto Sabater, Luis Montesano, and AnaC Murillo.Event transformer. a sparse-aware solution for efficient event data processing.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2677–2686, 2022.
  • Smith etal. (2022)JimmyTH Smith, Andrew Warrington, and ScottW Linderman.Simplified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022.
  • Wang etal. (2023)Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, and Raffay Hamid.Selective structured state-spaces for long-form video understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6387–6397, 2023.
  • Wang etal. (2024a)QiWang, Zhou Xu, Yuming Lin, Jingtao Ye, Hongsheng Li, Guangming Zhu, Syed AfaqAli Shah, Mohammed Bennamoun, and Liang Zhang.Dailydvs-200: A comprehensive benchmark dataset for event-based action recognition.arXiv preprint arXiv:2407.05106, 2024a.
  • Wang etal. (2024b)Xiao Wang, Zongzhen Wu, BoJiang, Zhimin Bao, Lin Zhu, Guoqi Li, Yaowei Wang, and Yonghong Tian.Hardvs: Revisiting human activity recognition with dynamic vision sensors.In Proceedings of the AAAI Conference on Artificial Intelligence, volume38, pp. 5615–5623, 2024b.
  • Wu & Krahenbuhl (2021)Chao-Yuan Wu and Philipp Krahenbuhl.Towards long-form video understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1884–1894, 2021.
  • Xie etal. (2022)Bochen Xie, Yongjian Deng, Zhanpeng Shao, Hai Liu, and Youfu Li.Vmv-gcn: Volumetric multi-view based graph cnn for event stream classification.IEEE Robotics and Automation Letters, 7(2):1976–1983, 2022.
  • Xie etal. (2023)Bochen Xie, Yongjian Deng, Zhanpeng Shao, Hai Liu, Qingsong Xu, and Youfu Li.Event voxel set transformer for spatiotemporal representation learning on event streams.arXiv preprint arXiv:2303.03856, 2023.
  • Xie etal. (2024)Bochen Xie, Yongjian Deng, Zhanpeng Shao, Qingsong Xu, and Youfu Li.Event voxel set transformer for spatiotemporal representation learning on event streams.IEEE Transactions on Circuits and Systems for Video Technology, 2024.
  • Yang etal. (2023)Yan Yang, Liyuan Pan, and Liu Liu.Event camera data pre-training.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10699–10709, 2023.
  • Yao etal. (2021)Man Yao, Huanhuan Gao, Guangshe Zhao, Dingheng Wang, Yihan Lin, Zhaoxu Yang, and Guoqi Li.Temporal-wise attention spiking neural networks for event streams classification.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10221–10230, 2021.
  • Yao etal. (2023)Man Yao, Jiakui Hu, Guangshe Zhao, Yaoyuan Wang, Ziyang Zhang, BoXu, and Guoqi Li.Inherent redundancy in spiking neural networks.In Proceedings of the IEEE/CVF international conference on computer vision, pp. 16924–16934, 2023.
  • Yao etal. (2024)Man Yao, Ole Richter, Guangshe Zhao, Ning Qiao, Yannan Xing, Dingheng Wang, Tianxiang Hu, Wei Fang, Tugba Demirci, Michele DeMarchi, etal.Spike-based dynamic computing with asynchronous sensing-computing neuromorphic chip.Nature Communications, 15(1):4464, 2024.
  • Zhang etal. (2024)Tao Zhang, Xiangtai Li, Haobo Yuan, Shunping Ji, and Shuicheng Yan.Point could mamba: Point cloud learning via state space model.arXiv preprint arXiv:2403.00762, 2024.
  • Zheng & Wang (2024)XuZheng and Lin Wang.Eventdance: Unsupervised source-free cross-modal adaptation for event-based object recognition.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17448–17458, 2024.
  • Zheng etal. (2023)XuZheng, Yexin Liu, Yunfan Lu, Tongyan Hua, Tianbo Pan, Weiming Zhang, Dacheng Tao, and Lin Wang.Deep learning for event-based vision: A comprehensive survey and benchmarks.arXiv preprint arXiv:2302.08890, 2023.
  • Zhou etal. (2023)Jiazhou Zhou, XuZheng, Yuanhuiyi Lyu, and Lin Wang.E-clip: Towards label-efficient event-based open-world understanding by clip.arXiv preprint arXiv:2308.03135, 2023.
  • Zhou etal. (2024)Jiazhou Zhou, XuZheng, Yuanhuiyi Lyu, and Lin Wang.Exact: Language-guided conceptual reasoning and uncertainty estimation for event-based action recognition and more.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18633–18643, 2024.
  • Zhu etal. (2024)Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang.Vision mamba: Efficient visual representation learning with bidirectional state space model.arXiv preprint arXiv:2401.09417, 2024.
  • Zubić etal. (2023)Nikola Zubić, Daniel Gehrig, Mathias Gehrig, and Davide Scaramuzza.From chaos comes order: Ordering event representations for object recognition and detection.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12846–12856, 2023.
  • Zubic etal. (2024)Nikola Zubic, Mathias Gehrig, and Davide Scaramuzza.State space models for event cameras.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5819–5828, 2024.

Appendix A Appendix

A.1 Additional technical details of SSMs.

SSMs(Gu etal., 2022; Smith etal., 2022; Fu etal., 2022; Wang etal., 2023) originate from the principles of continuous systems that map an input 1D sequence x(t)L𝑥𝑡superscript𝐿x(t)\in\mathbb{R}^{L}italic_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT into the output sequence y(t)L𝑦𝑡superscript𝐿y(t)\in\mathbb{R}^{L}italic_y ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT through an underlying hidden state h(t)N𝑡superscript𝑁h(t)\in\mathbb{R}^{N}italic_h ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Specifically, it is formalized by dh(t)/dt=Ah(t)+Bx(t)𝑑𝑡𝑑𝑡𝐴𝑡𝐵𝑥𝑡dh(t)/dt=Ah(t)+Bx(t)italic_d italic_h ( italic_t ) / italic_d italic_t = italic_A italic_h ( italic_t ) + italic_B italic_x ( italic_t ) and y(t)=Ch(t)+Dx(t)𝑦𝑡𝐶𝑡𝐷𝑥𝑡y(t)=Ch(t)+Dx(t)italic_y ( italic_t ) = italic_C italic_h ( italic_t ) + italic_D italic_x ( italic_t ), where AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, BN×1𝐵superscript𝑁1B\in\mathbb{R}^{N\times 1}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT, CN×1𝐶superscript𝑁1C\in\mathbb{R}^{N\times 1}italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT, DN×1𝐷superscript𝑁1D\in\mathbb{R}^{N\times 1}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT are the state matrix, the input projection matrix, the output projection matrix and the feed-forward matrix.

dh(t)/dt=Ah(t)+Bx(t),𝑑𝑡𝑑𝑡𝐴𝑡𝐵𝑥𝑡dh(t)/dt=Ah(t)+Bx(t),italic_d italic_h ( italic_t ) / italic_d italic_t = italic_A italic_h ( italic_t ) + italic_B italic_x ( italic_t ) ,(6)
y(t)=Ch(t)+Dx(t),𝑦𝑡𝐶𝑡𝐷𝑥𝑡y(t)=Ch(t)+Dx(t),italic_y ( italic_t ) = italic_C italic_h ( italic_t ) + italic_D italic_x ( italic_t ) ,(7)

where AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, BN×1𝐵superscript𝑁1B\in\mathbb{R}^{N\times 1}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT, CN×1𝐶superscript𝑁1C\in\mathbb{R}^{N\times 1}italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT, DN×1𝐷superscript𝑁1D\in\mathbb{R}^{N\times 1}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT are the state (or system) matrix, the input projection matrix, the output projection matrix and the feed-forward matrix.

The discretization process of SSMs is essential for integrating continuous-time models into deep-learning algorithms.(Wang etal., 2023). We adopt Mamba(Gu & Dao, 2023) strategy, treating D𝐷Ditalic_D as fixed network parameters while introducing timescale parameter ΔΔ\Deltaroman_Δ to transform the continuous parameters A𝐴Aitalic_A, B𝐵Bitalic_B into their discrete counterparts A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG, B^^𝐵\hat{B}over^ start_ARG italic_B end_ARG, formulated as follows:

A^=exp(ΔA)^𝐴𝑒𝑥𝑝Δ𝐴\hat{A}=exp(\Delta A)over^ start_ARG italic_A end_ARG = italic_e italic_x italic_p ( roman_Δ italic_A )(8)
B^=(ΔA)1(exp(ΔA)I)ΔB^𝐵superscriptΔ𝐴1𝑒𝑥𝑝Δ𝐴𝐼Δ𝐵\hat{B}=(\Delta A)^{-1}(exp(\Delta A)-I)\cdot\Delta Bover^ start_ARG italic_B end_ARG = ( roman_Δ italic_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_e italic_x italic_p ( roman_Δ italic_A ) - italic_I ) ⋅ roman_Δ italic_B(9)
ht=A^ht1+B^xt,subscript𝑡^𝐴subscript𝑡1^𝐵subscript𝑥𝑡h_{t}=\hat{A}h_{t-1}+\hat{B}x_{t},italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG italic_A end_ARG italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + over^ start_ARG italic_B end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,(10)
yt=Cht.subscript𝑦𝑡𝐶subscript𝑡y_{t}=Ch_{t}.italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_C italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .(11)

Compared to previous linear time-invariant SSMs, Mamba proposed a selective scan mechanism that directly derived the parameters B𝐵Bitalic_B, C𝐶Citalic_C, and ΔΔ\Deltaroman_Δ from the input during the training process, thus enabling better contextual sensitivity and adaptive weight modulation.

A.2 PyTorch-style Pseudocode for the Proposed PEAS Module.

In Algorithm1, we present the PyTorch-style pseudocode of the proposed PEAS module to facilitate readers’ understanding.

# B, C, H, W: Batch size, Channel, Width, Height

# P, K: Amount of input and output event frames

# x𝑥xitalic_x: Input event frames with shape (B, P, C, H, W)

# y𝑦yitalic_y: Output selected frames with shape (B, K, C, H, W)

s𝑠sitalic_s = ScorePredictor(x𝑥xitalic_x) # Two-layer CNN network

# Predict scores for each event frame (B, K, P)

if self.training # Differentiable selection during training

 selection_mask = F.gumbel_softmax(pred_score, dim=2, hard=True)

else: # Hard selection during evaluation

 idx_argmax = s𝑠sitalic_s.max(dim=2, keepdim=True)[1]

 selection_mask = torch.zeros_like(s𝑠sitalic_s).scatter_(dim=2, index=idx_argmax, value=1.0)

B, K, P = selection_mask.shape

indices = torch.where(selection_mask.eq(1))

# Sort from largest to smallest corresponding to the time sequence

indices_sorted = torch.argsort(indices[2].reshape(B, K), dim=1)

# Rearrange mask based on temporal sequence

For i in range(B):

 selection_mask[i, :, :] = selection_mask[i, indices_sorted[i], :]

# Perform frame selection using the mask

y𝑦yitalic_y = torch.einsum(‘bkp, bcthw’ \rightarrow ‘bcpkhw’, selection_mask, x𝑥xitalic_x)

# Sum over time dimension

y𝑦yitalic_y = y𝑦yitalic_y.sum(dim=3) # (B,C,K,H,W)

A.3 Existing event-based recognition datasets comparison.

We compare our proposed ArDVS100 dataset with existing event-based recognition datasets. As shown in Tab.7, previous datasets contain second-level event streams lasting from 0.1s to 20s, while our proposed ArDVS100 dataset provides minute-level duration event streams lasting from 1s to 265s. ArDVS100 has 100 classes with normal class labels. We believe that our ArDVS100 will provide enhanced evaluation platforms for recognizing event streams of arbitrary durations and inspire further research in this field.

DatasetYearSensorsObjectScaleClassReal TemporalFine-grainedLabelsDuration(s)
MNISTDVS2013DAVIS128Image30,00010-
N-Caltech1012015ATISImage8,7091010.3s
N-MNIST2015ATISImage70,000100.3s
CIFAR10-DVS2017DAVIS128Image10,000101.2s
N-ImageNet2021Samsung-Gen3Image1,781,1671,0000.1s
ES-lmageNet2021-Image1,306,9161,000-
ASLAN-DVS2011DAVIS240cAction3,697432-
DvsGesture2017DAVIS128Action1,342116s
N-CARS2018ATISCar24,02920.1s
ASL-DVS2019DAVIS240Hand100,800240.1s
DVS Action2019DAVIS346Action450105s
HMDB-DVS2019DAVIS240cAction6,7665119s
UCF-DVS2019DAVIS240cAction13,32010125s
DailyAction2021DAVIS346Action1,440125s
HARDVS2022DAVIS346Action107,6463005s
THUEACT502023CeleX-VAction10,500502s-5s
THUEAC50CHL2023DAVIS346Action2,330502s-6s
Bullying10K2023DAVIS346Action10,000101s-20s
SeAct2024DAVIS346Action580582s-10s
DailyDVS-2002024DVXplorer LiteAction22,0462002s-20s
ArDVS1002024DAVIS346Action10,0001001s-265s
Real-ArDVS102024DAVIS346Action100102s-75s
TemArDVS1002024DAVIS346Action10,00010014s-215s

A.4 The specific statistics result for model generalization across varying inference frequencies.

In Tab.8, we present the specific statistics result for Fig.7 for further comparison.

Top-1 Accuracy & Performance Drop (%)
Val f𝑓fitalic_f
Train f𝑓fitalic_fSettings20 Hz40 Hz60 Hz80 Hz100 Hz
Time Windows93.1089.65-3.4574.14-18.9672.41-20.6968.97-24.13
Event Counts94.8387.93-6.9075.86-18.9775.86-18.9770.69-24.14
20 HzEvent Counts + PAST-SSM-S93.1089.65-3.4589.65-3.4586.21-6.8984.48-8.62
Time Windows79.31-15.5287.93-6.9094.8389.65-5.1875.86-18.97
Event Counts81.03-15.5289.65-6.9096.5587.93-8.6279.89-16.66
60 HzEvent Counts + PAST-SSM-S89.66-6.8993.1-3.4596.5591.38-5.1787.93-8.62
Time Windows65.51-25.8687.93-3.4491.37-091.37-091.37
Event Counts67.24-27.5986.21-8.6289.65-5.1893.1-1.7394.83
100 HzEvent Counts + PAST-SSM-S89.66-5.1794,83-093.1-1.7393.1-1.7394.83

A.5 The settings of sampling frequency and aggregated event counts per frame for different datasets.

The additional experiment settings of sampling frequency and aggregated event count per frame for different datasets are presented in Tab.9.

DatasetSampling FrequencyAggregated Event Count / Frame
N-Caltech101200 Hz50,000
DVS Action80 Hz100,000
SeAct80 Hz80,000
HARDVS100 Hz80,000
ArDVS10050 Hz80,000
Real-ArDVS1050 Hz80,000
TemArDVS10050 Hz80,000
Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Eusebia Nader

Last Updated:

Views: 5719

Rating: 5 / 5 (80 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Eusebia Nader

Birthday: 1994-11-11

Address: Apt. 721 977 Ebert Meadows, Jereville, GA 73618-6603

Phone: +2316203969400

Job: International Farming Consultant

Hobby: Reading, Photography, Shooting, Singing, Magic, Kayaking, Mushroom hunting

Introduction: My name is Eusebia Nader, I am a encouraging, brainy, lively, nice, famous, healthy, clever person who loves writing and wants to share my knowledge and understanding with you.