Jiazhou Zhou^{1,3} Kanghao Chen^{1} Lei Zhang^{3} Lin Wang^{1,2}^{1}AI Thrust, HKUST(GZ) ^{2}Dept. of CSE, HKUST ^{3} International Digital Economy Academy (IDEA)

{jzhou297,kchen879}@connect.hkust-gz.edu.cn, leizhang@idea.edu.cn, linwang@ust.hkCorresponding Author

###### Abstract

Event cameras are bio-inspired sensors that capture the intensity changes asynchronously and output event streams with distinct advantages, such as high temporal resolution.To exploit event cameras for object/action recognition, existing methods predominantly sample and aggregate events in a second-level duration at every fixed temporal interval (or frequency). However, they often face difficulties in capturing the spatiotemporal relationships for longer,e.g., minute-level, events and generalizing across varying temporal frequencies. To fill the gap, we present a novel framework, dubbed PAST-SSM, exhibiting superior capacity in recognizing events of arbitrary duration (e.g., 0.1s to 4.5min) and generalizing to varying inference frequencies.Our key insight is to learn the spatiotemporal relationships from the encoded event features via the state space model (SSM) – whose linear complexity makes it ideal for modeling high temporal resolution events with longer sequences. To achieve this goal, we first propose a Path-Adaptive Event Aggregation and Scan (PEAS) module to encode events of varying duration into features with fixed dimensions by adaptively scanning and selecting aggregated event frames. On top of PEAS, we introduce a novel Multi-faceted Selection Guiding (MSG) loss to minimize the randomness and redundancy of the encoded features. This subtly enhances the model generalization across different inference frequencies. Lastly, the SSM is employed to better learn the spatiotemporal properties from the encoded features. Moreover, we build a minute-level event-based recognition dataset, named ArDVS100, with arbitrary duration for the benefit of the community. Extensive experiments prove that our method outperforms prior arts by +3.45%, +0.38% and +8.31% on the DVS Action, SeAct, and HARDVS datasets, respectively. In addition, it achieves an accuracy of 97.35%, 89.00%, and 100.00% in our ArDVS100, TemArDVS100, and Real-ArDVS10 datasets respectively with the duration from 1s to 265s. Our method also shows strong generalization with a maximum accuracy drop of only 8.62% for varying inference frequencies while the baseline’s drop reaches 27.59%. Project page: https://vlislab22.github.io/pastssm/.

## 1 Introduction

Event cameras are bio-inspired sensors that trigger signals when the relative intensity change exceeds a threshold, adapting to scene brightness, motion, and texture. Compared with standard cameras, event cameras output asynchronous event streams, instead of fixed frame rates. They offer distinct advantages, such as high dynamic range, microsecond temporal resolution, and low latency(Gallego etal., 2020; Zheng etal., 2023). Due to these merits, event cameras have been applied to address various vision tasks, such as object/action recognition(Deng etal., 2024; Cannici etal., 2020; Klenk etal., 2024; Zheng & Wang, 2024; Zhou etal., 2024; Sabater etal., 2022; deBlegiers etal., 2023; Gao etal., 2023)

The spatiotemporal richness of events introduces complexities in data processing and necessitates models that can efficiently process and interpret them. To address this problem, existing methods predominantly sample and aggregate them at every fixed temporal interval, i.e., frequency. In this way, the raw stream can be converted into denserepresentations(Zhou etal., 2023; Zubic etal., 2024; Bi etal., 2020; Sabater etal., 2022) akin to multi-channel images. In general, existing methods mainly follow two representative model structures: (a) step-by-step structure models(Xie etal., 2024; Yao etal., 2021; Zhou etal., 2024; 2023; Zheng & Wang, 2024; Kim etal., 2022) and (b) recurrent structure models (Sabater etal., 2022; Zubić etal., 2023). The former processes all time step event frames in parallel, employing local-range and long-range temporal modeling sequentially, as shown in Fig.2 (a). By contrast, the latter process event frames sequentially at each time step, updating a memory feature that affects the next input, as illustrated in Fig.2 (b).

However, both models face two pivotal challenges, as shown in Fig.1. 1) Limited temporal duration. Our world tells an ongoing story about people and objects and how they interact(Wu & Krahenbuhl, 2021). This indicates that recognizing event streams of arbitrary duration is more practical and beneficial for real-world scenarios. However, existing methodsoften struggle with longer, e.g., minute-level, spatiotemporal relationships of events because step-by-step structure models face high computational complexity with long events, while recurrent models struggle with forgetting nature of initial information and longer training times.2) Limited generalization to varying frequencies.The performance of existing recognition models significantly declines at inference frequencies that differ from those used during training, which is crucial for high-speed, dynamic visual scenarios(Zubic etal., 2024). For example, as illustrated in Fig. 7, the existing event sampling strategies exhibit poor generalization when evaluated at both higher and lower sampling frequencies with a maximum performance drop of 27.59%.

Recently, the selective state space model (SSM) rivals previous backbones such as vision transformer in performance while offering a significant reduction in memory usage and linear-scale complexity, as evidenced by Mamba(Gu & Dao, 2023), Vision Mamba(Zhu etal., 2024), and Video Mamba(Li etal., 2024) in language, image, and video modalities.Given the inherently longer sequences because of the event stream’s high temporal resolution, a natural motivation arises for harnessing the exceptional power of SSM for event spatiotemporal modeling with linear complexity. This prompts us to explore an interesting question: how to effectively recognize events of arbitrary duration (e.g., second-level to minuter-level) while generalizing across varying inference frequencies based on the SSM backbone?To this end, we propose PAST-SSM, a novel framework for recognizing event streams of arbitrary duration (0.1s to 4.5min), as depicted in Fig.1. By harnessing the linear complexity of SSM, PAST-SSM delivers exceptional recognition performance and frequency generalization.Our PAST-SSM brings two key technical breakthroughs.

Firstly, the number of aggregated event frames can vary dramatically due to the high temporal resolution of events. For example, if events lasting between 0.1s and 300s are sampled at 50Hz (every 0.02s), the number of resulting frames can range from 5 to 15,000. This variability causes difficulties for SSM in effectively learning the spatiotemporal properties from events, as SSM’s hidden state updates rely heavily on the sequence length and feature order.To this end, we propose a novel Path-Adaptive Event Aggregation and Scan (PEAS) module to encode events of arbitrary duration into sequence features with fixed dimensions.Concretely, as shown in Fig.3, a selection mask is first learned from the original event frames to facilitate frame selection. Then the bidirectional event scan is conducted on the selected frames to convert them into sequence features. This adaptive process ensures the event scan path is end-to-end learnable and responsive to every event input, thus enabling our PAST-SSM to effectively process event streams of arbitrary duration (Tab.4).

Secondly, the varying sampling frequencies hinder the framework’s generalization during the inference, as empirically verified in Tab.8. This suggests that alterations in the input sequence order, resulting from changes in sampling frequency, significantly impact model performance. For this reason, we propose a novel Multi-faceted Selection Guiding (MSG) loss. It minimizes the randomness of the event frame selection caused by the random initialization of the selection mask’s weight.As evidently shown in Fig.5, our MSG loss better facilitates alleviating the redundancy issue of the selected event frames, thus enhancing the SSM optimization’s effectiveness. Meanwhile, it also strengthens the generalization of the SSM model in varying inference frequencies (Tab.8).

Given the absence of datasets for minute-level duration event-based recognition, we collected ArDVS100 dataset (1s to 265s) and the more challenging TemArDVS100 dataset (14s to 215s) with temporal fine-grained classes through direct concatenation, each containing event streams across 100 classes, created through direct concatenation. Besides, we recorded the Real-ArDVS10 dataset, which includes real-world events from 2s to 75s across 10 classes. We believe they will enhance evaluation for recognizing event streams of arbitrary duration and inspire further research in this field.We conduct extensive experiments to evaluate our PAST-SSM on four publicly available datasets, showing superior or competitive performance with fewer model parameters. For example, it outperforms previous methods by +3.45%, +0.38%, and +8.31% on the DVS Action, SeAct, and HARDVS datasets, respectively. Meanwhile, it achieves 97.35%, 100.00% and 89.00% Top-1 accuracy on our proposed ArDVS100, Real-ArDVS10, and TemArDVS datasets respectively. Additionally, our PAST-SSM shows strong generalization with a maximum performance drop of only 8.62% across varying inference frequencies, compared to 27.59% for the previous sampling method.

## 2 Related Works

Event-based Object / Action Recognition. Existing event-based recognition works cover two main tasks based on the event’s duration: object recognition(Zhou etal., 2023; Zheng & Wang, 2024; Gallego etal., 2020; Kim etal., 2021; Zheng etal., 2023; Gehrig etal., 2019; Gu etal., 2020; Deng etal., 2022a; Li etal., 2021; Liu etal., 2022) and action recognition(Zhou etal., 2024; Xie etal., 2024; Sabater etal., 2022; Xie etal., 2023; Gao etal., 2023; Plizzari etal., 2022; Xie etal., 2022; Liu etal., 2021). Specifically, events for object recognition capture stationary objects with duration from 0.1 to 0.3 s, whereas action recognition records dynamic human actions over a longer duration (avg. 1-10 s). Among them, methods for modeling spatiotemporal relationships of events with varying duration can be structurally categorized into two types, as shown in Fig.2: 1) step-by-step structure models and 2) recurrent structure models. Initially, the events are sampled into slices at fixed time intervals. The step-by-step structure models then use off-the-shelf backbones to extract local-range spatiotemporal features from event slices and then perform long-range temporal modeling using various methods, such as simple average operation(Zhou etal., 2024; 2023), proposedmodules(Xie etal., 2024; Yao etal., 2021) and loss guidance(Zheng & Wang, 2024; Kim etal., 2022). Recurrent structure models(Sabater etal., 2022; Zubić etal., 2023), on the other hand, process the event slices sequentially, updating their hidden state based on the input at each time step. Both structures ensure adaptability to varying time durations. However, step-by-step structure models struggle with high computational complexity when handling longer-duration events, such as those at minute-level granularity. Recurrent structure models tend to forget the initial information due to their simplistic recurrent design and require longer training time because of their inability to process data in parallel. Additionally, as evidenced in Tab.8, existing methods struggle to generalize across different inference frequencies, which is essential for applications in high-speed, dynamic visual scenarios(Zubic etal., 2024). In this work, we aim to improve event-based recognition for minute-level duration with improved generalization across varying inference frequencies.

State Space Model (SSM).It has recently demonstrated considerable effectiveness in capturing the dynamics and dependencies of long sequences. Various models have been developed, such as S4D(Gu etal., 2022), S5(Smith etal., 2022), S6(Wang etal., 2023) and H3(Fu etal., 2022). Mamba(Gu & Dao, 2023) stands out by introducing a data-dependent SSM layer, a selection mechanism, and performance optimizations at the hardware level. Compared to transformers(Brown etal., 2020; Lu etal., 2019), which rely on quadratic complexity attention, SSMs excel at processing long sequences with linear complexity. Mamba(Gu & Dao, 2023) distinguishes itself by introducing a data-dependent SSM layer and a selection mechanism, employing parallel scanning as input during training and recurrent input during evaluation. It motivates a step-by-step of works in the vision(Zhu etal., 2024), video(Li etal., 2024), and point cloud(Zhang etal., 2024) domains.Recently, there has been growing interest in exploring the temporal modeling capabilities of SSMs for event data, given the high temporal resolution of event cameras(Zubic etal., 2024). Specifically, Zubic etal. (2024) first integrates several SSMs with a recurrent ViT framework for event-based object detection. It enhances the adaptability for varying sampling frequencies by low-pass band-limiting loss. However, it overlooks generalization across different event durations and achieves unsatisfactory performance in sampling frequency generalization. In contrast, our work seeks to recognize event streams of arbitrary duration based on SSM by employing a path-adaptive event scan module and generalizing over varying inference frequencies.

## 3 Preliminaries

Event Stream. Event cameras capture object movement by recording the pixel-level log intensity changes, rather than capturing full-frame at fixed intervals for conventional cameras. The asynchronous events, denoted as $\mathcal{E}=\{e_{i}(x_{i},y_{i},t_{i},p_{i})\},i=1,2,...,N$, reflects the brightness change $e_{i}$ for a pixel at the timestamp $t_{i}$, with coordinates $(x_{i},y_{i})$, and polarity $p_{i}\in\{1,-1\}$(Gallego etal., 2020; Zheng etal., 2023). Here, 1 and -1 represent the positive and negative brightness changes. Refer to the appendix for more details about the principle of event cameras.

SSM for Vision. SSMs(Gu etal., 2022; Smith etal., 2022; Fu etal., 2022; Wang etal., 2023) originate from the principles of continuous systems that map an input 1D sequence $x(t)\in\mathbb{R}^{L}$ into the output sequence $y(t)\in\mathbb{R}^{L}$ through an underlying hidden state $h(t)\in\mathbb{R}^{N}$. Specifically, it is formalized by $dh(t)/dt=Ah(t)+Bx(t)$ and $y(t)=Ch(t)+Dx(t)$, where $A\in\mathbb{R}^{N\times N}$, $B\in\mathbb{R}^{N\times 1}$, $C\in\mathbb{R}^{N\times 1}$, $D\in\mathbb{R}^{N\times 1}$ are the state matrix, the input projection matrix, the output projection matrix, and the feed-forward matrix. Refer to the appendix for more technical details.

## 4 Proposed Method

Overview. The PAST-SSM framework, as depicted in Fig.3, processes arbitrary-duration events using our PEAS module, followed by the SSM’s spatiotemporal modeling to predict various recognition outcomes, including objects, actions, and event streams of arbitrary duration. It comprises two components: 1) the PEAS module introduced in Sec.4.1 for event sampling, frame aggregation path-adaptive event selection, and bidirectional event scan to encode events into sequence features with fixed dimensions. On top of PEAS, the MSG loss $\mathcal{L}_{MSG}$ detailed in Sec.4.3 is proposed for minimizing the randomness and redundancy of encoded features; and 2) the event spatiotemporal modeling module discussed in Sec.4.2 to predict the final recognition results. The following subsections provide a detailed description of these components.

### 4.1 Path-adaptive Event Aggregation and Scan (PEAS) Module

We aim to recognize event streams of arbitrary duration. Following the previous frame-based event presentation methods, the events are prepossessed into aggregated event frames. Given the high temporal resolution of events, the number of aggregated event frames $P$ for arbitrary duration may vary significantly. For example, if events lasting between 0.1s and 300s are sampled at 50Hz (every 0.02s), the number of resulting frames can range from 5 to 15K. This variability introduces complexity for spatiotemporal event modeling. Additionally, due to SSM’s recurrent nature, its hidden state update is greatly affected by the input sequence length and feature order, especially when modeling the long-range temporal dependencies. To reduce this variability, we propose our PEAS module, which consists of the following four components to encode events of arbitrary duration into sequence features with fixed dimensions in an end-to-end learning manner.

Event Sampling and Frame Aggregation.Unlike sequential language with compact semantics, events $\mathcal{E}=\{e_{i}(x_{i},y_{i},t_{i},p_{i})\}\in\mathbb{R}^{N\times 4},i=1,2,%...,N$ denotes the asynchronous intensity change at the pixel $(x_{i},y_{i})$ at time $t_{i}$ with polarity $p_{i}\in\{1,-1\}$. The complexity of spatiotemporal event data requires efficient processing of this high-dimensional data. Following previous methods(Zhou etal., 2023; Zubic etal., 2024; Bi etal., 2020; Sabater etal., 2022),

we sample events with duration $T$ at every fixed temporal windows $1/f$, where $f$ denotes the sampling frequency, e.g.50 ms time windows $1/f$ corresponding to sampling frequency $f=20Hz$. We group a number of events $G$ at each sampling time, as shown in Fig.4 (b)). This sampling method is more effective and robust than grouping events within fixed time windows as illustrated in Fig.4 (a), as evidenced in the following Sec5.3. Therefore, we obtain $P=Tf$ event groups $\mathcal{E}^{{}^{\prime}}\in\mathbb{R}^{P\times G\times 4}$. Then, we utilize the event frame representation(Zhou etal., 2023) to transform the event groups $\mathcal{E}^{{}^{\prime}}$ into a series of event frames $F\in\mathbb{R}^{P\times H\times W\times 3}$. This transformation enables the use of traditional computer vision methods designed for frame-based data.

Path-adaptive Event Selection.

With the aggregated event frame input $F$, we then conduct our path-adaptive event selection to select $K$ event frames to reduce the variability of events of arbitrary duration. Concretely, as shown in Fig.3, the input of this module is the aggregated event frames $F\in$ $\mathbb{R}^{P\times H\times W\times 3}$. We utilize a lightweight score predictor composed of two 3D convolutional layers, followed by an activation function to generate a selection mask $M\in\mathbb{R}^{K\times P}$, where $K$ represents the number of selected frames and $P$ represents the number of original frames. $M$ consists of 0s and 1s, where the position of each 1 indicates the corresponding position of the selected event frame. Due to the non-differentiable nature of the max operation applied after the standard Softmax function to produce class probabilities, we employ a differentiable Gumbel Softmax(Jang etal., 2016) to facilitate backpropagation. To enhance the training process, the Gumbel Softmax is used exclusively during training, while the standard Softmax is applied during inference.Next, we utilize the Einsum matrix-matrix multiplication between the selection mask $M$ and the original event frames $F$ to obtain the final selected event frames $F^{{}^{\prime}}\in$ $\mathbb{R}^{K\times H\times W\times 3}$. The above process ensures that $F^{{}^{\prime}}$ can be derived from the original event frame input $F$ through an end-to-end learning approach. Please refer to the appendix for the pseudocode for the PEAS Module.

Fig.5 presents the original event frames alongside the $K$ selected ones at the start (epoch 0) and end (epoch 100) of the training process. Due to the events of arbitrary duration leading to different numbers of input event frames, frame padding is necessary to maintain consistent input sizes to ensure training among a mini-batch. In Fig.5, the black parts resent the padded zero-valued frames within a mini-batch. At epoch 0, the PEAS module randomly selects event frames, resulting in unnecessary padded frames and redundant event frames with repetitive information. After 100 epochs, the eight chosen frames exclude redundant frames and non-informative padding, demonstrating the effectiveness of the PEAS module and the MSG loss proposed in Sect.4.3.

Bidirectional Event Scan. Next, with the obtained selected event frames $F^{{}^{\prime}}\in$ $\mathbb{R}^{K\times H\times W\times 3}$, we convert the selected event frames into a 1D sequence using the bidirectional event scan, following the spatiotemporal scan proposed in (Li etal., 2024). As illustrated in Fig.3, this scan elegantly follows the temporal and spatial order, sweeping from left to right and cascading from top to bottom. In this way, the events of arbitrary duration are transformed into encoded features with fixed dimensions.

### 4.2 Event Spatiotemporal Modeling Module

On top of the PEAS module, the events of arbitrary duration are transformed into the event frame sequence $F^{{}^{\prime}}\in\mathbb{R}^{K\times H\times W\times 3}$. Given the inherently longer sequences because of the event stream’s high temporal resolution, we leverage the SSM for event spatiotemporal modeling with linear complexity.As shown in Fig.3, we first employ the 3D convolution with kernel size $1\times 16\times 16$ for patch embedding to transform the event frames into $L$ non-overlapping spatiotemporal tokens $x_{e}\in\mathbb{R}^{L\times C}$, where $L=T_{s}\times H\times W/16\times 16$. The SSM model, designed for sequential data, is sensitive to token positions, making preserving spatiotemporal position information crucial. Thus, we concatenate a learnable classification token $X_{cls}\in\mathbb{R}^{1\times C}$ at the start of the sequence and then add a learnable spatial position embedding $P_{s}\in\mathbb{R}^{(1+L)\times C}$ and temporal embedding $P_{t}\in\mathbb{R}^{T_{s}\times C}$ to obtain the final input sequence $x=[x_{cls},x_{e}]+P_{s}+P_{t}$. Next, the input sequence $x$ passes into $L$ layers of stacked B-Mamba blocks.(Gu & Dao, 2023). Note that the bidirectional event scan is actually conducted in the B-Mamba blocks for code implementation.Finally, the [CLS] token is extracted from the final layer’s output and forwarded to the classification head, which consists of the normalization layer and the linear classification layer for the final prediction $y$.

### 4.3 Multi-faceted Selection Guiding (MSG) Loss

While the proposed PEAS module is differentiable and capable of learning through back-propagation, the basic multi-class cross-entropy loss, ${L}_{CLS}$, is inadequate for effectively guiding model optimization.Due to the random weight initialization of the PEAS module, the selection of event frames is stochastic at the onset of training. However, throughout the training process, the model is limited to optimizing its performance based on the distribution of the randomly selected event frames, rather than enhancing the PEAS module to facilitate adaptive selection and scanning of the input events. To facilitate effective optimization, we propose the MSG loss that addresses two crucial aspects:1)minimizing the randomness of the selection process to ensure the selected sequence features can encapsulate the entirety of the sequence; and 2) guaranteeing that each selected events feature stands out with each other, thus eliminating redundancy.The MSG loss comprises three components, which will be detailed in the subsequent subsections.

Within-Frame Event Information Entropy (WEIE) Loss: Given the random initialization of the score predictor’s weight proposed in the PEAS module (Sec.4.1), the frame selection process tends to be random. For each selected event frame, we introduce a WEIE Loss $\mathcal{L}_{WEIE}$, which quantifies the image entropy of each event frame. Intuitively, a higher WEIE loss indicates that the selected event frame contains more information and richer details. Maximizing this loss helps enhance model optimization to minimize randomness in the selection process. It is defined as follows:

$\mathcal{L}_{WEIE}=-\sum_{k=1}^{K}\sum_{i=1}^{N}P_{i}^{k}logP^{k}_{i}/K,\quad P%^{k}=\textit{hist}(\textit{gray}(F^{{}^{\prime}}_{k})),$ | (1) |

wherehist$(.)$ indicateshistogram statistics; $N$ is the number of histogrambins; gray$(.)$ converts RGB event frames to grayscale; $P^{k}$ indicates the histogram statistics frequency for selected event frame $F^{{}^{\prime}}_{k}$; $K$ indicates the number of selected event frames.

Inter-frame Event Mutual Information (IEMI) Loss: The IEMI loss is proposed to reduce redundancy among the selected event frames. In light of the mutual information from the information theory(Russakoff etal., 2004), the IEMI loss is defined to $\mathcal{L}_{IEMI}$ quantifies the uncommon information between two event frames. Intuitively, a lower IEMI loss signifies greater differences between the frames. Thus, minimizing IEMI loss guides the model to maximize the difference of selected event frames. While the IEMI loss can be computed between any two event frames, we restrict our calculation within every consecutive event frames $F^{{}^{\prime}}\in$ $\mathbb{R}^{K\times H\times W\times 3}$ to reduce computational cost. Formally, the proposed event mutual information is composed of the coordinate-weighted joint event count histogram $hist(.)$ between every two consecutive event frames $F^{{}^{\prime}}_{k}$ and $F^{{}^{\prime}}_{k+1}$, added with their spatial coordinates $C_{x}$ and $C_{y}$. The IEMI loss $\mathcal{L}_{IEMI}$ is formulated as follows:

$P^{k}_{joint}=\textit{hist}(\textit{gray}(F^{{}^{\prime}}_{k}+F^{{}^{\prime}}_%{k+1}+C_{x}+C_{y})),$ | (2) |

$\mathcal{L}_{IEMI}=-\sum_{k=1}^{K-1}(\sum_{i=1}^{N}\sum_{j=1}^{N}P^{k}_{joint}%(i,j)log(P(i)P(j)/P^{k}_{joint}(i,j)))/(K-1),$ | (3) |

where $N$ indicates the number of histogram bins and $K$ is the number of selected event frames;

Mask Selection (MS) Loss: Due to the arbitrary length of event streams with different numbers of input event frames, frame padding is necessary to maintain consistent input sizes to ensure

training among a mini-batch. Therefore, we propose an MS loss $\mathcal{L}_{MS}$ to filter out the padded frames during the selection process. Specifically, as shown in Fig.6, given original event frames input $F\in\mathbb{R}^{P\times H\times W\times 3}$ and the selection mask $M\in\mathbb{R}^{K\times P}$ mentioned in Sec.4.1, the $\mathcal{L}_{MS}$ loss sum the mask value $M_{j},j=Ori+1,...,Ori+Pad$ at the corresponding position of the padding frame in $F_{j},j=ori+1,...,Ori+Pad$, which is formulated as follows:

$\mathcal{L}_{MS}=\sum_{i=1}^{K}\sum_{j=Ori+1}^{Pad}M_{i,j}/(K\times Pad),$ | (4) |

$K$, $Ori=P$, and $Pad$ indicate the number of selected event frames, original event frames, and padding frames respectively.

Total Objective:Given the final prediction class $y$ and the ground-truth class $Y$, the total objective is composed by the MSG loss $\mathcal{L}_{MSG}$ with three components and the commonly used multiclass cross-entropy loss $\mathcal{L}_{CLS}$:

$\mathcal{L}_{total}=\underbrace{\mathcal{L}_{IEMI}-\mathcal{L}_{WEIE}+\mathcal%{L}_{MS}}_{\mathcal{L}_{MSG}}+\mathcal{L}_{CLS}(y,Y).$ | (5) |

## 5 Experiments and Evaluation

### 5.1 Experiments settings

Public Available Datasets: Four publicly available event-based datasets are evaluated in this paper as follows: 1) DVS Action(Miao etal., 2019), also known as PAF, is an indoor dataset featuring 450 recordings across ten action categories lasting around 5s. 2) SeAct(Zhou etal., 2024) is a newly released dataset for event-based action recognition, covering 58 actions within four themes lasting around 2s-10s. This work uses only class-level labels despite available caption-level labels. 3) HARDVS(Wang etal., 2024b) is currently the largest dataset for event-based action recognition, comprising 107,646 recordings of 300 action categories. It also has an average duration of 5s and a resolution of 346 $\times$ 260.4) N-Caltech101(Orchard etal., 2015) contains event streams captured by an event camera in front of a mobile 180 $\times$ 240 ATIS system(Posch etal., 2010) with the LCD monitor presenting the original RGB images in Caltech101. There are 8,246 samples comprising 300 ms in length, covering 101 different types of items.

Our Minute-level ArDVS100, Real-ArDVS10 and TemArDVS100 Dataset. Given existing datasets only provide second-level duration events lasting approximately 0.1s to 0.3 s for objects and up to 20s for actions (Please refer to the appendix for all event-based object & action recognition dataset comparison),we propose the first arbitrary-duration dataset consisting of event streams of arbitrary durations, named ArDVS100 and TemArDVS datasets. Specifically, both the ArDVS100 and TemArDVS datasets contain 100 action classes with events ranging from 1s to 256s and 14s to 215s respectively; however, TemArDVS offers with fine-grained temporal labels that highlight the temporal sequence of actions. For instance, in TemArDVS100, ‘sit down then get up’ and ‘get up then sit down’ are distinct actions, while in ArDVS100, they are considered the same. Both datasets are synthesized by concatenating event streams from the HARDVS(Wang etal., 2024b) dataset for ArDVS100 and from HARDVS and DailyDVS-200(Wang etal., 2024a) for the TemArDVS dataset. We allocated 80% for training and 20% for testing (evaluating). Additionally, to assess the model’s real-world applicability, we created a real-world dataset, named Real-ArDVS10, comprising event-based actions lasting from 2s to 75s, encompassing 10 distinct classes selected from the ArDVS100 datasets. It was recorded using the DVS346 event camera, which has a resolution of 346 $\times$ 240 pixels. It is divided into 70% for training and 30% for testing (evaluation). We aim for our ArDVS100, Real-ArDVS10, and TemArDVS datasets to enhance evaluation for event-based action recognition and inspire further research.

Model Architecture: We utilize the default hyperparameters for the B-Mamba layer(Zhu etal., 2024), setting the state dimension to 16 and the expansion ratio to 2.

Model | Layer $L$ | Dim $D$ | Param. |

Tiny(T) | 24 | 192 | 7M |

Small(S) | 24 | 384 | 25M |

Middle(M) | 32 | 576 | 74M |

In alignment with ViT(Dosovitskiy etal., 2020), we modify the depth and embedding dimensions to match models of comparable sizes, including Tiny (T), Small (S), and Middle (M), as outlined in Tab.1. The stated model parameter is an estimate, as the actual parameter varies depending on the number of categories and selected event frames amount $K$.

Experimental Settings: We utilize the AdamW optimizer with a cosine learning rate schedule with the initial 5 epochs for linear warm-up. Unless a special statement, the default settings for the learning rate and weight decay are 1e-3 and 0.05, respectively. The model is trained with 100 epochs for DVS Action, SeAct, and N-Caltech101 datasets and 50 epochs for HARDVS and our ArDVS100 datasets. Additionally, we employ BFloat16 precision during training to improve stability. For data augmentation, we implement random scaling, random cropping, random flipping, and data mixup of the event frames during the training phase. We adopt the pre-trained VideoMamba(Li etal., 2024) model checkpoints for initialization. Refer to the appendix for additional experimental settings for each dataset. All ablation studies, unless specifically stated, use the Tiny version on the DVS Action dataset at a sampling frequency of 0.8 Hz with 32 selected event frames.

### 5.2 Experiments Results

#### 5.2.1 Event-based Arbitrary Duration Recognition Results

In this section, we evaluate our proposed PAST-SSM method for the recognition of event streams across three time duration: (1) 0.1s to 0.3s, (2) 1s to 10s, and (3) 1s to 265s.

Results for recognizing 0.1s to 0.3s event streamsWe evaluate our PAST-SSM on the popular event-based object recognition datasets, namely N-Caltech101, the average duration of which is 0.3s.

Object Recognition (Avg. 0.1s-0.3s) | ||

Model | Param. | Top-1 Accuracy(%) |

RG-CNNs(Cannici etal., 2020) | 19M | 65.70 |

Cho etal. (2023) | - | 82.61 |

EDGCN(Deng etal., 2024) | 0.77M | 83.50 |

Matrix-LSTM(Cannici etal., 2020) | - | 84.31 |

Yang etal. (2023) | 21M | 87.66 |

MEM(Klenk etal., 2024) | - | 90.10 |

EventDance(Zheng & Wang, 2024) | 26M | 92.35 |

PAST-SSM-T-$K$(1) | 88.29 | |

PAST-SSM-T-$K$(2) | 7M | 89.72 |

PAST-SSM-S-$K$(1) | 90.92 | |

PAST-SSM-S-$K$(2) | 25M | 91.96 |

PAST-SSM-M-$K$(1) | 94.20 | |

PAST-SSM-M-$K$(2) | 74M | 94.60 |

As shown in Tab.2, our PAST-SSM-M-K(2) secures a notable advantage, outperforming EventDance(Zheng & Wang, 2024) by +2.25%. This achievement underscores the potential of our purely SSM-based model in efficiently and effectively recognizing second-level event streams, highlighting its competence for local-rang event spatiotemporal modeling.

Results for recognizing 1s to 20s event streams Tab.3 presents results from event-based action recognition datasets with average durations of 1s to 10s. Our PAST-SSM-M outperforms previous methods, exceeding ExAct(Zhou etal., 2024) by +3.45% and +0.38% on the DVS Action and SeAct datasets, respectively. Additionally, the PAST-SSM-S-K(16) achieves a remarkable 98.41% Top-1 accuracy on the HARDVS dataset, surpassing ExAct(Zhou etal., 2024) by +8.31% while using only 25M parameters. This advancement also reduces computational demands due to the fewer parameters.

Action Recognition (Avg. 1s-10s) | ||||

Top-1 Accuracy(%) | ||||

Model | Param. | DVS Action | SeAct | HARDVS |

EV-ACT(Gao etal., 2023) | 21.3M | 92.60 | - | - |

EventTransAct(deBlegiers etal., 2023) | - | - | 57.81 | - |

EvT(Sabater etal., 2022) | 0.48M | 61.30 | - | |

TTPIONT(Ren etal., 2023) | 0.33M | 92.70 | - | - |

Speck(Yao etal., 2024) | - | - | - | 46.70 |

ASA(Yao etal., 2023) | - | - | - | 47.10 |

ESTF(Wang etal., 2024b) | - | - | - | 51.22 |

ExACT(Zhou etal., 2024) | 471M | 94.83 | 66.07 | 90.10 |

PAST-SSM-T-$K$(8) | 91.38 | 51.72 | 98.40 | |

PAST-SSM-T-$K$(16) | 7M | 94.83 | 49.14 | 98.37 |

PAST-SSM-S-$K$(8) | 93.33 | 60.34 | 98.20 | |

PAST-SSM-S-$K$(16) | 25M | 96.55 | 62.07 | 98.41 |

PAST-SSM-M-$K$(8) | 98.28 | 65.52 | 98.05 | |

PAST-SSM-M-$K$(16) | 74M | 96.55 | 66.38 | 98.20 |

Arbitrary-duration Event Recognition (Avg. 1s to 265s) | ||||

Top-1 Accuracy(%) | ||||

Model | Param. | ArDVS100 | Real-ArDVS10 | TemArDVS100 |

PAST-SSM-T-$K$(16) | 90.20 | 80.00 | 59.20 | |

PAST-SSM-T-$K$(32) | 7M | 93.85 | 93.33 | 89.00 |

PAST-SSM-S-$K$(16) | 94.90 | 90.00 | 62.90 | |

PAST-SSM-S-$K$(32) | 25M | 96.00 | 100.00 | 73.41 |

PAST-SSM-M-$K$(16) | 96.00 | 93.33 | 71.06 | |

PAST-SSM-M-$K$(32) | 74M | 97.35 | 100.00 | 82.50 |

Results for recognizing 1s to 265s event streamsAs illustrated in Tab.4, the linear complexity of PAST-SSM makes it well-suited for end-to-end training with arbitrary-duration event streams. We evaluate PAST-SSM on our ArDVS100 and TemArDVS100 datasets with event streams ranging from 1s to 265s. For ArDVS100 dataset, our PAST-SSM-M-K(32) achieves excellent 97.35% Top-1 accuracy. In the case of the more challenging TemArDVS100 dataset with fine-grained temporal labels, our PAST-SSM-T-K(32) reaches a Top-1 accuracy of 89.00% with reduced computational complexity and less training time, proving its advanced spatiotemporal modeling ability for distinguishing the timing of each action. Additionally, our PAST-SSM-S-K(32) achieved 100% Top-1 accuracy for recognizing the real-word event stream raging from 2s to 75s across 10 classes, demonstrating its effectiveness for real-world applications. For comparison methods, we fail to evaluate previous methods based on ViT or CNN backbones on our minute-level datasets because of their quadratic computational complexity or limited receptive field. This result highlights PAST-SSM’s effectiveness and its great potential for future arbitrary-duration event stream comprehension.

#### 5.2.2 Generalization results across Varying Inference Frequencies.

Datasets & Specific Experiments settings We trained our PAST-SSM-S on the DVS Action dataset across varying sampling frequencies, specifically at 20 Hz, 60 Hz, and 100 Hz, which correspond to low, medium, and high sampling frequencies, respectively. We assessed their performance under different inference frequencies ranging from 20 Hz to 100 Hz. We also examine two frame aggregation methods for sampling at fixed time intervals, which serve as the baseline: fixed ’Time Windows’ aggregation and fixed ’Event Counts’ aggregation. Fig.4 highlights the differences between these methods: ’Time Windows’ results in varying temporal ranges for aggregation, whereas ’Event Counts’ ensure consistent temporal ranges for aggregation. (please refer to Sec.5.3 for more explanation and discussion.)

Results & DiscussionAs shown in Fig.7, regardless of whether the model is trained at low, medium, or high frequencies, our models demonstrate consistently strong performance across various inference frequencies, with a maximum performance drop of only 8.62% when our PAST-SSM model trained at 60Hz and evaluated at 100HZ. This finding underscores their robustness and generalizability compared to the baseline methods (’Time Windows’ and ’Event Counts’), which typically experience significant performance declines, such as -18.96%, -20.59%, -29.32% for ’Time Windows’ trained at 20 Hz, 60 Hz, and 100 Hz and evaluated at 60 Hz, 100 Hz, and 20 Hz, respectively. (Please refer to the appendix for the specific statistics result for Fig.7.)

### 5.3 Ablation Study

We conduct ablation experiments on our PAST-SSM framework to evaluate the effectiveness of the PEAS module (Sec.4.1), $\mathcal{L}_{MSG}$ loss (Sec.4.3), and other hyper-parameters.

Impact of PEAS module & $\mathcal{L}_{MSG}$ loss. We ablate the key two components of our PAST-SSM model, namely the PEAS module (Sec.4.1) and the $\mathcal{L}_{MSG}$ loss (Sec.4.3). As shown in Tab.5,

Settings | DVS Action ($K$(16)) | |

Top1(%) | Top5(%) | |

Random Sampling | 92.98% | 100.00% |

PEAS | 93.33% | 100.00% |

PEAS + $\mathcal{L}_{MSG}$ | 94.83% | 100.00% |

the ’Random Selection’ refers to the baseline where we select $K$ event frames randomly and it achieves 92.98% Top-1 accuracy. With the PEAS module for path-adaptive event frame selection in an end-to-end manner, we achieve 93.33% Top-1 accuracy with +0.35% performance gain, thus proving the effectiveness of the PEAS module. When equipped with both the PEAS module and (Sec.4.1) and the $\mathcal{L}_{MSG}$ loss, the full model achieves 94.83% Top-1 accuracy with +1.85% performance gain, thus proving the effectiveness of proposed $\mathcal{L}_{MSG}$ loss to reduce randomness in the selection and promote effective sequence feature learning.

Effectiveness of Multi-faceted Selection Guiding Loss $\mathcal{L}_{MSG}$. As presented in Tab. 6, we conduct

$\mathcal{L}_{MSG}$ | DVS Action($K$(16)) | ||||

$\mathcal{L}_{CLS}$ | $\mathcal{L}_{IEMI}$ | $\mathcal{L}_{WEIE}$ | $\mathcal{L}_{MS}$ | Top1(%) | Top5(%) |

✓ | ✗ | ✗ | ✗ | 89.65 | 98.25 |

✓ | ✓ | ✗ | ✗ | 91.38 | 100.00 |

✓ | ✓ | ✓ | ✗ | 93.10 | 100.00 |

✓ | ✓ | ✓ | ✓ | 94.83 | 100.00 |

an ablation study on the four components of$\mathcal{L}_{MSG}$ (Eq.5). The component $\mathcal{L}_{CLS}$ serves as the baseline, performing optimization exclusively with the standard cross-entropy loss and achieving a Top-1 accuracy of 89.65%. By employing the proposed $\mathcal{L}_{IEMI}$ (Eq.2) to enhance comprehension of the selected input sequence, we attain a Top-1 accuracy of 91.38%, representing a performance gain of 1.73%. The integration of $\mathcal{L}_{WEIE}$ (Eq.1) for frame distinctness yields an additional 3.45% increase in accuracy, resulting in a Top-1 accuracy of 93.10%. Lastly, the component $\mathcal{L}_{MS}$ (Eq.4), designed for filtering out padded frames, also contributes a 5.18% improvement in accuracy, achieving a Top-1 accuracy of 94.83%. In summary, all three proposed components positively impact the final classification, thereby demonstrating their effectiveness.

Frame Aggregation Method: Time Windows vs. Event Counts. To erase the computational burden when processing the event with spatiotemporal richness, existing methodspredominantly sample and aggregate events at every fixed temporal interval, i.e., frequency. In general, this aggregation process can be categorized into two methods: fixed time windows and fixed event counts. Fig.4 illustrates distinctions between the two methods: ’Event Counts’ aggregation leads to varying aggregation temporal ranges, while ’Time Windows’ keeps them consistent. Tab. 8 presents our model’s performance with these two aggregation methods at different evaluated frequencies. We observe that ’Event Counts’ tend to achieve better Top-1 accuracy compared to ’Time Windows’. For example, ’Event Counts’ achieves 96.55% Top-1 accuracy in comparison to 94.83% Top-1 accuracy for ’Time Windows’ when both trained and evaluated at 60Hz. However, both methods perform poorly when training and evaluating at different frequencies, with -24.14% for ’Event Counts’ at 20 Hz evaluated at 100 Hz, and -25.86% for ’Time Windows’ at 100 Hz evaluated at 20 Hz. This leads us to propose the PEAS module to improve model generalization across inference frequencies.

Representation | N-Caltech101 ($K$(1)) | DVS Action ($K$(16)) | ||

Top1(%) | Top5(%) | Top1(%) | Top5(%) | |

Frame(Gray) | 90.48% | 97.53% | 93.33% | 100.00% |

Frame(RGB) | 90.94% | 97.82% | 94.83% | 100.00% |

Voxel | 90.19% | 97.02% | 92.47% | 100.00% |

TBR | 90.24% | 97.13% | 91.72% | 100.00% |

Frame-based Event Representation.Tab.8 displays the impact of four existing frame-based event representations. The RGB frame(Zhou etal., 2023) representation attains Top-1 accuracy rates of 90.94% on the N-Caltech101 dataset and 94.83% on the DVS Action dataset, surpassing the performance of the other three frame-based representations, including gray frame(Zhou etal., 2023), Voxell(Deng etal., 2022b) and TBR(Innocenti etal., 2021).

## 6 Conclusion and Future Work

In this paper, we have presented a novel approach, named PAST-SSM, for recognizing events of arbitrary duration and generalizing to varying inference frequencies.Extensive experiments prove that PAST-SSM outperforms prior arts with fewer parameters on four publicly available datasets and can successfully recognize events of arbitrary duration on our ArDVS100 (1s to 256s), Real-ArDVS10 (2s to 75s) and TemArDVS (14s to 215s) datasets. Moreover, it also shows strong generalization across varying inference frequencies. We hope this method can pave the way for future model design for recognizing events with longer duration and applications for high-seed dynamic visual scenarios.

Limitation. We observe that larger VideoMamba tends to overfit during our experiments, resulting to suboptimal performance. This issue is not limited to our models but also observed in VMamba(Gu & Dao, 2023) and VideoMamba(Li etal., 2024). Future research could explore training strategies such as Self-Distillation and advanced data augmentation to mitigate this overfitting.

## References

- Bi etal. (2020)Yin Bi, Aaron Chadha, Alhabib Abbas, Eirina Bourtsoulatze, and Yiannis Andreopoulos.Graph-based spatio-temporal feature learning for neuromorphic vision sensing.
*IEEE Transactions on Image Processing*, 29:9084–9098, 2020. - Brown etal. (2020)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal.Language models are few-shot learners.
*Advances in neural information processing systems*, 33:1877–1901, 2020. - Cannici etal. (2020)Marco Cannici, Marco Ciccone, Andrea Romanoni, and Matteo Matteucci.A differentiable recurrent surface for asynchronous event-based data.In
*Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16*, pp. 136–152. Springer, 2020. - Cho etal. (2023)Hoonhee Cho, Hyeonseong Kim, Yujeong Chae, and Kuk-Jin Yoon.Label-free event-based object recognition via joint learning with image reconstruction from events.In
*Proceedings of the IEEE/CVF International Conference on Computer Vision*, pp. 19866–19877, 2023. - deBlegiers etal. (2023)Tristan deBlegiers, IshanRajendrakumar Dave, Adeel Yousaf, and Mubarak Shah.Eventtransact: A video transformer-based framework for event-camera based action recognition.In
*2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)*, pp. 1–7. IEEE, 2023. - Deng etal. (2022a)Yongjian Deng, Hao Chen, Hai Liu, and Youfu Li.A Voxel Graph CNN for Object Classification With Event Cameras.In
*CVPR*, 2022a. - Deng etal. (2022b)Yongjian Deng, Hao Chen, Hai Liu, and Youfu Li.A voxel graph cnn for object classification with event cameras.In
*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 1172–1181, 2022b. - Deng etal. (2024)Yongjian Deng, Hao Chen, and Youfu Li.A dynamic gcn with cross-representation distillation for event-based learning.In
*Proceedings of the AAAI Conference on Artificial Intelligence*, volume38, pp. 1492–1500, 2024. - Dosovitskiy etal. (2020)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition at scale.
*arXiv preprint arXiv:2010.11929*, 2020. - Fu etal. (2022)DanielY Fu, Tri Dao, KhaledK Saab, ArminW Thomas, Atri Rudra, and Christopher Ré.Hungry hungry hippos: Towards language modeling with state space models.
*arXiv preprint arXiv:2212.14052*, 2022. - Gallego etal. (2020)Guillermo Gallego, Tobi Delbrück, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, AndrewJ Davison, Jörg Conradt, Kostas Daniilidis, etal.Event-based vision: A survey.
*IEEE transactions on pattern analysis and machine intelligence*, 44(1):154–180, 2020. - Gao etal. (2023)Yue Gao, Jiaxuan Lu, Siqi Li, Nan Ma, Shaoyi Du, Yipeng Li, and Qionghai Dai.Action recognition and benchmark using event cameras.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2023. - Gehrig etal. (2019)Daniel Gehrig, Antonio Loquercio, KonstantinosG Derpanis, and Davide Scaramuzza.End-to-end learning of representations for asynchronous event-based data.In
*ICCV*, 2019. - Gu & Dao (2023)Albert Gu and Tri Dao.Mamba: Linear-time sequence modeling with selective state spaces.
*arXiv preprint arXiv:2312.00752*, 2023. - Gu etal. (2022)Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré.On the parameterization and initialization of diagonal state space models.
*Advances in Neural Information Processing Systems*, 35:35971–35983, 2022. - Gu etal. (2020)Fuqiang Gu, Weicong Sng, Tasbolat Taunyazov, and Harold Soh.Tactilesgnet: A spiking graph neural network for event-based tactile object recognition.In
*IROS*, 2020. - Innocenti etal. (2021)SimoneUndri Innocenti, Federico Becattini, Federico Pernici, and Alberto DelBimbo.Temporal binary representation for event-based action recognition.In
*2020 25th International Conference on Pattern Recognition (ICPR)*, pp. 10426–10432. IEEE, 2021. - Jang etal. (2016)Eric Jang, Shixiang Gu, and Ben Poole.Categorical reparameterization with gumbel-softmax.
*arXiv preprint arXiv:1611.01144*, 2016. - Kim etal. (2021)Junho Kim, Jaehyeok Bae, Gangin Park, Dongsu Zhang, and YoungMin Kim.N-imagenet: Towards robust, fine-grained object recognition with event cameras.In
*Proceedings of the IEEE/CVF international conference on computer vision*, pp. 2146–2156, 2021. - Kim etal. (2022)Junho Kim, Inwoo Hwang, and YoungMin Kim.Ev-tta: Test-time adaptation for event-based object recognition.In
*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 17745–17754, 2022. - Klenk etal. (2024)Simon Klenk, David Bonello, Lukas Koestler, Nikita Araslanov, and Daniel Cremers.Masked event modeling: Self-supervised pretraining for event cameras.In
*Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision*, pp. 2378–2388, 2024. - Li etal. (2024)Kunchang Li, Xinhao Li, YiWang, Yinan He, Yali Wang, Limin Wang, and YuQiao.Videomamba: State space model for efficient video understanding.
*arXiv preprint arXiv:2403.06977*, 2024. - Li etal. (2021)Yijin Li, Han Zhou, Bangbang Yang, YeZhang, Zhaopeng Cui, Hujun Bao, and Guofeng Zhang.Graph-based asynchronous event processing for rapid object recognition.In
*Proceedings of the IEEE/CVF International Conference on Computer Vision*, pp. 934–943, 2021. - Liu etal. (2022)Chang Liu, Xiaojuan Qi, EdmundY Lam, and Ngai Wong.Fast classification and action recognition with event-based imaging.
*IEEE Access*, 10:55638–55649, 2022. - Liu etal. (2021)Qianhui Liu, Dong Xing, Huajin Tang, DeMa, and Gang Pan.Event-based action recognition using motion information and spiking neural networks.In
*IJCAI*, pp. 1743–1749, 2021. - Lu etal. (2019)Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee.Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.
*Advances in neural information processing systems*, 32, 2019. - Miao etal. (2019)Shu Miao, Guang Chen, Xiangyu Ning, Yang Zi, Kejia Ren, Zhenshan Bing, and Alois Knoll.Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection.
*Frontiers in neurorobotics*, 13:38, 2019. - Orchard etal. (2015)Garrick Orchard, Ajinkya Jayawant, GregoryK Cohen, and Nitish Thakor.Converting static image datasets to spiking neuromorphic datasets using saccades.
*Frontiers in neuroscience*, 9:437, 2015. - Plizzari etal. (2022)Chiara Plizzari, Mirco Planamente, Gabriele Goletto, Marco Cannici, Emanuele Gusso, Matteo Matteucci, and Barbara Caputo.E2 (go) motion: Motion augmented event stream for egocentric action recognition.In
*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pp. 19935–19947, 2022. - Posch etal. (2010)Christoph Posch, Daniel Matolin, and Rainer Wohlgenannt.A qvga 143 db dynamic range frame-free pwm image sensor with lossless pixel-level video compression and time-domain cds.
*IEEE Journal of Solid-State Circuits*, 46(1):259–275, 2010. - Ren etal. (2023)Hongwei Ren, Yue Zhou, Haotian Fu, Yulong Huang, Renjing Xu, and Bojun Cheng.Ttpoint: A tensorized point cloud network for lightweight action recognition with event cameras.In
*Proceedings of the 31st ACM International Conference on Multimedia*, pp. 8026–8034, 2023. - Russakoff etal. (2004)DanielB Russakoff, Carlo Tomasi, Torsten Rohlfing, and CalvinR Maurer.Image similarity using mutual information of regions.In
*Computer Vision-ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11-14, 2004. Proceedings, Part III 8*, pp. 596–607. Springer, 2004. - Sabater etal. (2022)Alberto Sabater, Luis Montesano, and AnaC Murillo.Event transformer. a sparse-aware solution for efficient event data processing.In
*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 2677–2686, 2022. - Smith etal. (2022)JimmyTH Smith, Andrew Warrington, and ScottW Linderman.Simplified state space layers for sequence modeling.
*arXiv preprint arXiv:2208.04933*, 2022. - Wang etal. (2023)Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, and Raffay Hamid.Selective structured state-spaces for long-form video understanding.In
*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 6387–6397, 2023. - Wang etal. (2024a)QiWang, Zhou Xu, Yuming Lin, Jingtao Ye, Hongsheng Li, Guangming Zhu, Syed AfaqAli Shah, Mohammed Bennamoun, and Liang Zhang.Dailydvs-200: A comprehensive benchmark dataset for event-based action recognition.
*arXiv preprint arXiv:2407.05106*, 2024a. - Wang etal. (2024b)Xiao Wang, Zongzhen Wu, BoJiang, Zhimin Bao, Lin Zhu, Guoqi Li, Yaowei Wang, and Yonghong Tian.Hardvs: Revisiting human activity recognition with dynamic vision sensors.In
*Proceedings of the AAAI Conference on Artificial Intelligence*, volume38, pp. 5615–5623, 2024b. - Wu & Krahenbuhl (2021)Chao-Yuan Wu and Philipp Krahenbuhl.Towards long-form video understanding.In
*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 1884–1894, 2021. - Xie etal. (2022)Bochen Xie, Yongjian Deng, Zhanpeng Shao, Hai Liu, and Youfu Li.Vmv-gcn: Volumetric multi-view based graph cnn for event stream classification.
*IEEE Robotics and Automation Letters*, 7(2):1976–1983, 2022. - Xie etal. (2023)Bochen Xie, Yongjian Deng, Zhanpeng Shao, Hai Liu, Qingsong Xu, and Youfu Li.Event voxel set transformer for spatiotemporal representation learning on event streams.
*arXiv preprint arXiv:2303.03856*, 2023. - Xie etal. (2024)Bochen Xie, Yongjian Deng, Zhanpeng Shao, Qingsong Xu, and Youfu Li.Event voxel set transformer for spatiotemporal representation learning on event streams.
*IEEE Transactions on Circuits and Systems for Video Technology*, 2024. - Yang etal. (2023)Yan Yang, Liyuan Pan, and Liu Liu.Event camera data pre-training.In
*Proceedings of the IEEE/CVF International Conference on Computer Vision*, pp. 10699–10709, 2023. - Yao etal. (2021)Man Yao, Huanhuan Gao, Guangshe Zhao, Dingheng Wang, Yihan Lin, Zhaoxu Yang, and Guoqi Li.Temporal-wise attention spiking neural networks for event streams classification.In
*Proceedings of the IEEE/CVF International Conference on Computer Vision*, pp. 10221–10230, 2021. - Yao etal. (2023)Man Yao, Jiakui Hu, Guangshe Zhao, Yaoyuan Wang, Ziyang Zhang, BoXu, and Guoqi Li.Inherent redundancy in spiking neural networks.In
*Proceedings of the IEEE/CVF international conference on computer vision*, pp. 16924–16934, 2023. - Yao etal. (2024)Man Yao, Ole Richter, Guangshe Zhao, Ning Qiao, Yannan Xing, Dingheng Wang, Tianxiang Hu, Wei Fang, Tugba Demirci, Michele DeMarchi, etal.Spike-based dynamic computing with asynchronous sensing-computing neuromorphic chip.
*Nature Communications*, 15(1):4464, 2024. - Zhang etal. (2024)Tao Zhang, Xiangtai Li, Haobo Yuan, Shunping Ji, and Shuicheng Yan.Point could mamba: Point cloud learning via state space model.
*arXiv preprint arXiv:2403.00762*, 2024. - Zheng & Wang (2024)XuZheng and Lin Wang.Eventdance: Unsupervised source-free cross-modal adaptation for event-based object recognition.In
*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 17448–17458, 2024. - Zheng etal. (2023)XuZheng, Yexin Liu, Yunfan Lu, Tongyan Hua, Tianbo Pan, Weiming Zhang, Dacheng Tao, and Lin Wang.Deep learning for event-based vision: A comprehensive survey and benchmarks.
*arXiv preprint arXiv:2302.08890*, 2023. - Zhou etal. (2023)Jiazhou Zhou, XuZheng, Yuanhuiyi Lyu, and Lin Wang.E-clip: Towards label-efficient event-based open-world understanding by clip.
*arXiv preprint arXiv:2308.03135*, 2023. - Zhou etal. (2024)Jiazhou Zhou, XuZheng, Yuanhuiyi Lyu, and Lin Wang.Exact: Language-guided conceptual reasoning and uncertainty estimation for event-based action recognition and more.In
*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 18633–18643, 2024. - Zhu etal. (2024)Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang.Vision mamba: Efficient visual representation learning with bidirectional state space model.
*arXiv preprint arXiv:2401.09417*, 2024. - Zubić etal. (2023)Nikola Zubić, Daniel Gehrig, Mathias Gehrig, and Davide Scaramuzza.From chaos comes order: Ordering event representations for object recognition and detection.In
*Proceedings of the IEEE/CVF International Conference on Computer Vision*, pp. 12846–12856, 2023. - Zubic etal. (2024)Nikola Zubic, Mathias Gehrig, and Davide Scaramuzza.State space models for event cameras.In
*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pp. 5819–5828, 2024.

## Appendix A Appendix

### A.1 Additional technical details of SSMs.

SSMs(Gu etal., 2022; Smith etal., 2022; Fu etal., 2022; Wang etal., 2023) originate from the principles of continuous systems that map an input 1D sequence $x(t)\in\mathbb{R}^{L}$ into the output sequence $y(t)\in\mathbb{R}^{L}$ through an underlying hidden state $h(t)\in\mathbb{R}^{N}$. Specifically, it is formalized by $dh(t)/dt=Ah(t)+Bx(t)$ and $y(t)=Ch(t)+Dx(t)$, where $A\in\mathbb{R}^{N\times N}$, $B\in\mathbb{R}^{N\times 1}$, $C\in\mathbb{R}^{N\times 1}$, $D\in\mathbb{R}^{N\times 1}$ are the state matrix, the input projection matrix, the output projection matrix and the feed-forward matrix.

$dh(t)/dt=Ah(t)+Bx(t),$ | (6) |

$y(t)=Ch(t)+Dx(t),$ | (7) |

where $A\in\mathbb{R}^{N\times N}$, $B\in\mathbb{R}^{N\times 1}$, $C\in\mathbb{R}^{N\times 1}$, $D\in\mathbb{R}^{N\times 1}$ are the state (or system) matrix, the input projection matrix, the output projection matrix and the feed-forward matrix.

The discretization process of SSMs is essential for integrating continuous-time models into deep-learning algorithms.(Wang etal., 2023). We adopt Mamba(Gu & Dao, 2023) strategy, treating $D$ as fixed network parameters while introducing timescale parameter $\Delta$ to transform the continuous parameters $A$, $B$ into their discrete counterparts $\hat{A}$, $\hat{B}$, formulated as follows:

$\hat{A}=exp(\Delta A)$ | (8) |

$\hat{B}=(\Delta A)^{-1}(exp(\Delta A)-I)\cdot\Delta B$ | (9) |

$h_{t}=\hat{A}h_{t-1}+\hat{B}x_{t},$ | (10) |

$y_{t}=Ch_{t}.$ | (11) |

Compared to previous linear time-invariant SSMs, Mamba proposed a selective scan mechanism that directly derived the parameters $B$, $C$, and $\Delta$ from the input during the training process, thus enabling better contextual sensitivity and adaptive weight modulation.

### A.2 PyTorch-style Pseudocode for the Proposed PEAS Module.

In Algorithm1, we present the PyTorch-style pseudocode of the proposed PEAS module to facilitate readers’ understanding.

### A.3 Existing event-based recognition datasets comparison.

We compare our proposed ArDVS100 dataset with existing event-based recognition datasets. As shown in Tab.7, previous datasets contain second-level event streams lasting from 0.1s to 20s, while our proposed ArDVS100 dataset provides minute-level duration event streams lasting from 1s to 265s. ArDVS100 has 100 classes with normal class labels. We believe that our ArDVS100 will provide enhanced evaluation platforms for recognizing event streams of arbitrary durations and inspire further research in this field.

Dataset | Year | Sensors | Object | Scale | Class | Real | TemporalFine-grainedLabels | Duration(s) |

MNISTDVS | 2013 | DAVIS128 | Image | 30,000 | 10 | ✗ | ✗ | - |

N-Caltech101 | 2015 | ATIS | Image | 8,709 | 101 | ✗ | ✗ | 0.3s |

N-MNIST | 2015 | ATIS | Image | 70,000 | 10 | ✗ | ✗ | 0.3s |

CIFAR10-DVS | 2017 | DAVIS128 | Image | 10,000 | 10 | ✗ | ✗ | 1.2s |

N-ImageNet | 2021 | Samsung-Gen3 | Image | 1,781,167 | 1,000 | ✗ | ✗ | 0.1s |

ES-lmageNet | 2021 | - | Image | 1,306,916 | 1,000 | ✗ | ✗ | - |

ASLAN-DVS | 2011 | DAVIS240c | Action | 3,697 | 432 | ✗ | ✗ | - |

DvsGesture | 2017 | DAVIS128 | Action | 1,342 | 11 | ✓ | ✗ | 6s |

N-CARS | 2018 | ATIS | Car | 24,029 | 2 | ✓ | ✗ | 0.1s |

ASL-DVS | 2019 | DAVIS240 | Hand | 100,800 | 24 | ✓ | ✗ | 0.1s |

DVS Action | 2019 | DAVIS346 | Action | 450 | 10 | ✓ | ✗ | 5s |

HMDB-DVS | 2019 | DAVIS240c | Action | 6,766 | 51 | ✗ | ✗ | 19s |

UCF-DVS | 2019 | DAVIS240c | Action | 13,320 | 101 | ✗ | ✗ | 25s |

DailyAction | 2021 | DAVIS346 | Action | 1,440 | 12 | ✓ | ✗ | 5s |

HARDVS | 2022 | DAVIS346 | Action | 107,646 | 300 | ✓ | ✗ | 5s |

THUEACT50 | 2023 | CeleX-V | Action | 10,500 | 50 | ✓ | ✗ | 2s-5s |

THUEAC50CHL | 2023 | DAVIS346 | Action | 2,330 | 50 | ✓ | ✗ | 2s-6s |

Bullying10K | 2023 | DAVIS346 | Action | 10,000 | 10 | ✓ | ✗ | 1s-20s |

SeAct | 2024 | DAVIS346 | Action | 580 | 58 | ✓ | ✗ | 2s-10s |

DailyDVS-200 | 2024 | DVXplorer Lite | Action | 22,046 | 200 | ✓ | ✗ | 2s-20s |

ArDVS100 | 2024 | DAVIS346 | Action | 10,000 | 100 | ✗ | ✗ | 1s-265s |

Real-ArDVS10 | 2024 | DAVIS346 | Action | 100 | 10 | ✓ | ✗ | 2s-75s |

TemArDVS100 | 2024 | DAVIS346 | Action | 10,000 | 100 | ✗ | ✓ | 14s-215s |

### A.4 The specific statistics result for model generalization across varying inference frequencies.

In Tab.8, we present the specific statistics result for Fig.7 for further comparison.

Top-1 Accuracy & Performance Drop (%) | ||||||

Val $f$ | ||||||

Train $f$ | Settings | 20 Hz | 40 Hz | 60 Hz | 80 Hz | 100 Hz |

Time Windows | 93.10 | 89.65-3.45 | 74.14-18.96 | 72.41-20.69 | 68.97-24.13 | |

Event Counts | 94.83 | 87.93-6.90 | 75.86-18.97 | 75.86-18.97 | 70.69-24.14 | |

20 Hz | Event Counts + PAST-SSM-S | 93.10 | 89.65-3.45 | 89.65-3.45 | 86.21-6.89 | 84.48-8.62 |

Time Windows | 79.31-15.52 | 87.93-6.90 | 94.83 | 89.65-5.18 | 75.86-18.97 | |

Event Counts | 81.03-15.52 | 89.65-6.90 | 96.55 | 87.93-8.62 | 79.89-16.66 | |

60 Hz | Event Counts + PAST-SSM-S | 89.66-6.89 | 93.1-3.45 | 96.55 | 91.38-5.17 | 87.93-8.62 |

Time Windows | 65.51-25.86 | 87.93-3.44 | 91.37-0 | 91.37-0 | 91.37 | |

Event Counts | 67.24-27.59 | 86.21-8.62 | 89.65-5.18 | 93.1-1.73 | 94.83 | |

100 Hz | Event Counts + PAST-SSM-S | 89.66-5.17 | 94,83-0 | 93.1-1.73 | 93.1-1.73 | 94.83 |

### A.5 The settings of sampling frequency and aggregated event counts per frame for different datasets.

The additional experiment settings of sampling frequency and aggregated event count per frame for different datasets are presented in Tab.9.

Dataset | Sampling Frequency | Aggregated Event Count / Frame |

N-Caltech101 | 200 Hz | 50,000 |

DVS Action | 80 Hz | 100,000 |

SeAct | 80 Hz | 80,000 |

HARDVS | 100 Hz | 80,000 |

ArDVS100 | 50 Hz | 80,000 |

Real-ArDVS10 | 50 Hz | 80,000 |

TemArDVS100 | 50 Hz | 80,000 |