Introduction
Intelligent surveillance research has significant implications across various domains, including public safety, security, and law enforcement efforts. It can be employed in public spaces, transportation hubs, schools, and other crowded areas to identify and prevent potential threats, mitigating risks associated with armed violence and terrorist activities. By integrating such systems into law enforcement operations, authorities can identify and apprehend individuals transporting contraband or harmful objects, reducing the occurrence of violent crimes. Mass shooting prevention can be achieved by detecting contraband in locations like schools and public events, while border security is essential for securing borders and preventing contraband trafficking (United Nations, 2006). Military settings can benefit from systems in identifying and neutralizing enemy threats, providing added protection for troops. Airport and aviation security can be enhanced by detecting concealed contraband in carry-on luggage and other areas. Prisons and correctional facilities can also benefit from technologies, enhancing overall security for staff and inmates. AI-based image identification, infrared imaging, millimeter-wave scanners, and improved sensor systems are examples of technological advances in contraband detection.
Turning to quantum computing, this represents a field of computer science and physics that explores the principles and technologies underlying quantum mechanics to develop new types of computers. Traditional computers use bits representing a 0 or 1 to process and store information, while quantum computers use quantum bits or qubits, which can exist simultaneously in multiple states, enabling quantum computers to perform certain computations much faster than the classical ones, especially for complex problems such as factorization and optimization. Based on ideas like quantum superposition, entanglement, and interference, quantum computing can transform several industries, including banking, drug research, encryption, and AI. However, problems remain to be solved, such as creating dependable quantum hardware and new quantum computing algorithms.
An emerging technology called quantum deep learning uses deep learning methods to solve complex issues in fields such as image recognition, natural language processing, and drug development. Carrying out deep learning tasks entails creating quantum algorithms and quantum computing hardware. The quantum circuits use quantum gates to process and alter the quantum states created by data encoding into them for deep learning tasks. Despite these difficulties, there is significant interest in quantum deep learning, and organizations and academics worldwide are looking at how it may lead to new developments in AI.
In a hybrid CNN-QCNN, QCNN is used for later layers, such as classification or detection, while traditional CNN is used for earlier levels, like feature extraction and reduction. While quantum-inspired optimization algorithms like the Variational Quantum Eigensolver (VQE) or Quantum Approximate Optimization Algorithm (QAOA) can be used, this method can also be learned using conventional deep learning approaches like backpropagation. Quantum deep learning is a potential technology area for the future since it can improve performance on some sorts of issues when combined with the classical CNN and quantum methods. For instance, the data may be preprocessed using the traditional CNN to extract relevant features, which can then be fed into the QCNN for additional processing and analysis.
While these technical advancements hold immense potential for enhancing intelligent systems, their implementation may raise ethical concerns, including privacy risks and biases in surveillance technology. These aspects warrant further investigation in future studies. Real-time warning and integration with surveillance systems provide a speedy reaction to possible threats. Overall research has far-reaching consequences for the safety and security of individuals and communities. However, achieving a balance between security measures and individual rights and privacy needs careful consideration of ethical and legal factors. Collaborations between researchers, law enforcement agencies, and policymakers are critical for developing and applying responsible intelligent surveillance technologies.
Literature review
Countries with large stocks of contraband, such as firearms, often experience elevated crime rates. Research indicates that illegal activities contribute to severe consequences, including murder, theft, destruction of infrastructure, and loss of billions of dollars (Atif et al., 2023). These findings underscore the critical need for effective measures to combat contraband-related crimes.
Traditional CCTV cameras are used to monitor specific areas, but the surveillance technique relies largely on manual observa tion (Gopinath & Krishna, 2014). However, deep learning has modified this scenario, and researchers have developed more sophisticated models to identify multiple forms of contraband. This article examines these advancements, ranging from early contraband detection systems to the latest designs (Lai & Maples, 2017). This journey began with manual methods and ended with fully automated intelligent systems (Uddin et al., 2020).
The initial methods relied on manual techniques that incorporated weighted Gaussian mixtures, polarization signal-based methodologies, multiresolution mosaicism, three-dimensional (3D) computed tomography (CT), and the Haar cascade method (Uddin et al., 2020). Later, machine learning approaches emerged, introducing tools such as the Visual Background Extractor Algorithm, 3-layer ANNs used in conjunction with active mmWave radar, X-ray-based methodologies, and ANN-based MPEG-7 classifiers (Ho et al., 2019).
With the introduction of Speeded Up Robust Features (SURF), Harris Interest Point Detector (HIPD), and Fast Retina Keypoint (FREAK), the era of convolutional neural networks began (Grega et al., 2016). The arrival of various models was also announced to improve efficiency. Related to this, many scientific publications have explored the application of deep convolutional networks and transfer learning technology (Wang et al., 2019). For example, X-rays are used for classification purposes, while infrared imaging is employed to detect concealed contraband.
Recently, a special CNN model that is accurate and significantly speeds up contraband detection in streaming video was announced. It includes R-CNN, Fast R-CNN, Faster R-CNN, Inception, YOLO, VGG-Net, ZF-Net, and YOLO-V3. In addition to speed and accuracy, another critical factor is complexity, which allows the model to run more smoothly on small devices such as smartphones and can be used in IoT apps to study such areas (Agurto et al., 2007). In addition to these advancements, numerous other models have been developed, such as Regional Proposal Network (RPN), GoogleNet, SqueezeNet, HyperNet, RetinaNet, LeNet, AlexNet, ZFNet, GoogleNet, VGGNet, ResNet, Startup Model, ResNeXt, SENet, MobileNet V1/V2, X, NASNet, PNASNet, ENASNet, EfficientPointNet, MobiiliNet2, Inception ResNetV2, and ResNet50. This study also examines the terminology frequently used in the literature, covering concepts like “Complex Backbone vs. Light Backbone,” “Two-Phase Indicator vs. OnePhase Indicator,” and “Pros and Cons” of Different Model (Tiwari & Verma, 2015).
It is important to note that this study not only presents research models and their evolution over time but also aims to help researchers establish a solid foundation for future studies (Uddin et al., 2020). This review discusses the evolution of intelligent surveillance models and their performance in detecting guns and pistols (Piyadasa, 2020). Traditional methods for firearm detection, such as X-ray technology or millimetric wave imaging, are expensive and impractical due to their reaction to all metallic items, including different categories of contraband (Jain et al., 2020).
However, deep learning has proven to be the most effective learning method, with convolutional neural networks (CNNs) outperforming traditional methods (Zhao et al., 2019). Transfer learning, which re-utilizes information from one domain to another related domain, is becoming popular. Other popular techniques include Scale-Invariant Feature Transform (SIFT), Rotation-Invariant Feature Transform (RIFT), and Fast Retina Keypoint (FREAK) (Ali Shah et al., 2020).
Uddin et al. (2022) proposed a reinforcement learning-based framework to optimize policy decisions for COVID-19 prevention. Their study highlights how machine learning can dynamically adapt public health policies in response to evolving pandemic data. By modeling different policy scenarios, the system learns optimal actions to minimize infection rates while balancing social and economic factors. This work is relevant in demonstrating the potential of AI in public safety and crisis management.
Initial efforts have been done to detect pistols in images, but these approaches struggle to identify multiple pistols within a single image (Olmos et al., 2019). The Bag of Words Surveillance System (BoWSS) (Danilov et al., 2021) algorithm was used to detect guns in images, and Faster R-CNN deep learning was used to detect a hand-held gun (Ashraf et al., 2022). However, the model can only detect and locate pistols and often fails to detect other types of contraband, such as machine guns (Uddin et al., 2020). Fernandez et al. presented a new CNN model for detecting guns and knives in video surveillance systems and conducted a comparative analysis with GoogleNet and SqueezeNet (Khan et al., 2020). The results indicated that GoogleNet performed better in knife detection (Dong et al., 2023), while SqueezeNet demonstrated superior performance in gun detection. A sequential deep neural network-based approach was developed to resolve three major problems in automated Ballistic and Concealed Guns (BCG) detection and learning (Gelana & Yadav, 2019). Nevertheless, real-time detection of pistols remains a challenge due to factors such as distance, visibility, type of pistol, scale, rotation, and shape. Advancements in these domains are being made to improve accuracy and performance in contraband detection (Atif et al., 2023). Ali Shah et al. (2021) conducted a comprehensive review of existing weapon detection techniques, specifically focusing on their applicability to street-crime scenarios. The study evaluates various methods, including image processing, machine learning, and sensor-based systems, highlighting their strengths and limitations. This review provides a foundational understanding of the technological landscape in weapon detection, aiding future research in developing more robust and context-aware security systems. Khalid et al. (2023) developed a weapon detection system aimed at enhancing surveillance and security in real-time environments. Utilizing advanced computer vision and machine learning algorithms, the system was designed to identify weapons in live video feeds with high accuracy. The research contributes significantly to intelligent surveillance solutions by addressing the growing need for automated threat detection in public and private security settings. Kambhatla and Ahmed (2024) explored advanced weapon detection techniques using YOLO and Faster R-CNN, optimizing speed and accuracy through model pruning and ensembling, achieving high average precision (AP) scores. This highlights the significance of refining architectures for practical applications (Kambhatla & Ahmed, 2024). Xie and Wang (2023) introduced a deep learning pipeline incorporating multiple base models (BMs) to mitigate false positives and negatives. Their ensemble approach outperformed individual architectures, achieving robust detection capabilities (Xie & Wang, 2023).
Dataset
For this research, we have designed our dataset to include the most used type of contraband in third-world countries.
Other datasets include a variety of weap onry but usually contain various types of contraband used in Hollywood movies or in developed countries. When the same model is implemented in third-world countries, it is less accurate. To make our work acceptable internationally, we developed our dataset with an armory used in all parts of the world. This research primarily focuses on the types of weaponry used in street crimes, such as short-range rifles, shotguns, pistols, and knives, collectively referred to as the Street Crimes Arms Dataset (SCAD). Each category contains approximately 5,000 images, bringing the total to about 20,000 images. To optimize the dataset for effective model training, various alterations have been applied, such as augmentation and normalization.
Methodology
This research integrates the power of Quantum computing with conventional Deep Learning, specifically Convolutional Neural Networks. Both technologies have their strengths and weaknesses. Quantum computing is famous for its tremendous speed, but due to the limited availability of hardware, its implementation is still uncommon. On the other hand, although Deep learning applications are widely used, they lack speed in many real-time scenarios. Combining both technologies reveals a new dimension and thus suppresses their weaknesses. There is currently no prior work that integrates QCNN with RetinaNet for intelligent surveillance applications. In RetinaNet, the task of feature extraction is replaced by QCNN. Thus, it achieves the speed of Quantum Computing and accuracy of RetinaNet.
RetinaNet
RetinaNet is a state-of-the-art object detection algorithm that was introduced by researchers at Facebook AI Research in 2017. It is designed to solve the problem of detecting objects in an image, where the loca tion and type of objects can vary widely. The key innovation of RetinaNet is the use of a novel loss function called Focal Loss, which addresses the issue of class imbalance in ob ject detection. Object detection typically has many more negative examples (background) than positive examples (objects of interest). RetinaNet utilizes Focal Loss to reduce bias toward negative examples and focuses on hard examples in the positive class, which helps the model identify rare and important positive cases. RetinaNet’s feature pyramid network (FPN) design mixes low-resolution and high-resolution characteristics to recog nize objects of varying sizes and resolutions. The model has gained cutting-edge perfor mance on benchmarks like COCO and PAS CAL VOC, making it a popular choice for object identification tasks in industrial and commercial applications.
Quantum-RetinaNet
The core functionality of the Retina Net is divided into two main parts. The first is called feature extractor, which deals with the extraction of features. The second one is called task-specific networks, responsible for Classification and Bounding Box. The first part, i.e., the feature extractor, uses convolution and pooling layers to extract features, which is a time-consuming process. In Quantum-RetinaNet, this time is reduced using Quantum-Convolutional Neural Network.
Fig. 1 summarizes how Q-RetinaNet works. This model converts an image into a vector of corresponding bits for quantum computing. Qubits are fabricated, and feature extraction is performed using QCNN. These features are then forwarded to Task Specific Networks, which draw bounding boxes and detect desired objects.

Note: the first part deals with the extraction of features, and the second one deals with task-specific networks. Derived from research.
Figure 1 The Basic Architecture of the RetinaNet.
Feature Extraction using Qonvolutional Neural Network A Quantum Convolutional Neural Network or Qonvolutional Neural Network uses quantum circuits to conduct convolutional operations on input data. The procedure entails converting the incoming data into a quantum state through methods like amplitude encoding or qubit encoding. Convolutional filters are used to identify specific characteristics, such as edges or textures. After that, the data is pooled using a pooling layer, which lowers the dimensionality while maintaining crucial properties, as shown in Fig. 2.

Note: QCNN uses quantum circuits to conduct convolutional operations on input data, using amplitude or qubit encoding to convert incoming data into a quantum state. Derived from research.
Figure 2 Quantum Convolutional Neural Network or Qonvolutional Neural Network
For classification tasks, the output is processed utilizing additional layers of quantum circuits, employing methods such as variational quantum classifiers or quantum SVMs. To improve network performance, the parameters of the quantum circuits are improved using strategies like quantum gradient descent or variational approaches.
Bits to Qubits . The data in a bit is encoded into the quantum state of a single qubit by transforming it from a classical bit to a quantum bit. A qubit is a type of quantum bit that may exist in several states at once as opposed to a classical bit, which can only exist in one state at a time. For example, to encode a classical bit into a qubit, initialize the qubit in the |0 state, which corresponds to a 0 state of a classical bit. If the classical bit is 1, apply the X gate to change the qubit from its |0| state to its |1| state. The qubit’s final state corresponds to the encoded form of the conventional bit. Several additional techniques can convert conventional bits into qubits depending on the exact application.
Quantum Convolutions. Quantum convolutions are implemented using quantum circuits, gates, and quantum al gorithms. While gates, such as the Hadamard-Walsh transform gate for discrete cosine transform (DCT) and discrete wavelet transform (DWT) operations, apply the convolutional filter directly to a quantum state, circuits employ quantum Fourier transforms (QFTs) to change a quantum state. Quantum algorithms are created to work on quantum data and perform various tasks, including quantum covolution, such as the quantum singular value transformation (QSVD) or the quantum Fourier transform (QFT).
Quantum Activation Functions . Quantum activation functions are quantum analogs of classical activation functions in neural networks. They are mathematical functions that transform the output of a quantum circuit into a new quantum state, introducing nonlinearity into the output of a neural network. There are several types of quantum activation functions, including Quantum ReLU (Q- ReLU), Quantum Sigmoid (Q-Sigmoid), and Quantum Softmax (Q-Softmax). Quantum activation functions can be used in quantum machine learning algorithms to improve their performance on specific problems.
Quantum Pooling
The quantum pooling approach is employed in quantum machine learning to decrease the spatial dimensions of a quantum feature map while keeping the most critical data. Since quantum states are inherently uncertain, identifying the most important details can be challenging. Numerous quantum pooling strategies have been suggested to overcome this issue, including quantum maximum pooling, quantum mean pooling, and quantum median pooling. Quantum amplitude pooling is implemented using quantum gates and circuits, such as quantum amplitude pooling.
The input quantum state is transformed into the frequency domain using a quantum Fourier transform, and the amplitude of each frequency component is squared using a quantum circuit containing Hadamard gates and Controlled- NOT (CNOT) gates. Through a measurement operation, the squared amplitudes are then used to construct a new quantum state with fewer qubits.
Qubit to bit. Quantum computing involves measuring the state of a qubit to convert it to a classical bit, with the probability determined by the quantum state of the qubit. Quantum convolution operations are used in quantum neural networks, specifically in Qonvolutional Neural Networks (QCNNs). In Qonvolutional, input data is encoded into a quantum state, usually using qubits, and quantum gates are applied to the state to perform the convolution operation. The output is then decoded from the resulting quantum state.
Task Specific Networks. For each of the X anchors and Y object classes, the classification subnet forecasts the likelihood of an object’s presence at each spatial position. The subnet is a fully connected network interconnected to all levels, sharing features throughout. Its design includes a channel input feature map, four convolutional layers with filters, ReLU activations, filters, and sigmoid activations to output binary predictions for each spatial location. The object classification subnet and box regression subnet share a structure, but the box regression subnet uses a class-agnostic bounding box regression with fewer parameters. This strategy is equally successful, as both subnets employ distinct parameters for regressing the offset from anchor boxes to neighboring ground-truth objects.
The detailed process architecture of the Q-RetinaNet is shown in Fig. 3. The process begins with an image being fed into the model; then, it is converted to a vector of corresponding bits. For quantum computing, bits are transformed into Qubits. Once Qubits are fabricated, feature extraction can be performed using QCNN that is an equivalent of CNN and only works on Qubits. After extracting the features, the next step is to forward these feature maps to Task Specific Networks. Due to compatibility issues, these Qubits must now be transformed back into conventional bits. Finally, these bits are transferred to Task Specific Networks, which are responsible for drawing bounding boxes and detecting the desired objects from the image.
Analysis and Results
Various comparative analysis approach es are utilized to compare the performance of the different models employed in the research.
Accuracy
Fig. 4 illustrates the accuracy of LeNet, AlexNet, VGG, RetinaNet, and Quantum- RetinaNet. Although the Q-Ret inaNet contains fewer Quantum layers, it still shows sufficient accuracy; although it partially implements the QCNN compared to CNN, the results are quite impressive.
Confusion Matrix
In machine learning and deep learning, a classification model’s performance is assessed using a confusion matrix. It is used to assess the classification model’s accuracy by contrasting the anticipated and actual class labels. An explanation of how the confusion matrix calculates the percentage of true positive (TP), false positive (FP), false negative (FN), and true negative (TN) predictions can be found in the confusion matrix. The instances in each row of the matrix correspond to a predicted class, whereas the examples in each column correspond to an actual class, as shown in Fig. 5

Note: derived from research.
Figure 5 The Comparison of the Confusion Matrix of Lenet, Alexnet, VGG, RetinaNet, and Quantum-RetinaNet
Several performance indicators are calculated using the confusion matrix, including accuracy, precision, recall, and F1-score. It offers a thorough analysis of the categorization model and aids in determining the model’s advantages and disadvantages.
F1-SCORE
Deep learning often uses the F1-score to track performance, particularly in bina ry classification. It is the harmonic mean of precision and recall, where recall is the ra tio of true positives to the total of both true and false positives, and precision is the sum of both true and false positives (FN). These steps are used to determine the F1 score:
F1-score = 2 * (precision * recall) / (precision + recall)
The F1 score is crucial for evaluating a classification model’s accuracy and recall. It ranges from 0 to 1, with 0 indicating no predictive capacity and 1 indicating perfect precision and recall. It can be computed independently for each class and averaged to provide a performance indicator. See Fig. 6 for detailed information.
ROC
A ROC curve in deep learning measures the accuracy of binary classifier models by plotting the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings, as shown in Fig. 7. It helps determine the optimal threshold setting and compares the performance of different classifier models. The area under the ROC curve (AUC) is a commonly used metric, with scores ranging from 0 to 1, indicating poor classification and random guessing.
Conclusion
Hybrid neural networks (QCNN-RetinaNet) for object detection integrating conventional and quantum computing methods have attracted much interest. RetinaNet is utilized for object detection, while the QCNN layers are responsible for feature extraction and reduction. Quantum gates and circuits can be used to implement these layers, potentially resulting in exponential speedup in some applications. Traditional deep learning methods like backpropagation and focus loss function may be used to train RetinaNet, which was created to overcome the class imbalance in object recognition. When solving specific object detection issues, the hybrid QCNN-RetinaNet performs better than a QCNN or RetinaNet. However, in addition to competence in deep learning and neural networks, constructing a hybrid QCNN-RetinaNet combines the benefits of both classical and quantum computing methods.
Ethics
This research generated a dataset designed to include a range of armories such as pistols, shotguns, short-range rifles, knives, and Kalashnikovs. It was defined as a dataset that contained all types of armories most commonly seen in the streets. The dataset shows armories frequently found in thirdworld countries, as the project focuses on surveillance in developing countries, which makes the dataset relevant for countries in such regions. Our investigation involved consulting various sources, including law enforcement, police, private security agencies, websites, the Internet, and the crime branch of a major news channel, among other security departments.
Author contribution statement
Dr. Syed Atif Ali Shah: Conceptualization idea, methodology, investigation, experiment implementation, data collection, and writing (80%).
Dr. Nasir Ageelani: Supervision and review (15%).
Dr. Najeeb Al-Sammurrai: Supervision and validation (5%).
Data availability statement
The data supporting the results of this study will be made available by the corresponding author, Dr. Syed Atif Ali Shah, upon reasonable request.
Preprint A Preprint version of this paper was deposited in: https://arxiv.org/abs/2309.03231


















