Ex) Article Title, Author, Keywords
Current Optics
and Photonics
Ex) Article Title, Author, Keywords
Curr. Opt. Photon. 2022; 6(5): 463-472
Published online October 25, 2022 https://doi.org/10.3807/COPP.2022.6.5.463
Copyright © Optical Society of Korea.
Leihong Zhang1, Zimin Shen1 , Weihong Lin1, Dawei Zhang2,3
Corresponding author: *923722470@qq.com, ORCID 0000-0001-6699-1247
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
At certain wavelengths, single-pixel imaging is considered to be a solution that can achieve high quality imaging and also reduce costs. However, achieving imaging of complex scenes is an overhead-intensive process for single-pixel imaging systems, so low efficiency and high consumption are the biggest obstacles to their practical application. Improving efficiency to reduce overhead is the solution to this problem. Salient object detection is usually used as a pre-processing step in computer vision tasks, mimicking human functions in complex natural scenes, to reduce overhead and improve efficiency by focusing on regions with a large amount of information. Therefore, in this paper, we explore the implementation of salient object detection based on single-pixel imaging after a single pixel, and propose a scheme to reconstruct images based on Fourier bases and use U2Net models for salient object detection.
Keywords: Fourier transform, Salient object detection, Single-pixel imaging
OCIS codes: (070.4340) Nonlinear optical signal processing; (100.5010) Pattern recognition; (110.2970) Image detection systems; (110.1758) Computational imaging
Photos and videos commonly seen in life are often created by capturing photons (the building blocks of light) using digital sensors, meaning that ambient light reflects off an object and the lens focuses it on a screen made up of tiny photosensitive elements, or pixels. The image is a pattern formed by the light and dark spots created by the reflected light. The most common digital camera, for example, consists of hundreds of pixels that form an image by detecting the intensity and color of light at each point in space. Also, a 3D image can be generated by placing several cameras around the object and photographing it from multiple angles, or by scanning the object using a stream of photons and reconstructing it in three dimensions. However, regardless of the method used, the image is constructed by collecting spatial information about the scene. In contrast, single-pixel imaging is an imaging technique developed from computational ghost imaging [1]. It is based on the principle of correlation measurement and relies solely on the collection of light intensity information to image the object. Single-pixel imaging takes structured light illumination on the illumination side and uses a single-pixel light intensity detector on the detection side to collect the signal. When the illumination structure changes, the corresponding change in the light intensity of the object reflects the degree of correlation between the illumination structure and the spatial information of the object. By continuously changing the illumination structure and accumulating the correlation information, the final imaging of the object is achieved [2]. Since a single-pixel camera requires only light intensity detection at the detection end, its requirements are much lower than those of front-facing detectors in ordinary imaging. Therefore, for some special conditions, such as for some bands where the technology of surface array detectors is not mature, single-pixel imaging technology has great application advantages.
For example, imaging through biologic tissues and improving the spatial resolution of optical microscopy have been two major biomedical imaging challenges. Non-visible imaging is the current solution to these challenges, because the use of wavelengths in the optical window of biological tissues (
Based on the study of the above critical issues, this paper proposes image reconstruction based on a Fourier basis to apply salient object detection techniques to single-pixel imaging systems. Salient object detection, commonly used as a pre-processing procedure in many complex vision tasks, including target detection [1, 20], semantic segmentation [21–24], image classification, etc., all mimic human functions to quickly capture regions of attractive objects in complex natural scenes and process more information in these regions with sufficient details, while relatively irrelevant information in other regions is not processed, which greatly ensures that the human visual system works efficiently and properly. Invoking this capability of salient object detection as an image pre-processing step in single-pixel imaging systems has important scientific significance and application value, especially in medical imaging and subsequent medical image information processing. This scheme not only improves the imaging efficiency of single-pixel imaging systems, but also can decide different sampling rates according to different tasks, which greatly improves efficiency and saves the overhead of single-pixel imaging systems. It also becomes particularly important to select a deep learning model with good performance for saliency target detection, which serves as a pre-processing procedure for complex vision tasks, and its excellent performance lays a solid foundation for subsequent tasks.
Suppose a two-dimensional image of an object is
Single-pixel imaging is a computational imaging method that uses a known modulation mask matrix and a sequence of detected measurement signals to solve for the target image. Then the solved target image can be solved by Eq. 2 as follows:
However, to solve it according to this formula, it is necessary to ensure that the modulation mask matrix
Notation must be legible, clear, compact, and consistent with standard usage. In general, acronyms should be spelled out the first time they are used. Adherence to the following guidelines will greatly assist the production process:
where
In the beginning, Fourier analysis was only used for one-dimensional continuous signals, but in fact, it is also applicable to two-dimensional discrete signals, and images are actually two-dimensional discrete signals, so Fourier analysis is also suitable for digital images. Equations 5 and 6 represent the two-dimensional discrete Fourier normal transform and the two-dimensional discrete Fourier inverse transform, respectively.
where
The process of acquiring the Fourier spectrum involves acquiring the weights of each Fourier base pattern corresponding to the image,
Where n is 1, 2, 3, 4 corresponding to m for 0,
Assuming that the target object is reflective and the reflected intensity of the object in the direction of the single pixel detector measured relative to the direction of illumination used in projecting the Fourier base pattern is
where S is the projection region of the Fourier base pattern. The value of the optical response of the single-pixel detector is Eq. 1.
where
According to the four-step phase shift method can be obtained about the Fourier coefficients
In summary, it can be seen that each Fourier coefficient
Accordingly, the Fourier spectrum of the obtained object image can be reconstructed by implementing the Fourier inversion as in Eq. 13.
where
To reconstruct an image with a resolution of
The image energy is mainly concentrated in the low-frequency part of the Fourier spectrum, while the Fourier coefficients of high frequencies have small or even near-zero modes. Therefore, the object image can be reconstructed by projecting only the Fourier base pattern of low spatial frequencies, obtaining the Fourier coefficients of low frequencies and directly setting the coefficients of high frequencies to zero, thus achieving the goal of fewer measurements.
The human visual system can quickly acquire information in images, and the dynamic selection ability of the human visual system, called the visual attention mechanism, plays an important role in the processing of what humans perceive around them. The visual attention mechanism allows people to quickly and accurately capture the part of an image that is of most interest, which is called the salient region as shown in Fig. 1, it is the result obtained after the salient object detection in the image. For an image, the salient region often contains the most important information in the image. Normal image processing processes the entire image, which is a waste of computer computing power and time, resulting in inefficient image processing. This dynamic selection capability in human vision is introduced into image processing as salient object detection, which extracts salient regions from the image and mainly processes the selected regions, greatly improving the efficiency of image processing. In many application scenarios in single-pixel imaging systems, we can save overhead by selecting the appropriate sampling rate according to the task. We choose a model for salient object detection.
Most of today’s salient object detection networks are based on extracting deep features using a backbone trained for image classification purposes. In the literature, the method used for salient object detection after single-pixel imaging is the PoolNet network architecture with a ResNet-50 backbone. ResNet-50 is designed for the ultimate purpose of image classification, and the extracted features represent the semantic rather than the most important local and global contrast information for saliency target detection. Instead of using a pre-trained model for image classification as the main backbone, the network structure used in this paper employs a double-layer nested U-shaped structure network trained from scratch. The specific network model is shown in Fig. 2.
The network consists of five encoders on the left side, five decoders on the right side, and one decoder on the bottom side, which are combined in a U-shaped structure, and there is also a U-shaped structure in each decoder and encoder. The U-shaped structure network in each stage of decoder and encoder is called residual U-block (RSU). RSU module is shown in Fig. 3, and it consists of three main parts:
(1) Input convolutional layer: the input feature map is converted into an intermediate feature map with the number of channels, which is the ordinary convolutional layer for extracting local features.
(2) Network-based feature extraction layer: the intermediate feature map is used as the input, and multi-scale contextual information is extracted through the processing of the network, which is denoted as the network. There are parameters controlling the depth of the RSU module; the larger the RSU layer, the more pooling operations and the larger the receptive field.
(3) Feature fusion layer: the fusion of local features and multi-scale features.
For example, in Fig. 3, there are 4, 5, 6, and 7 layers of RSU modules, respectively. The more layers of the module used to get a larger receptive field, conversely, the smaller. Map 1 to Map 6 are the 6 groups of feature maps obtained by different RSU modules, and they are fused to obtain the final fused feature maps. Because of the different depths of the RSU modules, the final fused feature maps contain rich global and local information.
The use of a double-layer nested U-shaped structure allows for more efficient extraction of multi-scale features within each phase and multi-level features in the aggregation phase.
The dataset used for the training U2Net model in the experiment is the DUTS dataset reconstructed based on the Fourier spectrum to obtain the Fourier coefficients, the specific process of image reconstruction based on Fourier basis is shown in Fig. 4. And the original DUTS [25] dataset contains 10,553 images. We selected some images in the training set with sampling rates of 100%, 25%, 6.25%, 1.56%, and 0.39% for reconstruction, and the results are shown in Fig. 5. It was found that when the sampling rate is 100%, 25% and 6.25%, the reconstructed image is relatively clear. After calculation, it was found that the average time required for the reconstructed image with different sampling rates is as shown in Table 1. It was found that the time required to reconstruct the image at 100% and 25% sampling rates is seconds, while the corresponding speed below 6.25% is milliseconds, which greatly saves the overhead of the single pixel imaging system, but does not significantly reduce the quality of the reconstructed image.
TABLE 1 Average time (s) required to reconstruct images at different sampling rates
Sampling Rates | 100% | 25% | 6.25% | 1.56% | 0.39% |
---|---|---|---|---|---|
Time (s) | 11.3682 | 2.8137 | 0.6646 | 0.1742 | 0.0493 |
A total of 21,106 images with sampling rates of 1.56% and 0.39% were used for the training of the network. The data and labels used for training are shown in Fig. 6. The dataset used for testing the significance target detection results is the ECSSD [25] dataset based on the Fourier spectrum to obtain the reconstructed Fourier coefficients. One thousand images are included in the original dataset, and sampling rates of 100%, 25%, 6.25%, 1.56%, and 0.39% are used to form 5 datasets, which are input to the model for detection. The reconstructed ECSSD dataset is shown in Fig. 7.
In this scheme, U2Net is used as the model of deep learning. To train the network, the 10,553 images in the DUTS training set are first converted to grayscale images and resized to 128 × 128. Then the pre-processed images are reconstructed by simulated Fourier inverse transform using Matlab, and the sampling rates are selected as 100%, 25%, 6.25%, 1.56%, and 0.39% to obtain the single-pixel imaging images based on Fourier spectrum acquisition with corresponding sampling rates. In order to save training time, 21,106 reconstructed images of size 128 × 128 pixels with sampling rates of 1.56% and 0.39% are used to train the network in the training phase. We train our network from scratch, with all convolutional layers initialized by Xavier, setting the loss weights
Since the
TABLE 2 In this paper, we use the significance test max
Sampling Rate | 100% | 25% | 6.25% | 1.56% | 0.39% |
---|---|---|---|---|---|
max | 0.4769 | 0.5011 | 0.5103 | 0.8901 | 0.4911 |
MAE | 0.3457 | 0.2803 | 0.2781 | 0.1137 | 0.2906 |
where Precision = |B∩G|/|B|; Recall = |B∩G|/|G|; B is the mask generated by binarizing the significance map S with a threshold; G is the true significance map; and |-| is the accumulation of non-zero terms, and is empirically set to 0.3. The expression for MAE is Eq. 15.
In this paper, a scheme is proposed to perform image reconstruction based on a Fourier basis, and then train a saliency target detection model using the reconstructed pattern as the training set of the model and test it. The feasibility of the scheme is verified by computer simulation. In the simulation, the theoretical feasibility of the scheme to obtain Fourier coefficients based on Fourier spectrum for image reconstruction based on Fourier inversion was verified, and the performance of the trained U2Net model was observed for the reconstructed input images at 5 different sampling ratios of 100%, 25%, 6.25%, 1.56%, and 0.39%. Experimentally, 1,000 images in the ECSSD dataset were converted to grayscale maps and resized to 128 × 128 as the test object scenes to be imaged with single pixels. The test subject scene images were reconstructed at five different sampling ratios of 100%, 25%, 6.25%, 1.56%, and 0.39%. Then 5,000 test reconstructed images of 128 × 128 pixels were divided into five batches according to different sampling ratios and input to the trained U2Net-based model. The test results obtained are as in Fig. 8. It was found that the model obtained in 100% of the cases of the saliency target detection results was not good, and there were also folds in the edges of the target. The initial judgment is that the sampling rate of the training set is 1.56% and 0.39%, and there is a big difference between the images with a 100% sampling rate and this phenomenon, since 25%, 6.25% and 1.56% have better detection results. A low sampling rate like 0.39% also gives roughly the region of significance. Comparing the results obtained from the test with the actual significance plot and calculating the max
Because of the good performance shown by the model on the 1.56% test set, the reconstructed images at the 1.56% sampling rate were compared with other methods using the saliency target detection scheme in this paper, and we selected the more classical traditional ITTI method for saliency target detection, as well as the PoolNet network-based implementation for saliency target detection used in the [1]. For the ITTI method, we used reconstructed images (of size 128 × 128) based on the ECSSD dataset at a sampling rate of 1.56% as input for testing. In order to have better comparability with the PoolNet network, we used the reconstructed images (size 128 × 128) based on the DUTS dataset at 1.56% and 0.39% sampling rates with a total of 21,106 images as a training set on the network backbone of ResNet-50, and tested the model obtained with the same dataset, get the comparison results in Fig. 9. The scheme in this paper can accurately detect the region of the target, and the combination of max
TABLE 3 Comparison of evaluation indexes of three methods under 1.56% sampling rate
Evaluation Indexes | ThisWorks | ITTI | PoolNet |
---|---|---|---|
max | 0.8901 | 0.2331 | 0.3583 |
MAE | 0.1137 | 0.2257 | 0.4412 |
It was found that single-pixel imaging based on the Fourier spectrum can get better imaging results. Moreover, the U2Net network model experimented in this scheme has a better detection effect for the reconstructed images, and the model obtained can get better detection results below the 25% sampling rate, although the significant target detection effect is poor at high sampling rates. The best results are achieved on images with a 1.56% sampling rate. Good detection results are still obtained at the 0.39% sampling rate. The detection results of this scheme are superior to the traditional methods ITTI and deep learning PoolNet model.
In this paper, we discussed the implementation of salient object detection based on single-pixel imaging systems and proposed a salient object detection scheme based on Fourier-based reconstructed images and deep learning models. Based on U2Net model, salient object detection is performed on the reconstructed images of under-sampled data. The proposed scheme shows better detection results as well as robustness, providing a new idea for single-pixel imaging systems for complex vision tasks. The experimental results and analysis also demonstrate the good flexibility of the single-pixel saliency target detection system in adapting to different application requirements. More data can be measured in some applications such as image segmentation and image synthesis when well-defined boundaries are required, and more measurements can be reduced in some applications such as visual tracking and target localization when only a rough idea of the target’s location and area is needed. This not only improves the efficiency of the single-pixel imaging system, but also greatly saves the overhead of the single-pixel imaging system.
The authors declare no conflicts of interest.
Data underlying the results presented in this paper are not publicly available at the time of publication, which may be obtained from the authors upon reasonable request.
Natural Science Foundation of Shanghai (Grant No. 18ZR1425800); the National Natural Science Foundation of China (Grant No. 61775140, 61875125).
Curr. Opt. Photon. 2022; 6(5): 463-472
Published online October 25, 2022 https://doi.org/10.3807/COPP.2022.6.5.463
Copyright © Optical Society of Korea.
Leihong Zhang1, Zimin Shen1 , Weihong Lin1, Dawei Zhang2,3
1College of Communication and Art design, University of Shanghai for Science and Technology, Shanghai 200093, China
2School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
3Shanghai Institute of Intelligent Science and Technology, Tongji University, Shanghai 200092, China
Correspondence to:*923722470@qq.com, ORCID 0000-0001-6699-1247
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
At certain wavelengths, single-pixel imaging is considered to be a solution that can achieve high quality imaging and also reduce costs. However, achieving imaging of complex scenes is an overhead-intensive process for single-pixel imaging systems, so low efficiency and high consumption are the biggest obstacles to their practical application. Improving efficiency to reduce overhead is the solution to this problem. Salient object detection is usually used as a pre-processing step in computer vision tasks, mimicking human functions in complex natural scenes, to reduce overhead and improve efficiency by focusing on regions with a large amount of information. Therefore, in this paper, we explore the implementation of salient object detection based on single-pixel imaging after a single pixel, and propose a scheme to reconstruct images based on Fourier bases and use U2Net models for salient object detection.
Keywords: Fourier transform, Salient object detection, Single-pixel imaging
Photos and videos commonly seen in life are often created by capturing photons (the building blocks of light) using digital sensors, meaning that ambient light reflects off an object and the lens focuses it on a screen made up of tiny photosensitive elements, or pixels. The image is a pattern formed by the light and dark spots created by the reflected light. The most common digital camera, for example, consists of hundreds of pixels that form an image by detecting the intensity and color of light at each point in space. Also, a 3D image can be generated by placing several cameras around the object and photographing it from multiple angles, or by scanning the object using a stream of photons and reconstructing it in three dimensions. However, regardless of the method used, the image is constructed by collecting spatial information about the scene. In contrast, single-pixel imaging is an imaging technique developed from computational ghost imaging [1]. It is based on the principle of correlation measurement and relies solely on the collection of light intensity information to image the object. Single-pixel imaging takes structured light illumination on the illumination side and uses a single-pixel light intensity detector on the detection side to collect the signal. When the illumination structure changes, the corresponding change in the light intensity of the object reflects the degree of correlation between the illumination structure and the spatial information of the object. By continuously changing the illumination structure and accumulating the correlation information, the final imaging of the object is achieved [2]. Since a single-pixel camera requires only light intensity detection at the detection end, its requirements are much lower than those of front-facing detectors in ordinary imaging. Therefore, for some special conditions, such as for some bands where the technology of surface array detectors is not mature, single-pixel imaging technology has great application advantages.
For example, imaging through biologic tissues and improving the spatial resolution of optical microscopy have been two major biomedical imaging challenges. Non-visible imaging is the current solution to these challenges, because the use of wavelengths in the optical window of biological tissues (
Based on the study of the above critical issues, this paper proposes image reconstruction based on a Fourier basis to apply salient object detection techniques to single-pixel imaging systems. Salient object detection, commonly used as a pre-processing procedure in many complex vision tasks, including target detection [1, 20], semantic segmentation [21–24], image classification, etc., all mimic human functions to quickly capture regions of attractive objects in complex natural scenes and process more information in these regions with sufficient details, while relatively irrelevant information in other regions is not processed, which greatly ensures that the human visual system works efficiently and properly. Invoking this capability of salient object detection as an image pre-processing step in single-pixel imaging systems has important scientific significance and application value, especially in medical imaging and subsequent medical image information processing. This scheme not only improves the imaging efficiency of single-pixel imaging systems, but also can decide different sampling rates according to different tasks, which greatly improves efficiency and saves the overhead of single-pixel imaging systems. It also becomes particularly important to select a deep learning model with good performance for saliency target detection, which serves as a pre-processing procedure for complex vision tasks, and its excellent performance lays a solid foundation for subsequent tasks.
Suppose a two-dimensional image of an object is
Single-pixel imaging is a computational imaging method that uses a known modulation mask matrix and a sequence of detected measurement signals to solve for the target image. Then the solved target image can be solved by Eq. 2 as follows:
However, to solve it according to this formula, it is necessary to ensure that the modulation mask matrix
Notation must be legible, clear, compact, and consistent with standard usage. In general, acronyms should be spelled out the first time they are used. Adherence to the following guidelines will greatly assist the production process:
where
In the beginning, Fourier analysis was only used for one-dimensional continuous signals, but in fact, it is also applicable to two-dimensional discrete signals, and images are actually two-dimensional discrete signals, so Fourier analysis is also suitable for digital images. Equations 5 and 6 represent the two-dimensional discrete Fourier normal transform and the two-dimensional discrete Fourier inverse transform, respectively.
where
The process of acquiring the Fourier spectrum involves acquiring the weights of each Fourier base pattern corresponding to the image,
Where n is 1, 2, 3, 4 corresponding to m for 0,
Assuming that the target object is reflective and the reflected intensity of the object in the direction of the single pixel detector measured relative to the direction of illumination used in projecting the Fourier base pattern is
where S is the projection region of the Fourier base pattern. The value of the optical response of the single-pixel detector is Eq. 1.
where
According to the four-step phase shift method can be obtained about the Fourier coefficients
In summary, it can be seen that each Fourier coefficient
Accordingly, the Fourier spectrum of the obtained object image can be reconstructed by implementing the Fourier inversion as in Eq. 13.
where
To reconstruct an image with a resolution of
The image energy is mainly concentrated in the low-frequency part of the Fourier spectrum, while the Fourier coefficients of high frequencies have small or even near-zero modes. Therefore, the object image can be reconstructed by projecting only the Fourier base pattern of low spatial frequencies, obtaining the Fourier coefficients of low frequencies and directly setting the coefficients of high frequencies to zero, thus achieving the goal of fewer measurements.
The human visual system can quickly acquire information in images, and the dynamic selection ability of the human visual system, called the visual attention mechanism, plays an important role in the processing of what humans perceive around them. The visual attention mechanism allows people to quickly and accurately capture the part of an image that is of most interest, which is called the salient region as shown in Fig. 1, it is the result obtained after the salient object detection in the image. For an image, the salient region often contains the most important information in the image. Normal image processing processes the entire image, which is a waste of computer computing power and time, resulting in inefficient image processing. This dynamic selection capability in human vision is introduced into image processing as salient object detection, which extracts salient regions from the image and mainly processes the selected regions, greatly improving the efficiency of image processing. In many application scenarios in single-pixel imaging systems, we can save overhead by selecting the appropriate sampling rate according to the task. We choose a model for salient object detection.
Most of today’s salient object detection networks are based on extracting deep features using a backbone trained for image classification purposes. In the literature, the method used for salient object detection after single-pixel imaging is the PoolNet network architecture with a ResNet-50 backbone. ResNet-50 is designed for the ultimate purpose of image classification, and the extracted features represent the semantic rather than the most important local and global contrast information for saliency target detection. Instead of using a pre-trained model for image classification as the main backbone, the network structure used in this paper employs a double-layer nested U-shaped structure network trained from scratch. The specific network model is shown in Fig. 2.
The network consists of five encoders on the left side, five decoders on the right side, and one decoder on the bottom side, which are combined in a U-shaped structure, and there is also a U-shaped structure in each decoder and encoder. The U-shaped structure network in each stage of decoder and encoder is called residual U-block (RSU). RSU module is shown in Fig. 3, and it consists of three main parts:
(1) Input convolutional layer: the input feature map is converted into an intermediate feature map with the number of channels, which is the ordinary convolutional layer for extracting local features.
(2) Network-based feature extraction layer: the intermediate feature map is used as the input, and multi-scale contextual information is extracted through the processing of the network, which is denoted as the network. There are parameters controlling the depth of the RSU module; the larger the RSU layer, the more pooling operations and the larger the receptive field.
(3) Feature fusion layer: the fusion of local features and multi-scale features.
For example, in Fig. 3, there are 4, 5, 6, and 7 layers of RSU modules, respectively. The more layers of the module used to get a larger receptive field, conversely, the smaller. Map 1 to Map 6 are the 6 groups of feature maps obtained by different RSU modules, and they are fused to obtain the final fused feature maps. Because of the different depths of the RSU modules, the final fused feature maps contain rich global and local information.
The use of a double-layer nested U-shaped structure allows for more efficient extraction of multi-scale features within each phase and multi-level features in the aggregation phase.
The dataset used for the training U2Net model in the experiment is the DUTS dataset reconstructed based on the Fourier spectrum to obtain the Fourier coefficients, the specific process of image reconstruction based on Fourier basis is shown in Fig. 4. And the original DUTS [25] dataset contains 10,553 images. We selected some images in the training set with sampling rates of 100%, 25%, 6.25%, 1.56%, and 0.39% for reconstruction, and the results are shown in Fig. 5. It was found that when the sampling rate is 100%, 25% and 6.25%, the reconstructed image is relatively clear. After calculation, it was found that the average time required for the reconstructed image with different sampling rates is as shown in Table 1. It was found that the time required to reconstruct the image at 100% and 25% sampling rates is seconds, while the corresponding speed below 6.25% is milliseconds, which greatly saves the overhead of the single pixel imaging system, but does not significantly reduce the quality of the reconstructed image.
TABLE 1. Average time (s) required to reconstruct images at different sampling rates.
Sampling Rates | 100% | 25% | 6.25% | 1.56% | 0.39% |
---|---|---|---|---|---|
Time (s) | 11.3682 | 2.8137 | 0.6646 | 0.1742 | 0.0493 |
A total of 21,106 images with sampling rates of 1.56% and 0.39% were used for the training of the network. The data and labels used for training are shown in Fig. 6. The dataset used for testing the significance target detection results is the ECSSD [25] dataset based on the Fourier spectrum to obtain the reconstructed Fourier coefficients. One thousand images are included in the original dataset, and sampling rates of 100%, 25%, 6.25%, 1.56%, and 0.39% are used to form 5 datasets, which are input to the model for detection. The reconstructed ECSSD dataset is shown in Fig. 7.
In this scheme, U2Net is used as the model of deep learning. To train the network, the 10,553 images in the DUTS training set are first converted to grayscale images and resized to 128 × 128. Then the pre-processed images are reconstructed by simulated Fourier inverse transform using Matlab, and the sampling rates are selected as 100%, 25%, 6.25%, 1.56%, and 0.39% to obtain the single-pixel imaging images based on Fourier spectrum acquisition with corresponding sampling rates. In order to save training time, 21,106 reconstructed images of size 128 × 128 pixels with sampling rates of 1.56% and 0.39% are used to train the network in the training phase. We train our network from scratch, with all convolutional layers initialized by Xavier, setting the loss weights
Since the
TABLE 2. In this paper, we use the significance test max
Sampling Rate | 100% | 25% | 6.25% | 1.56% | 0.39% |
---|---|---|---|---|---|
max | 0.4769 | 0.5011 | 0.5103 | 0.8901 | 0.4911 |
MAE | 0.3457 | 0.2803 | 0.2781 | 0.1137 | 0.2906 |
where Precision = |B∩G|/|B|; Recall = |B∩G|/|G|; B is the mask generated by binarizing the significance map S with a threshold; G is the true significance map; and |-| is the accumulation of non-zero terms, and is empirically set to 0.3. The expression for MAE is Eq. 15.
In this paper, a scheme is proposed to perform image reconstruction based on a Fourier basis, and then train a saliency target detection model using the reconstructed pattern as the training set of the model and test it. The feasibility of the scheme is verified by computer simulation. In the simulation, the theoretical feasibility of the scheme to obtain Fourier coefficients based on Fourier spectrum for image reconstruction based on Fourier inversion was verified, and the performance of the trained U2Net model was observed for the reconstructed input images at 5 different sampling ratios of 100%, 25%, 6.25%, 1.56%, and 0.39%. Experimentally, 1,000 images in the ECSSD dataset were converted to grayscale maps and resized to 128 × 128 as the test object scenes to be imaged with single pixels. The test subject scene images were reconstructed at five different sampling ratios of 100%, 25%, 6.25%, 1.56%, and 0.39%. Then 5,000 test reconstructed images of 128 × 128 pixels were divided into five batches according to different sampling ratios and input to the trained U2Net-based model. The test results obtained are as in Fig. 8. It was found that the model obtained in 100% of the cases of the saliency target detection results was not good, and there were also folds in the edges of the target. The initial judgment is that the sampling rate of the training set is 1.56% and 0.39%, and there is a big difference between the images with a 100% sampling rate and this phenomenon, since 25%, 6.25% and 1.56% have better detection results. A low sampling rate like 0.39% also gives roughly the region of significance. Comparing the results obtained from the test with the actual significance plot and calculating the max
Because of the good performance shown by the model on the 1.56% test set, the reconstructed images at the 1.56% sampling rate were compared with other methods using the saliency target detection scheme in this paper, and we selected the more classical traditional ITTI method for saliency target detection, as well as the PoolNet network-based implementation for saliency target detection used in the [1]. For the ITTI method, we used reconstructed images (of size 128 × 128) based on the ECSSD dataset at a sampling rate of 1.56% as input for testing. In order to have better comparability with the PoolNet network, we used the reconstructed images (size 128 × 128) based on the DUTS dataset at 1.56% and 0.39% sampling rates with a total of 21,106 images as a training set on the network backbone of ResNet-50, and tested the model obtained with the same dataset, get the comparison results in Fig. 9. The scheme in this paper can accurately detect the region of the target, and the combination of max
TABLE 3. Comparison of evaluation indexes of three methods under 1.56% sampling rate.
Evaluation Indexes | ThisWorks | ITTI | PoolNet |
---|---|---|---|
max | 0.8901 | 0.2331 | 0.3583 |
MAE | 0.1137 | 0.2257 | 0.4412 |
It was found that single-pixel imaging based on the Fourier spectrum can get better imaging results. Moreover, the U2Net network model experimented in this scheme has a better detection effect for the reconstructed images, and the model obtained can get better detection results below the 25% sampling rate, although the significant target detection effect is poor at high sampling rates. The best results are achieved on images with a 1.56% sampling rate. Good detection results are still obtained at the 0.39% sampling rate. The detection results of this scheme are superior to the traditional methods ITTI and deep learning PoolNet model.
In this paper, we discussed the implementation of salient object detection based on single-pixel imaging systems and proposed a salient object detection scheme based on Fourier-based reconstructed images and deep learning models. Based on U2Net model, salient object detection is performed on the reconstructed images of under-sampled data. The proposed scheme shows better detection results as well as robustness, providing a new idea for single-pixel imaging systems for complex vision tasks. The experimental results and analysis also demonstrate the good flexibility of the single-pixel saliency target detection system in adapting to different application requirements. More data can be measured in some applications such as image segmentation and image synthesis when well-defined boundaries are required, and more measurements can be reduced in some applications such as visual tracking and target localization when only a rough idea of the target’s location and area is needed. This not only improves the efficiency of the single-pixel imaging system, but also greatly saves the overhead of the single-pixel imaging system.
The authors declare no conflicts of interest.
Data underlying the results presented in this paper are not publicly available at the time of publication, which may be obtained from the authors upon reasonable request.
Natural Science Foundation of Shanghai (Grant No. 18ZR1425800); the National Natural Science Foundation of China (Grant No. 61775140, 61875125).
TABLE 1 Average time (s) required to reconstruct images at different sampling rates
Sampling Rates | 100% | 25% | 6.25% | 1.56% | 0.39% |
---|---|---|---|---|---|
Time (s) | 11.3682 | 2.8137 | 0.6646 | 0.1742 | 0.0493 |
TABLE 2 In this paper, we use the significance test max
Sampling Rate | 100% | 25% | 6.25% | 1.56% | 0.39% |
---|---|---|---|---|---|
max | 0.4769 | 0.5011 | 0.5103 | 0.8901 | 0.4911 |
MAE | 0.3457 | 0.2803 | 0.2781 | 0.1137 | 0.2906 |
TABLE 3 Comparison of evaluation indexes of three methods under 1.56% sampling rate
Evaluation Indexes | ThisWorks | ITTI | PoolNet |
---|---|---|---|
max | 0.8901 | 0.2331 | 0.3583 |
MAE | 0.1137 | 0.2257 | 0.4412 |