G-0K8J8ZR168
검색
검색 팝업 닫기

Ex) Article Title, Author, Keywords

## Article

Curr. Opt. Photon. 2022; 6(5): 463-472

Published online October 25, 2022 https://doi.org/10.3807/COPP.2022.6.5.463

## U2Net-based Single-pixel Imaging Salient Object Detection

Leihong Zhang1, Zimin Shen1 , Weihong Lin1, Dawei Zhang2,3

1College of Communication and Art design, University of Shanghai for Science and Technology, Shanghai 200093, China
2School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
3Shanghai Institute of Intelligent Science and Technology, Tongji University, Shanghai 200092, China

Corresponding author: *923722470@qq.com, ORCID 0000-0001-6699-1247

Received: April 15, 2022; Revised: June 22, 2022; Accepted: July 11, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

At certain wavelengths, single-pixel imaging is considered to be a solution that can achieve high quality imaging and also reduce costs. However, achieving imaging of complex scenes is an overhead-intensive process for single-pixel imaging systems, so low efficiency and high consumption are the biggest obstacles to their practical application. Improving efficiency to reduce overhead is the solution to this problem. Salient object detection is usually used as a pre-processing step in computer vision tasks, mimicking human functions in complex natural scenes, to reduce overhead and improve efficiency by focusing on regions with a large amount of information. Therefore, in this paper, we explore the implementation of salient object detection based on single-pixel imaging after a single pixel, and propose a scheme to reconstruct images based on Fourier bases and use U2Net models for salient object detection.

Keywords: Fourier transform, Salient object detection, Single-pixel imaging

OCIS codes: (070.4340) Nonlinear optical signal processing; (100.5010) Pattern recognition; (110.2970) Image detection systems; (110.1758) Computational imaging

### I. INTRODUCTION

Photos and videos commonly seen in life are often created by capturing photons (the building blocks of light) using digital sensors, meaning that ambient light reflects off an object and the lens focuses it on a screen made up of tiny photosensitive elements, or pixels. The image is a pattern formed by the light and dark spots created by the reflected light. The most common digital camera, for example, consists of hundreds of pixels that form an image by detecting the intensity and color of light at each point in space. Also, a 3D image can be generated by placing several cameras around the object and photographing it from multiple angles, or by scanning the object using a stream of photons and reconstructing it in three dimensions. However, regardless of the method used, the image is constructed by collecting spatial information about the scene. In contrast, single-pixel imaging is an imaging technique developed from computational ghost imaging [1]. It is based on the principle of correlation measurement and relies solely on the collection of light intensity information to image the object. Single-pixel imaging takes structured light illumination on the illumination side and uses a single-pixel light intensity detector on the detection side to collect the signal. When the illumination structure changes, the corresponding change in the light intensity of the object reflects the degree of correlation between the illumination structure and the spatial information of the object. By continuously changing the illumination structure and accumulating the correlation information, the final imaging of the object is achieved [2]. Since a single-pixel camera requires only light intensity detection at the detection end, its requirements are much lower than those of front-facing detectors in ordinary imaging. Therefore, for some special conditions, such as for some bands where the technology of surface array detectors is not mature, single-pixel imaging technology has great application advantages.

For example, imaging through biologic tissues and improving the spatial resolution of optical microscopy have been two major biomedical imaging challenges. Non-visible imaging is the current solution to these challenges, because the use of wavelengths in the optical window of biological tissues (e.g. X-ray imaging [3], computed tomography (CT) imaging [4], ultrasound imaging [5], positron emission imaging [6, 7], single photon emission imaging [811], terahertz imaging [1216], etc.) can effectively overcome the scattering effect of light by biological media, and the use of shorter wavelengths than visible light for imaging can achieve higher spatial resolution. However, traditional optical imaging methods require the use of a surface array detector, and the relative complexity, difficulty, and high cost of making a surface array detector that can operate in the non-visible wavelength band make it difficult to improve the quality of non-visible imaging based on traditional imaging methods. As a new imaging method, single-pixel imaging is considered a solution to achieve high-quality non-visible imaging because it only requires a single point detector without spatial resolution to achieve spatial information acquisition of the target object to reconstruct the object image. However, single-pixel imaging technology has only reached the degree of being able to image, and still faces some problems in practical applications. For example, the imaging quality still has a certain gap compared with traditional imaging techniques because the image reconstruction mechanism of random and statistical mathematics is applied, and the reconstructed image will show obvious noise. Secondly, the number of single-pixel imaging measurements is huge, which means that single-pixel imaging has low efficiency and high overhead. Many scholars have conducted research on how to improve the imaging efficiency of single-pixel imaging, and proposed a Fourier-based single-pixel imaging technique [18], which is the most efficient technique for sampling so far. Li et al. [18] of Shanghai Jiaotong University proposed the idea of single-pixel imaging based on a discrete cosine basis and deciding the sampling rate of a single-pixel imaging system according to the demand of subsequent tasks. This scheme has low imaging efficiency, poor performance of the selected deep learning model, and cannot maximize the overhead savings of the single-pixel imaging system.

Based on the study of the above critical issues, this paper proposes image reconstruction based on a Fourier basis to apply salient object detection techniques to single-pixel imaging systems. Salient object detection, commonly used as a pre-processing procedure in many complex vision tasks, including target detection [1, 20], semantic segmentation [2124], image classification, etc., all mimic human functions to quickly capture regions of attractive objects in complex natural scenes and process more information in these regions with sufficient details, while relatively irrelevant information in other regions is not processed, which greatly ensures that the human visual system works efficiently and properly. Invoking this capability of salient object detection as an image pre-processing step in single-pixel imaging systems has important scientific significance and application value, especially in medical imaging and subsequent medical image information processing. This scheme not only improves the imaging efficiency of single-pixel imaging systems, but also can decide different sampling rates according to different tasks, which greatly improves efficiency and saves the overhead of single-pixel imaging systems. It also becomes particularly important to select a deep learning model with good performance for saliency target detection, which serves as a pre-processing procedure for complex vision tasks, and its excellent performance lays a solid foundation for subsequent tasks.

### 2.1. Single-pixel Imaging

Suppose a two-dimensional image of an object is IRnn2, containing a total of N pixel points, where N = n1 × n2. To acquire this image, a series of modulation mask patterns with spatial resolution are loaded onto the DMD. For the modulation mask sequence P = [P1, P2, ..., PM] ∈ RM×nn2, where Pi ∈ Rnn2 denotes the ith frame of the modulation mask and M denotes the number of modulation masks, the corresponding light intensity value M is [s1, s2, ..., sM] ∈ RM captured by the barrel detector after interacting with the object. The two-dimensional image is in matrix form, and for convenience of expression, the two-dimensional image is expanded into vector form, i.e. IRN; similarly, the modulation mask sequence is represented as a two-dimensional matrix form, i.e. PRM×N, where each row represents one frame of the mask. It is possible to obtain

PI=s

Single-pixel imaging is a computational imaging method that uses a known modulation mask matrix and a sequence of detected measurement signals to solve for the target image. Then the solved target image can be solved by Eq. 2 as follows:

I=P1s

However, to solve it according to this formula, it is necessary to ensure that the modulation mask matrix P is orthogonal and M = N.

### 2.2. Fourier Basis Based Image Reconstruction

Notation must be legible, clear, compact, and consistent with standard usage. In general, acronyms should be spelled out the first time they are used. Adherence to the following guidelines will greatly assist the production process:

C(fx,fy)=+ + I(x,y)exp[j2π(fxx+fyy)]dfxdfy

I(x,y)=+ + C(fx,fy)exp[j2π(fxx+fyy)]dxdy

where x, y are the right-angle coordinates in the spatial domain; fx, fy are the right-angle coordinates in the Fourier domain, corresponding to the spatial frequencies in the directions of x, y; I(x, y) denotes a two-dimensional image; C( fx, fy) denotes the Fourier spectrum of a two dimensional image; and j is the imaginary unit.

In the beginning, Fourier analysis was only used for one-dimensional continuous signals, but in fact, it is also applicable to two-dimensional discrete signals, and images are actually two-dimensional discrete signals, so Fourier analysis is also suitable for digital images. Equations 5 and 6 represent the two-dimensional discrete Fourier normal transform and the two-dimensional discrete Fourier inverse transform, respectively.

C(u,v)= x=0 M1y=0N1I(x,y)exp[j2π( uxM+ vyN)]

I(x,y)= x=0 M1y=0N1C(u,v)exp[j2π(uxM+vyN)]

where u, v are the discretized forms of the spatial frequencies fx, fy, respectively. According to Fourier analysis, any image is a linear combination of a series of Fourier base patterns. The weight corresponding to each Fourier base pattern is the Fourier coefficient C(u, v).

The process of acquiring the Fourier spectrum involves acquiring the weights of each Fourier base pattern corresponding to the image, i.e. acquiring the Fourier coefficients. We use a computer to generate the Fourier base pattern, project the resulting pattern onto the target object, and then use a barrel detector to collect the intensity of the resulting light signal. Four Fourier base patterns with spatial frequencies ( fx, fy), initial phase 0, π/2, π, 3π/2 are projected on the target object P1, P2, P3, P4.

Pn(x,y;fx,fy)=a+bcos(2πfxx+2πfyy+m)

Where n is 1, 2, 3, 4 corresponding to m for 0, π/2, π, 3π/2 a is the average light intensity, b is the contrast, x, y are the right angle coordinates of the plane where the target object is located, fx, fy are the spatial frequencies corresponding to the directions of x, y respectively. Equation 7 is rewritten as Eq. 8 according to the triangular constant deformation.

P1(x,y;fx,fy)=a+bcos(2πfxx+2πfyy),P2(x,y;fx,fy)=absin(2πfxx+2πfyy),P3(x,y;fx,fy)=abcos(2πfxx+2πfyy),P4(x,y;fx,fy)=a+bsin(2πfxx+2πfyy).

Assuming that the target object is reflective and the reflected intensity of the object in the direction of the single pixel detector measured relative to the direction of illumination used in projecting the Fourier base pattern is R(x, y), and the object image is noted as I(x, y), the relationship between the object image and the reflected intensity distribution of the object is I(x, y) ∝ R(x, y). The light intensity of the reflected light obtained from the target object illuminated by the Fourier base pattern P(x, y; fx, fy, φ) is given by Eq. 9.

Eϕ(fx,fy)= S R(x,y)P(x,y;fx,fy,ϕ)dxdy

where S is the projection region of the Fourier base pattern. The value of the optical response of the single-pixel detector is Eq. 1.

Dϕ(fx,fy)=Dn+βEϕ(fx,fy)

where Dn is the value of the optical response caused by the background illumination at the detector position, and β is a factor related to the magnification of the single-pixel detector and the spatial relationship between the detector and the object. The response values D1, D2, D3, D4 can be obtained from the four Fourier base pattern expressions of Eq. 7, Eq. 9, and 10 as in Eq. 11:

D1(fx,fy)=Dn+aβS R(x,y)dxdy+bβS R(x,y)cos(2πfx x+2πfy y)dxdy,D2(fx,fy)=Dn+aβS R(x,y)dxdybβS R(x,y)sin(2πfx x+2πfy y)dxdy,D3(fx,fy)=Dn+aβS R(x,y)dxdybβS R(x,y)cos(2πfx x+2πfy y)dxdy,D4(fx,fy)=Dn+αβS R(x,y)dxdy+bβS R(x,y)sin(2πfx x+2πfy y)dxdy.

According to the four-step phase shift method can be obtained about the Fourier coefficients C( fx, fy) as in Eq. 12:

C(fx,fy)=[D1(fx,fy)D3(fx,fy)]+j[D2(fx,fy)D4(fx,fy)]=2bβS R(x,y){cos[2π(fx,fy)]jsin(fx,fy)]}dxdy=2bβS R(x,y)exp[j2π(fx,fy)]dxdx=2bβF{R(x,y)}.

In summary, it can be seen that each Fourier coefficient C( fx, fy) can be obtained from the four Fourier base patterns with initial phases of 0, π/2, π, and 3π/2 , respectively.

Accordingly, the Fourier spectrum of the obtained object image can be reconstructed by implementing the Fourier inversion as in Eq. 13.

I(x,y)=F1{C(fx,fy)}=F1{[D1(fx,fy)D3(fx,fy)]+j[D2(fx,fy)D4(fx,fy)]}

where F−1{ } denotes the Fourier inverse transform.

To reconstruct an image with a resolution of M × N pixels without distortion, all M × N Fourier coefficients in its Fourier spectrum need to be acquired. Since each Fourier coefficient requires projection of a four-part phase-shifted Fourier base pattern, i.e. four measurements are required; and since the object image is mathematically a real-valued matrix with conjugate symmetry in its Fourier spectrum, a total of 2 × M × N measurements are required to acquire M × N Fourier coefficients. In other words, in the case of a distortion-free reconstructed image, obtaining the Fourier spectrum of the image using the four-part phase shift method takes twice as many measurements as the number of pixels in the reconstructed image, the underlying reason being that the four-step phase shift method is essentially a differential measurement.

The image energy is mainly concentrated in the low-frequency part of the Fourier spectrum, while the Fourier coefficients of high frequencies have small or even near-zero modes. Therefore, the object image can be reconstructed by projecting only the Fourier base pattern of low spatial frequencies, obtaining the Fourier coefficients of low frequencies and directly setting the coefficients of high frequencies to zero, thus achieving the goal of fewer measurements.

### 2.3. Salient Object Detection

The human visual system can quickly acquire information in images, and the dynamic selection ability of the human visual system, called the visual attention mechanism, plays an important role in the processing of what humans perceive around them. The visual attention mechanism allows people to quickly and accurately capture the part of an image that is of most interest, which is called the salient region as shown in Fig. 1, it is the result obtained after the salient object detection in the image. For an image, the salient region often contains the most important information in the image. Normal image processing processes the entire image, which is a waste of computer computing power and time, resulting in inefficient image processing. This dynamic selection capability in human vision is introduced into image processing as salient object detection, which extracts salient regions from the image and mainly processes the selected regions, greatly improving the efficiency of image processing. In many application scenarios in single-pixel imaging systems, we can save overhead by selecting the appropriate sampling rate according to the task. We choose a model for salient object detection.

Figure 1.Salient object detection.

### 2.4. U2Net Model

Most of today’s salient object detection networks are based on extracting deep features using a backbone trained for image classification purposes. In the literature, the method used for salient object detection after single-pixel imaging is the PoolNet network architecture with a ResNet-50 backbone. ResNet-50 is designed for the ultimate purpose of image classification, and the extracted features represent the semantic rather than the most important local and global contrast information for saliency target detection. Instead of using a pre-trained model for image classification as the main backbone, the network structure used in this paper employs a double-layer nested U-shaped structure network trained from scratch. The specific network model is shown in Fig. 2.

Figure 2.U2Net model architecture.

The network consists of five encoders on the left side, five decoders on the right side, and one decoder on the bottom side, which are combined in a U-shaped structure, and there is also a U-shaped structure in each decoder and encoder. The U-shaped structure network in each stage of decoder and encoder is called residual U-block (RSU). RSU module is shown in Fig. 3, and it consists of three main parts:

Figure 3.Residual U-block (RSU) module (a) RSU7, (b) RSU6, (c) RSU5, (d) RSU4, and (e) RSU4F.

(1) Input convolutional layer: the input feature map is converted into an intermediate feature map with the number of channels, which is the ordinary convolutional layer for extracting local features.

(2) Network-based feature extraction layer: the intermediate feature map is used as the input, and multi-scale contextual information is extracted through the processing of the network, which is denoted as the network. There are parameters controlling the depth of the RSU module; the larger the RSU layer, the more pooling operations and the larger the receptive field.

(3) Feature fusion layer: the fusion of local features and multi-scale features.

For example, in Fig. 3, there are 4, 5, 6, and 7 layers of RSU modules, respectively. The more layers of the module used to get a larger receptive field, conversely, the smaller. Map 1 to Map 6 are the 6 groups of feature maps obtained by different RSU modules, and they are fused to obtain the final fused feature maps. Because of the different depths of the RSU modules, the final fused feature maps contain rich global and local information.

The use of a double-layer nested U-shaped structure allows for more efficient extraction of multi-scale features within each phase and multi-level features in the aggregation phase.

### 3.1. Dataset

The dataset used for the training U2Net model in the experiment is the DUTS dataset reconstructed based on the Fourier spectrum to obtain the Fourier coefficients, the specific process of image reconstruction based on Fourier basis is shown in Fig. 4. And the original DUTS [25] dataset contains 10,553 images. We selected some images in the training set with sampling rates of 100%, 25%, 6.25%, 1.56%, and 0.39% for reconstruction, and the results are shown in Fig. 5. It was found that when the sampling rate is 100%, 25% and 6.25%, the reconstructed image is relatively clear. After calculation, it was found that the average time required for the reconstructed image with different sampling rates is as shown in Table 1. It was found that the time required to reconstruct the image at 100% and 25% sampling rates is seconds, while the corresponding speed below 6.25% is milliseconds, which greatly saves the overhead of the single pixel imaging system, but does not significantly reduce the quality of the reconstructed image.

TABLE 1 Average time (s) required to reconstruct images at different sampling rates

Sampling Rates100%25%6.25%1.56%0.39%
Time (s)11.36822.81370.66460.17420.0493

Figure 4.Fourier basis-based image reconstruction process.
Figure 5.Reconstructed images of partial training set at 100%, 25%, 6.25%, 1.56%, and 0.39% sampling rate.

A total of 21,106 images with sampling rates of 1.56% and 0.39% were used for the training of the network. The data and labels used for training are shown in Fig. 6. The dataset used for testing the significance target detection results is the ECSSD [25] dataset based on the Fourier spectrum to obtain the reconstructed Fourier coefficients. One thousand images are included in the original dataset, and sampling rates of 100%, 25%, 6.25%, 1.56%, and 0.39% are used to form 5 datasets, which are input to the model for detection. The reconstructed ECSSD dataset is shown in Fig. 7.

Figure 6.Partial training set and labels.
Figure 7.Partial test set.

### 3.2. Experimental Parameters

In this scheme, U2Net is used as the model of deep learning. To train the network, the 10,553 images in the DUTS training set are first converted to grayscale images and resized to 128 × 128. Then the pre-processed images are reconstructed by simulated Fourier inverse transform using Matlab, and the sampling rates are selected as 100%, 25%, 6.25%, 1.56%, and 0.39% to obtain the single-pixel imaging images based on Fourier spectrum acquisition with corresponding sampling rates. In order to save training time, 21,106 reconstructed images of size 128 × 128 pixels with sampling rates of 1.56% and 0.39% are used to train the network in the training phase. We train our network from scratch, with all convolutional layers initialized by Xavier, setting the loss weights wsidem and wfuse both to 1. We train our network using the Adam optimizer with its hyperparameters set to default values [initial learning rate lr = 1e-3, betas = (0.9, 0.999), eps = 1e-8, and weight decay = 0]. We trained the network until the loss function started to converge, and after 600k iterations (batch 12), the loss function started to converge during training, and the whole training process took about 120 hours. The program was run on a graphics processing unit (NVIDIA GeForce GTX 3060 with 12 GB of video memory), Python version 3.8.2, using PyTorch 1.8.2.

### 3.3. Evaluation Indicators

Since the F-measure score and the mean absolute error (MAE) are common metrics for evaluating the performance of the saliency target detection model, we give in Table 2 the mean values of max Fβ and MAE for the whole test object scenario with different sampling ratios. The higher the F-measure score, the lower the mean value of MAE, indicating higher accuracy. The F-measure is denoted as Fβ, and the expression is Eq. 14:

TABLE 2 In this paper, we use the significance test max Fβ and mean absolute error (MAE) values of the scheme at different sampling rates

Sampling Rate100%25%6.25%1.56%0.39%
max Fβ0.47690.50110.51030.89010.4911
MAE0.34570.28030.27810.11370.2906

Fβ=1+β2×Precision×Recallβ2×Precision+Recall

where Precision = |B∩G|/|B|; Recall = |B∩G|/|G|; B is the mask generated by binarizing the significance map S with a threshold; G is the true significance map; and |-| is the accumulation of non-zero terms, and is empirically set to 0.3. The expression for MAE is Eq. 15.

MAE=1MN i=0 M1 j=0 N1|S(i,j)G(i,j)|

### 3.4. Simulation Analysis

In this paper, a scheme is proposed to perform image reconstruction based on a Fourier basis, and then train a saliency target detection model using the reconstructed pattern as the training set of the model and test it. The feasibility of the scheme is verified by computer simulation. In the simulation, the theoretical feasibility of the scheme to obtain Fourier coefficients based on Fourier spectrum for image reconstruction based on Fourier inversion was verified, and the performance of the trained U2Net model was observed for the reconstructed input images at 5 different sampling ratios of 100%, 25%, 6.25%, 1.56%, and 0.39%. Experimentally, 1,000 images in the ECSSD dataset were converted to grayscale maps and resized to 128 × 128 as the test object scenes to be imaged with single pixels. The test subject scene images were reconstructed at five different sampling ratios of 100%, 25%, 6.25%, 1.56%, and 0.39%. Then 5,000 test reconstructed images of 128 × 128 pixels were divided into five batches according to different sampling ratios and input to the trained U2Net-based model. The test results obtained are as in Fig. 8. It was found that the model obtained in 100% of the cases of the saliency target detection results was not good, and there were also folds in the edges of the target. The initial judgment is that the sampling rate of the training set is 1.56% and 0.39%, and there is a big difference between the images with a 100% sampling rate and this phenomenon, since 25%, 6.25% and 1.56% have better detection results. A low sampling rate like 0.39% also gives roughly the region of significance. Comparing the results obtained from the test with the actual significance plot and calculating the max Fβ and MAE values shows that the model gives good results for 25%, 6.25% and 1.56%, with a particularly good performance at 1.56%.

Figure 8.Results obtained for different sample rate test sets.

Because of the good performance shown by the model on the 1.56% test set, the reconstructed images at the 1.56% sampling rate were compared with other methods using the saliency target detection scheme in this paper, and we selected the more classical traditional ITTI method for saliency target detection, as well as the PoolNet network-based implementation for saliency target detection used in the [1]. For the ITTI method, we used reconstructed images (of size 128 × 128) based on the ECSSD dataset at a sampling rate of 1.56% as input for testing. In order to have better comparability with the PoolNet network, we used the reconstructed images (size 128 × 128) based on the DUTS dataset at 1.56% and 0.39% sampling rates with a total of 21,106 images as a training set on the network backbone of ResNet-50, and tested the model obtained with the same dataset, get the comparison results in Fig. 9. The scheme in this paper can accurately detect the region of the target, and the combination of max Fβ value and MAE value in Table 3 also shows that this scheme has good performance that far exceeds other methods.

TABLE 3 Comparison of evaluation indexes of three methods under 1.56% sampling rate

Evaluation IndexesThisWorksITTIPoolNet
max Fβ0.89010.23310.3583
MAE0.11370.22570.4412

Figure 9.Comparison of different significance testing methods at 1.56% sampling rate.

It was found that single-pixel imaging based on the Fourier spectrum can get better imaging results. Moreover, the U2Net network model experimented in this scheme has a better detection effect for the reconstructed images, and the model obtained can get better detection results below the 25% sampling rate, although the significant target detection effect is poor at high sampling rates. The best results are achieved on images with a 1.56% sampling rate. Good detection results are still obtained at the 0.39% sampling rate. The detection results of this scheme are superior to the traditional methods ITTI and deep learning PoolNet model.

### IV. CONCLUSION

In this paper, we discussed the implementation of salient object detection based on single-pixel imaging systems and proposed a salient object detection scheme based on Fourier-based reconstructed images and deep learning models. Based on U2Net model, salient object detection is performed on the reconstructed images of under-sampled data. The proposed scheme shows better detection results as well as robustness, providing a new idea for single-pixel imaging systems for complex vision tasks. The experimental results and analysis also demonstrate the good flexibility of the single-pixel saliency target detection system in adapting to different application requirements. More data can be measured in some applications such as image segmentation and image synthesis when well-defined boundaries are required, and more measurements can be reduced in some applications such as visual tracking and target localization when only a rough idea of the target’s location and area is needed. This not only improves the efficiency of the single-pixel imaging system, but also greatly saves the overhead of the single-pixel imaging system.

The authors declare no conflicts of interest.

### DATA AVAILABILITY

Data underlying the results presented in this paper are not publicly available at the time of publication, which may be obtained from the authors upon reasonable request.

Natural Science Foundation of Shanghai (Grant No. 18ZR1425800); the National Natural Science Foundation of China (Grant No. 61775140, 61875125).

1. Y. Bromberg, O. Katz, and Y. Silberberg, “Ghost imaging with a single detector,” Phys. Rev. A 79, 053840 (2009).
2. M. P. Edgar, G. M. Gibson, and M. J. Padgett, “Principles and prospects for single-pixel imaging,” Nat. Photonics 13, 13-20 (2019).
3. S. J. Hansen, “X-ray imaging system,” U.S. Patent 5,521,957A. (1996).
4. V. Cnudde and M. N. Boone, “High-resolution X-ray computed tomography in geosciences: A review of the current technology and applications,” Earth-Sci. Rev. 123, 1-17 (2013).
5. A. Fenster and D. B. Downey, “3-D ultrasound imaging: A review,” IEEE Eng. Med. Biol. Mag. 15, 41-51 (1996).
6. M. E. Phelps and J. C. Mazziotta, “Positon emission tomography: human brain function and biochemistry,” Science 228, 799-809 (1985).
7. D. L. Bailey, D. W. Townsend, P. E. Valk, and M. N. Maisey, Positron emission tomography (Springer, 2005).
8. T. A. Holly, B. G. Abbott, M. Al-Mallah, D. A. Calnon, M. C. Cohen, F. P. DiFilippo, E. P. Ficaro, M. R. Freeman, R. C. Hendel, D. Jain, S. M. Leonard, K. J. Nichols, D. M. Polk, and P. Soman, “Single photon-emission computed tomography,” J. Nucl. Cardiolo. 17, 941-973 (2010).
9. A. Wagner, H. Mahrholdt, T. A. Holly, M. D. Elliott, M. Regenfus, M. Parker, F. J. Klocke, R. O. Bonow, R. J. Kim, and R. M. Judd, “Contrast-enhanced MRI and routine single photon emission computed tomography (SPECT) perfusion imaging for detection of subendocardial myocardial infarcts: an imaging study,” The Lancet 361, 374-379 (2013).
10. R. Hachamovitch, D. S. Berman, L. J. Shaw, H. Kiat, I. Cohen, J. A. Cabico, J. Friedman, and G. A. Diamond, “Incremental prognostic value of myocardial perfusion single photon emission computed tomography for the prediction of cardiac death,” Circulation 97, 535-543 (1998).
11. R. Hachamovitch, S. W. Hayes, J. D. Friedman, I. Cohen, and D. S. Berman, “Comparison of the short-term survival benefit associated with revascularization compared with medical therapy in patients with no prior coronary artery disease undergoing stress myocardial perfusion single photon emission computed tomography,” Circulation 107, 2900-2907 (2003).
12. B. B. Hu and M. C. Nuss, “Imaging with terahertz waves,” Opt. Lett. 20, 1716-1718 (1995).
13. D. M. Mittleman, M. Gupta, R. Neelamani, R. G. Baraniuk, J. V. Rudd, and M. Koch, “Recent advances in terahertz imaging,” Appl. Phys. B 68, 1085-1094 (1999).
14. W. L. Chan, K. Charan, D. Takhar, K. F. Kelly, R. G. Baraniuk, and D. M. Mittleman, “A single-pixel terahertz imaging system based on compressed sensing,” Appl. Phys. Lett. 93, 121105 (2008).
15. D. Shrekenhamer, C. M. Watts, and W. J. Padilla, “Terahertz single pixel imaging with an optically controlled dynamic spatial light modulator,” Opt. Express 21, 12507-12518 (2013).
16. C. M. Watts, D. Shrekenhamer, J. Montoya, G. Lipworth, J. Hunt, T. Sleasman, S. Krishna, D. R. Smith, and W. J. Padilla, “Terahertz compressive imaging with metamaterial spatial light modulators,” Nat. Photonics 8, 605-609 (2014).
17. Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang, “Region-based saliency detection and its application in object recognition,” IEEE Trans. Circuits Syst. Video Technol. 24, 769-779 (2014).
18. Y. Li, J. Shi, L. Sun, X. Wu, and G. Zeng, “Single-pixel salient object detection via discrete cosine spectrum acquisition and deep learning,” IEEE Photonics Technol. Lett. 32,1381-1384 (2020).
19. Z. Zhang, X. Ma, and J. Zhong, “Single-pixel imaging by means of Fourier spectrum acquisition,” Nat. Commun. 6, 6225 (2015).
20. D. Zhang, D. Meng, L. Zhao, and J. Han, “Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning,” in Proc. International Joint Conferences on Artificial Intelligence (NY, USA, Jul. 9-15, 2016), pp. 3538-3544.
21. Y . Wei, J. Feng, X. Liang, M.-M. Cheng, Y . Zhao, and S. Yan, “Object region mining with adversarial erasing: A simple classi-fication to semantic segmentation approach,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (Honolulu, Hawaii, USA, Jul. 22-25, 2017), pp. 1568-1576.
22. X. Wang, S. You, X. Li, and H. Ma, “Weakly-supervised semantic segmentation by iteratively mining common object features,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (Salt Lake City, USA, Jul. 18-22, 2018), pp. 1354-1362.
23. G. Sun, W. Wang, J. Dai, and L. Van Gool, “Mining cross-image semantics for weakly supervised semantic segmentation,” in Proc. European Conference on Computer Vision (Glasgow, UK, Aug. 23-28, 2020), pp. 347-365.
24. L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, “Learning to detect salient objects with image-level supervision,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (Honolulu, Hawaii, USA, Jul. 21-26, 2017), pp. 136-145.
25. Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in Proc. Computer Vision and Pattern Recognition (Portland, OR, USA, Jun. 23-28, 2013), pp. 1155-1162.

### Article

#### Article

Curr. Opt. Photon. 2022; 6(5): 463-472

Published online October 25, 2022 https://doi.org/10.3807/COPP.2022.6.5.463

## U2Net-based Single-pixel Imaging Salient Object Detection

Leihong Zhang1, Zimin Shen1 , Weihong Lin1, Dawei Zhang2,3

1College of Communication and Art design, University of Shanghai for Science and Technology, Shanghai 200093, China
2School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
3Shanghai Institute of Intelligent Science and Technology, Tongji University, Shanghai 200092, China

Correspondence to:*923722470@qq.com, ORCID 0000-0001-6699-1247

Received: April 15, 2022; Revised: June 22, 2022; Accepted: July 11, 2022

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

### Abstract

At certain wavelengths, single-pixel imaging is considered to be a solution that can achieve high quality imaging and also reduce costs. However, achieving imaging of complex scenes is an overhead-intensive process for single-pixel imaging systems, so low efficiency and high consumption are the biggest obstacles to their practical application. Improving efficiency to reduce overhead is the solution to this problem. Salient object detection is usually used as a pre-processing step in computer vision tasks, mimicking human functions in complex natural scenes, to reduce overhead and improve efficiency by focusing on regions with a large amount of information. Therefore, in this paper, we explore the implementation of salient object detection based on single-pixel imaging after a single pixel, and propose a scheme to reconstruct images based on Fourier bases and use U2Net models for salient object detection.

Keywords: Fourier transform, Salient object detection, Single-pixel imaging

### I. INTRODUCTION

Photos and videos commonly seen in life are often created by capturing photons (the building blocks of light) using digital sensors, meaning that ambient light reflects off an object and the lens focuses it on a screen made up of tiny photosensitive elements, or pixels. The image is a pattern formed by the light and dark spots created by the reflected light. The most common digital camera, for example, consists of hundreds of pixels that form an image by detecting the intensity and color of light at each point in space. Also, a 3D image can be generated by placing several cameras around the object and photographing it from multiple angles, or by scanning the object using a stream of photons and reconstructing it in three dimensions. However, regardless of the method used, the image is constructed by collecting spatial information about the scene. In contrast, single-pixel imaging is an imaging technique developed from computational ghost imaging [1]. It is based on the principle of correlation measurement and relies solely on the collection of light intensity information to image the object. Single-pixel imaging takes structured light illumination on the illumination side and uses a single-pixel light intensity detector on the detection side to collect the signal. When the illumination structure changes, the corresponding change in the light intensity of the object reflects the degree of correlation between the illumination structure and the spatial information of the object. By continuously changing the illumination structure and accumulating the correlation information, the final imaging of the object is achieved [2]. Since a single-pixel camera requires only light intensity detection at the detection end, its requirements are much lower than those of front-facing detectors in ordinary imaging. Therefore, for some special conditions, such as for some bands where the technology of surface array detectors is not mature, single-pixel imaging technology has great application advantages.

For example, imaging through biologic tissues and improving the spatial resolution of optical microscopy have been two major biomedical imaging challenges. Non-visible imaging is the current solution to these challenges, because the use of wavelengths in the optical window of biological tissues (e.g. X-ray imaging [3], computed tomography (CT) imaging [4], ultrasound imaging [5], positron emission imaging [6, 7], single photon emission imaging [811], terahertz imaging [1216], etc.) can effectively overcome the scattering effect of light by biological media, and the use of shorter wavelengths than visible light for imaging can achieve higher spatial resolution. However, traditional optical imaging methods require the use of a surface array detector, and the relative complexity, difficulty, and high cost of making a surface array detector that can operate in the non-visible wavelength band make it difficult to improve the quality of non-visible imaging based on traditional imaging methods. As a new imaging method, single-pixel imaging is considered a solution to achieve high-quality non-visible imaging because it only requires a single point detector without spatial resolution to achieve spatial information acquisition of the target object to reconstruct the object image. However, single-pixel imaging technology has only reached the degree of being able to image, and still faces some problems in practical applications. For example, the imaging quality still has a certain gap compared with traditional imaging techniques because the image reconstruction mechanism of random and statistical mathematics is applied, and the reconstructed image will show obvious noise. Secondly, the number of single-pixel imaging measurements is huge, which means that single-pixel imaging has low efficiency and high overhead. Many scholars have conducted research on how to improve the imaging efficiency of single-pixel imaging, and proposed a Fourier-based single-pixel imaging technique [18], which is the most efficient technique for sampling so far. Li et al. [18] of Shanghai Jiaotong University proposed the idea of single-pixel imaging based on a discrete cosine basis and deciding the sampling rate of a single-pixel imaging system according to the demand of subsequent tasks. This scheme has low imaging efficiency, poor performance of the selected deep learning model, and cannot maximize the overhead savings of the single-pixel imaging system.

Based on the study of the above critical issues, this paper proposes image reconstruction based on a Fourier basis to apply salient object detection techniques to single-pixel imaging systems. Salient object detection, commonly used as a pre-processing procedure in many complex vision tasks, including target detection [1, 20], semantic segmentation [2124], image classification, etc., all mimic human functions to quickly capture regions of attractive objects in complex natural scenes and process more information in these regions with sufficient details, while relatively irrelevant information in other regions is not processed, which greatly ensures that the human visual system works efficiently and properly. Invoking this capability of salient object detection as an image pre-processing step in single-pixel imaging systems has important scientific significance and application value, especially in medical imaging and subsequent medical image information processing. This scheme not only improves the imaging efficiency of single-pixel imaging systems, but also can decide different sampling rates according to different tasks, which greatly improves efficiency and saves the overhead of single-pixel imaging systems. It also becomes particularly important to select a deep learning model with good performance for saliency target detection, which serves as a pre-processing procedure for complex vision tasks, and its excellent performance lays a solid foundation for subsequent tasks.

### 2.1. Single-pixel Imaging

Suppose a two-dimensional image of an object is IRnn2, containing a total of N pixel points, where N = n1 × n2. To acquire this image, a series of modulation mask patterns with spatial resolution are loaded onto the DMD. For the modulation mask sequence P = [P1, P2, ..., PM] ∈ RM×nn2, where Pi ∈ Rnn2 denotes the ith frame of the modulation mask and M denotes the number of modulation masks, the corresponding light intensity value M is [s1, s2, ..., sM] ∈ RM captured by the barrel detector after interacting with the object. The two-dimensional image is in matrix form, and for convenience of expression, the two-dimensional image is expanded into vector form, i.e. IRN; similarly, the modulation mask sequence is represented as a two-dimensional matrix form, i.e. PRM×N, where each row represents one frame of the mask. It is possible to obtain

$PI=s$

Single-pixel imaging is a computational imaging method that uses a known modulation mask matrix and a sequence of detected measurement signals to solve for the target image. Then the solved target image can be solved by Eq. 2 as follows:

$I=P−1s$

However, to solve it according to this formula, it is necessary to ensure that the modulation mask matrix P is orthogonal and M = N.

### 2.2. Fourier Basis Based Image Reconstruction

Notation must be legible, clear, compact, and consistent with standard usage. In general, acronyms should be spelled out the first time they are used. Adherence to the following guidelines will greatly assist the production process:

$C(fx,fy)=∫−∞+∞ ∫ −∞ +∞ I(x,y)exp[−j⋅2π(fxx+fyy)]dfxdfy$

$I(x,y)=∫−∞+∞ ∫ −∞ +∞ C(fx,fy)exp[j⋅2π(fxx+fyy)]dxdy$

where x, y are the right-angle coordinates in the spatial domain; fx, fy are the right-angle coordinates in the Fourier domain, corresponding to the spatial frequencies in the directions of x, y; I(x, y) denotes a two-dimensional image; C( fx, fy) denotes the Fourier spectrum of a two dimensional image; and j is the imaginary unit.

In the beginning, Fourier analysis was only used for one-dimensional continuous signals, but in fact, it is also applicable to two-dimensional discrete signals, and images are actually two-dimensional discrete signals, so Fourier analysis is also suitable for digital images. Equations 5 and 6 represent the two-dimensional discrete Fourier normal transform and the two-dimensional discrete Fourier inverse transform, respectively.

$C(u,v)=∑ x=0 M−1∑y=0N−1I(x,y)exp[−j2π( uxM+ vyN)]$

$I(x,y)=∑ x=0 M−1∑y=0N−1C(u,v)exp[j2π(uxM+vyN)]$

where u, v are the discretized forms of the spatial frequencies fx, fy, respectively. According to Fourier analysis, any image is a linear combination of a series of Fourier base patterns. The weight corresponding to each Fourier base pattern is the Fourier coefficient C(u, v).

The process of acquiring the Fourier spectrum involves acquiring the weights of each Fourier base pattern corresponding to the image, i.e. acquiring the Fourier coefficients. We use a computer to generate the Fourier base pattern, project the resulting pattern onto the target object, and then use a barrel detector to collect the intensity of the resulting light signal. Four Fourier base patterns with spatial frequencies ( fx, fy), initial phase 0, π/2, π, 3π/2 are projected on the target object P1, P2, P3, P4.

$Pn(x,y;fx,fy)=a+b⋅cos(2πfxx+2πfyy+m)$

Where n is 1, 2, 3, 4 corresponding to m for 0, π/2, π, 3π/2 a is the average light intensity, b is the contrast, x, y are the right angle coordinates of the plane where the target object is located, fx, fy are the spatial frequencies corresponding to the directions of x, y respectively. Equation 7 is rewritten as Eq. 8 according to the triangular constant deformation.

$P1(x,y;fx,fy)=a+b⋅cos(2πfxx+2πfyy),P2(x,y;fx,fy)=a−b⋅sin(2πfxx+2πfyy),P3(x,y;fx,fy)=a−b⋅cos(2πfxx+2πfyy),P4(x,y;fx,fy)=a+b⋅sin(2πfxx+2πfyy).$

Assuming that the target object is reflective and the reflected intensity of the object in the direction of the single pixel detector measured relative to the direction of illumination used in projecting the Fourier base pattern is R(x, y), and the object image is noted as I(x, y), the relationship between the object image and the reflected intensity distribution of the object is I(x, y) ∝ R(x, y). The light intensity of the reflected light obtained from the target object illuminated by the Fourier base pattern P(x, y; fx, fy, φ) is given by Eq. 9.

$Eϕ(fx,fy)=∫ ∫S R(x,y)⋅P(x,y;fx,fy,ϕ)dxdy$

where S is the projection region of the Fourier base pattern. The value of the optical response of the single-pixel detector is Eq. 1.

$Dϕ(fx,fy)=Dn+β⋅Eϕ(fx,fy)$

where Dn is the value of the optical response caused by the background illumination at the detector position, and β is a factor related to the magnification of the single-pixel detector and the spatial relationship between the detector and the object. The response values D1, D2, D3, D4 can be obtained from the four Fourier base pattern expressions of Eq. 7, Eq. 9, and 10 as in Eq. 11:

$D1(fx,fy)=Dn+a⋅β∫∫S R(x,y)dxdy+b⋅β∫∫S R(x,y)⋅cos(2πfx x+2πfy y)dxdy,D2(fx,fy)=Dn+a⋅β∫∫S R(x,y)dxdy−b⋅β∫∫S R(x,y)⋅sin(2πfx x+2πfy y)dxdy,D3(fx,fy)=Dn+a⋅β∫∫S R(x,y)dxdy−b⋅β∫∫S R(x,y)⋅cos(2πfx x+2πfy y)dxdy,D4(fx,fy)=Dn+α⋅β∫∫S R(x,y)dxdy+b⋅β∫∫S R(x,y)⋅sin(2πfx x+2πfy y)dxdy.$

According to the four-step phase shift method can be obtained about the Fourier coefficients C( fx, fy) as in Eq. 12:

$C(fx,fy)=[D1(fx,fy)−D3(fx,fy)]+j⋅[D2(fx,fy)−D4(fx,fy)]=2b⋅β∫∫S R(x,y)⋅{cos[2π(fx,fy)]−j⋅sin(fx,fy)]}dxdy=2b⋅β∫∫S R(x,y)⋅exp[−j⋅2π(fx,fy)]dxdx=2b⋅β⋅F{R(x,y)}.$

In summary, it can be seen that each Fourier coefficient C( fx, fy) can be obtained from the four Fourier base patterns with initial phases of 0, π/2, π, and 3π/2 , respectively.

Accordingly, the Fourier spectrum of the obtained object image can be reconstructed by implementing the Fourier inversion as in Eq. 13.

$I(x,y)=F−1{C(fx,fy)}=F−1{[D1(fx,fy)−D3(fx,fy)]+j⋅[D2(fx,fy)−D4(fx,fy)]}$

where F−1{ } denotes the Fourier inverse transform.

To reconstruct an image with a resolution of M × N pixels without distortion, all M × N Fourier coefficients in its Fourier spectrum need to be acquired. Since each Fourier coefficient requires projection of a four-part phase-shifted Fourier base pattern, i.e. four measurements are required; and since the object image is mathematically a real-valued matrix with conjugate symmetry in its Fourier spectrum, a total of 2 × M × N measurements are required to acquire M × N Fourier coefficients. In other words, in the case of a distortion-free reconstructed image, obtaining the Fourier spectrum of the image using the four-part phase shift method takes twice as many measurements as the number of pixels in the reconstructed image, the underlying reason being that the four-step phase shift method is essentially a differential measurement.

The image energy is mainly concentrated in the low-frequency part of the Fourier spectrum, while the Fourier coefficients of high frequencies have small or even near-zero modes. Therefore, the object image can be reconstructed by projecting only the Fourier base pattern of low spatial frequencies, obtaining the Fourier coefficients of low frequencies and directly setting the coefficients of high frequencies to zero, thus achieving the goal of fewer measurements.

### 2.3. Salient Object Detection

The human visual system can quickly acquire information in images, and the dynamic selection ability of the human visual system, called the visual attention mechanism, plays an important role in the processing of what humans perceive around them. The visual attention mechanism allows people to quickly and accurately capture the part of an image that is of most interest, which is called the salient region as shown in Fig. 1, it is the result obtained after the salient object detection in the image. For an image, the salient region often contains the most important information in the image. Normal image processing processes the entire image, which is a waste of computer computing power and time, resulting in inefficient image processing. This dynamic selection capability in human vision is introduced into image processing as salient object detection, which extracts salient regions from the image and mainly processes the selected regions, greatly improving the efficiency of image processing. In many application scenarios in single-pixel imaging systems, we can save overhead by selecting the appropriate sampling rate according to the task. We choose a model for salient object detection.

Figure 1. Salient object detection.

### 2.4. U2Net Model

Most of today’s salient object detection networks are based on extracting deep features using a backbone trained for image classification purposes. In the literature, the method used for salient object detection after single-pixel imaging is the PoolNet network architecture with a ResNet-50 backbone. ResNet-50 is designed for the ultimate purpose of image classification, and the extracted features represent the semantic rather than the most important local and global contrast information for saliency target detection. Instead of using a pre-trained model for image classification as the main backbone, the network structure used in this paper employs a double-layer nested U-shaped structure network trained from scratch. The specific network model is shown in Fig. 2.

Figure 2. U2Net model architecture.

The network consists of five encoders on the left side, five decoders on the right side, and one decoder on the bottom side, which are combined in a U-shaped structure, and there is also a U-shaped structure in each decoder and encoder. The U-shaped structure network in each stage of decoder and encoder is called residual U-block (RSU). RSU module is shown in Fig. 3, and it consists of three main parts:

Figure 3. Residual U-block (RSU) module (a) RSU7, (b) RSU6, (c) RSU5, (d) RSU4, and (e) RSU4F.

(1) Input convolutional layer: the input feature map is converted into an intermediate feature map with the number of channels, which is the ordinary convolutional layer for extracting local features.

(2) Network-based feature extraction layer: the intermediate feature map is used as the input, and multi-scale contextual information is extracted through the processing of the network, which is denoted as the network. There are parameters controlling the depth of the RSU module; the larger the RSU layer, the more pooling operations and the larger the receptive field.

(3) Feature fusion layer: the fusion of local features and multi-scale features.

For example, in Fig. 3, there are 4, 5, 6, and 7 layers of RSU modules, respectively. The more layers of the module used to get a larger receptive field, conversely, the smaller. Map 1 to Map 6 are the 6 groups of feature maps obtained by different RSU modules, and they are fused to obtain the final fused feature maps. Because of the different depths of the RSU modules, the final fused feature maps contain rich global and local information.

The use of a double-layer nested U-shaped structure allows for more efficient extraction of multi-scale features within each phase and multi-level features in the aggregation phase.

### 3.1. Dataset

The dataset used for the training U2Net model in the experiment is the DUTS dataset reconstructed based on the Fourier spectrum to obtain the Fourier coefficients, the specific process of image reconstruction based on Fourier basis is shown in Fig. 4. And the original DUTS [25] dataset contains 10,553 images. We selected some images in the training set with sampling rates of 100%, 25%, 6.25%, 1.56%, and 0.39% for reconstruction, and the results are shown in Fig. 5. It was found that when the sampling rate is 100%, 25% and 6.25%, the reconstructed image is relatively clear. After calculation, it was found that the average time required for the reconstructed image with different sampling rates is as shown in Table 1. It was found that the time required to reconstruct the image at 100% and 25% sampling rates is seconds, while the corresponding speed below 6.25% is milliseconds, which greatly saves the overhead of the single pixel imaging system, but does not significantly reduce the quality of the reconstructed image.

TABLE 1. Average time (s) required to reconstruct images at different sampling rates.

Sampling Rates100%25%6.25%1.56%0.39%
Time (s)11.36822.81370.66460.17420.0493

Figure 4. Fourier basis-based image reconstruction process.
Figure 5. Reconstructed images of partial training set at 100%, 25%, 6.25%, 1.56%, and 0.39% sampling rate.

A total of 21,106 images with sampling rates of 1.56% and 0.39% were used for the training of the network. The data and labels used for training are shown in Fig. 6. The dataset used for testing the significance target detection results is the ECSSD [25] dataset based on the Fourier spectrum to obtain the reconstructed Fourier coefficients. One thousand images are included in the original dataset, and sampling rates of 100%, 25%, 6.25%, 1.56%, and 0.39% are used to form 5 datasets, which are input to the model for detection. The reconstructed ECSSD dataset is shown in Fig. 7.

Figure 6. Partial training set and labels.
Figure 7. Partial test set.

### 3.2. Experimental Parameters

In this scheme, U2Net is used as the model of deep learning. To train the network, the 10,553 images in the DUTS training set are first converted to grayscale images and resized to 128 × 128. Then the pre-processed images are reconstructed by simulated Fourier inverse transform using Matlab, and the sampling rates are selected as 100%, 25%, 6.25%, 1.56%, and 0.39% to obtain the single-pixel imaging images based on Fourier spectrum acquisition with corresponding sampling rates. In order to save training time, 21,106 reconstructed images of size 128 × 128 pixels with sampling rates of 1.56% and 0.39% are used to train the network in the training phase. We train our network from scratch, with all convolutional layers initialized by Xavier, setting the loss weights $wsidem$ and wfuse both to 1. We train our network using the Adam optimizer with its hyperparameters set to default values [initial learning rate lr = 1e-3, betas = (0.9, 0.999), eps = 1e-8, and weight decay = 0]. We trained the network until the loss function started to converge, and after 600k iterations (batch 12), the loss function started to converge during training, and the whole training process took about 120 hours. The program was run on a graphics processing unit (NVIDIA GeForce GTX 3060 with 12 GB of video memory), Python version 3.8.2, using PyTorch 1.8.2.

### 3.3. Evaluation Indicators

Since the F-measure score and the mean absolute error (MAE) are common metrics for evaluating the performance of the saliency target detection model, we give in Table 2 the mean values of max Fβ and MAE for the whole test object scenario with different sampling ratios. The higher the F-measure score, the lower the mean value of MAE, indicating higher accuracy. The F-measure is denoted as Fβ, and the expression is Eq. 14:

TABLE 2. In this paper, we use the significance test max Fβ and mean absolute error (MAE) values of the scheme at different sampling rates.

Sampling Rate100%25%6.25%1.56%0.39%
max Fβ0.47690.50110.51030.89010.4911
MAE0.34570.28030.27810.11370.2906

$Fβ=1+β2×Pr​ecision×Re callβ2×Pr​ecision+Re call$

where Precision = |B∩G|/|B|; Recall = |B∩G|/|G|; B is the mask generated by binarizing the significance map S with a threshold; G is the true significance map; and |-| is the accumulation of non-zero terms, and is empirically set to 0.3. The expression for MAE is Eq. 15.

$MAE=1MN∑ i=0 M−1∑ j=0 N−1|S(i,j)−G(i,j)|$

### 3.4. Simulation Analysis

In this paper, a scheme is proposed to perform image reconstruction based on a Fourier basis, and then train a saliency target detection model using the reconstructed pattern as the training set of the model and test it. The feasibility of the scheme is verified by computer simulation. In the simulation, the theoretical feasibility of the scheme to obtain Fourier coefficients based on Fourier spectrum for image reconstruction based on Fourier inversion was verified, and the performance of the trained U2Net model was observed for the reconstructed input images at 5 different sampling ratios of 100%, 25%, 6.25%, 1.56%, and 0.39%. Experimentally, 1,000 images in the ECSSD dataset were converted to grayscale maps and resized to 128 × 128 as the test object scenes to be imaged with single pixels. The test subject scene images were reconstructed at five different sampling ratios of 100%, 25%, 6.25%, 1.56%, and 0.39%. Then 5,000 test reconstructed images of 128 × 128 pixels were divided into five batches according to different sampling ratios and input to the trained U2Net-based model. The test results obtained are as in Fig. 8. It was found that the model obtained in 100% of the cases of the saliency target detection results was not good, and there were also folds in the edges of the target. The initial judgment is that the sampling rate of the training set is 1.56% and 0.39%, and there is a big difference between the images with a 100% sampling rate and this phenomenon, since 25%, 6.25% and 1.56% have better detection results. A low sampling rate like 0.39% also gives roughly the region of significance. Comparing the results obtained from the test with the actual significance plot and calculating the max Fβ and MAE values shows that the model gives good results for 25%, 6.25% and 1.56%, with a particularly good performance at 1.56%.

Figure 8. Results obtained for different sample rate test sets.

Because of the good performance shown by the model on the 1.56% test set, the reconstructed images at the 1.56% sampling rate were compared with other methods using the saliency target detection scheme in this paper, and we selected the more classical traditional ITTI method for saliency target detection, as well as the PoolNet network-based implementation for saliency target detection used in the [1]. For the ITTI method, we used reconstructed images (of size 128 × 128) based on the ECSSD dataset at a sampling rate of 1.56% as input for testing. In order to have better comparability with the PoolNet network, we used the reconstructed images (size 128 × 128) based on the DUTS dataset at 1.56% and 0.39% sampling rates with a total of 21,106 images as a training set on the network backbone of ResNet-50, and tested the model obtained with the same dataset, get the comparison results in Fig. 9. The scheme in this paper can accurately detect the region of the target, and the combination of max Fβ value and MAE value in Table 3 also shows that this scheme has good performance that far exceeds other methods.

TABLE 3. Comparison of evaluation indexes of three methods under 1.56% sampling rate.

Evaluation IndexesThisWorksITTIPoolNet
max Fβ0.89010.23310.3583
MAE0.11370.22570.4412

Figure 9. Comparison of different significance testing methods at 1.56% sampling rate.

It was found that single-pixel imaging based on the Fourier spectrum can get better imaging results. Moreover, the U2Net network model experimented in this scheme has a better detection effect for the reconstructed images, and the model obtained can get better detection results below the 25% sampling rate, although the significant target detection effect is poor at high sampling rates. The best results are achieved on images with a 1.56% sampling rate. Good detection results are still obtained at the 0.39% sampling rate. The detection results of this scheme are superior to the traditional methods ITTI and deep learning PoolNet model.

### IV. CONCLUSION

In this paper, we discussed the implementation of salient object detection based on single-pixel imaging systems and proposed a salient object detection scheme based on Fourier-based reconstructed images and deep learning models. Based on U2Net model, salient object detection is performed on the reconstructed images of under-sampled data. The proposed scheme shows better detection results as well as robustness, providing a new idea for single-pixel imaging systems for complex vision tasks. The experimental results and analysis also demonstrate the good flexibility of the single-pixel saliency target detection system in adapting to different application requirements. More data can be measured in some applications such as image segmentation and image synthesis when well-defined boundaries are required, and more measurements can be reduced in some applications such as visual tracking and target localization when only a rough idea of the target’s location and area is needed. This not only improves the efficiency of the single-pixel imaging system, but also greatly saves the overhead of the single-pixel imaging system.

### DISCLOSURES

The authors declare no conflicts of interest.

### DATA AVAILABILITY

Data underlying the results presented in this paper are not publicly available at the time of publication, which may be obtained from the authors upon reasonable request.

### FUNDING

Natural Science Foundation of Shanghai (Grant No. 18ZR1425800); the National Natural Science Foundation of China (Grant No. 61775140, 61875125).

### Fig 1.

Figure 1.Salient object detection.
Current Optics and Photonics 2022; 6: 463-472https://doi.org/10.3807/COPP.2022.6.5.463

### Fig 2.

Figure 2.U2Net model architecture.
Current Optics and Photonics 2022; 6: 463-472https://doi.org/10.3807/COPP.2022.6.5.463

### Fig 3.

Figure 3.Residual U-block (RSU) module (a) RSU7, (b) RSU6, (c) RSU5, (d) RSU4, and (e) RSU4F.
Current Optics and Photonics 2022; 6: 463-472https://doi.org/10.3807/COPP.2022.6.5.463

### Fig 4.

Figure 4.Fourier basis-based image reconstruction process.
Current Optics and Photonics 2022; 6: 463-472https://doi.org/10.3807/COPP.2022.6.5.463

### Fig 5.

Figure 5.Reconstructed images of partial training set at 100%, 25%, 6.25%, 1.56%, and 0.39% sampling rate.
Current Optics and Photonics 2022; 6: 463-472https://doi.org/10.3807/COPP.2022.6.5.463

### Fig 6.

Figure 6.Partial training set and labels.
Current Optics and Photonics 2022; 6: 463-472https://doi.org/10.3807/COPP.2022.6.5.463

### Fig 7.

Figure 7.Partial test set.
Current Optics and Photonics 2022; 6: 463-472https://doi.org/10.3807/COPP.2022.6.5.463

### Fig 8.

Figure 8.Results obtained for different sample rate test sets.
Current Optics and Photonics 2022; 6: 463-472https://doi.org/10.3807/COPP.2022.6.5.463

### Fig 9.

Figure 9.Comparison of different significance testing methods at 1.56% sampling rate.
Current Optics and Photonics 2022; 6: 463-472https://doi.org/10.3807/COPP.2022.6.5.463

TABLE 1 Average time (s) required to reconstruct images at different sampling rates

Sampling Rates100%25%6.25%1.56%0.39%
Time (s)11.36822.81370.66460.17420.0493

TABLE 2 In this paper, we use the significance test max Fβ and mean absolute error (MAE) values of the scheme at different sampling rates

Sampling Rate100%25%6.25%1.56%0.39%
max Fβ0.47690.50110.51030.89010.4911
MAE0.34570.28030.27810.11370.2906

TABLE 3 Comparison of evaluation indexes of three methods under 1.56% sampling rate

Evaluation IndexesThisWorksITTIPoolNet
max Fβ0.89010.23310.3583
MAE0.11370.22570.4412

### References

1. Y. Bromberg, O. Katz, and Y. Silberberg, “Ghost imaging with a single detector,” Phys. Rev. A 79, 053840 (2009).
2. M. P. Edgar, G. M. Gibson, and M. J. Padgett, “Principles and prospects for single-pixel imaging,” Nat. Photonics 13, 13-20 (2019).
3. S. J. Hansen, “X-ray imaging system,” U.S. Patent 5,521,957A. (1996).
4. V. Cnudde and M. N. Boone, “High-resolution X-ray computed tomography in geosciences: A review of the current technology and applications,” Earth-Sci. Rev. 123, 1-17 (2013).
5. A. Fenster and D. B. Downey, “3-D ultrasound imaging: A review,” IEEE Eng. Med. Biol. Mag. 15, 41-51 (1996).
6. M. E. Phelps and J. C. Mazziotta, “Positon emission tomography: human brain function and biochemistry,” Science 228, 799-809 (1985).
7. D. L. Bailey, D. W. Townsend, P. E. Valk, and M. N. Maisey, Positron emission tomography (Springer, 2005).
8. T. A. Holly, B. G. Abbott, M. Al-Mallah, D. A. Calnon, M. C. Cohen, F. P. DiFilippo, E. P. Ficaro, M. R. Freeman, R. C. Hendel, D. Jain, S. M. Leonard, K. J. Nichols, D. M. Polk, and P. Soman, “Single photon-emission computed tomography,” J. Nucl. Cardiolo. 17, 941-973 (2010).
9. A. Wagner, H. Mahrholdt, T. A. Holly, M. D. Elliott, M. Regenfus, M. Parker, F. J. Klocke, R. O. Bonow, R. J. Kim, and R. M. Judd, “Contrast-enhanced MRI and routine single photon emission computed tomography (SPECT) perfusion imaging for detection of subendocardial myocardial infarcts: an imaging study,” The Lancet 361, 374-379 (2013).
10. R. Hachamovitch, D. S. Berman, L. J. Shaw, H. Kiat, I. Cohen, J. A. Cabico, J. Friedman, and G. A. Diamond, “Incremental prognostic value of myocardial perfusion single photon emission computed tomography for the prediction of cardiac death,” Circulation 97, 535-543 (1998).
11. R. Hachamovitch, S. W. Hayes, J. D. Friedman, I. Cohen, and D. S. Berman, “Comparison of the short-term survival benefit associated with revascularization compared with medical therapy in patients with no prior coronary artery disease undergoing stress myocardial perfusion single photon emission computed tomography,” Circulation 107, 2900-2907 (2003).
12. B. B. Hu and M. C. Nuss, “Imaging with terahertz waves,” Opt. Lett. 20, 1716-1718 (1995).
13. D. M. Mittleman, M. Gupta, R. Neelamani, R. G. Baraniuk, J. V. Rudd, and M. Koch, “Recent advances in terahertz imaging,” Appl. Phys. B 68, 1085-1094 (1999).
14. W. L. Chan, K. Charan, D. Takhar, K. F. Kelly, R. G. Baraniuk, and D. M. Mittleman, “A single-pixel terahertz imaging system based on compressed sensing,” Appl. Phys. Lett. 93, 121105 (2008).
15. D. Shrekenhamer, C. M. Watts, and W. J. Padilla, “Terahertz single pixel imaging with an optically controlled dynamic spatial light modulator,” Opt. Express 21, 12507-12518 (2013).
16. C. M. Watts, D. Shrekenhamer, J. Montoya, G. Lipworth, J. Hunt, T. Sleasman, S. Krishna, D. R. Smith, and W. J. Padilla, “Terahertz compressive imaging with metamaterial spatial light modulators,” Nat. Photonics 8, 605-609 (2014).
17. Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang, “Region-based saliency detection and its application in object recognition,” IEEE Trans. Circuits Syst. Video Technol. 24, 769-779 (2014).
18. Y. Li, J. Shi, L. Sun, X. Wu, and G. Zeng, “Single-pixel salient object detection via discrete cosine spectrum acquisition and deep learning,” IEEE Photonics Technol. Lett. 32,1381-1384 (2020).
19. Z. Zhang, X. Ma, and J. Zhong, “Single-pixel imaging by means of Fourier spectrum acquisition,” Nat. Commun. 6, 6225 (2015).
20. D. Zhang, D. Meng, L. Zhao, and J. Han, “Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning,” in Proc. International Joint Conferences on Artificial Intelligence (NY, USA, Jul. 9-15, 2016), pp. 3538-3544.
21. Y . Wei, J. Feng, X. Liang, M.-M. Cheng, Y . Zhao, and S. Yan, “Object region mining with adversarial erasing: A simple classi-fication to semantic segmentation approach,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (Honolulu, Hawaii, USA, Jul. 22-25, 2017), pp. 1568-1576.
22. X. Wang, S. You, X. Li, and H. Ma, “Weakly-supervised semantic segmentation by iteratively mining common object features,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (Salt Lake City, USA, Jul. 18-22, 2018), pp. 1354-1362.
23. G. Sun, W. Wang, J. Dai, and L. Van Gool, “Mining cross-image semantics for weakly supervised semantic segmentation,” in Proc. European Conference on Computer Vision (Glasgow, UK, Aug. 23-28, 2020), pp. 347-365.
24. L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, “Learning to detect salient objects with image-level supervision,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (Honolulu, Hawaii, USA, Jul. 21-26, 2017), pp. 136-145.
25. Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in Proc. Computer Vision and Pattern Recognition (Portland, OR, USA, Jun. 23-28, 2013), pp. 1155-1162.

Wonshik Choi,
Editor-in-chief