Ex) Article Title, Author, Keywords
Current Optics
and Photonics
Ex) Article Title, Author, Keywords
Curr. Opt. Photon. 2024; 8(5): 463-471
Published online October 25, 2024 https://doi.org/10.3807/COPP.2024.8.5.463
Copyright © Optical Society of Korea.
Kai Liu1, Leihong Zhang1 , Runchu Xu1, Dawei Zhang1, Haima Yang1, Quan Sun2
Corresponding author: *lhzhang@usst.edu.cn, ORCID 0000-0002-1787-2978
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Multimode fibers (MMFs) possess high information throughput and small core diameter, making them highly promising for applications such as endoscopy and communication. However, modal dispersion hinders the direct use of MMFs for image transmission. By training neural networks on time-series waveforms collected from MMFs it is possible to reconstruct images, transforming blurred speckle patterns into recognizable images. This paper proposes a fully convolutional neural-network model, MSMDFNet, for image restoration in MMFs. The network employs an encoder-decoder architecture, integrating multiscale convolutional modules in the decoding layers to enhance the receptive field for feature extraction. Additionally, attention mechanisms are incorporated from both spatial and channel dimensions, to improve the network’s feature-perception capabilities. The algorithm demonstrates excellent performance on MNIST and Fashion-MNIST datasets collected through MMFs, showing significant improvements in various metrics such as SSIM.
Keywords: Deep learning, Multi-dimensional attention mechanism, Multimode fiber image reconstruction, Multi-scale convolution
OCIS codes: (060.2350) Fiber optics imaging; (100.3020) Image reconstruction-restoration
As a high-capacity device for information transmission, multimode fiber (MMF) has garnered increasing attention and application in areas such as endoscopy [1–4], optical imaging [5, 6], and fiber-optic communication [7]. Compared to single-mode and few-mode fibers, MMFs boast advantages such as rich modal diversity, high information capacity, and large numerical aperture [8, 9]. Additionally, their low manufacturing cost and ease of connectivity make them attractive for endoscopic medical diagnostics. However, MMF is a typical scattering medium in which different modes exhibit varying group velocities, leading to dispersion during signal transmission across modes. The images transmitted through MMF appear as speckle patterns with spatially random amplitude and phase distributions, making the original images unrecognizable [10–12]. Furthermore, MMF imaging is highly sensitive to environmental conditions; Factors such as temperature, length, bending, and vibration significantly impact speckle imaging, posing challenges to the application of MMF in image transmission. Therefore, understanding the transmission and mapping relationships within MMFs is crucial to advancing MMF imaging research.
Traditional optical methods for MMF image transmission have yielded some results. Techniques such as digital phase conjugation [13–15], spot-scanning imaging [16], and the transmission matrix [17, 18] have been employed to overcome modal dispersion, fiber coupling, and other external factors affecting MMF imaging. However, these mathematical models are often constrained by the uncertainties of environmental interferences, and require pre-calibration before each experiment to simulate the nonlinear relationships within MMFs, making image information recovery challenging. Thus there is a need to explore methods that do not rely on phase information and pre-modulation to overcome the challenges of MMF imaging.
The successful application of deep learning in optical imaging has spurred rapid development in imaging through scattering media, offering new possibilities for MMF endoscopy [19–24]. Rahmani et al. [25] demonstrated that convolutional neural networks (CNNs) could learn the nonlinear mapping relationships between MMF inputs and outputs. Fan et al. [26] found that CNNs exhibited excellent generalization capabilities across various MMF transmission states. Resisi et al. [27] used deep-learning methods to successfully reconstruct input images from the output speckles of perturbed MMFs. Hu et al. [28] achieved high-fidelity, full-color cellular imaging using unsupervised learning methods, which do not require paired images and allow for more flexible calibration.
However, traditional CNN models may not effectively restore certain image features, necessitating improvements to the original methods. For image restoration through scattering media, the U-net network [29] is typically employed for model training. However, conventional U-net struggles with complex image details, as it cannot focus on the critical pixels during feature extraction, and its single-scale convolution limits effective extraction of key features. Additionally, the physical instrumentation for MMF transmission is relatively complex, making this task more specialized compared to other image-restoration tasks. When the input image is converted into a two-dimensional (2D) signal, some features of the original image are lost due to compression and data processing, leading to key information missing from 2D image restoration and hindering accurate reconstruction of the original image.
To address these issues, this paper proposes a novel and fully convolutional neural network, MSMDFNet, which uses data collected from a self-built MMF experimental platform for reconstruction tasks. MSMDFNet employs an encoder-decoder architecture, replacing pooling layers in the encoder with convolutional layers to expand the receptive field, and to enhance feature-extraction capabilities. Additionally, multiscale convolutional modules and multidimensional attention mechanisms are introduced in the decoding layers to ensure efficient and accurate image restoration, improving the network’s transfer-learning ability. Currently, MSMDFNet outperforms traditional networks like U-net across multiple evaluation metrics, demonstrating strong adaptability and robustness. Its potential applications in medicine and communication could be significant.
The MMF speckle-collection system is illustrated in Fig. 1. The light source is a 633-nm He-Ne laser. After emission, the laser is focused onto a pinhole by the first objective lens, with subsequent lenses used for filtering, collimation, and beam expansion. The laser beam then passes through a polarizer and a beam-splitter cube before illuminating the spatial light modulator (SLM), which carries the information. The light reflected from the SLM is directed to an objective lens and coupled into the input end of the MMF (200 µm, NA 0.22, M122L01; Thorlabs, NJ, USA). The light carrying the object’s information undergoes complex internal transformations within the MMF and exits at the output end, where an objective lens modifies its divergence angle. At this stage, a charge-coupled device (CCD) camera successfully captures the speckle information emerging from the MMF through the objective lens.
The traditional U-net model employs convolution and downsampling in its encoding layers to extract image features and reduce dimensionality, followed by upsampling in the decoding layers to increase image size and obtain deeper features. Each convolutional block in the encoding layer is connected to a corresponding convolutional block in the decoding layer, allowing the fusion of low-level features from the early stages with high-level features from the later stages. This integration of multilevel information and global connections effectively enhances the network’s ability to reconstruct speckle images.
In this study, considering that single-scale convolutions of traditional U-net may not fully capture critical image information, and tend to overlook effective features between local and global scales, we introduce a multiscale feature-fusion module. This module replaces the convolutional layers of the decoder in the original U-net with parallel combinations of convolutional layers using 1 × 1, 3 × 3, and 5 × 5 kernels. The multiscale feature-fusion module captures details and features of the image at different scales, improving the performance of image-reconstruction tasks by fusing information from various scales, thus enhancing the model’s robustness and accuracy.
Additionally, a multidimensional attention-mechanism module is incorporated into the network’s decoder. This module aims to learn the attention weights of upsampled data and to assign optimal weights and biases, enabling the network to focus on key information and suppress irrelevant information. The learned feature maps are skip-connected with the corresponding layers in the upsampling process, facilitating multilevel feature fusion. This approach not only enhances the extraction of critical features, but also ensures the invariance of key feature information on a global scale. The multidimensional attention-mechanism module learns and extracts features from both channel and spatial dimensions, offering better results compared to attention mechanisms focusing solely on channels.
Moreover, our network employs a fully convolutional architecture. In downsampling, single-layer convolutions replace pooling layers, and in upsampling, transposed convolutions are used for information extraction from feature maps. Convolutional layers have a larger receptive field compared to pooling layers, helping to capture broader contextual information. Convolutional layers can learn more complex features and minimize information loss, whereas pooling layers simply reduce the size of feature maps, potentially losing some detail. Consequently, the MSMDFNet model can efficiently learn valuable information from the collected data, yielding significantly better image reconstruction than the traditional U-net model.
The structure of the MSMDFNet network model is illustrated in Fig. 2. The first layer of the network inputs a single-channel 128 × 128 image, which transforms into a 64-channel feature map after passing through the initial convolutional layer. In the encoder, pooling layers in downsampling are replaced by convolutional layers with a stride of 2, reducing the width and height of feature maps by half while learning more complex features. After a series of downsampling steps, a 1,024-channel feature map of size 8 × 8 is obtained. During decoding, the upsampling and multiscale convolution operations, along with the multidimensional attention mechanism shown in Fig. 3, connect the corresponding downsampling layers to the current layers, progressively restoring image details and improving image accuracy.
Figure 2(a) depicts the multiscale and multidimensional fusion module. C1 undergoes downsampling and a convolutional layer to obtain C2, followed by upsampling to yield C3. C3 and C1 serve as inputs to the multidimensional attention-mechanism module. Figure 3 illustrates the structure of the multidimensional attention-mechanism module, where input information passes through sequential channel and spatial attention modules to produce the output. The output is concatenated with C1 to obtain C4, which then undergoes the multiscale convolution module shown in Fig. 2(b) to yield C5. The input feature map first enters the channel attention module, where average pooling and max pooling are performed along the spatial dimension to obtain two one-dimensional feature vectors. These vectors respectively represent the global average features and the global most salient features. These pooled vectors are then mapped to a compact space through shared fully connected layers to generate a channel weight. The two channel weights are then weighted and summed, and normalized using a sigmoid activation function to obtain the attention weight for each channel. This weight is used to reweight the original feature map, highlighting key channel features and suppressing less important ones. The weighted feature map then enters the spatial attention module, where global average pooling and max pooling are performed along the channel dimension to obtain two feature maps. These feature maps are concatenated along the channel dimension to form a new feature map, which is then convolved to generate a two dimensional spatial-attention weight map. This weight map is normalized via a sigmoid activation function and used to reweight the input feature map, thereby highlighting key spatial information. The multiscale convolution module consists of parallel single layer convolutions with kernel sizes of 1 × 1, 3 × 3, and 5 × 5, concatenated along the channel dimension. The three different scales of convolution kernels provide a broader receptive field for the network. Additionally, a convolutional layer is used to connect the attention mechanism module end to end, to better preserve the detailed textures of the feature maps. During decoding, the network improves its generalization ability and optimization efficiency while extracting more complex feature information.
In our model, batch normalization (BN) and rectified linear unit (ReLU) layers are added after some convolutional layers. The BN layer helps standardize the data, and the ReLU layer prevents overfitting and reduces computational load. After a series of upsampling steps, the multichannel feature maps are compressed and optimized using a single-layer convolution and the tanh activation function, producing a single-channel image with a resolution of 128 × 128.
In our experiments, the network model is trained on a high-performance workstation equipped with an NVIDIA RTX 3090 GPU, an Intel i7-11700K CPU, and 128 GB of RAM. Python 3.9.18 and Pytorch 1.12 are used as the training environment to implement the model’s algorithm. The training is conducted over 200 iterations, with Adam with weight decay (AdamW) chosen as the optimizer. Each iteration uses a batch size of 20 samples, and the learning rate is set to 1 × 10−3. The mean squared error (MSE) is selected as the loss function, defined by the following Eq. (1):
where n is the number of samples, Yi represents the true values, and Ŷi represents the predicted values. Using MSE as the loss function facilitates faster convergence to the minimum value, even when the learning rate remains fixed.
To evaluate the quality of speckle image reconstruction, the structural similarity (SSIM) index is chosen, as it more closely aligns with human visual perception. SSIM defines structural information from the perspective of image composition, independent of brightness and contrast, and effectively reflects the structural attributes of objects in the scene, as shown in Eq. (2) below:
μA and μB (δA and δB) are the means (variances) of images A and B respectively, δAB is the covariance of images A and B, and c1 and c2 are regularization parameters. For the assessment of SSIM between the restored image and the reference image, higher SSIM values indicate better reconstruction quality.
In addition, peak signal-to-noise ratio (PSNR) is employed as an auxiliary evaluation method. Higher PSNR values indicate better reconstruction effects, and the PSNR’s objectivity and stability can well reflect the quality of the image, as shown in Eq. (3) below:
Furthermore, the correlation coefficient (CORR) is used as an evaluation metric to assess the correlation between the reconstructed image and the original image, as shown in Eq. (4) below:
A and B represent the mean pixel values of the reconstructed and original images, respectively. The closer CORR is to 1, the more similar the two images are.
To better evaluate the performance of the MSMDFNet model in the task of MMF speckle image restoration, we compare different networks on the experimentally collected MNIST [30] speckle dataset. The networks compared were U-net, Vgg-Unet [31], and R2U-net [32]. Vgg-Unet utilizes the VGG network as the encoder with a decoder identical to that of U-net, while R2U-net is a recurrent residual convolutional neural network based on the U-net model, integrating RNN and Resnet structures into the encoder-decoder architecture. We conduct ablation experiments to demonstrate the superiority of the multiscale and multidimensional modules in speckle image restoration. The specific network comparison models are as follows: The traditional U-net model, Net1 (with only the multiscale module added), Net2 (with only the multidimensional attention-mechanism module added), and MSMDFNet (with both the multiscale module and multidimensional attention-mechanism module added, and the pooling layers in the encoder replaced by convolutional layers). Finally, to test the proposed network’s generalization ability, we use the same model to restore speckle data collected from the Fashion-MNIST [33] (hereinafter referred to as FMNIST) dataset under the same environmental conditions.
Using the experimental setup shown in Fig. 1, a total of 2,200 speckle images from the MNIST dataset and another 2,200 speckle images from the FMNIST dataset are collected at the output end of the MMF, corresponding to their respective original images. Of these, the initial 2,000 image pairs are selected as the training set, and the remaining 200 pairs are used as the test set. To ensure the suitability of the network training, and to retain as much useful information as possible, the collected speckle image size is standardized to 128 × 128.
The 2,000 image-sample training data pairs are used to train the U-net, Vgg-Unet, R2U-net, and MSMDFNet models, resulting in the optimal models. The trained weights and biases are then loaded, and the 200-image MNIST test set is used to test the models under the same acquisition conditions as the training set. The test results are shown in Table 1.
TABLE 1 Comparison of test results for MNIST speckle image recovery by different networks
Model | SSIM | PSNR | CORR |
---|---|---|---|
U-net | 0.7626 | 18.1678 | 0.8886 |
Vgg-Unet | 0.7461 | 17.1990 | 0.8601 |
R2U-net | 0.7584 | 17.7645 | 0.8811 |
MSMDFNet | 0.8158 | 20.3670 | 0.9331 |
The four network models are tested on speckle images corresponding to handwritten digits 0–9. As shown in Table 1, the MSMDFNet model achieves higher scores across all three evaluation metrics, compared to the other networks. This demonstrates that the MSMDFNet, which incorporates multiscale and multidimensional attention mechanisms, performs better in restoring MMF speckle images. The multiscale module increases the network’s receptive field for feature extraction, allowing it to capture detailed information at different scales, thus compensating for key features lost due to single-scale convolutional kernels. The multidimensional attention mechanism enhances the network’s focus on restoring central digit information in both channel and spatial dimensions, effectively focusing on the sparse, abstract 2D waveform images. Consequently, the reconstructed images have better contrast and brightness, with higher SSIM and PSNR values, leading to higher CORR in the reconstructed images.
Figure 4 presents the information-recovery results of the speckle images tested by the four networks. From a human visual perspective, it is evident that the speckle image in Fig. 4(f) reconstructed by the MSMDFNet model accurately restores the original image. The reconstructed handwritten digits have smooth edges without any indentations, matching the shape of the label image without any distortions. In contrast, the speckle-reconstruction images in Figs. 4(c)–4(e) have visibly lower reconstruction quality. Some reconstructed images have pronounced jagged edges, especially for the digit 5. Compared to the MSMDFNet model, the U-net, R2U-net, and Vgg-Unet models fail to adequately restore the label images, exhibiting significant distortions.
To more clearly demonstrate the advantages of the MSMDFNet model in MMF speckle image reconstruction, Fig. 5 records the loss, SSIM, and PSNR for each training iteration of the network. Figure 5(a) shows the training loss, indicating that while the loss-convergence regions of the four networks are similar, the other three networks exhibit significant fluctuations and occasional exploding errors, while in contrast, the MSMDFNet model’s loss remains stable with minimal oscillation, indicating that our network effectively learns the hidden features and patterns of the input information, showing a clear advantage in stability. Figures 5(b)–5(d) display the SSIM, PSNR, and CORR of the test set during network training. It is evident that the MSMDFNet model significantly outperforms the other networks in terms of quality of MMF speckle image reconstruction across all three-evaluation metrics, with smaller fluctuations and greater stability during training.
Using 2,000 pairs of MMF speckle image data, we train the U-net, Net1, Net2, and MSMDFNet models to test the effectiveness of different modules in quality of speckle image reconstruction. A test set of 200 speckle images, collected under the same experimental conditions, is used to observe the models’ reconstruction of the label images. Table 2 presents the average SSIM, PSNR and CORR values for image reconstruction of MMF speckles by the four networks.
TABLE 2 Comparison of test results for MNIST speckle image recovery by different modules
Model | SSIM | PSNR | CORR |
---|---|---|---|
U-net | 0.7626 | 18.1678 | 0.8886 |
Net1 | 0.7693 | 18.24 | 0.8895 |
Net2 | 0.7851 | 19.13 | 0.9118 |
MSMDFNet | 0.8158 | 20.3670 | 0.9331 |
From the data in Table 2, the following observations can be made: (1) Model Net1, which includes the multiscale module, shows higher metrics across all data, compared to the traditional U-net. This indicates that focusing on features at different scales helps to improve the quality of MMF speckle image restoration. (2) Model Net2, incorporating the multidimensional attention-mechanism module, demonstrates a significant improvement in reconstruction results. This proves that focusing on different dimensions of network information can effectively enhance image reconstruction quality. The multidimensional attention mechanism concentrates the network’s attention on the central, information-rich regions of the handwritten digits, resulting in better reconstruction effects. (3) The MSMDFNet model, which combines both modules, significantly outperforms the other models, demonstrating the substantial advantage of the proposed network in MMF speckle image reconstruction.
Using the four network models trained solely on the MNIST speckle dataset, we reconstruct 200 FMNIST speckle images collected under the same conditions, to test the models’ transfer-learning capabilities in speckle image restoration. As shown in Table 3, the performance metrics dropped significantly compared to those involving the simpler MNIST dataset, due to the increased complexity of FMNIST images. However, the MSMDFNet model still exhibits superior adaptability, with all metrics surpassing those of the other models.
TABLE 3 Test results for FMNIST speckle image recovery
Model | SSIM | PSNR | CORR |
---|---|---|---|
U-net | 0.5810 | 17.0114 | 0.8519 |
Vgg-Unet | 0.5538 | 16.4643 | 0.8437 |
R2U-net | 0.5670 | 17.0968 | 0.8543 |
MSMDFNet | 0.6214 | 18.4463 | 0.8857 |
From Fig. 6, it is evident that the image details reconstructed in Fig. 6(b) are superior to those in Figs. 6(c)–6(e). Thus MSMDFNet’s advantages over the other models are more pronounced, in terms of detail reconstruction. These results demonstrate that the multiscale module and multidimensional attention-mechanism module contribute significantly to information recovery from MMF speckle images. The MSMDFNet network model excels in extracting image feature information, and proves effective for transfer learning between different image datasets.
This paper proposes a fully convolutional neural-network model, MSMDFNet, which integrates multiscale information and multidimensional attention mechanisms for image restoration in MMF. This approach fuses feature information from different scales, enhancing the capability to capture global information. Combined with multidimensional attention mechanisms, the network’s focus is concentrated on regions with complex information, providing superior detail handling in MMF speckle reconstruction. The MSMDFNet model also demonstrates superior generalization ability compared to other networks, maintaining excellent reconstruction performance for different datasets. In summary, the MSMDFNet model exhibits strong learning and demodulation capabilities for the nonlinear optical transmission mechanism from input to output in MMF, making it more advantageous in image reconstruction. It holds significant application potential in medical endoscopic imaging and secure communication.
The authors thank the National Natural Science Foundation of China, Shanghai Industrial Collaborative Innovation Project, and the Development Fund for Shanghai Talents for help in identifying collaborators for this work.
The National Natural Science Foundation of China (Grant no. 62275153, 62005165); the Shanghai Industrial Collaborative Innovation Project (Grant no. HCXBCY-2022-006); the Development Fund for Shanghai Talents (Grant no. 2021005); the Key Laboratory of Space Active Opto-electronics Technology of Chinese Academy of Sciences (Grant no. 2021ZDKF4); the Shanghai Science and Technology Innovation Action Plan (Grant no. 22dz1201300).
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Data underlying the results presented in this paper are not publicly available at this time, but may be obtained from the authors upon reasonable request.
Curr. Opt. Photon. 2024; 8(5): 463-471
Published online October 25, 2024 https://doi.org/10.3807/COPP.2024.8.5.463
Copyright © Optical Society of Korea.
Kai Liu1, Leihong Zhang1 , Runchu Xu1, Dawei Zhang1, Haima Yang1, Quan Sun2
1School of Optical-electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
2College of Advanced Interdisciplinary Studies, National University of Defense Technology, Changsha 410073, China
Correspondence to:*lhzhang@usst.edu.cn, ORCID 0000-0002-1787-2978
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Multimode fibers (MMFs) possess high information throughput and small core diameter, making them highly promising for applications such as endoscopy and communication. However, modal dispersion hinders the direct use of MMFs for image transmission. By training neural networks on time-series waveforms collected from MMFs it is possible to reconstruct images, transforming blurred speckle patterns into recognizable images. This paper proposes a fully convolutional neural-network model, MSMDFNet, for image restoration in MMFs. The network employs an encoder-decoder architecture, integrating multiscale convolutional modules in the decoding layers to enhance the receptive field for feature extraction. Additionally, attention mechanisms are incorporated from both spatial and channel dimensions, to improve the network’s feature-perception capabilities. The algorithm demonstrates excellent performance on MNIST and Fashion-MNIST datasets collected through MMFs, showing significant improvements in various metrics such as SSIM.
Keywords: Deep learning, Multi-dimensional attention mechanism, Multimode fiber image reconstruction, Multi-scale convolution
As a high-capacity device for information transmission, multimode fiber (MMF) has garnered increasing attention and application in areas such as endoscopy [1–4], optical imaging [5, 6], and fiber-optic communication [7]. Compared to single-mode and few-mode fibers, MMFs boast advantages such as rich modal diversity, high information capacity, and large numerical aperture [8, 9]. Additionally, their low manufacturing cost and ease of connectivity make them attractive for endoscopic medical diagnostics. However, MMF is a typical scattering medium in which different modes exhibit varying group velocities, leading to dispersion during signal transmission across modes. The images transmitted through MMF appear as speckle patterns with spatially random amplitude and phase distributions, making the original images unrecognizable [10–12]. Furthermore, MMF imaging is highly sensitive to environmental conditions; Factors such as temperature, length, bending, and vibration significantly impact speckle imaging, posing challenges to the application of MMF in image transmission. Therefore, understanding the transmission and mapping relationships within MMFs is crucial to advancing MMF imaging research.
Traditional optical methods for MMF image transmission have yielded some results. Techniques such as digital phase conjugation [13–15], spot-scanning imaging [16], and the transmission matrix [17, 18] have been employed to overcome modal dispersion, fiber coupling, and other external factors affecting MMF imaging. However, these mathematical models are often constrained by the uncertainties of environmental interferences, and require pre-calibration before each experiment to simulate the nonlinear relationships within MMFs, making image information recovery challenging. Thus there is a need to explore methods that do not rely on phase information and pre-modulation to overcome the challenges of MMF imaging.
The successful application of deep learning in optical imaging has spurred rapid development in imaging through scattering media, offering new possibilities for MMF endoscopy [19–24]. Rahmani et al. [25] demonstrated that convolutional neural networks (CNNs) could learn the nonlinear mapping relationships between MMF inputs and outputs. Fan et al. [26] found that CNNs exhibited excellent generalization capabilities across various MMF transmission states. Resisi et al. [27] used deep-learning methods to successfully reconstruct input images from the output speckles of perturbed MMFs. Hu et al. [28] achieved high-fidelity, full-color cellular imaging using unsupervised learning methods, which do not require paired images and allow for more flexible calibration.
However, traditional CNN models may not effectively restore certain image features, necessitating improvements to the original methods. For image restoration through scattering media, the U-net network [29] is typically employed for model training. However, conventional U-net struggles with complex image details, as it cannot focus on the critical pixels during feature extraction, and its single-scale convolution limits effective extraction of key features. Additionally, the physical instrumentation for MMF transmission is relatively complex, making this task more specialized compared to other image-restoration tasks. When the input image is converted into a two-dimensional (2D) signal, some features of the original image are lost due to compression and data processing, leading to key information missing from 2D image restoration and hindering accurate reconstruction of the original image.
To address these issues, this paper proposes a novel and fully convolutional neural network, MSMDFNet, which uses data collected from a self-built MMF experimental platform for reconstruction tasks. MSMDFNet employs an encoder-decoder architecture, replacing pooling layers in the encoder with convolutional layers to expand the receptive field, and to enhance feature-extraction capabilities. Additionally, multiscale convolutional modules and multidimensional attention mechanisms are introduced in the decoding layers to ensure efficient and accurate image restoration, improving the network’s transfer-learning ability. Currently, MSMDFNet outperforms traditional networks like U-net across multiple evaluation metrics, demonstrating strong adaptability and robustness. Its potential applications in medicine and communication could be significant.
The MMF speckle-collection system is illustrated in Fig. 1. The light source is a 633-nm He-Ne laser. After emission, the laser is focused onto a pinhole by the first objective lens, with subsequent lenses used for filtering, collimation, and beam expansion. The laser beam then passes through a polarizer and a beam-splitter cube before illuminating the spatial light modulator (SLM), which carries the information. The light reflected from the SLM is directed to an objective lens and coupled into the input end of the MMF (200 µm, NA 0.22, M122L01; Thorlabs, NJ, USA). The light carrying the object’s information undergoes complex internal transformations within the MMF and exits at the output end, where an objective lens modifies its divergence angle. At this stage, a charge-coupled device (CCD) camera successfully captures the speckle information emerging from the MMF through the objective lens.
The traditional U-net model employs convolution and downsampling in its encoding layers to extract image features and reduce dimensionality, followed by upsampling in the decoding layers to increase image size and obtain deeper features. Each convolutional block in the encoding layer is connected to a corresponding convolutional block in the decoding layer, allowing the fusion of low-level features from the early stages with high-level features from the later stages. This integration of multilevel information and global connections effectively enhances the network’s ability to reconstruct speckle images.
In this study, considering that single-scale convolutions of traditional U-net may not fully capture critical image information, and tend to overlook effective features between local and global scales, we introduce a multiscale feature-fusion module. This module replaces the convolutional layers of the decoder in the original U-net with parallel combinations of convolutional layers using 1 × 1, 3 × 3, and 5 × 5 kernels. The multiscale feature-fusion module captures details and features of the image at different scales, improving the performance of image-reconstruction tasks by fusing information from various scales, thus enhancing the model’s robustness and accuracy.
Additionally, a multidimensional attention-mechanism module is incorporated into the network’s decoder. This module aims to learn the attention weights of upsampled data and to assign optimal weights and biases, enabling the network to focus on key information and suppress irrelevant information. The learned feature maps are skip-connected with the corresponding layers in the upsampling process, facilitating multilevel feature fusion. This approach not only enhances the extraction of critical features, but also ensures the invariance of key feature information on a global scale. The multidimensional attention-mechanism module learns and extracts features from both channel and spatial dimensions, offering better results compared to attention mechanisms focusing solely on channels.
Moreover, our network employs a fully convolutional architecture. In downsampling, single-layer convolutions replace pooling layers, and in upsampling, transposed convolutions are used for information extraction from feature maps. Convolutional layers have a larger receptive field compared to pooling layers, helping to capture broader contextual information. Convolutional layers can learn more complex features and minimize information loss, whereas pooling layers simply reduce the size of feature maps, potentially losing some detail. Consequently, the MSMDFNet model can efficiently learn valuable information from the collected data, yielding significantly better image reconstruction than the traditional U-net model.
The structure of the MSMDFNet network model is illustrated in Fig. 2. The first layer of the network inputs a single-channel 128 × 128 image, which transforms into a 64-channel feature map after passing through the initial convolutional layer. In the encoder, pooling layers in downsampling are replaced by convolutional layers with a stride of 2, reducing the width and height of feature maps by half while learning more complex features. After a series of downsampling steps, a 1,024-channel feature map of size 8 × 8 is obtained. During decoding, the upsampling and multiscale convolution operations, along with the multidimensional attention mechanism shown in Fig. 3, connect the corresponding downsampling layers to the current layers, progressively restoring image details and improving image accuracy.
Figure 2(a) depicts the multiscale and multidimensional fusion module. C1 undergoes downsampling and a convolutional layer to obtain C2, followed by upsampling to yield C3. C3 and C1 serve as inputs to the multidimensional attention-mechanism module. Figure 3 illustrates the structure of the multidimensional attention-mechanism module, where input information passes through sequential channel and spatial attention modules to produce the output. The output is concatenated with C1 to obtain C4, which then undergoes the multiscale convolution module shown in Fig. 2(b) to yield C5. The input feature map first enters the channel attention module, where average pooling and max pooling are performed along the spatial dimension to obtain two one-dimensional feature vectors. These vectors respectively represent the global average features and the global most salient features. These pooled vectors are then mapped to a compact space through shared fully connected layers to generate a channel weight. The two channel weights are then weighted and summed, and normalized using a sigmoid activation function to obtain the attention weight for each channel. This weight is used to reweight the original feature map, highlighting key channel features and suppressing less important ones. The weighted feature map then enters the spatial attention module, where global average pooling and max pooling are performed along the channel dimension to obtain two feature maps. These feature maps are concatenated along the channel dimension to form a new feature map, which is then convolved to generate a two dimensional spatial-attention weight map. This weight map is normalized via a sigmoid activation function and used to reweight the input feature map, thereby highlighting key spatial information. The multiscale convolution module consists of parallel single layer convolutions with kernel sizes of 1 × 1, 3 × 3, and 5 × 5, concatenated along the channel dimension. The three different scales of convolution kernels provide a broader receptive field for the network. Additionally, a convolutional layer is used to connect the attention mechanism module end to end, to better preserve the detailed textures of the feature maps. During decoding, the network improves its generalization ability and optimization efficiency while extracting more complex feature information.
In our model, batch normalization (BN) and rectified linear unit (ReLU) layers are added after some convolutional layers. The BN layer helps standardize the data, and the ReLU layer prevents overfitting and reduces computational load. After a series of upsampling steps, the multichannel feature maps are compressed and optimized using a single-layer convolution and the tanh activation function, producing a single-channel image with a resolution of 128 × 128.
In our experiments, the network model is trained on a high-performance workstation equipped with an NVIDIA RTX 3090 GPU, an Intel i7-11700K CPU, and 128 GB of RAM. Python 3.9.18 and Pytorch 1.12 are used as the training environment to implement the model’s algorithm. The training is conducted over 200 iterations, with Adam with weight decay (AdamW) chosen as the optimizer. Each iteration uses a batch size of 20 samples, and the learning rate is set to 1 × 10−3. The mean squared error (MSE) is selected as the loss function, defined by the following Eq. (1):
where n is the number of samples, Yi represents the true values, and Ŷi represents the predicted values. Using MSE as the loss function facilitates faster convergence to the minimum value, even when the learning rate remains fixed.
To evaluate the quality of speckle image reconstruction, the structural similarity (SSIM) index is chosen, as it more closely aligns with human visual perception. SSIM defines structural information from the perspective of image composition, independent of brightness and contrast, and effectively reflects the structural attributes of objects in the scene, as shown in Eq. (2) below:
μA and μB (δA and δB) are the means (variances) of images A and B respectively, δAB is the covariance of images A and B, and c1 and c2 are regularization parameters. For the assessment of SSIM between the restored image and the reference image, higher SSIM values indicate better reconstruction quality.
In addition, peak signal-to-noise ratio (PSNR) is employed as an auxiliary evaluation method. Higher PSNR values indicate better reconstruction effects, and the PSNR’s objectivity and stability can well reflect the quality of the image, as shown in Eq. (3) below:
Furthermore, the correlation coefficient (CORR) is used as an evaluation metric to assess the correlation between the reconstructed image and the original image, as shown in Eq. (4) below:
A and B represent the mean pixel values of the reconstructed and original images, respectively. The closer CORR is to 1, the more similar the two images are.
To better evaluate the performance of the MSMDFNet model in the task of MMF speckle image restoration, we compare different networks on the experimentally collected MNIST [30] speckle dataset. The networks compared were U-net, Vgg-Unet [31], and R2U-net [32]. Vgg-Unet utilizes the VGG network as the encoder with a decoder identical to that of U-net, while R2U-net is a recurrent residual convolutional neural network based on the U-net model, integrating RNN and Resnet structures into the encoder-decoder architecture. We conduct ablation experiments to demonstrate the superiority of the multiscale and multidimensional modules in speckle image restoration. The specific network comparison models are as follows: The traditional U-net model, Net1 (with only the multiscale module added), Net2 (with only the multidimensional attention-mechanism module added), and MSMDFNet (with both the multiscale module and multidimensional attention-mechanism module added, and the pooling layers in the encoder replaced by convolutional layers). Finally, to test the proposed network’s generalization ability, we use the same model to restore speckle data collected from the Fashion-MNIST [33] (hereinafter referred to as FMNIST) dataset under the same environmental conditions.
Using the experimental setup shown in Fig. 1, a total of 2,200 speckle images from the MNIST dataset and another 2,200 speckle images from the FMNIST dataset are collected at the output end of the MMF, corresponding to their respective original images. Of these, the initial 2,000 image pairs are selected as the training set, and the remaining 200 pairs are used as the test set. To ensure the suitability of the network training, and to retain as much useful information as possible, the collected speckle image size is standardized to 128 × 128.
The 2,000 image-sample training data pairs are used to train the U-net, Vgg-Unet, R2U-net, and MSMDFNet models, resulting in the optimal models. The trained weights and biases are then loaded, and the 200-image MNIST test set is used to test the models under the same acquisition conditions as the training set. The test results are shown in Table 1.
TABLE 1. Comparison of test results for MNIST speckle image recovery by different networks.
Model | SSIM | PSNR | CORR |
---|---|---|---|
U-net | 0.7626 | 18.1678 | 0.8886 |
Vgg-Unet | 0.7461 | 17.1990 | 0.8601 |
R2U-net | 0.7584 | 17.7645 | 0.8811 |
MSMDFNet | 0.8158 | 20.3670 | 0.9331 |
The four network models are tested on speckle images corresponding to handwritten digits 0–9. As shown in Table 1, the MSMDFNet model achieves higher scores across all three evaluation metrics, compared to the other networks. This demonstrates that the MSMDFNet, which incorporates multiscale and multidimensional attention mechanisms, performs better in restoring MMF speckle images. The multiscale module increases the network’s receptive field for feature extraction, allowing it to capture detailed information at different scales, thus compensating for key features lost due to single-scale convolutional kernels. The multidimensional attention mechanism enhances the network’s focus on restoring central digit information in both channel and spatial dimensions, effectively focusing on the sparse, abstract 2D waveform images. Consequently, the reconstructed images have better contrast and brightness, with higher SSIM and PSNR values, leading to higher CORR in the reconstructed images.
Figure 4 presents the information-recovery results of the speckle images tested by the four networks. From a human visual perspective, it is evident that the speckle image in Fig. 4(f) reconstructed by the MSMDFNet model accurately restores the original image. The reconstructed handwritten digits have smooth edges without any indentations, matching the shape of the label image without any distortions. In contrast, the speckle-reconstruction images in Figs. 4(c)–4(e) have visibly lower reconstruction quality. Some reconstructed images have pronounced jagged edges, especially for the digit 5. Compared to the MSMDFNet model, the U-net, R2U-net, and Vgg-Unet models fail to adequately restore the label images, exhibiting significant distortions.
To more clearly demonstrate the advantages of the MSMDFNet model in MMF speckle image reconstruction, Fig. 5 records the loss, SSIM, and PSNR for each training iteration of the network. Figure 5(a) shows the training loss, indicating that while the loss-convergence regions of the four networks are similar, the other three networks exhibit significant fluctuations and occasional exploding errors, while in contrast, the MSMDFNet model’s loss remains stable with minimal oscillation, indicating that our network effectively learns the hidden features and patterns of the input information, showing a clear advantage in stability. Figures 5(b)–5(d) display the SSIM, PSNR, and CORR of the test set during network training. It is evident that the MSMDFNet model significantly outperforms the other networks in terms of quality of MMF speckle image reconstruction across all three-evaluation metrics, with smaller fluctuations and greater stability during training.
Using 2,000 pairs of MMF speckle image data, we train the U-net, Net1, Net2, and MSMDFNet models to test the effectiveness of different modules in quality of speckle image reconstruction. A test set of 200 speckle images, collected under the same experimental conditions, is used to observe the models’ reconstruction of the label images. Table 2 presents the average SSIM, PSNR and CORR values for image reconstruction of MMF speckles by the four networks.
TABLE 2. Comparison of test results for MNIST speckle image recovery by different modules.
Model | SSIM | PSNR | CORR |
---|---|---|---|
U-net | 0.7626 | 18.1678 | 0.8886 |
Net1 | 0.7693 | 18.24 | 0.8895 |
Net2 | 0.7851 | 19.13 | 0.9118 |
MSMDFNet | 0.8158 | 20.3670 | 0.9331 |
From the data in Table 2, the following observations can be made: (1) Model Net1, which includes the multiscale module, shows higher metrics across all data, compared to the traditional U-net. This indicates that focusing on features at different scales helps to improve the quality of MMF speckle image restoration. (2) Model Net2, incorporating the multidimensional attention-mechanism module, demonstrates a significant improvement in reconstruction results. This proves that focusing on different dimensions of network information can effectively enhance image reconstruction quality. The multidimensional attention mechanism concentrates the network’s attention on the central, information-rich regions of the handwritten digits, resulting in better reconstruction effects. (3) The MSMDFNet model, which combines both modules, significantly outperforms the other models, demonstrating the substantial advantage of the proposed network in MMF speckle image reconstruction.
Using the four network models trained solely on the MNIST speckle dataset, we reconstruct 200 FMNIST speckle images collected under the same conditions, to test the models’ transfer-learning capabilities in speckle image restoration. As shown in Table 3, the performance metrics dropped significantly compared to those involving the simpler MNIST dataset, due to the increased complexity of FMNIST images. However, the MSMDFNet model still exhibits superior adaptability, with all metrics surpassing those of the other models.
TABLE 3. Test results for FMNIST speckle image recovery.
Model | SSIM | PSNR | CORR |
---|---|---|---|
U-net | 0.5810 | 17.0114 | 0.8519 |
Vgg-Unet | 0.5538 | 16.4643 | 0.8437 |
R2U-net | 0.5670 | 17.0968 | 0.8543 |
MSMDFNet | 0.6214 | 18.4463 | 0.8857 |
From Fig. 6, it is evident that the image details reconstructed in Fig. 6(b) are superior to those in Figs. 6(c)–6(e). Thus MSMDFNet’s advantages over the other models are more pronounced, in terms of detail reconstruction. These results demonstrate that the multiscale module and multidimensional attention-mechanism module contribute significantly to information recovery from MMF speckle images. The MSMDFNet network model excels in extracting image feature information, and proves effective for transfer learning between different image datasets.
This paper proposes a fully convolutional neural-network model, MSMDFNet, which integrates multiscale information and multidimensional attention mechanisms for image restoration in MMF. This approach fuses feature information from different scales, enhancing the capability to capture global information. Combined with multidimensional attention mechanisms, the network’s focus is concentrated on regions with complex information, providing superior detail handling in MMF speckle reconstruction. The MSMDFNet model also demonstrates superior generalization ability compared to other networks, maintaining excellent reconstruction performance for different datasets. In summary, the MSMDFNet model exhibits strong learning and demodulation capabilities for the nonlinear optical transmission mechanism from input to output in MMF, making it more advantageous in image reconstruction. It holds significant application potential in medical endoscopic imaging and secure communication.
The authors thank the National Natural Science Foundation of China, Shanghai Industrial Collaborative Innovation Project, and the Development Fund for Shanghai Talents for help in identifying collaborators for this work.
The National Natural Science Foundation of China (Grant no. 62275153, 62005165); the Shanghai Industrial Collaborative Innovation Project (Grant no. HCXBCY-2022-006); the Development Fund for Shanghai Talents (Grant no. 2021005); the Key Laboratory of Space Active Opto-electronics Technology of Chinese Academy of Sciences (Grant no. 2021ZDKF4); the Shanghai Science and Technology Innovation Action Plan (Grant no. 22dz1201300).
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Data underlying the results presented in this paper are not publicly available at this time, but may be obtained from the authors upon reasonable request.
TABLE 1 Comparison of test results for MNIST speckle image recovery by different networks
Model | SSIM | PSNR | CORR |
---|---|---|---|
U-net | 0.7626 | 18.1678 | 0.8886 |
Vgg-Unet | 0.7461 | 17.1990 | 0.8601 |
R2U-net | 0.7584 | 17.7645 | 0.8811 |
MSMDFNet | 0.8158 | 20.3670 | 0.9331 |
TABLE 2 Comparison of test results for MNIST speckle image recovery by different modules
Model | SSIM | PSNR | CORR |
---|---|---|---|
U-net | 0.7626 | 18.1678 | 0.8886 |
Net1 | 0.7693 | 18.24 | 0.8895 |
Net2 | 0.7851 | 19.13 | 0.9118 |
MSMDFNet | 0.8158 | 20.3670 | 0.9331 |
TABLE 3 Test results for FMNIST speckle image recovery
Model | SSIM | PSNR | CORR |
---|---|---|---|
U-net | 0.5810 | 17.0114 | 0.8519 |
Vgg-Unet | 0.5538 | 16.4643 | 0.8437 |
R2U-net | 0.5670 | 17.0968 | 0.8543 |
MSMDFNet | 0.6214 | 18.4463 | 0.8857 |