Reconstruction method – focal-line-based back-projection algorithm
In PACT, the universal back-projection (UBP) is frequently used for 3D image reconstruction [21]. Details of this reconstruction method are described by the following formula:
$$ {p}_0\left(\overrightarrow{r}\right)=\frac{1}{\Omega_0}\underset{S}{\int }d\Omega {\left.\left[2p\left(\overrightarrow{r_d},t\right)-2t\frac{\partial p\left(\overrightarrow{r_d},t\right)}{\partial t}\right]\right|}_{t=\left|\overrightarrow{r_d}-\overrightarrow{r}\right|/{v}_s} $$
Here, \( {p}_0\left(\overrightarrow{r}\right) \) is the initial PA pressure at \( \overrightarrow{r} \), \( p\left(\overrightarrow{r_d},t\right) \) is the acoustic pressure at \( \overrightarrow{r_d} \), and delay time t is calculated from the travel time \( \left|\overrightarrow{r_d}-\overrightarrow{r}\ \right|/{v}_s \), in which vs is the speed of sound in tissue (1.54 m/msec). Ω0 is the solid angle spanning over the transducer surface S. The universal back-projection algorithm is developed based on point-like transducers and is inaccurate for focused transducers, such as linear transducer arrays with a focus along the axial direction. In this case, because of the element aperture, time delay cannot be computed directly from the point source to the center of the element. The focal-line reconstruction algorithm addresses this issue by utilizing a focal line which goes through the foci of all transducer elements. The travel path (time of arrival) of any point in 3D space is quantified based on its intersection with the focal line: only the path that goes across the focal line gives the strongest response in the transducer. Detailed descriptions of this method can be found in [9, 22].
MEXCUDA function generation
As aforementioned, MATLAB is used as the main platform for pre- and post-processing the data and all the extensive computation process is performed in C++. Such that, we need to establish a “gateway” between CUDA/C++ and MATLAB. MEXCUDA function offers a perfect solution for this connection. It is a convenient way to take input from MATLAB to C++, perform calculation in C++, and then take the output back to MATLAB. In details, MEXCUDA is the expansion of MATLAB mex function that utilizes C/C++ for execution using C++ MEX API. The difference between mex and MEXCUDA is that MEXCUDA is compiled by the NVIDIA CUDA compiler (nvcc), enabling GPU execution on C++ for improved performance.
We first need to generate a MEXCUDA function before calling it in MATLAB. The source code for the MEXCUDA function is a CU file which is written in C++ for CUDA. The CU file has the following main building blocks. The first block is initialization with two purposes. First, it prepares the code with MathWorks’ GPU library by calling mxInitGPU from the mxGPU API. Secondly, it creates mxGPUArray objects (mxGPUArray is a CUDA class to contain GPU arrays) to store gpuArray inputs from MATLAB and an output matrix “pa_img” representing the reconstructed image. The next block of code is parallel computation. It contains several kernel functions on the device code to calculate pa_img from the input mxGPUArray objects in parallel. The last block of the CU code is finalization. It includes functions to deliver pa_img back to MATLAB code and to destroy the GPU matrices to save memory. From this source code, we create the compiled MEXCUDA function by using the mexcuda command in MATLAB. This final MEXCUDA function is in mexw64 type, which is a nvcc-compiled code for the 64-bit Windows operating system. This function can be called directly in Matlab as a subfunction.
The workflow of a function execution by MEXCUDA is demonstrated in Fig. 1. First, in the MATLAB front-end code, users load raw data, convert CPU-based matrices into GPU matrices, and set reconstruction parameters. Then, users send inputs to MEXCUDA function. After executing through the building blocks mentioned above, this function returns the output as the final reconstructed image to MATLAB. Finally, with post-processing steps in MATLAB, users are able to visualize and examine the reconstructed 3D structure.
Heterogeneous computing in CUDA/C++
The process flow executed in C++ employs a widely-known programming method called heterogeneous computing in order to maximize the performance. GPU, despite having excellent computing ability by calculating each matrix value in parallel, cannot perform both traditional serial and CPU-based tasks effectively, such as checking input compatibility, pre-allocating memory, and creating output arrays. On the other hand, CPU is faster at handling these steps so it is better suited for pre- and post-processing data. Such that, CPU is employed in the initialization and finalization blocks, while GPU is exploited in the parallel computation block. This processing flow is presented in Fig. 2.
Validating experiments
To evaluate the efficiency of the optimized code, we scanned a breast of a healthy volunteer to acquire 3D vascular data. The human imaging study was performed in compliance with the University at Buffalo IRB protocols. The PACT imaging system contains three main parts: a 10-ns-pulsed Nd:YAG laser with 10 Hz pulse repetition rate and 1064 nm output wavelength, a customized linear array with 128 elements and 2.25 MHz central frequency, and a Verasonics’ Vantage data acquisition system with 128 receive channels. The light illumination was achieved through a bifurcated fiber bundle with 1.1-cm-diameter circular input and two 7.5-cm-length line outputs (Light CAM #2, Schott Fostec). During the experiment, the input laser energy was around 800 mJ/pulse and the efficiency of the fiber bundle is 60%, so that the laser output from the fiber bundle is around 480 mJ/pulse. Since the size of the laser beam on the object’s surface was approximately 2.5 cm × 8.0 cm, the laser intensity is 30 mJ/cm2, which is much lower than the safety limit of 100 mJ/cm2 [23]. The transducer was scanned along the elevation direction over 40 mm at 0.1 mm step size. The entire imaging area is 8.6 cm (lateral width of the probe) × 4 cm (scanning distance). A schematic of the experimental setup is illustrated in Fig. 3. Following data collection, we performed 3D focal-line reconstruction with MCCC, MCC and MWGC for comparison.