**A tutorial using Python and scientific libraries to implement pair correlation function (pCF) analysis of a big time series of images from fluorescence microscopy on a personal computer.**

Updated on June, 28, 2022

Presented at the Big Data Image Processing & Analysis BigDIPA workshops 2016, 2017, and 2018

Supported by the National Institute of Health grant numbers 1R25EB022366-01 and 2P41GM103540-31

Released under the creative commons Attribution 4.0 International license

Spatiotemporal analysis of fluorescence fluctuations of optical microscopy measurements on living samples is a powerful tool that can provide insight into dynamic molecular processes of high biological relevance (Di Rienzo et al., 2016).

Pair correlation function (pCF) analysis of fluorescence microscopy image time series acquired using very fast cameras is one of the emerging and promising spatiotemporal techniques. However, it is computationally very expensive, and the analysis of big image data sets used to take hours or be impractical on personal computers.

In this tutorial, we use the open-source Python programming language and scientific libraries to compute the pCF analysis of a large time series of fluorescence images acquired with Selective Plane Illumination Microscopy (SPIM).

First, we implement a function to calculate the cross-correlation of two time series. We demonstrate the limitations of Python for efficient numerical computations and several ways to overcome them.

Next, we implement the pCF analysis of a small simulated image time series and optimize its speed by almost two orders of magnitude.

Finally, we use this pCF analysis function to analyze a multi-gigabyte image time series in small, overlapping chunks.

This is a tutorial on computing pair correlations in big images. Refer to the references for an introduction to the pair correlation method and how the computed pair correlations can be analyzed and visualized to study molecular flow in cells.

To follow this tutorial and run its code, the following prerequisites are needed:

- pair correlation function analysis of fluorescence fluctuations (e.g. Gratton and Digman lectures)
- programming and nD-array computing (e.g. Matlab, numpy)
- signal processing, time and frequency domain

- 64-bit Windows 10, macOS, or Linux based operating system
- Core i5 CPU with 4 cores
- 8 GB RAM
- SSD drive with 50 GB free space
- NVIDIA GPU with CUDA drivers
- A modern web browser supporting WebSockets
- Disabled on-access antivirus scanning for the working and scratch directories

- CPython 3.8 64-bit with development header files and libraries
- Python packages: Jupyter, IPython, numpy, scipy, matplotlib, scikit-image, h5py, Cython, dask, numba, and CuPy (optional)
- CUDA Toolkit (optional, used for CuPy)
- A Python distutils compatible C compiler with OpenMP support: Visual Studio 2019 or gcc

Clone the source code from the ipcf.ipynb repository to a working directory:

`git clone https://github.com/cgohlke/ipcf.ipynb`

Extract the example data files from the Simulation_Channel.bin.zip and nih3t3-egfp_2.zip archives to the ipcf.ipynb directory:

`unzip Simulation_Channel.bin.zip -d ipcf.ipynb unzip nih3t3-egfp_2.zip -d ipcf.ipynb`

Open the

`ipcf.ipynb`

notebook from within the ipcf.ipynb directory, e.g. using locally installed jupyter or a docker image, e.g.:`jupyter-nbclassic ipcf.ipynb docker run --rm -p 8888:8888 -v ${PWD}/ipcf.ipynb:/home/jovyan/work/ipcf.ipynb jupyter/scipy-notebook:d990a62010ae`

The notebook depends on a few platform specific variables. Adjust the path to the example data and the path to a scratch directory where large intermediate and output files will be saved:

In [1]:

```
# import common modules
import os
import sys
import math
import random
import datetime
import warnings
import threading
import multiprocessing
# limit the number CPUs to use
MAXCPUS = min(8, multiprocessing.cpu_count() // 2)
# read-only directory where the example data were extracted
DATA_PATH = './'
# writable directory where large intermediate and output files will be saved
# must not be a network drive
SCRATCH_PATH = './'
# for the BigDIPA workshop cluster
if os.path.exists('../../data/02_fcs_computation/'):
DATA_PATH = '../../data/02_fcs_computation/'
SCRATCH_PATH = '../../scratch/'
# use sequential MKL to prevent thread oversubscription
os.environ['MKL_NUM_THREADS'] = '1'
# os.environ['OMP_NUM_THREADS'] = '1'
import numpy
# set compiler and linker arguments for OpenMP
if sys.platform == 'win32':
OPENMP_ARGS = '--compile-args=/openmp'
else:
OPENMP_ARGS = '--compile-args=-fopenmp --link-args=-fopenmp'
# tell numba where to find CUDA NVVM on Windows
cuda_path = os.environ.get('CUDA_PATH', 'CUDA not found')
if os.path.exists(cuda_path):
os.environ['PATH'] += r';%s\bin;%s\nvvm\bin' % (cuda_path, cuda_path)
os.environ['CUDA_HOME'] = cuda_path
# ignore warnings
warnings.simplefilter('ignore')
# acquire a lock object to force single threaded execution
THREADLOCK = threading.RLock()
# initialize random number generators
random.seed(42)
numpy.random.seed(42)
# display plots within Jupyter notebook
%matplotlib inline
# detect if CUDA is available
try:
import cupy
SKIP_CUDA = False
except ImportError:
SKIP_CUDA = True
# record the current time
START_TIME = datetime.datetime.now()
```

The challenge is to **compute the pair correlation function analysis (pCF) of a large time series of images using Python on a personal computer in reasonable time**.

Our dataset is a 34.5 GB time series of SPIM images of a biological cell as **35,000 TIFF files of 1024x512 16-bit greyscale samples** each:

As part of molecular flow analysis, we need to **cross-correlate the time series at each pixel of the image with the time series of all its neighbor pixels at a specific radius** after correcting the time series for photobleaching:

For a radius of 6 there are 32 neighbor pixels. To analyze an image of 1024x512 pixels, the **number of cross-correlation calculations needed is 16,777,216** (including padding of the border by 6 pixels).

We would like to perform the cross-correlation calculations on the dataset in **about 15 minutes**.

We have a modern notebook or desktop computer with a minimum of **4 CPU cores, 8 GB RAM, and SSD**, running a 64-bit OS with a scientific Python 3.6 distribution and a C compiler installed.

To compute 16,777,216 cross-correlations on 4 CPU cores in 15 minutes, **a single cross-correlation computation must finish in about 215 µs** (`4*15*60*1000*1000/16777216`

).

Here is a **pseudo code for the image pair correlation function (ipCF)** analysis that will later be completed to be the reference implementation:

In [2]:

```
def ipcf_reference(image_timeseries, circle_coordinates, bins):
"""Return pair correlation function analysis of image time series.
Cross-correlate the time series of each pixel in the image
with all its neighbors at a certain radius and return all
the log-binned and smoothed correlation curves.
"""
ntimes, height, width = image_timeseries.shape
npoints = len(circle_coordinates)
radius = circle_coordinates[0, 0]
result = zeros((height-2*radius, width-2*radius, npoints, len(bins)),
'float32')
for y in range(radius, height-radius):
for x in range(radius, width-radius):
a = image_timeseries[:, y, x]
for i in range(npoints):
u, v = circle_coordinates[i]
b = image_timeseries[:, y+v, x+u]
c = correlate(b, a)
result[y-radius, x-radius, i] = smooth(average(c, bins))
return result
# functions that need to be implemented
def correlate(a, b):
"""Return cross-correlation of two arrays."""
...
def average(c, bins):
"""Return averaged chunks of array."""
...
def smooth(c):
"""Return smoothed array."""
...
def circle(npoints, radius):
"""Return circle coordinates."""
...
def logbins(size, nbins):
"""Return up to nbins exponentially increasing integers from 1 to size."""
...
```

In this section, we

- review the mathematical definition and some properties of cross-correlation
- implement an unnormalized cross-correlation function in pure Python
- compare its speed with an implementation in C
- try several Python libraries to speed up the cross-correlation calculation: threading, numpy, scipy, numba, numba.cuda, CuPy, and Cython
- use the cross-correlation theorem, Cython, and the fft2d C library to implement a very fast circular correlation function

Along the way we learn

- about CPython and its limitations for numerical computations
- to write Python C extensions and interface with C libraries using Cython
- to use the Jupyter notebook to interactively manage data, code, visualizations, and explanatory text

The techniques learned in the first section are applied to implement the pCF analysis of a small simulated image time series.

In this section, we

- load and explore a time series of images from a simulation of fluorescence fluctuations
- implement functions to normalize, average, and smooth correlation functions for image fluctuation analysis
- run the reference implementation of image pCF analysis
- visualize the result of the image pCF analysis
- optimize the algorithm and implementation of the image pCF analysis for speed

In this section, we demonstrate some methods to process data that fit on disk but are larger than RAM, aka out-of-core processing. We

- interactively browse the 34.5 GB image time series consisting of 35,000 TIFF files
- semi-automatically select a subset of the data for further analysis using the scikit-image library
- save the selection as a blocked dataset in a HDF5 file
- implement a function to correct time series for photobleaching
- use the dask library to chop big data in smaller, overlapping blocks/chunks and schedule the analysis of individual blocks
- run the photobleaching correction and the image pair correlation function implemented in the second part on overlapping blocks of the big SPIM dataset

In this section, we implement a fast 1D cross-correlation function. The goal is to perform a single cross-correlation of two long (2^{14}) 1D sequences of real numbers in about 200 µs.

The **unnormalized discrete correlation of two sampled functions** $a$ and $b$ as generally defined in signal processing texts (e.g. numpy.correlate and MATLAB xcorr) is:

This is known as the **sliding dot product**.

The above definition is not unique and sometimes correlation may be defined differently (e.g. Wikipedia):

$$c'_{ab}[delay] = \sum_n conj(a[n]) \cdot b[n+delay],$$which is related to the first definition:

$$c'_{ab}[delay] = c_{ab}[-delay] = c_{ba}[delay]$$There is also an ambiguity for which range of $delay$s to compute the correlations and how to deal with out of bounds $delay$ values.

In this tutorial, $a$, $b$ and $c$ will have the same length. $delay$ will range from negative half length to positive half length.

There are two ways of dealing with **out of bounds $delay$ values**:

- for
**linear correlation**, the out of bounds indices are set to zero (zero padding):

In [3]:

```
a = [1, 2]
b = [3, 4]
c = [ 0 * b[0] + a[1 - 1] * b[1], # delay -1 a[-1]=0
a[0 + 0] * b[0] + a[1 + 0] * b[1]] # delay 0
print(c)
```

[4, 11]

- for
**circular correlation**, the out of bounds indices are wrapped around:

In [4]:

```
a = [1, 2]
b = [3, 4]
c = [a[0 - 1] * b[0] + a[1 - 1] * b[1], # delay -1 a[-1]=a[1]
a[0 + 0] * b[0] + a[1 + 0] * b[1]] # delay 0
print(c)
```

[10, 11]

Linear correlation can be calculated using the circular algorithm by zero padding the input arrays on both sides to twice their lengths:

In [5]:

```
a = [0, 1, 2, 0]
b = [0, 3, 4, 0]
c = [a[0 - 1] * b[0] + a[1 - 1] * b[1] + a[2 - 1] * b[2] + a[3 - 1] * b[3],
a[0 + 0] * b[0] + a[1 + 0] * b[1] + a[2 + 0] * b[2] + a[3 + 0] * b[3]]
print(c)
```

[4, 11]

Some properties of the cross-correlation function are relevant for the pair correlation calculations:

According to the

**cross correlation theorem**, the cross-correlation ($\star$) of functions $a(t)$ and $b(t)$ can be calculated using the Fourier transform $\mathcal{F}\{\}$:$$ a\star b = \mathcal{F}^{-1}(\mathcal{F}\{a\} \cdot (\mathcal{F}\{b\})^*)$$

where $()^*$ denotes the complex conjugate. For long vectors this method is faster to calculate than the sliding dot product.

The cross-correlation of functions $a(t)$ and $b(t)$ is equivalent to the

**convolution**($*$) of $a(t)$ and $b^*(-t)$:$$ a\star b = a*b^*(-t),$$

where $b^*$ denotes the conjugate of $b$.

- For real valued input, the following symmetry applies:

- The discrete
**autocorrelation**of a sampled function is the discrete**correlation of the function with itself**. It is always symmetric with respect to positive and negative delays:

- The circular correlation can calculate linear correlation by zero-padding the vectors.

Let's implement the discrete linear correlation function for real input sequences in pure Python using the **sliding dot product** as defined previously:

In [6]:

```
def dot_python(a, b, start, stop, delay):
"""Return dot product of two sequences in range."""
sum = 0
for n in range(start, stop):
sum += a[n + delay] * b[n]
return sum
def correlate_python(a, b):
"""Return linear correlation of two sequences."""
size = len(a)
c = [0] * size # allocate output array/list
for index in range(size):
delay = index - size // 2
if delay < 0:
c[index] = dot_python(a, b, -delay, size, delay)
else:
c[index] = dot_python(a, b, 0, size-delay, delay)
return c
```

It is good practice to write tests to verify code:

In [7]:

```
def test_correlate(correlate_function):
"""Test linear correlate function using known result."""
# even lengths
c = correlate_function([1, 2], [3, 4])
assert list(c) == [4, 11], c
# uneven lengths
c = correlate_function([1, 2, 3], [4, 8, 16])
assert list(c) == [40, 68, 32], c
test_correlate(correlate_python)
```

The tests passed, no exception is raised.

Let's time the cross-correlation of two random sequences of length 8192, which are created using list comprehension:

In [8]:

```
import random
A = [random.random()-0.5 for _ in range(2**13)]
B = [random.random()-0.5 for _ in range(2**13)]
%time c = correlate_python(A, B)
```

Wall time: 4.49 s

This implementation is about **25,000 times slower than desired** (~200 µs) for our pair correlation function image analysis task.

Using the matplotlib 2D plotting library, we plot the auto-correlation of a short random sequence and embed it into the Juyter notebook:

In [9]:

```
from matplotlib import pyplot
def plot_autocorrelation(size=200):
"""Plot autocorrelation of a random sequence."""
a = [random.random()-0.5 for _ in range(size)]
c = correlate_python(a, a)
delays = list(range(-len(a) // 2, len(a) // 2))
pyplot.figure(figsize=(6, 6))
pyplot.subplot(2, 1, 1)
pyplot.title('random sequence')
pyplot.ylabel('intensity')
pyplot.plot(a, 'g')
pyplot.subplot(2, 1, 2)
pyplot.title('auto-correlation')
pyplot.xlabel('delay')
pyplot.ylabel('correlation')
pyplot.plot(delays, c, 'r')
pyplot.tight_layout()
pyplot.show()
plot_autocorrelation()
```

The autocorrelation is always **symmetric** with respect to positive and negative delays.

For long random sequences, the autocorrelation approaches an **impulse function**.

Using IPython widgets, we plot the cross-correlation of two short random sequences with peak, where the peak in sequence $b$ is delayed with respect to sequence $a$:

In [10]:

```
from matplotlib import pyplot
from ipywidgets import interact, IntSlider, Dropdown
def plot_crosscorrelation(size=100):
"""Interactively plot cross-correlation of signals with delayed peak."""
delays = list(range(-size//2, size//2))
a = [random.random()-0.5 for _ in range(size)]
b = [random.random()-0.5 for _ in range(size)]
a[size//2] = 10 # add peak in middle of sequence
def _plot(option, delay):
b_ = b.copy()
b_[size//2 + delay] = 10 # add peak at shifted position
if option.endswith('b'):
c = correlate_python(a, b_)
else:
c = correlate_python(b_, a)
pyplot.figure(figsize=(6, 6))
pyplot.subplot(2, 1, 1)
pyplot.title('random sequences with peak')
pyplot.ylabel('intensity')
pyplot.plot(a, 'g', label='a')
pyplot.plot(b_, 'b', label='b')
pyplot.ylim([-2, 12])
pyplot.yticks([0, 5, 10])
pyplot.legend(fancybox=True, framealpha=0.5)
pyplot.subplot(2, 1, 2)
pyplot.title('cross-correlation')
pyplot.xlabel('delay')
pyplot.ylabel('correlation')
pyplot.xlim([-size//2, size//2])
pyplot.ylim([-20, 120])
pyplot.yticks([0, 50, 100])
pyplot.plot(delays, c, 'r', label=option)
pyplot.legend(fancybox=True, framealpha=0.5)
pyplot.tight_layout()
pyplot.show()
interact(_plot,
option=Dropdown(options=['a\u2605b', 'b\u2605a']),
delay=IntSlider(value=size//5, min=2-size//2, max=size//2-1,
continuous_update=False))
plot_crosscorrelation()
```

A positive delay of the peak in sequence $b$ with respect to the peak in sequence $a$ shows as a peak at a negative delay in the cross-correlation of a and b ($a \star b$).

Let's try to use Python's concurrent.futures module to **run several correlation functions in parallel on multiple CPU cores** within the same process using threads (the smallest sequences of programmed instructions that can be managed independently by the operating system):

In [11]:

```
from functools import partial
from concurrent.futures import ThreadPoolExecutor
def map_threaded(function, *iterables, max_workers=MAXCPUS, **kwargs):
"""Apply function to every item of iterable and return list of results.
Use a pool of threads to execute calls asynchronously.
"""
if kwargs:
function = partial(function, **kwargs)
with ThreadPoolExecutor(max_workers) as executor:
return list(executor.map(function, *iterables))
```

In [12]:

```
%time c = map_threaded(correlate_python, [A, A], [B, B])
assert c[0] == c[1]
```

Wall time: 12.1 s

There is **no improvement** over executing the functions sequentially.

In CPython, the reference Python implementation we are using, **only one thread can execute Python code at once** due to the **Global Interpreter Lock (GIL)**.

Threading is still an appropriate model in Python to run multiple I/O-bound tasks simultaneously.

We demonstrate later that Python functions implemented in C can release the GIL and be executed on multiple CPU cores using threads.

Let's compare the performance of the pure Python function to an implementation in C.

The speed of compiled C code is often used as a reference when comparing single-threaded performance.

In [13]:

```
%%writefile correlate_c.c
/* A linear correlate function implemented in C. */
#include <stddef.h>
#include <stdlib.h>
#include <stdio.h>
typedef ptrdiff_t ssize_t;
/* Compute dot product of two sequences in range. */
double dot_c(double *a, double *b, ssize_t start, ssize_t end, ssize_t delay)
{
ssize_t n;
double sum = 0.0;
for (n = start; n < end; n++)
sum += a[n + delay] * b[n];
return sum;
}
/* Compute linear correlation of two one-dimensional sequences. */
void correlate_c(double *a, double *b, double *c, ssize_t size)
{
ssize_t index, delay;
for(index = 0; index < size; index++) {
delay = index - size / 2;
if (delay < 0) {
c[index] = dot_c(a, b, -delay, size, delay);
}
else {
c[index] = dot_c(a, b, 0, size-delay, delay);
}
}
}
/* Time the correlate_c function. */
int main()
{
ssize_t i;
ssize_t size = 8192;
ssize_t loops = 25;
double *a = (double*)malloc(size * sizeof(double));
double *b = (double*)malloc(size * sizeof(double));
for (i = 0; i < size; i++) {
a[i] = (double)rand()/(double)(RAND_MAX) - 0.5;
b[i] = (double)rand()/(double)(RAND_MAX) - 0.5;
}
for (i = 0; i < loops; i++) {
double *c = (double*)malloc(size * sizeof(double));
correlate_c(a, b, c, size);
free(c);
}
free(a);
free(b);
return 0;
}
```

Writing correlate_c.c

Python's distutils.ccompiler module can be used to compile and link the C code:

In [14]:

```
from distutils import ccompiler
compiler = ccompiler.new_compiler()
objects = compiler.compile(['correlate_c.c'], extra_postargs=['-O2'])
compiler.link_executable(objects, 'correlate_c')
```

The generated executable can be timed using Jupyter's magick:

In [15]:

```
correlate_executable = './correlate_c'
t = %timeit -r 1 -q -o ! $correlate_executable
print('{:.2f} ms per loop'.format(t.best * 1000 / 25))
```

40.84 ms per loop

The C program calculates the correlation about **two orders of magnitudes faster than Python**.

So far, we have used Python's built-in list type to store sequences of floating-point numbers.

Depending on the Python implementation and platform:

every item of a list is an 8-byte pointer to an object storing the value.

every floating-point number is stored as a 24-byte object.

Hence, Python lists are very inefficient for storing large number of homogeneous numerical data:

the numbers are

**not stored contiguously**in a Python list.a Python list of floating-point numbers is about

**4x larger than a C array**:

In [16]:

```
import sys
import random
size = 8192
alist = [random.random() for _ in range(size)]
print('Storage size of Python list: {:>6} bytes'.format(
sys.getsizeof(alist) + sys.getsizeof(alist[0]) * size))
print('Storage size of C array: {:>6} bytes'.format(8 + size * 8))
```

Storage size of Python list: 265760 bytes Storage size of C array: 65544 bytes

So far, we have shown that:

- Python built-in lists cannot efficiently store homogeneous numerical data.
- Python runs numerical code orders of magnitudes slower than compiled C.
- Python code cannot be run in parallel on multiple CPU cores in the same process.

Note that this applies to CPython, the Python reference implementation, only. Other Python implementations (pypy, Jython, IronPython) might not have these limitations.

There are technical solutions to overcome those limitations:

The

**numpy library**provides a**standardized, efficient N-dimensional array object**to store homogeneous numerical data.**Many third-party libraries**(numpy, scipy, scikit-image, etc.) provide fast implementations of numerical functions operating on numpy arrays.Python can be

**extended using modules written in C**, which can release the GIL.Python code can be

**type annotated and compiled to C code**using Cython.Python code using a subset of the language synthax can be

**just in time compiled to LLVM, CUDA, or OpenCL**and executed on CPU or GPU, e.g. via numba.

Putting the limitations into perspective: besides CPU bound numerical calculations, there are many other tasks that are part of an efficient image processing pipeline:

Many tasks are

**I/O bound**(load or save data from/to the Internet, hard drive, or databases) and can be efficiently multi-threaded in Python.Besides threading, there are other

**methods to analyze data in parallel**: SIMD, multiprocessing, distributed.Python can be used to

**drive/control/schedule compile and compute tasks**, e.g. generate, compile, and execute C/OpenCL/CUDA code at runtime.

As an image analyst or end user of imaging software, Python can be used

as a

**glue language**for external libraries or executables written in C, Fortran, R, Java, .NET, Matlab, etc.for

**data munging**, i.e. mapping image and meta-data from a diversity of formats (raw binary, html, CSV, TIFF, HDF5, etc.) and sources (file system, databases, http, ftp, cloud storage) into more convenient formats.as a

**scripting language for imaging software**such as ZEN, μManager, CellProfiler, MeVisLab, ArcGIS, Amira, ParaView, VisIt, GIMP, Blender, OMERO, BisQue, etc.

In [17]:

```
import numpy
def correlate_numpy(a, b):
"""Return linear correlation of two one-dimensional arrays."""
return numpy.correlate(a, b, mode='same')
def test_correlate(correlate_function, **kwargs):
"""Test correlate function using known results."""
c = correlate_function(numpy.array([1., 2., 3.]),
numpy.array([4., 8., 16.]), **kwargs)
assert numpy.allclose(c, [40., 68., 32.]), c
c = correlate_function(numpy.array([1., 2., 3., 4.]),
numpy.array([5., 6., 7., 8.]), **kwargs)
assert numpy.allclose(c, [23.0, 44.0, 70.0, 56.0]), c
test_correlate(correlate_numpy)
A = numpy.random.random(2**13) - 0.5
B = numpy.random.random(2**13) - 0.5
%timeit correlate_numpy(A, B)
```

9.67 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Depending on how the numpy library was compiled, the **numpy.correlate function is several times faster than our implementation in C**.

Numpy uses an optimized version of the dot product (from the BLAS library) for calculating the sliding dot product.

The storage size of the numpy array is close to a C array. The overhead of less than 100 bytes matters only for scalar values and small arrays:

In [18]:

```
print('Storage size of numpy array: {} bytes'.format(sys.getsizeof(A)))
```

Storage size of numpy array: 65632 bytes

Many numpy functions release the GIL and can be run in parallel on multiple CPU cores:

In [19]:

```
%timeit map_threaded(correlate_numpy, [A, A], [B, B])
```

10.6 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

However, depending on the numpy build options, some functions might use threads internally, which can cause oversubscription and parallel execution to be slower than sequential. In such cases, when using numpy built with Intel's MKL library, set the `MKL_NUM_THREADS`

environment variable to 1 to disable the libraries internal use of threads.

Cython is an optimizing static compiler for both the Python programming language and the extended Cython programming language.

Cython makes writing C extensions for Python and interfacing with numpy arrays and C libraries easy. Unlike numba, Cython requires an external, Python compatible C compiler.

Cython integrates well with the Jupyter Notebook:

In [20]:

```
%reload_ext Cython
```

Here we use Cython to implement the sliding dot product cross-correlation by

**type annotating**the Python code**releasing the GIL**- using
**typed memoryviews**to access data in numpy arrays **compiling the code to machine code**via Python C extension

In [21]:

```
%%cython --compile-args=-O2
#
#cython: boundscheck=False
#cython: wraparound=False
import numpy
cdef double dot_cython(double[::1] a, double[::1] b,
ssize_t start, ssize_t end, ssize_t delay) nogil:
"""Return dot product of two sequences in range."""
cdef ssize_t n
cdef double sum
sum = 0.0
for n in range(start, end):
sum += a[n + delay] * b[n]
return sum
def correlate_cython(double[::1] a not None, double[::1] b not None):
"""Return linear correlation of two one-dimensional arrays."""
cdef ssize_t size, delay, index
size = len(a)
result = numpy.empty(size, dtype='float64')
# numpy array objects cannot be accessed in a nogil section
# use a Cython typed memoryview instead
cdef double[::1] c = result
with nogil:
for index in range(size):
delay = index - size // 2
if delay < 0:
c[index] = dot_cython(a, b, -delay, size, delay)
else:
c[index] = dot_cython(a, b, 0, size-delay, delay)
return result
```

In [22]:

```
test_correlate(correlate_cython)
%timeit correlate_cython(A, B)
%timeit map_threaded(correlate_cython, [A, A], [B, B])
```

The Cython implementation is about **as fast as the C implementation**. Since the function releases the GIL, it can efficiently run in parallel on multiple CPU cores using multi-threading.

Cython's code analysis can be helpful in type annotating and optimizing code. Try it using `%%cython --annotate`

at the top of the previous cell.

OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform **shared memory multiprocessing** programming in C, C++, and Fortran.

Cython's `prange`

function is implemented using OpenMP's "`#pragma omp parallel for`

" directive.

We instruct the C compiler to use OpenMP by specifying extra platform specific compiler and linker arguments, which were defined at the beginning of the notebook in the `OPENMP_ARGS`

variable:

In [23]:

```
%%cython --compile-args=-O2 $OPENMP_ARGS
#
#cython: boundscheck=False
#cython: wraparound=False
import numpy
from cython.parallel import prange, parallel
cdef double dot_cython(double[::1] a, double[::1] b,
ssize_t start, ssize_t end, ssize_t delay) nogil:
"""Return dot product of two sequences in range."""
cdef ssize_t n
cdef double sum
sum = 0.0
for n in range(start, end):
sum += a[n + delay] * b[n]
return sum
def correlate_cython_omp(double[::1] a not None, double[::1] b not None,
int num_threads=0):
"""Return linear correlation of two one-dimensional arrays."""
cdef ssize_t size, delay, index
size = a.size
result = numpy.empty(size, dtype='float64')
cdef double[::1] c = result
with nogil, parallel(num_threads=num_threads):
for index in prange(size):
delay = index - size // 2
if delay < 0:
c[index] = dot_cython(a, b, -delay, size, delay)
else:
c[index] = dot_cython(a, b, 0, size-delay, delay)
return result
```

In [24]:

```
test_correlate(correlate_cython_omp)
%timeit correlate_cython_omp(A, B)
%timeit map_threaded(correlate_cython_omp, [A, A], [B, B])
```

Depending on the number of CPU cores, this function is several times faster than the previous implementation, but it can no longer be efficiently multi-threaded because the function already uses all CPU cores via OpenMP.

This implementation is slightly faster than using the `vsldCorrExec1D`

function in direct mode from Intel's Math Kernel Library (MKL) (not shown).

In [25]:

```
import numpy
import numba
@numba.jit(nogil=True)
def dot_numba(a, b, start, stop, delay):
"""Return dot product of two sequences in range."""
sum = 0.0
for n in range(start, stop):
sum += a[n + delay] * b[n]
return sum
@numba.jit
def correlate_numba(a, b):
"""Return linear correlation of two one-dimensional arrays."""
size = len(a)
c = numpy.empty(size, 'float64') # allocate output numpy array
for index in range(size):
delay = index - size // 2
if delay < 0:
c[index] = dot_numba(a, b, -delay, size, delay)
else:
c[index] = dot_numba(a, b, 0, size-delay, delay)
return c
test_correlate(correlate_numba)
%timeit correlate_numba(A, B)
%timeit map_threaded(correlate_numba, [A, A], [B, B])
```

Compiling Python code with numba achieves **speed comparable to C**.

The function cannot be run in parallel on CPU cores even though the inner `dot`

function releases the GIL. The loops need to be factored out to a function that does not hold the GIL.

To further improve the Numba implementation of the correlation function, we

move the code of the outer loop

`for index in range(size):`

into its own function named`correlate_numba_jit`

and release the GILuse the

`numba.prange`

function to parallelize the outer loop

In [26]:

```
import numpy
import numba
@numba.jit(nogil=True)
def dot_numba(a, b, start, stop, delay):
"""Return dot product of two sequences in range."""
sum = 0.0
for n in range(start, stop):
sum += a[n + delay] * b[n]
return sum
@numba.jit(nogil=True, parallel=True)
def correlate_numba_jit(c, a, b, size):
"""Compute linear correlation of two arrays using sliding-dot product."""
for index in numba.prange(size):
delay = index - size // 2
if delay < 0:
c[index] = dot_numba(a, b, -delay, size, delay)
else:
c[index] = dot_numba(a, b, 0, size-delay, delay)
def correlate_numba_parallel(a, b):
"""Return linear correlation of two one-dimensional arrays."""
size = len(a)
c = numpy.empty(size, 'float64')
with THREADLOCK:
correlate_numba_jit(c, a, b, size)
return c
test_correlate(correlate_numba_parallel)
%timeit correlate_numba_parallel(A, B)
%timeit map_threaded(correlate_numba_parallel, [A, A], [B, B])
```

The numba.cuda decorator can translate Python functions into PTX code, which execute on the CUDA hardware, e.g. a NVidia graphics card with thousands of cores.

In the **CUDA execution model**, a **kernel** function is executed once by each thread on a **grid of blocks of threads**. The grid can be 1, 2, or 3 dimensional. The threads within each block can synchronize and share memory. Thread blocks can execute independently in any order. The kernel function can determine the absolute position of the current thread in the entire grid of blocks.

In [27]:

```
import numpy
import numba.cuda
@numba.cuda.jit(device=True, inline=True)
def dot_cuda(a, b, start, stop, delay):
"""Return dot product of two sequences in range."""
sum = 0.0
for i in range(start, stop):
sum += a[i + delay] * b[i]
return sum
@numba.cuda.jit()
def correlate_cuda_kernel(c, a, b, size):
"""CUDA kernel to compute linear correlation of two arrays."""
# global position of the thread in the 1D grid
index = numba.cuda.grid(1)
if index < size:
delay = index - size // 2
if delay < 0:
c[index] = dot_cuda(a, b, -delay, size, delay)
else:
c[index] = dot_cuda(a, b, 0, size-delay, delay)
def correlate_numba_cuda(a, b):
"""Return linear correlation of two one-dimensional arrays."""
size = a.size
c = numpy.zeros(size, 'float64')
# launch the CUDA kernel
threadsperblock = 32
blockspergrid = (size + (threadsperblock - 1)) // threadsperblock
correlate_cuda_kernel[blockspergrid, threadsperblock](c, a, b, size)
return c
if not SKIP_CUDA:
test_correlate(correlate_numba_cuda)
%timeit correlate_numba_cuda(A, B)
%timeit map_threaded(correlate_numba_cuda, [A, A], [B, B])
```

Running this on a 1920 core GPU is about 20 times faster than on a single core CPU device. For longer input arrays, the GPU will be significant faster (see below).

So far, we have calculated the cross-correlation in the **time domain** using the **sliding dot product**.

For longer sequences, it is more efficient to use the **cross-correlation theorem** to calculate the cross-correlation in the **frequency domain** using Fourier transforms $\mathcal{F}\{\}$:

The scipy library provides many efficient numerical routines such as numerical integration and optimization.

The `scipy.signal.fftconvolve`

function uses zero-padding and the **Fast Fourier Transform (FFT)** according to the convolution theorem to calculate the convolution of two arrays.

Recall that the cross-correlation of functions $a(t)$ and $b(t)$ is equivalent to the **convolution** ($*$) of $a(t)$ and $b^*(-t)$:

It means that **correlation can be calculated using convolution by reversing the second input array**:

In [28]:

```
import scipy.signal
def correlate_scipy(a, b):
"""Return circular correlation of two one-dimensional arrays."""
return scipy.signal.fftconvolve(a, b[::-1], 'same')
test_correlate(correlate_scipy)
%timeit correlate_scipy(A, B)
%timeit map_threaded(correlate_scipy, [A, A], [B, B])
```

This is about **an order of magnitude faster** and **scales much better with larger array sizes** (not shown) than the multi-threaded Cython implementation of the sliding dot product.

Let's implement a circular correlation function using numpy's FFT functions according to the **cross-correlation theorem**:

In [29]:

```
import numpy.fft
def correlate_fft(a, b):
"""Return circular correlation of two one-dimensional arrays."""
# forward DFT
a = numpy.fft.rfft(a)
b = numpy.fft.rfft(b)
# multiply by complex conjugate
a *= b.conj()
# reverse DFT
c = numpy.fft.irfft(a)
# shift
c = numpy.fft.fftshift(c)
return c
%timeit correlate_fft(A, B)
%timeit map_threaded(correlate_fft, [A, A], [B, B])
```

Recall that the **circular correlation function can calculate the linear correlation by zero-padding the arrays** (to twice the size for even length arrays):

In [30]:

```
import numpy
c = correlate_fft(numpy.pad(A, A.size//2, mode='constant'),
numpy.pad(B, B.size//2, mode='constant'))
# remove zero-padding from result
c = c[A.size//2: -A.size//2]
# compare to linear correlation
assert numpy.allclose(c, correlate_numpy(A, B))
```

In general, circular correlation should only be used to analyze cyclic functions. However, later we demonstrate that for our specific application the results obtained from circular correlation do not significantly differ from linear correlation.

CuPy is an implementation of a NumPy-compatible multi-dimensional array on CUDA.

To run the FFT based circular correlation function on a GPU, we

- move the input numpy arrays to the current GPU device using
`cupy.asarray()`

- use FFT functions from
`cupy.fft`

instead of`numpy.fft`

- move the result array from the GPU device to the host using
`cupy.asnumpy()`

In [31]:

```
if not SKIP_CUDA:
import cupy.fft
def correlate_cufft(a, b):
"""Return circular correlation of two one-dimensional arrays."""
# move arrays to the current GPU device
a = cupy.asarray(a)
b = cupy.asarray(b)
a = cupy.fft.rfft(a)
b = cupy.fft.rfft(b)
a *= b.conj()
c = cupy.fft.irfft(a)
c = cupy.fft.fftshift(c)
# move array from GPU device to the host
return cupy.asnumpy(c)
if not SKIP_CUDA:
%timeit correlate_cufft(A, B)
%timeit map_threaded(correlate_cufft, [A, A], [B, B])
```

This GPU version runs slower than the CPU version due to the overhead of moving small arrays from/to the device and executing several small kernel functions.

The fft2d C library by Takuya Ooura provides efficient functions to compute Fast Fourier Transforms (FFT).

Cython makes it relatively easy to use C libraries from Python.

The file ** fftsg.c defines a function rdft**, which computes forward and invers DFT of real input arrays:

```
Fast Fourier/Cosine/Sine Transform
dimension :one
data length :power of 2
decimation :frequency
radix :split-radix
data :inplace
table :use
functions
rdft: Real Discrete Fourier Transform
function prototypes
void rdft(int, int, double *, int *, double *);
-------- Real DFT / Inverse of Real DFT --------
[definition]
<case1> RDFT
R[k] = sum_j=0^n-1 a[j]*cos(2*pi*j*k/n), 0<=k<=n/2
I[k] = sum_j=0^n-1 a[j]*sin(2*pi*j*k/n), 0<k<n/2
<case2> IRDFT (excluding scale)
a[k] = (R[0] + R[n/2]*cos(pi*k))/2 +
sum_j=1^n/2-1 R[j]*cos(2*pi*j*k/n) +
sum_j=1^n/2-1 I[j]*sin(2*pi*j*k/n), 0<=k<n
[usage]
<case1>
ip[0] = 0; // first time only
rdft(n, 1, a, ip, w);
<case2>
ip[0] = 0; // first time only
rdft(n, -1, a, ip, w);
[parameters]
n :data length (int)
n >= 2, n = power of 2
a[0...n-1] :input/output data (double *)
<case1>
output data
a[2*k] = R[k], 0<=k<n/2
a[2*k+1] = I[k], 0<k<n/2
a[1] = R[n/2]
<case2>
input data
a[2*j] = R[j], 0<=j<n/2
a[2*j+1] = I[j], 0<j<n/2
a[1] = R[n/2]
ip[0...*] :work area for bit reversal (int *)
length of ip >= 2+sqrt(n/2)
strictly,
length of ip >=
2+(1<<(int)(log(n/2+0.5)/log(2))/2).
ip[0],ip[1] are pointers of the cos/sin table.
w[0...n/2-1] :cos/sin table (double *)
w[],ip[] are initialized if ip[0] == 0.
[remark]
Inverse of
rdft(n, 1, a, ip, w);
is
rdft(n, -1, a, ip, w);
for (j = 0; j <= n - 1; j++) {
a[j] *= 2.0 / n;
}
```

We use Python's distutils module to compile the `fftsg.c`

C code into a static link library:

In [32]:

```
from distutils import ccompiler
compiler = ccompiler.new_compiler()
objects = compiler.compile(['fft2d/fftsg.c'], extra_postargs=['-fPIC', '-O2'])
compiler.create_static_lib(objects, 'ftt2d', output_dir='.')
```

To use the `ftt2d`

library from Cython, we need to include the declaration from the C header file and allocate temporary arrays:

In [33]:

```
%%cython --compile-args=-O2 -I. -l./ftt2d
#
#cython: boundscheck=False
#cython: wraparound=False
import numpy
from libc.stdlib cimport malloc, free
from libc.math cimport sqrt
cdef extern from 'fft2d.h':
void rdft(int n, int isgn, double *a, int *ip, double *w) nogil
def correlate_cython_fft2d(a, b):
"""Return circular correlation of two one-dimensional arrays."""
cdef:
ssize_t size = a.size
double scale = 2.0 / size
double[::1] a_
double[::1] b_
double *w_
int *ip_
int s
# copy input arrays. rdft computes in-place
result = numpy.copy(a)
a_ = result
b_ = numpy.copy(b)
with nogil:
# allocate cos/sin table
w_ = <double *>malloc((size // 2) * sizeof(double))
if not w_:
with gil:
raise MemoryError('could not allocate w_')
# allocate work area for bit reversal
ip_ = <int *>malloc((2 + <int>(sqrt((size//2) + 0.5))) * sizeof(int))
if not ip_:
with gil:
raise MemoryError('could not allocate ip_')
ip_[0] = 0
# forward DFT
rdft(size, 1, &b_[0], ip_, w_)
rdft(size, 1, &a_[0], ip_, w_)
# multiply by complex conjugate
multiply_conj(a_, b_, size)
# reverse DFT
rdft(size, -1, &a_[0], ip_, w_)
# shift and scale results
fftshift(a_, size, scale)
free(w_)
free(ip_)
return result
cdef void multiply_conj(double[::1] a, double[::1] b, ssize_t size) nogil:
"""In-place multiply a by complex conjugate of b."""
cdef:
ssize_t i
double ar, br, ai, bi
a[0] = a[0] * b[0]
a[1] = a[1] * b[1]
for i in range(2, size, 2):
ar = a[i]
ai = a[i+1]
br = b[i]
bi = b[i+1]
a[i] = ar * br + ai * bi
a[i+1] = ai * br - ar * bi
cdef void fftshift(double[::1] a, ssize_t size, double scale) nogil:
"""In-place shift zero-frequency component to center of spectrum."""
cdef:
ssize_t i
double t
size //= 2
for i in range(size):
t = a[i]
a[i] = a[i + size] * scale
a[i + size] = t * scale
```

In [34]:

```
assert numpy.allclose(correlate_cython_fft2d(A, B),
correlate_fft(A, B))
%timeit correlate_cython_fft2d(A, B)
%timeit map_threaded(correlate_cython_fft2d, [A, A], [B, B])
```

So far this is the fastest implementation of the correlate function. It can be run multi-threaded, although **for short input arrays the overhead of multi-threading is detrimental**.

Finally, we compare several implementations of the cross-correlate function using longer time series of 16384 samples:

In [35]:

```
import timeit
import numpy
from IPython.display import display
from ipywidgets import IntProgress
def time_functions(functions, size=2**14, max_workers=MAXCPUS):
"""Return runtimes of single and multi-threaded correlation functions."""
progress = IntProgress(min=0, max=2*len(functions))
display(progress)
a = numpy.random.random(size) - 0.5
b = numpy.random.random(size) - 0.5
ab = [a]*max_workers, [b]*max_workers
result = []
for function in functions:
try:
func = globals()[function]
t0 = timeit.Timer(lambda: func(a, b)).timeit(number=1)
number = max(2, int(1 / t0))
t0 = timeit.Timer(lambda: func(a, b)).timeit(number)
progress.value += 1
t1 = timeit.Timer(lambda: map_threaded(func, *ab,
max_workers=max_workers)
).timeit(number)
progress.value += 1
result.append(['{:.2f}'.format(t0 * 1e3 / number),
'{:.2f}'.format(t1 * 1e3 / number),
'{:.1f}'.format(t0/t1 * max_workers)])
except Exception:
result.append([float('nan')] * 3)
progress.close()
try:
import pandas
columns = ['1 thread / ms', '{} threads / ms'.format(max_workers),
'speedup']
return pandas.DataFrame(result, index=functions, columns=columns)
except ImportError:
return result
display(time_functions([
# 'correlate_python'
'correlate_numpy',
'correlate_cython',
'correlate_cython_omp',
'correlate_numba',
'correlate_numba_parallel',
'correlate_numba_cuda',
'correlate_scipy',
'correlate_fft',
'correlate_cufft',
'correlate_cython_fft2d']))
```

1 thread / ms | 6 threads / ms | speedup | |
---|---|---|---|

correlate_numpy | 46.65 | 49.06 | 5.7 |

correlate_cython | 152.29 | 161.85 | 5.6 |

correlate_cython_omp | 19.83 | 112.32 | 1.1 |

correlate_numba | 156.38 | 958.76 | 1.0 |

correlate_numba_parallel | 32.20 | 194.29 | 1.0 |

correlate_numba_cuda | 3.39 | 22.02 | 0.9 |

correlate_scipy | 1.30 | 4.16 | 1.9 |

correlate_fft | 0.57 | 2.73 | 1.3 |

correlate_cufft | 0.78 | 8.57 | 0.5 |

correlate_cython_fft2d | 0.33 | 1.81 | 1.1 |

The best implementation of the correlate function is **almost as fast as desired (~200 µs)** for the pCF analysis of images.

The correlate function could be further optimized by implementing it in C++ and using the `DFTi`

functions of the closed source Intel MKL library. Expect another 30% speed improvement.

Now that we have developed a fast cross-correlation function and learned techniques to speed-up Python code, we use them to analyze a small simulated time series of images.

The ** Simulation_Channel.bin** file contains the result of a simulation of fluorescent particles diffusing on a

The simulated images at **32,000 time steps** are stored contiguously as **16-bit unsigned integers** in the file.

The time samples are not stored contiguously. Accessing time series will be inefficient while spatial access will be fast: