Pair Correlation Function Analysis of Fluorescence Fluctuations in Big Image Time Series using Python

A tutorial using Python and scientific libraries to implement pair correlation function (pCF) analysis of a big time series of images from fluorescence microscopy on a personal computer.

by Christoph Gohlke

Updated on June, 28, 2022

Presented at the Big Data Image Processing & Analysis BigDIPA workshops 2016, 2017, and 2018

Supported by the National Institute of Health grant numbers 1R25EB022366-01 and 2P41GM103540-31

Released under the creative commons Attribution 4.0 International license


Spatiotemporal analysis of fluorescence fluctuations of optical microscopy measurements on living samples is a powerful tool that can provide insight into dynamic molecular processes of high biological relevance (Di Rienzo et al., 2016).

Pair correlation function (pCF) analysis of fluorescence microscopy image time series acquired using very fast cameras is one of the emerging and promising spatiotemporal techniques. However, it is computationally very expensive, and the analysis of big image data sets used to take hours or be impractical on personal computers.

In this tutorial, we use the open-source Python programming language and scientific libraries to compute the pCF analysis of a large time series of fluorescence images acquired with Selective Plane Illumination Microscopy (SPIM).

First, we implement a function to calculate the cross-correlation of two time series. We demonstrate the limitations of Python for efficient numerical computations and several ways to overcome them.

Next, we implement the pCF analysis of a small simulated image time series and optimize its speed by almost two orders of magnitude.

Finally, we use this pCF analysis function to analyze a multi-gigabyte image time series in small, overlapping chunks.

This is a tutorial on computing pair correlations in big images. Refer to the references for an introduction to the pair correlation method and how the computed pair correlations can be analyzed and visualized to study molecular flow in cells.


To follow this tutorial and run its code, the following prerequisites are needed:

Familiarity with

Minimum computer specifications

Python development environment

Tutorial source code and data files

Configure the runtime environment

The notebook depends on a few platform specific variables. Adjust the path to the example data and the path to a scratch directory where large intermediate and output files will be saved:

The Challenge

The challenge is to compute the pair correlation function analysis (pCF) of a large time series of images using Python on a personal computer in reasonable time.

Our dataset is a 34.5 GB time series of SPIM images of a biological cell as 35,000 TIFF files of 1024x512 16-bit greyscale samples each:

image timeseries

As part of molecular flow analysis, we need to cross-correlate the time series at each pixel of the image with the time series of all its neighbor pixels at a specific radius after correcting the time series for photobleaching:

image pair correlation function analysis

For a radius of 6 there are 32 neighbor pixels. To analyze an image of 1024x512 pixels, the number of cross-correlation calculations needed is 16,777,216 (including padding of the border by 6 pixels).

We would like to perform the cross-correlation calculations on the dataset in about 15 minutes.

We have a modern notebook or desktop computer with a minimum of 4 CPU cores, 8 GB RAM, and SSD, running a 64-bit OS with a scientific Python 3.6 distribution and a C compiler installed.

To compute 16,777,216 cross-correlations on 4 CPU cores in 15 minutes, a single cross-correlation computation must finish in about 215 µs (4*15*60*1000*1000/16777216).

Here is a pseudo code for the image pair correlation function (ipCF) analysis that will later be completed to be the reference implementation:


1. Implement a fast cross-correlation function

In this section, we

Along the way we learn

2. Implement pair correlation function analysis of small image time series

The techniques learned in the first section are applied to implement the pCF analysis of a small simulated image time series.

In this section, we

3. Implement out-of-core pair correlation function analysis of big image time series

In this section, we demonstrate some methods to process data that fit on disk but are larger than RAM, aka out-of-core processing. We

1. Implement a fast cross-correlation function

In this section, we implement a fast 1D cross-correlation function. The goal is to perform a single cross-correlation of two long (214) 1D sequences of real numbers in about 200 µs.

Definition of cross-correlation

The unnormalized discrete correlation of two sampled functions $a$ and $b$ as generally defined in signal processing texts (e.g. numpy.correlate and MATLAB xcorr) is:

$$c_{ab}[delay] = \sum_n a[n+delay] \cdot conj(b[n])$$

This is known as the sliding dot product.

The above definition is not unique and sometimes correlation may be defined differently (e.g. Wikipedia):

$$c'_{ab}[delay] = \sum_n conj(a[n]) \cdot b[n+delay],$$

which is related to the first definition:

$$c'_{ab}[delay] = c_{ab}[-delay] = c_{ba}[delay]$$

There is also an ambiguity for which range of $delay$s to compute the correlations and how to deal with out of bounds $delay$ values.

In this tutorial, $a$, $b$ and $c$ will have the same length. $delay$ will range from negative half length to positive half length.

Linear and circular cross-correlation

There are two ways of dealing with out of bounds $delay$ values:

Linear correlation can be calculated using the circular algorithm by zero padding the input arrays on both sides to twice their lengths:

Properties of cross-correlation

Some properties of the cross-correlation function are relevant for the pair correlation calculations:

$$c_{ab}[delay] = c_{ba}[-delay]$$ $$c_{aa}[delay] = c_{aa}[-delay]$$

Cross-correlation using pure Python

Let's implement the discrete linear correlation function for real input sequences in pure Python using the sliding dot product as defined previously:

$$c_{ab}[delay] = \sum_n a[n+delay] * b[n]$$

It is good practice to write tests to verify code:

The tests passed, no exception is raised.

Let's time the cross-correlation of two random sequences of length 8192, which are created using list comprehension:

This implementation is about 25,000 times slower than desired (~200 µs) for our pair correlation function image analysis task.

Plot auto-correlation

Using the matplotlib 2D plotting library, we plot the auto-correlation of a short random sequence and embed it into the Juyter notebook:

The autocorrelation is always symmetric with respect to positive and negative delays.

For long random sequences, the autocorrelation approaches an impulse function.

Interactively plot cross-correlation

Using IPython widgets, we plot the cross-correlation of two short random sequences with peak, where the peak in sequence $b$ is delayed with respect to sequence $a$:

A positive delay of the peak in sequence $b$ with respect to the peak in sequence $a$ shows as a peak at a negative delay in the cross-correlation of a and b ($a \star b$).


Let's try to use Python's concurrent.futures module to run several correlation functions in parallel on multiple CPU cores within the same process using threads (the smallest sequences of programmed instructions that can be managed independently by the operating system):

There is no improvement over executing the functions sequentially.

In CPython, the reference Python implementation we are using, only one thread can execute Python code at once due to the Global Interpreter Lock (GIL).

Threading is still an appropriate model in Python to run multiple I/O-bound tasks simultaneously.

We demonstrate later that Python functions implemented in C can release the GIL and be executed on multiple CPU cores using threads.

Cross-correlation using C

Let's compare the performance of the pure Python function to an implementation in C.

The speed of compiled C code is often used as a reference when comparing single-threaded performance.

Python's distutils.ccompiler module can be used to compile and link the C code:

The generated executable can be timed using Jupyter's magick:

The C program calculates the correlation about two orders of magnitudes faster than Python.

Python lists

So far, we have used Python's built-in list type to store sequences of floating-point numbers.

Depending on the Python implementation and platform:

Hence, Python lists are very inefficient for storing large number of homogeneous numerical data:

Why Python?

So far, we have shown that:

Note that this applies to CPython, the Python reference implementation, only. Other Python implementations (pypy, Jython, IronPython) might not have these limitations.

Why do we consider Python for big data image processing and analysis?

There are technical solutions to overcome those limitations:

Putting the limitations into perspective: besides CPU bound numerical calculations, there are many other tasks that are part of an efficient image processing pipeline:

As an image analyst or end user of imaging software, Python can be used

Cross-correlation using Numpy

Besides an efficient N-dimensional array object, the numpy library provides useful, optimized functions operating on the arrays, including random number capabilities and a correlate function.

Let's redefine the correlate and test functions using numpy:

Depending on how the numpy library was compiled, the numpy.correlate function is several times faster than our implementation in C.

Numpy uses an optimized version of the dot product (from the BLAS library) for calculating the sliding dot product.

The storage size of the numpy array is close to a C array. The overhead of less than 100 bytes matters only for scalar values and small arrays:

Numpy multi-threaded

Many numpy functions release the GIL and can be run in parallel on multiple CPU cores:

However, depending on the numpy build options, some functions might use threads internally, which can cause oversubscription and parallel execution to be slower than sequential. In such cases, when using numpy built with Intel's MKL library, set the MKL_NUM_THREADS environment variable to 1 to disable the libraries internal use of threads.

Cross-correlation using Cython

Cython is an optimizing static compiler for both the Python programming language and the extended Cython programming language.

Cython makes writing C extensions for Python and interfacing with numpy arrays and C libraries easy. Unlike numba, Cython requires an external, Python compatible C compiler.

Cython integrates well with the Jupyter Notebook:

Here we use Cython to implement the sliding dot product cross-correlation by

The Cython implementation is about as fast as the C implementation. Since the function releases the GIL, it can efficiently run in parallel on multiple CPU cores using multi-threading.

Cython's code analysis can be helpful in type annotating and optimizing code. Try it using %%cython --annotate at the top of the previous cell.

Using Cython with OpenMP

OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran.

Cython's prange function is implemented using OpenMP's "#pragma omp parallel for" directive.

We instruct the C compiler to use OpenMP by specifying extra platform specific compiler and linker arguments, which were defined at the beginning of the notebook in the OPENMP_ARGS variable:

Depending on the number of CPU cores, this function is several times faster than the previous implementation, but it can no longer be efficiently multi-threaded because the function already uses all CPU cores via OpenMP.

This implementation is slightly faster than using the vsldCorrExec1D function in direct mode from Intel's Math Kernel Library (MKL) (not shown).

Just-in-time compile Python code using Numba

The numba library can generate optimized machine code from Python code using the LLVM compiler infrastructure. No external C compiler is required.

We can simply decorate the pure Python functions with numba.jit and use numpy array instead of lists:

Compiling Python code with numba achieves speed comparable to C.

The function cannot be run in parallel on CPU cores even though the inner dot function releases the GIL. The loops need to be factored out to a function that does not hold the GIL.

Parallelized numba code

To further improve the Numba implementation of the correlation function, we

  1. move the code of the outer loop for index in range(size): into its own function named correlate_numba_jit and release the GIL

  2. use the numba.prange function to parallelize the outer loop

Just-in-time compile Python code to CUDA using Numba

The numba.cuda decorator can translate Python functions into PTX code, which execute on the CUDA hardware, e.g. a NVidia graphics card with thousands of cores.

In the CUDA execution model, a kernel function is executed once by each thread on a grid of blocks of threads. The grid can be 1, 2, or 3 dimensional. The threads within each block can synchronize and share memory. Thread blocks can execute independently in any order. The kernel function can determine the absolute position of the current thread in the entire grid of blocks.


Running this on a 1920 core GPU is about 20 times faster than on a single core CPU device. For longer input arrays, the GPU will be significant faster (see below).

Switching to frequency domain

So far, we have calculated the cross-correlation in the time domain using the sliding dot product.

For longer sequences, it is more efficient to use the cross-correlation theorem to calculate the cross-correlation in the frequency domain using Fourier transforms $\mathcal{F}\{\}$:

$$ a\star b = \mathcal{F}^{-1}(\mathcal{F}\{a\} \cdot (\mathcal{F}\{b\})^*)$$

Cross-correlation using Scipy's convolution function

The scipy library provides many efficient numerical routines such as numerical integration and optimization.

The scipy.signal.fftconvolve function uses zero-padding and the Fast Fourier Transform (FFT) according to the convolution theorem to calculate the convolution of two arrays.

Recall that the cross-correlation of functions $a(t)$ and $b(t)$ is equivalent to the convolution ($*$) of $a(t)$ and $b^*(-t)$:

$$ a\star b = a*b^*(-t),$$

It means that correlation can be calculated using convolution by reversing the second input array:

This is about an order of magnitude faster and scales much better with larger array sizes (not shown) than the multi-threaded Cython implementation of the sliding dot product.

Circular cross-correlation using FFT

Let's implement a circular correlation function using numpy's FFT functions according to the cross-correlation theorem:

$$ a\star b = \mathcal{F}^{-1}(\mathcal{F}\{a\} \cdot (\mathcal{F}\{b\})^*)$$

Recall that the circular correlation function can calculate the linear correlation by zero-padding the arrays (to twice the size for even length arrays):

In general, circular correlation should only be used to analyze cyclic functions. However, later we demonstrate that for our specific application the results obtained from circular correlation do not significantly differ from linear correlation.

Circular cross-correlation using CUDA FFT

CuPy is an implementation of a NumPy-compatible multi-dimensional array on CUDA.

To run the FFT based circular correlation function on a GPU, we

This GPU version runs slower than the CPU version due to the overhead of moving small arrays from/to the device and executing several small kernel functions.

Use Cython with a C FFT library

The fft2d C library by Takuya Ooura provides efficient functions to compute Fast Fourier Transforms (FFT).

Cython makes it relatively easy to use C libraries from Python.

The file fftsg.c defines a function rdft, which computes forward and inverse DFT of real input arrays:

Fast Fourier/Cosine/Sine Transform
    dimension   :one
    data length :power of 2
    decimation  :frequency
    radix       :split-radix
    data        :inplace
    table       :use

    rdft: Real Discrete Fourier Transform

function prototypes
    void rdft(int, int, double *, int *, double *);

-------- Real DFT / Inverse of Real DFT --------
        <case1> RDFT
            R[k] = sum_j=0^n-1 a[j]*cos(2*pi*j*k/n), 0<=k<=n/2
            I[k] = sum_j=0^n-1 a[j]*sin(2*pi*j*k/n), 0<k<n/2
        <case2> IRDFT (excluding scale)
            a[k] = (R[0] + R[n/2]*cos(pi*k))/2 + 
                   sum_j=1^n/2-1 R[j]*cos(2*pi*j*k/n) + 
                   sum_j=1^n/2-1 I[j]*sin(2*pi*j*k/n), 0<=k<n
            ip[0] = 0; // first time only
            rdft(n, 1, a, ip, w);
            ip[0] = 0; // first time only
            rdft(n, -1, a, ip, w);
        n              :data length (int)
                        n >= 2, n = power of 2
        a[0...n-1]     :input/output data (double *)
                            output data
                                a[2*k] = R[k], 0<=k<n/2
                                a[2*k+1] = I[k], 0<k<n/2
                                a[1] = R[n/2]
                            input data
                                a[2*j] = R[j], 0<=j<n/2
                                a[2*j+1] = I[j], 0<j<n/2
                                a[1] = R[n/2]
        ip[0...*]      :work area for bit reversal (int *)
                        length of ip >= 2+sqrt(n/2)
                        length of ip >= 
                        ip[0],ip[1] are pointers of the cos/sin table.
        w[0...n/2-1]   :cos/sin table (double *)
                        w[],ip[] are initialized if ip[0] == 0.
        Inverse of 
            rdft(n, 1, a, ip, w);
            rdft(n, -1, a, ip, w);
            for (j = 0; j <= n - 1; j++) {
                a[j] *= 2.0 / n;

We use Python's distutils module to compile the fftsg.c C code into a static link library:

To use the ftt2d library from Cython, we need to include the declaration from the C header file and allocate temporary arrays:

So far this is the fastest implementation of the correlate function. It can be run multi-threaded, although for short input arrays the overhead of multi-threading is detrimental.

Compare implementations

Finally, we compare several implementations of the cross-correlate function using longer time series of 16384 samples:

The best implementation of the correlate function is almost as fast as desired (~200 µs) for the pCF analysis of images.

The correlate function could be further optimized by implementing it in C++ and using the DFTi functions of the closed source Intel MKL library. Expect another 30% speed improvement.

2. Implement pair correlation function analysis of small image time series

Now that we have developed a fast cross-correlation function and learned techniques to speed-up Python code, we use them to analyze a small simulated time series of images.

Load and explore simulated images

The Simulation_Channel.bin file contains the result of a simulation of fluorescent particles diffusing on a 64x64 grid. The grid contains a diagonal, 300 nm wide channel, which restricts free diffusion. The file was produced using the Globals for Images · SimFCS software.

The simulated images at 32,000 time steps are stored contiguously as 16-bit unsigned integers in the file.

The time samples are not stored contiguously. Accessing time series will be inefficient while spatial access will be fast: