HyperFit API

HyperFit has one main class and associated members: LinFit.

We also provide a few classes containing example data, but these can be ignored it you want to get started fitting your own datasets.

LinFit Class

class hyperfit.linfit.LinFit(data, cov, weights=None, vertaxis=-1)

The LinFit class.

Implements methods to fit straight lines or planes, including taking data and a covariance matrix and fitting either to just find the best fit, or run an MCMC. Has four main attributes, that are useful for accessing other information after running ‘optimize’, ‘emcee’ or ‘zeus’.

coords

N dimensional array holding the best-fitting HyperFit parameters in the data coordinates after a call to ‘optimize’ or one of the MCMC routines. Otherwise zeros.

Type: ndarray

normal

N dimensional array holding the best-fitting HyperFit parameters in the normal unit vectors after a call to ‘optimize’ or one of the MCMC routines. Otherwise zeros.

Type: ndarray

vert_scat

Holds the best-fitting scatter in the vertical axis of the data coordinates, after a call to ‘optimize’ or one of the MCMC routines. Otherwise zero.

Type: float

norm_scat

Holds the best-fitting scatter normal to the plane, after a call to ‘optimize’ or one of the MCMC routines. Otherwise zero.

Type: float

normal_bounds

Holds the prior bounds in the normal unit vectors, after bounds in the data coordinates have been passed to a call to ‘optimize’ or one of the MCMC routines. Otherwise None.

Type: sequence

Parameters

data (ndarray) – The N x D dimensional data vector
cov (ndarray) – The N x N x D dimensional set of covariance matrices.
weights (ndarray, optional) – D dimensional array of weights for each data. Default is None, in which can unit weights are assumed for each data point.
vertaxis (float, optional) – Specifies which of the coordinate axis is to be treated as the ‘vertical’ axis (i.e,. ‘y’ for 2D data). Default is -1, in which case the last axis will be treated as vertical.

bessel_cochran(sigma)

Corrects the sample scatter to the population scatter using the Bessel and Cochran corrections.

The intrinsic scatter fit from the likelihood is generally not equal to the underlying population scatter This is 1) because the standard deviation is estimated from a finite number of data samples, and 2) because the maximum likelihood value of the variance is not the maximum likelihood value of the standard deviation. These are corrected by the so-called Bessel and Cochran corrections respectively. This function applies these corrections based on the number of data points and dimensionality of the fitted plane.

Parameters: sigma (ndarray) – M dimensional array of scatter values.
Returns: sigma_corr – M dimensional array of corrected scatter values.
Return type: ndarray

compute_cartesian(normal=None, norm_scat=None)

Converts from the normal vector to the data coordinates.

Parameters

normal (ndarray, optional) – N x M dimensional array of unit vectors. Default is None, which means use the values stored in the self.normal attribute.
norm_scat (ndarray, optional) – M dimensional array of scatter values normal to the plane. Default is None, which means use the values stored in the self.norm_scat attribute.

Returns

coords (float) – N x M dimensional array of points in the data coordinates.
vert_scat (float) – M dimensional array of scatters along the vertical axis of the data.

compute_normal(coords=None, vert_scat=None)

Converts from data coordinates to the normal vector.

Parameters

coords (ndarray, optional) – N x M dimensional array of coordinates. Default is None, which means use the values stored in the self.coords attribute.
vert_scat (ndarray, optional) – M dimensional array of scatter values. Default is None, which means use the values stored in the self.vert_scat attribute.

Returns

normal (ndarray) – N x M dimensional array of normal unit vectors.
norm_scat (ndarray) – M dimensional array of scatters normal to the N-1 dimensional plane.

emcee(bounds, max_iter=100000, batchsize=1000, ntau=50.0, tautol=0.05, verbose=False)

Run an MCMC on the data using the emcee sampler (Foreman-Mackay et. al., 2013).

The MCMC runs in batches, checking convergence at the end of each batch until either the chain is well converged or the maximum number of iterations has been reached. Convergence is defined as the point when the chain is longer than ntau autocorrelation lengths, and the estimate of the autocorrelation length varies less than tautol between batches. Burn-in is then removed from the samples, before they are flattened and returned.

Parameters

bounds (sequence) – Bounds for variables. Must be a set of N + 1 (min, max) pairs, one for each free parameter, defining the finite lower and upper bounds. Passed straight through to scipy.differential_evolution, and used to set the prior for the MCMC sampler.
max_iter (int, optional) – The maximum number of MCMC iterations.
batchsize (int, optional) – The size of each batch, between which we check convergence.
ntau (float, optional) – The minimum number of autocorrelation lengths to require before convergence.
tautol (float, optional) – The maximum fractional deviation between successive values of the autocorrelation length required for convergence.
verbose (bool, optional) – Whether or not to print out convergence statistics and progress.

Returns

mcmc_samples (ndarray) – (N + 1) x Nsamples dimensional array containing the flattened, burnt-in MCMC samples. First N dimensions are the parameters of the plane. Last dimension is intrinsic scatter in the vertical axis.
mcmc_lnlike (ndarray) – Nsamples dimensional array containing the log-likelihood for each MCMC sample.

Raises

ValueError – If the number of values in ‘begin’ is not equal to N + 1.:

Note

Also calls ‘optimize’ and stores the results in the relevant class attributes if you want to access the best-fit.

get_sigmas(normal=None, norm_scat=None)

Calculates the offset between each data point and a plane in units of the standard deviation, i.e., in terms of x-sigma.

Parameters

normal (ndarray, optional) – N x M dimensional array of unit vectors. Default is None, which means use the values stored in the self.normal attribute.
norm_scat (ndarray, optional) – M dimensional array of scatter values normal to the plane. Default is None, which means use the values stored in the self.norm_scat attribute.

Returns

sigmas – D x M dimensional array containing the offsets of the D data points, in units of the standard deviation from the M models.

Return type

ndarray

optimize(bounds, tol=1e-06, verbose=False)

Find the best-fitting line/plane/hyperplane.

Fits the N x D dimensional self.data using scipy.optimise’s basinhopping + Nelder-Mead algorithm. Pretty robust.

Parameters

bounds (sequence) – Bounds for variables. Must be a set of N + 1 (min, max) pairs, one for each free parameter, defining the finite lower and upper bounds. Passed straight through to scipy.differential_evolution
tol (float, optional) – The optimisation tolerance.
verbose (bool, optional) – If True prints out the full dictionary returned by scipy.optimize.basinhopping.

Returns

coords (ndarray) – N dimensional array containing the best-fitting parameters.
vert_scat (float) – The scatter in the vertical axis, corrected using the Bessel-Cochran correction.
log_posterior (float) – The log posterior at the best-fitting parameters.

Raises

ValueError – If the number of pairs in ‘bounds’ is not equal to N + 1.:

Note

If you want to access the best-fitting parameters in the normal coordinates and the scatter normal to the plane, these are stored in the self.normal and self.norm_scat class attributes respectively following a call to optimize.

snowline(bounds, num_global_samples=400, num_gauss_samples=400, max_ncalls=100000, min_ess=400, max_improvement_loops=4, heavytail_laplaceapprox=True, verbose=False)

Get posterior samples and Bayesian evidence using the snowline package (https://johannesbuchner.github.io/snowline/).

Input kwargs are passed directly to snowline and are named the same, so see the snowline documentation for more details on these. self.optimize is also called even though snowline runs it’s own optimisation to ensure some useful attributes are stored, and for consistency with the emcee and zeus functions.

Parameters

bounds (sequence) – Bounds for variables. Must be a set of N + 1 (min, max) pairs, one for each free parameter, defining the finite lower and upper bounds. Passed straight through to scipy.differential_evolution, and used to set the prior for the MCMC sampler.
num_global_samples (int, optional) – Number of samples to draw from the prior.
num_gauss_samples (int, optional) – Number of samples to draw from initial Gaussian likelihood approximation before improving the approximation.
max_ncalls (int, optional) – Maximum number of likelihood function evaluations.
min_ess (int, optional) – Number of effective samples to draw.
max_improvement_loops – Number of times the proposal should be improved.

Returns

mcmc_samples (ndarray) – (N + 1) x Nsamples dimensional array containing the flattened, burnt-in MCMC samples. First N dimensions are the parameters of the plane. Last dimension is intrinsic scatter in the vertical axis.
mcmc_lnlike (ndarray) – Nsamples dimensional array containing the log-likelihood for each MCMC sample.
logz (float) – The Bayesian evidence.
logzerr (float) – Error on the Bayesian evidence.

Raises

ValueError – If the number of values in ‘begin’ is not equal to N + 1.:

Note

Also calls ‘optimize’ and stores the results in the relevant class attributes if you want to access the best-fit.

zeus(bounds, max_iter=100000, batchsize=1000, ntau=10.0, tautol=0.05, verbose=False)

Run an MCMC on the data using the zeus sampler (Karamanis and Beutler 2020).

The MCMC runs in batches, checking convergence at the end of each batch until either the chain is well converged or the maximum number of iterations has been reached. Convergence is defined as the point when the chain is longer than ntau autocorrelation lengths, and the estimate of the autocorrelation length varies less than tautol between batches. Burn-in is then removed from the samples, before they are flattened and returned.

Parameters

bounds (sequence) – Bounds for variables. Must be a set of N + 1 (min, max) pairs, one for each free parameter, defining the finite lower and upper bounds. Passed straight through to scipy.differential_evolution, and used to set the prior for the MCMC sampler.
max_iter (int, optional) – The maximum number of MCMC iterations.
batchsize (int, optional) – The size of each batch, between which we check convergence.
ntau (float, optional) – The minimum number of autocorrelation lengths to require before convergence.
tautol (float, optional) – The maximum fractional deviation between successive values of the autocorrelation length required for convergence.
verbose (bool, optional) – Whether or not to print out convergence statistics and progress.

Returns

mcmc_samples (ndarray) – (N + 1) x Nsamples dimensional array containing the flattened, burnt-in MCMC samples. First N dimensions are the parameters of the plane. Last dimension is intrinsic scatter in the vertical axis.
mcmc_lnlike (ndarray) – Nsamples dimensional array containing the log-likelihood for each MCMC sample.

Raises

ValueError – If the number of values in ‘begin’ is not equal to N + 1.:

Note

Also calls ‘optimize’ and stores the results in the relevant class attributes if you want to access the best-fit.

Data Classes

class hyperfit.data.FitData

Abstract base class for the test data included with HyperFit

Meant to only be accessed through the various listed data subclasses. The attributes below are inherited by these subclasses

xs

The N x D dimensional data vector

Type: ndarray

cov

The N x N x D dimensional set of covariance matrices.

Type: ndarray

weights

D dimensional array of weights for each data. Default is None, in which can unit weights are assumed for each data point.

Type: ndarray, optional

plot(linfit=None)

Produces a plot of the data where implemented, colour-coded by weight. If a linfit class instance is specified it will also plot the best-fit and instead colour code the points by sigma offset

Parameters: linfit (object, optional) – LinFit class instance from which the best-fit to the data can be accessed, and the sigma offsets computed
Raises: NotImplementedError – If called from a subclass with 3-D data (MJB or FP6dFGS):

class hyperfit.data.GAMAsmVsize

GAMA mass size relation data from Lange et. al., 2015

Contains 2 x 1854 data of log(mass) in solar masses and log(effective_radius) in kpc along with a diagonal covariance matrix (uncorrelated measurement pairs) and weights.