mzapy.MZA

This object serves as the primary interface for interacting with raw data in the MZA format.

Module Reference

Initialization

class mzapy.MZA(h5_file, cache_metadata_headers=['Scan', 'MSLevel', 'RetentionTime', 'IonMobilityFrame', 'IonMobilityBin', 'IonMobilityTime'], ms1lvl=1, io_threads=8, cache_scan_data=False, mza_version='new')

object for accessing MS data from HDF5 formatted files

Attributes:
mza_versionstr

version of the underlying .mza format

h5_filestr

file name (and optionally path to) the source HDF5 file

h5h5py.File

h5py File object for extracting information from the HDF5 file

mz_fullnumpy.ndarray(float)

full m/z array

min_mzfloat

minimum m/z value

max_mzfloat

maximum m/z value

idxnumpy.ndarray(int)

indices of all scans

dtnumpy.ndarray(float)

drift time of all scans

min_dtfloat

minimum drift time

max_dtfloat

maximum drift time

rtnumpy.ndarray(float)

retention time of all scans

min_rtfloat

minimum retention time

max_rtfloat

maximum retention time

imbnumpy.array(int)

ion mobility bin of all scans

mslvlnumpy.ndarray(int)

MS level of all scans

ms1_frame2frame_idxdict(:)

mapping ?

frame_idx2ms1_framedict(:)

mapping ?

Methods

close()

closes the underlying h5py File object, trying to read data after closing will cause errors

collect_atd_arrays_by_rt_mz(mz_min, mz_max, ...)

loads ATD (dt, intensity) as arrays for target mass within an RT window

collect_dtmz_arrays_by_rt(rt_min, rt_max[, ...])

loads DTMZ data (dt, mz, intensity) as arrays within a target RT range, using optional m/z and DT bounds

collect_ms1_arrays_by_dt(dt_min, dt_max[, ...])

loads MS1 spectrum (m/z, intensity) as arrays for m/z and DT range

collect_ms1_arrays_by_rt(rt_min, rt_max[, ...])

loads MS1 spectrum (m/z, intensity) as arrays for m/z and RT range, ignoring DT if present

collect_ms1_arrays_by_rt_dt(rt_min, rt_max, ...)

loads MS1 spectrum (m/z, intensity) as arrays for m/z, RT, and DT range

collect_ms1_df_by_rt(rt_min, rt_max[, ...])

collects MS1 data within RT ranges, IM dimension is collapsed/ignored

collect_ms1_df_by_rt_dt(rt_min, rt_max, ...)

collects MS1 data within RT and DT ranges

collect_ms2_arrays_by_rt(rt_min, rt_max[, ...])

loads MS2 spectrum (m/z, intensity) as arrays for m/z and RT range, ignoring DT if present

collect_ms2_arrays_by_rt_dt(rt_min, rt_max, ...)

loads MS2 spectrum (m/z, intensity) as arrays for m/z, RT, and DT range

collect_ms2_df_by_rt(rt_min, rt_max[, ...])

collects MS2 data within RT ranges, IM dimension is collapsed/ignored

collect_ms2_df_by_rt_dt(rt_min, rt_max, ...)

collects MS2 data within RT and DT ranges

collect_rtdt_arrays_by_mz(mz_min, mz_max[, ...])

loads RTDT data (rt, dt, intensity) as arrays within a target m/z range, using optional RT and DT bounds

collect_rtmz_arrays([rt_bounds, mz_bounds, ...])

loads RTMZ data (rt, mz, intensity) as arrays ignoring/collapsing DT, using optional m/z and RT bounds

collect_rtmz_arrays_by_dt(dt_min, dt_max[, ...])

loads RTMZ data (rt, mz, intensity) as arrays within a target DT range, using optional m/z and RT bounds

collect_xic_arrays_by_mz(mz_min, mz_max[, ...])

loads XIC (retention time, intensity) as arrays for m/z range, ignoring DT if present

collect_xic_arrays_by_mz_dt(mz_min, mz_max, ...)

loads XIC (RT, intensity) as arrays for target mass within a DT window

load_scan_cache([scan_cache_file, ...])

load the scan cache from file

metadata(header)

retrieve the array of metadata from the specified header

save_scan_cache([scan_cache_file])

saves the scan cache to file for fast loading later

mzapy.MZA.__init__(self, h5_file, cache_metadata_headers=['Scan', 'MSLevel', 'RetentionTime', 'IonMobilityFrame', 'IonMobilityBin', 'IonMobilityTime'], ms1lvl=1, io_threads=8, cache_scan_data=False, mza_version='new')

init a new MZA instance using the path to the source HDF5 file

Parameters:
h5_filestr

file name (and optionally path to) the source HDF5 file

cache_metadata_headerslist(str) or str, default=_CACHE_METADATA_HEADERS

specify which metadata headers to cache in memory for faster access, [] to cache none, ‘all’ to cache all, by default the set defined in _config._CACHE_METADATA_HEADERS is used

ms1lvlint, default=1

mslvl value corresponding to MS1 data (1 if MSMS data is present, 0 other times)

io_threadsint, default=8

number of threads to use for performing IO tasks

cache_scan_databool, default=False

whether to cache extracted scan data for faster subsequent access

mza_versionstr, default=’new’

temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.

Note

The cache_metadata_headers kwarg is used to control which metadata headers are cached in memory for faster access. Caching more metadata headers reduces access time, but there is a tradeoff with the memory footprint of the MZA instance.

controlling which headers are cached at initialization
from mzapy import MZA

# load some data from ./data/example.h5, cache the default metadata headers
h5_cache_std = MZA('./data/example.h5')

# load some data from ./data/example.h5, do not cache any metadata headers (smallest memory footprint)
h5_cache_none = MZA('./data/example.h5', cache_metadata_headers=[])

# load some data from ./data/example.h5, cache all metadata headers (largest memory footprint)
h5_cache_all = MZA('./data/example.h5', cache_metadata_headers='all')
mzapy.MZA.close(self)

closes the underlying h5py File object, trying to read data after closing will cause errors

Scan Caching

mzapy.MZA.load_scan_cache(self, scan_cache_file=None, ignore_no_cache_file=False)

load the scan cache from file

Parameters:
scan_cache_filestr, optional

if provided, override the default scan cache file name

ignore_no_cache_filebool, default=False

do not raise an exception if the cache file is not found, just silently use {} as the scan cache

mzapy.MZA.save_scan_cache(self, scan_cache_file=None)

saves the scan cache to file for fast loading later

Parameters:
scan_cache_filestr, optional

if provided, override the default scan cache file name

Metadata

mzapy.MZA.metadata(self, header)

retrieve the array of metadata from the specified header

Parameters:
headerstr

specify which metadata header to fetch

Returns:
metadata_columnnumpy.ndarray(?)

array of metadata (could be type float, int, or bytes depending on the header)

Extracted Ion Chromatograms

mzapy.MZA.collect_xic_arrays_by_mz(self, mz_min, mz_max, rt_bounds=None, mslvl=1, verbose=False)

loads XIC (retention time, intensity) as arrays for m/z range, ignoring DT if present

Parameters:
mz_minfloat

lower m/z bound

mz_maxfloat

upper m/z bound

rt_boundstuple(float, float), optional

(lower, upper) RT bounds

mslvlint, default=1

MS level to select from

verbosebool, default=False

print information about the progress

Returns:
xic_rtnumpy.ndarray(float)

retention time component of XIC

xic_intnumpy.ndarray(int)

intensity component of XIC

mzapy.MZA.collect_xic_arrays_by_mz_dt(self, mz_min, mz_max, dt_min, dt_max, rt_bounds=None, mslvl=1, verbose=False)

loads XIC (RT, intensity) as arrays for target mass within a DT window

Parameters:
mz_minfloat

lower m/z bound

mz_maxfloat

upper m/z bound

dt_minfloat

lower DT bound

dt_maxfloat

upper DT bound

rt_boundstuple(float, float), optional

(lower, upper) RT bounds

mslvlint, default=1

MS level to select from

verbosebool, default=False

print information about the progress

Returns:
xic_rtnumpy.ndarray(float)

retention time component of XIC

xic_intnumpy.ndarray(int)

intensity component of XIC

Arrival Time Distributions

mzapy.MZA.collect_atd_arrays_by_rt_mz(self, mz_min, mz_max, rt_min, rt_max, dt_bounds=None, mslvl=1, verbose=False)

loads ATD (dt, intensity) as arrays for target mass within an RT window

Parameters:
mz_minfloat

lower m/z bound

mz_maxfloat

upper m/z bound

rt_minfloat

lower RT bound

rt_maxfloat

upper RT bound

dt_boundstuple(float), optional

(lower, upper) drift bounds, tightening DT bounds around area of interest reduces extraction time

mslvlint, default=1

MS level to select from

verbosebool, default=False

print information about the progress

Returns:
atd_dtnumpy.ndarray(float)

drift time component of ATD

atd_intnumpy.ndarray(int)

intensity component of ATD

MS1 Spectra

mzapy.MZA.collect_ms1_arrays_by_rt(self, rt_min, rt_max, mz_bounds=None)

loads MS1 spectrum (m/z, intensity) as arrays for m/z and RT range, ignoring DT if present

Parameters:
rt_minfloat

lower RT bound

rt_maxfloat

upper RT bound

mz_boundstuple(float), optional

(lower, upper) m/z bounds, filters data after extraction so no effect on extraction time

Returns:
ms1_mznumpy.ndarray(float)

m/z component of MS1 spectrum

ms1_intnumpy.ndarray(int)

intensity component of MS1 spectrum

mzapy.MZA.collect_ms1_arrays_by_dt(self, dt_min, dt_max, mz_bounds=None)

loads MS1 spectrum (m/z, intensity) as arrays for m/z and DT range

Parameters:
dt_minfloat

lower DT bound

dt_maxfloat

upper DT bound

mz_boundstuple(float), optional

(lower, upper) m/z bounds, filters data after extraction so no effect on extraction time

Returns:
ms1_mznumpy.ndarray(float)

m/z component of MS1 spectrum

ms1_intnumpy.ndarray(int)

intensity component of MS1 spectrum

mzapy.MZA.collect_ms1_arrays_by_rt_dt(self, rt_min, rt_max, dt_min, dt_max, mz_bounds=None, verbose=False)

loads MS1 spectrum (m/z, intensity) as arrays for m/z, RT, and DT range

Parameters:
rt_minfloat

lower RT bound

rt_maxfloat

upper RT bound

dt_minfloat

lower bound of DT window to select data from

dt_maxfloat

upper bound of DT window to select data from

mz_boundstuple(float), optional

(lower, upper) m/z bounds, filters data after extraction so no effect on extraction time

verbosebool, default=False

print information about the progress

Returns:
ms1_mznumpy.ndarray(float)

m/z component of MS1 spectrum

ms1_intnumpy.ndarray(int)

intensity component of MS1 spectrum

mzapy.MZA.collect_ms1_df_by_rt(self, rt_min, rt_max, mz_bounds=None, verbose=False)

collects MS1 data within RT ranges, IM dimension is collapsed/ignored

Parameters:
rt_minfloat

lower bound of RT window to select data from

rt_maxfloat

upper bound of RT window to select data from

mz_boundstuple(float, float), optional

mz_min, mz_max

verbosebool, default=False

print information about the progress

Returns:
datapandas.DataFrame

data frame with columns mzbin, mz, intensity, rt, frame

mzapy.MZA.collect_ms1_df_by_rt_dt(self, rt_min, rt_max, dt_min, dt_max, mz_bounds=None, verbose=False)

collects MS1 data within RT and DT ranges

Parameters:
rt_maxfloat

lower bound of RT window to select data from

rt_maxfloat

upper bound of RT window to select data from

dt_minfloat

lower bound of DT window to select data from

dt_maxfloat

upper bound of DT window to select data from

mz_boundstuple(float, float), optional

mz_min, mz_max

verbosebool, default=False

print information about the progress

Returns:
datapandas.DataFrame

data frame with columns mzbin, mz, intensity, rt, dt, frame

MS2 Spectra

mzapy.MZA.collect_ms2_arrays_by_rt_dt(self, rt_min, rt_max, dt_min, dt_max, mz_bounds=None, verbose=False)

loads MS2 spectrum (m/z, intensity) as arrays for m/z, RT, and DT range

Parameters:
rt_minfloat

lower RT bound

rt_maxfloat

upper RT bound

dt_minfloat

lower bound of DT window to select data from

dt_maxfloat

upper bound of DT window to select data from

mz_boundstuple(float), optional

(lower, upper) m/z bounds, filters data after extraction so no effect on extraction time

verbosebool, default=False

print information about the progress

Returns:
ms1_mznumpy.ndarray(float)

m/z component of MS1 spectrum

ms1_intnumpy.ndarray(int)

intensity component of MS1 spectrum

mzapy.MZA.collect_ms2_df_by_rt(self, rt_min, rt_max, mz_bounds=None, verbose=False)

collects MS2 data within RT ranges, IM dimension is collapsed/ignored

Parameters:
rt_minfloat

lower bound of RT window to select data from

rt_maxfloat

upper bound of RT window to select data from

mz_boundstuple(float, float), optional

mz_min, mz_max

verbosebool, default=False

print information about the progress

Returns:
datapandas.DataFrame

data frame with columns mzbin, mz, intensity, rt

mzapy.MZA.collect_ms2_df_by_rt_dt(self, rt_min, rt_max, dt_min, dt_max, mz_bounds=None, verbose=False)

collects MS2 data within RT and DT ranges

Parameters:
rt_maxfloat

lower bound of RT window to select data from

rt_maxfloat

upper bound of RT window to select data from

dt_minfloat

lower bound of DT window to select data from

dt_maxfloat

upper bound of DT window to select data from

mz_boundstuple(float, float), optional

mz_min, mz_max

verbosebool, default=False

print information about the progress

Returns:
datapandas.DataFrame

data frame with columns mzbin, mz, intensity, rt, dt

2-Dimensional Data

mzapy.MZA.collect_rtmz_arrays(self, rt_bounds=None, mz_bounds=None, verbose=False)

loads RTMZ data (rt, mz, intensity) as arrays ignoring/collapsing DT, using optional m/z and RT bounds

Parameters:
rt_boundstuple(float), optional

(lower, upper) retention time bounds, tightening RT bounds around area of interest reduces extraction time

mz_boundstuple(float), optional

(lower, upper) m/z bounds, filters data after extraction so no effect on extraction time

verbosebool, default=False

print information about the progress

Returns:
rtmz_rtnumpy.ndarray(float)

retention time component of RTMZ data

rtmz_mznumpy.ndarray(float)

m/z component of RTMZ data

rtmz_intnumpy.ndarray(int)

intensity component of RTMZ data

mzapy.MZA.collect_rtmz_arrays_by_dt(self, dt_min, dt_max, rt_bounds=None, mz_bounds=None, verbose=False)

loads RTMZ data (rt, mz, intensity) as arrays within a target DT range, using optional m/z and RT bounds

Parameters:
dt_minfloat

lower DT bound

dt_maxfloat

upper DT bound

rt_boundstuple(float), optional

(lower, upper) retention time bounds, tightening RT bounds around area of interest reduces extraction time

mz_boundstuple(float), optional

(lower, upper) m/z bounds, filters data after extraction so no effect on extraction time

verbosebool, default=False

print information about the progress

Returns:
rtmz_rtnumpy.ndarray(float)

retention time component of RTMZ data

rtmz_mznumpy.ndarray(float)

m/z component of RTMZ data

rtmz_intnumpy.ndarray(int)

intensity component of RTMZ data

mzapy.MZA.collect_dtmz_arrays_by_rt(self, rt_min, rt_max, dt_bounds=None, mz_bounds=None, verbose=False)

loads DTMZ data (dt, mz, intensity) as arrays within a target RT range, using optional m/z and DT bounds

Parameters:
rt_minfloat

lower RT bound

rt_maxfloat

upper RT bound

dt_boundstuple(float), optional

(lower, upper) drift bounds, tightening DT bounds around area of interest reduces extraction time

mz_boundstuple(float), optional

(lower, upper) m/z bounds, filters data after extraction so no effect on extraction time

verbosebool, default=False

print information about the progress

Returns:
dtmz_dtnumpy.ndarray(float)

drift time component of DTMZ data

dtmz_mznumpy.ndarray(float)

m/z component of DTMZ data

dtmz_intnumpy.ndarray(int)

intensity component of DTMZ data

mzapy.MZA.collect_rtdt_arrays_by_mz(self, mz_min, mz_max, rt_bounds=None, dt_bounds=None, verbose=False)

loads RTDT data (rt, dt, intensity) as arrays within a target m/z range, using optional RT and DT bounds

Parameters:
mz_minfloat

lower m/z bound

mz_maxfloat

upper m/z bound

rt_boundstuple(float), optional

(lower, upper) retention time bounds, tightening RT bounds around area of interest reduces extraction time

dt_boundstuple(float), optional

(lower, upper) drift time bounds, tightening DT bounds around area of interest reduces extraction time

verbosebool, default=False

print information about the progress

Returns:
rtdt_rtnumpy.ndarray(float)

retention time component of RTDT data

rtdt_dtnumpy.ndarray(float)

drift time component of RTDT data

rtdt_intnumpy.ndarray(int)

intensity component of RTDT data