`mzapy.MZA`

This object serves as the primary interface for interacting with raw data in the MZA format.

Module Reference

Initialization

class mzapy.MZA(h5_file, cache_metadata_headers=['Scan', 'MSLevel', 'RetentionTime', 'IonMobilityFrame', 'IonMobilityBin', 'IonMobilityTime'], ms1lvl=1, io_threads=8, cache_scan_data=False, mza_version='new')

object for accessing MS data from HDF5 formatted files

Attributes:

mza_versionstr: version of the underlying .mza format
h5_filestr: file name (and optionally path to) the source HDF5 file
h5h5py.File: h5py File object for extracting information from the HDF5 file
mz_fullnumpy.ndarray(float): full m/z array
min_mzfloat: minimum m/z value
max_mzfloat: maximum m/z value
idxnumpy.ndarray(int): indices of all scans
dtnumpy.ndarray(float): drift time of all scans
min_dtfloat: minimum drift time
max_dtfloat: maximum drift time
rtnumpy.ndarray(float): retention time of all scans
min_rtfloat: minimum retention time
max_rtfloat: maximum retention time
imbnumpy.array(int): ion mobility bin of all scans
mslvlnumpy.ndarray(int): MS level of all scans
ms1_frame2frame_idxdict(:): mapping ?
frame_idx2ms1_framedict(:): mapping ?

Methods

`close`()	closes the underlying h5py File object, trying to read data after closing will cause errors
`collect_atd_arrays_by_rt_mz`(mz_min, mz_max, ...)	loads ATD (dt, intensity) as arrays for target mass within an RT window
`collect_dtmz_arrays_by_rt`(rt_min, rt_max[, ...])	loads DTMZ data (dt, mz, intensity) as arrays within a target RT range, using optional m/z and DT bounds
`collect_ms1_arrays_by_dt`(dt_min, dt_max[, ...])	loads MS1 spectrum (m/z, intensity) as arrays for m/z and DT range
`collect_ms1_arrays_by_rt`(rt_min, rt_max[, ...])	loads MS1 spectrum (m/z, intensity) as arrays for m/z and RT range, ignoring DT if present
`collect_ms1_arrays_by_rt_dt`(rt_min, rt_max, ...)	loads MS1 spectrum (m/z, intensity) as arrays for m/z, RT, and DT range
`collect_ms1_df_by_rt`(rt_min, rt_max[, ...])	collects MS1 data within RT ranges, IM dimension is collapsed/ignored
`collect_ms1_df_by_rt_dt`(rt_min, rt_max, ...)	collects MS1 data within RT and DT ranges
`collect_ms2_arrays_by_rt`(rt_min, rt_max[, ...])	loads MS2 spectrum (m/z, intensity) as arrays for m/z and RT range, ignoring DT if present
`collect_ms2_arrays_by_rt_dt`(rt_min, rt_max, ...)	loads MS2 spectrum (m/z, intensity) as arrays for m/z, RT, and DT range
`collect_ms2_df_by_rt`(rt_min, rt_max[, ...])	collects MS2 data within RT ranges, IM dimension is collapsed/ignored
`collect_ms2_df_by_rt_dt`(rt_min, rt_max, ...)	collects MS2 data within RT and DT ranges
`collect_rtdt_arrays_by_mz`(mz_min, mz_max[, ...])	loads RTDT data (rt, dt, intensity) as arrays within a target m/z range, using optional RT and DT bounds
`collect_rtmz_arrays`([rt_bounds, mz_bounds, ...])	loads RTMZ data (rt, mz, intensity) as arrays ignoring/collapsing DT, using optional m/z and RT bounds
`collect_rtmz_arrays_by_dt`(dt_min, dt_max[, ...])	loads RTMZ data (rt, mz, intensity) as arrays within a target DT range, using optional m/z and RT bounds
`collect_xic_arrays_by_mz`(mz_min, mz_max[, ...])	loads XIC (retention time, intensity) as arrays for m/z range, ignoring DT if present
`collect_xic_arrays_by_mz_dt`(mz_min, mz_max, ...)	loads XIC (RT, intensity) as arrays for target mass within a DT window
`load_scan_cache`([scan_cache_file, ...])	load the scan cache from file
`metadata`(header)	retrieve the array of metadata from the specified header
`save_scan_cache`([scan_cache_file])	saves the scan cache to file for fast loading later

mzapy.MZA.__init__(self, h5_file, cache_metadata_headers=['Scan', 'MSLevel', 'RetentionTime', 'IonMobilityFrame', 'IonMobilityBin', 'IonMobilityTime'], ms1lvl=1, io_threads=8, cache_scan_data=False, mza_version='new')

init a new MZA instance using the path to the source HDF5 file

Parameters:

h5_filestr: file name (and optionally path to) the source HDF5 file
cache_metadata_headerslist(str) or str, default=_CACHE_METADATA_HEADERS: specify which metadata headers to cache in memory for faster access, [] to cache none, ‘all’ to cache all, by default the set defined in _config._CACHE_METADATA_HEADERS is used
ms1lvlint, default=1: mslvl value corresponding to MS1 data (1 if MSMS data is present, 0 other times)
io_threadsint, default=8: number of threads to use for performing IO tasks
cache_scan_databool, default=False: whether to cache extracted scan data for faster subsequent access
mza_versionstr, default=’new’: temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.

Note

The cache_metadata_headers kwarg is used to control which metadata headers are cached in memory for faster access. Caching more metadata headers reduces access time, but there is a tradeoff with the memory footprint of the MZA instance.

controlling which headers are cached at initialization

from mzapy import MZA

# load some data from ./data/example.h5, cache the default metadata headers
h5_cache_std = MZA('./data/example.h5')

# load some data from ./data/example.h5, do not cache any metadata headers (smallest memory footprint)
h5_cache_none = MZA('./data/example.h5', cache_metadata_headers=[])

# load some data from ./data/example.h5, cache all metadata headers (largest memory footprint)
h5_cache_all = MZA('./data/example.h5', cache_metadata_headers='all')

mzapy.MZA.close(self): closes the underlying h5py File object, trying to read data after closing will cause errors

Scan Caching

mzapy.MZA.load_scan_cache(self, scan_cache_file=None, ignore_no_cache_file=False)

load the scan cache from file

Parameters:

scan_cache_filestr, optional: if provided, override the default scan cache file name
ignore_no_cache_filebool, default=False: do not raise an exception if the cache file is not found, just silently use {} as the scan cache

mzapy.MZA.save_scan_cache(self, scan_cache_file=None)

saves the scan cache to file for fast loading later

Parameters:

scan_cache_filestr, optional: if provided, override the default scan cache file name

Metadata

mzapy.MZA.metadata(self, header)

retrieve the array of metadata from the specified header

Parameters:

headerstr: specify which metadata header to fetch

Returns:

metadata_columnnumpy.ndarray(?): array of metadata (could be type float, int, or bytes depending on the header)

Extracted Ion Chromatograms

mzapy.MZA.collect_xic_arrays_by_mz(self, mz_min, mz_max, rt_bounds=None, mslvl=1, verbose=False)

loads XIC (retention time, intensity) as arrays for m/z range, ignoring DT if present

Parameters:

mz_minfloat: lower m/z bound
mz_maxfloat: upper m/z bound
rt_boundstuple(float, float), optional: (lower, upper) RT bounds
mslvlint, default=1: MS level to select from
verbosebool, default=False: print information about the progress

Returns:

xic_rtnumpy.ndarray(float): retention time component of XIC
xic_intnumpy.ndarray(int): intensity component of XIC

mzapy.MZA.collect_xic_arrays_by_mz_dt(self, mz_min, mz_max, dt_min, dt_max, rt_bounds=None, mslvl=1, verbose=False)

loads XIC (RT, intensity) as arrays for target mass within a DT window

Parameters:

mz_minfloat: lower m/z bound
mz_maxfloat: upper m/z bound
dt_minfloat: lower DT bound
dt_maxfloat: upper DT bound
rt_boundstuple(float, float), optional: (lower, upper) RT bounds
mslvlint, default=1: MS level to select from
verbosebool, default=False: print information about the progress

Returns:

xic_rtnumpy.ndarray(float): retention time component of XIC
xic_intnumpy.ndarray(int): intensity component of XIC

Arrival Time Distributions

mzapy.MZA.collect_atd_arrays_by_rt_mz(self, mz_min, mz_max, rt_min, rt_max, dt_bounds=None, mslvl=1, verbose=False)

loads ATD (dt, intensity) as arrays for target mass within an RT window

Parameters:

mz_minfloat: lower m/z bound
mz_maxfloat: upper m/z bound
rt_minfloat: lower RT bound
rt_maxfloat: upper RT bound
dt_boundstuple(float), optional: (lower, upper) drift bounds, tightening DT bounds around area of interest reduces extraction time
mslvlint, default=1: MS level to select from
verbosebool, default=False: print information about the progress

Returns:

atd_dtnumpy.ndarray(float): drift time component of ATD
atd_intnumpy.ndarray(int): intensity component of ATD

MS1 Spectra

mzapy.MZA.collect_ms1_arrays_by_rt(self, rt_min, rt_max, mz_bounds=None)

loads MS1 spectrum (m/z, intensity) as arrays for m/z and RT range, ignoring DT if present

Parameters:

rt_minfloat: lower RT bound
rt_maxfloat: upper RT bound
mz_boundstuple(float), optional: (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time

Returns:

ms1_mznumpy.ndarray(float): m/z component of MS1 spectrum
ms1_intnumpy.ndarray(int): intensity component of MS1 spectrum

mzapy.MZA.collect_ms1_arrays_by_dt(self, dt_min, dt_max, mz_bounds=None)

loads MS1 spectrum (m/z, intensity) as arrays for m/z and DT range

Parameters:

dt_minfloat: lower DT bound
dt_maxfloat: upper DT bound
mz_boundstuple(float), optional: (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time

Returns:

ms1_mznumpy.ndarray(float): m/z component of MS1 spectrum
ms1_intnumpy.ndarray(int): intensity component of MS1 spectrum

mzapy.MZA.collect_ms1_arrays_by_rt_dt(self, rt_min, rt_max, dt_min, dt_max, mz_bounds=None, verbose=False)

loads MS1 spectrum (m/z, intensity) as arrays for m/z, RT, and DT range

Parameters:

rt_minfloat: lower RT bound
rt_maxfloat: upper RT bound
dt_minfloat: lower bound of DT window to select data from
dt_maxfloat: upper bound of DT window to select data from
mz_boundstuple(float), optional: (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time
verbosebool, default=False: print information about the progress

Returns:

ms1_mznumpy.ndarray(float): m/z component of MS1 spectrum
ms1_intnumpy.ndarray(int): intensity component of MS1 spectrum

mzapy.MZA.collect_ms1_df_by_rt(self, rt_min, rt_max, mz_bounds=None, verbose=False)

collects MS1 data within RT ranges, IM dimension is collapsed/ignored

Parameters:

rt_minfloat: lower bound of RT window to select data from
rt_maxfloat: upper bound of RT window to select data from
mz_boundstuple(float, float), optional: mz_min, mz_max
verbosebool, default=False: print information about the progress

Returns:

datapandas.DataFrame: data frame with columns mzbin, mz, intensity, rt, frame

mzapy.MZA.collect_ms1_df_by_rt_dt(self, rt_min, rt_max, dt_min, dt_max, mz_bounds=None, verbose=False)

collects MS1 data within RT and DT ranges

Parameters:

rt_maxfloat: lower bound of RT window to select data from
rt_maxfloat: upper bound of RT window to select data from
dt_minfloat: lower bound of DT window to select data from
dt_maxfloat: upper bound of DT window to select data from
mz_boundstuple(float, float), optional: mz_min, mz_max
verbosebool, default=False: print information about the progress

Returns:

datapandas.DataFrame: data frame with columns mzbin, mz, intensity, rt, dt, frame

MS2 Spectra

mzapy.MZA.collect_ms2_arrays_by_rt_dt(self, rt_min, rt_max, dt_min, dt_max, mz_bounds=None, verbose=False)

loads MS2 spectrum (m/z, intensity) as arrays for m/z, RT, and DT range

Parameters:

rt_minfloat: lower RT bound
rt_maxfloat: upper RT bound
dt_minfloat: lower bound of DT window to select data from
dt_maxfloat: upper bound of DT window to select data from
mz_boundstuple(float), optional: (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time
verbosebool, default=False: print information about the progress

Returns:

ms1_mznumpy.ndarray(float): m/z component of MS1 spectrum
ms1_intnumpy.ndarray(int): intensity component of MS1 spectrum

mzapy.MZA.collect_ms2_df_by_rt(self, rt_min, rt_max, mz_bounds=None, verbose=False)

collects MS2 data within RT ranges, IM dimension is collapsed/ignored

Parameters:

rt_minfloat: lower bound of RT window to select data from
rt_maxfloat: upper bound of RT window to select data from
mz_boundstuple(float, float), optional: mz_min, mz_max
verbosebool, default=False: print information about the progress

Returns:

datapandas.DataFrame: data frame with columns mzbin, mz, intensity, rt

mzapy.MZA.collect_ms2_df_by_rt_dt(self, rt_min, rt_max, dt_min, dt_max, mz_bounds=None, verbose=False)

collects MS2 data within RT and DT ranges

Parameters:

rt_maxfloat: lower bound of RT window to select data from
rt_maxfloat: upper bound of RT window to select data from
dt_minfloat: lower bound of DT window to select data from
dt_maxfloat: upper bound of DT window to select data from
mz_boundstuple(float, float), optional: mz_min, mz_max
verbosebool, default=False: print information about the progress

Returns:

datapandas.DataFrame: data frame with columns mzbin, mz, intensity, rt, dt

2-Dimensional Data

mzapy.MZA.collect_rtmz_arrays(self, rt_bounds=None, mz_bounds=None, verbose=False)

loads RTMZ data (rt, mz, intensity) as arrays ignoring/collapsing DT, using optional m/z and RT bounds

Parameters:

rt_boundstuple(float), optional: (lower, upper) retention time bounds, tightening RT bounds around area of interest reduces extraction time
mz_boundstuple(float), optional: (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time
verbosebool, default=False: print information about the progress

Returns:

rtmz_rtnumpy.ndarray(float): retention time component of RTMZ data
rtmz_mznumpy.ndarray(float): m/z component of RTMZ data
rtmz_intnumpy.ndarray(int): intensity component of RTMZ data

mzapy.MZA.collect_rtmz_arrays_by_dt(self, dt_min, dt_max, rt_bounds=None, mz_bounds=None, verbose=False)

loads RTMZ data (rt, mz, intensity) as arrays within a target DT range, using optional m/z and RT bounds

Parameters:

dt_minfloat: lower DT bound
dt_maxfloat: upper DT bound
rt_boundstuple(float), optional: (lower, upper) retention time bounds, tightening RT bounds around area of interest reduces extraction time
mz_boundstuple(float), optional: (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time
verbosebool, default=False: print information about the progress

Returns:

rtmz_rtnumpy.ndarray(float): retention time component of RTMZ data
rtmz_mznumpy.ndarray(float): m/z component of RTMZ data
rtmz_intnumpy.ndarray(int): intensity component of RTMZ data

mzapy.MZA.collect_dtmz_arrays_by_rt(self, rt_min, rt_max, dt_bounds=None, mz_bounds=None, verbose=False)

loads DTMZ data (dt, mz, intensity) as arrays within a target RT range, using optional m/z and DT bounds

Parameters:

rt_minfloat: lower RT bound
rt_maxfloat: upper RT bound
dt_boundstuple(float), optional: (lower, upper) drift bounds, tightening DT bounds around area of interest reduces extraction time
mz_boundstuple(float), optional: (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time
verbosebool, default=False: print information about the progress

Returns:

dtmz_dtnumpy.ndarray(float): drift time component of DTMZ data
dtmz_mznumpy.ndarray(float): m/z component of DTMZ data
dtmz_intnumpy.ndarray(int): intensity component of DTMZ data

mzapy.MZA.collect_rtdt_arrays_by_mz(self, mz_min, mz_max, rt_bounds=None, dt_bounds=None, verbose=False)

loads RTDT data (rt, dt, intensity) as arrays within a target m/z range, using optional RT and DT bounds

Parameters:

mz_minfloat: lower m/z bound
mz_maxfloat: upper m/z bound
rt_boundstuple(float), optional: (lower, upper) retention time bounds, tightening RT bounds around area of interest reduces extraction time
dt_boundstuple(float), optional: (lower, upper) drift time bounds, tightening DT bounds around area of interest reduces extraction time
verbosebool, default=False: print information about the progress

Returns:

rtdt_rtnumpy.ndarray(float): retention time component of RTDT data
rtdt_dtnumpy.ndarray(float): drift time component of RTDT data
rtdt_intnumpy.ndarray(int): intensity component of RTDT data

mzapy.MZA

Module Reference

Initialization

Scan Caching

Metadata

Extracted Ion Chromatograms

Arrival Time Distributions

MS1 Spectra

MS2 Spectra

2-Dimensional Data

`mzapy.MZA`