mzapy.MZA
This object serves as the primary interface for interacting with raw data in the MZA format.
Module Reference
Initialization
- class mzapy.MZA(h5_file, cache_metadata_headers=['Scan', 'MSLevel', 'RetentionTime', 'IonMobilityFrame', 'IonMobilityBin', 'IonMobilityTime'], ms1lvl=1, io_threads=8, cache_scan_data=False, mza_version='new')
object for accessing MS data from HDF5 formatted files
- Attributes:
- mza_version
str version of the underlying .mza format
- h5_file
str file name (and optionally path to) the source HDF5 file
- h5
h5py.File h5py File object for extracting information from the HDF5 file
- mz_fullnumpy.ndarray(float)
full m/z array
- min_mz
float minimum m/z value
- max_mz
float maximum m/z value
idxnumpy.ndarray(int)indices of all scans
dtnumpy.ndarray(float)drift time of all scans
min_dtfloatminimum drift time
max_dtfloatmaximum drift time
rtnumpy.ndarray(float)retention time of all scans
min_rtfloatminimum retention time
max_rtfloatmaximum retention time
imbnumpy.array(int)ion mobility bin of all scans
mslvlnumpy.ndarray(int)MS level of all scans
- ms1_frame2frame_idx
dict(:) mapping ?
- frame_idx2ms1_frame
dict(:) mapping ?
- mza_version
Methods
close()closes the underlying h5py File object, trying to read data after closing will cause errors
collect_atd_arrays_by_rt_mz(mz_min, mz_max, ...)loads ATD (dt, intensity) as arrays for target mass within an RT window
collect_dtmz_arrays_by_rt(rt_min, rt_max[, ...])loads DTMZ data (dt, mz, intensity) as arrays within a target RT range, using optional m/z and DT bounds
collect_ms1_arrays_by_dt(dt_min, dt_max[, ...])loads MS1 spectrum (m/z, intensity) as arrays for m/z and DT range
collect_ms1_arrays_by_rt(rt_min, rt_max[, ...])loads MS1 spectrum (m/z, intensity) as arrays for m/z and RT range, ignoring DT if present
collect_ms1_arrays_by_rt_dt(rt_min, rt_max, ...)loads MS1 spectrum (m/z, intensity) as arrays for m/z, RT, and DT range
collect_ms1_df_by_rt(rt_min, rt_max[, ...])collects MS1 data within RT ranges, IM dimension is collapsed/ignored
collect_ms1_df_by_rt_dt(rt_min, rt_max, ...)collects MS1 data within RT and DT ranges
collect_ms2_arrays_by_rt(rt_min, rt_max[, ...])loads MS2 spectrum (m/z, intensity) as arrays for m/z and RT range, ignoring DT if present
collect_ms2_arrays_by_rt_dt(rt_min, rt_max, ...)loads MS2 spectrum (m/z, intensity) as arrays for m/z, RT, and DT range
collect_ms2_df_by_rt(rt_min, rt_max[, ...])collects MS2 data within RT ranges, IM dimension is collapsed/ignored
collect_ms2_df_by_rt_dt(rt_min, rt_max, ...)collects MS2 data within RT and DT ranges
collect_rtdt_arrays_by_mz(mz_min, mz_max[, ...])loads RTDT data (rt, dt, intensity) as arrays within a target m/z range, using optional RT and DT bounds
collect_rtmz_arrays([rt_bounds, mz_bounds, ...])loads RTMZ data (rt, mz, intensity) as arrays ignoring/collapsing DT, using optional m/z and RT bounds
collect_rtmz_arrays_by_dt(dt_min, dt_max[, ...])loads RTMZ data (rt, mz, intensity) as arrays within a target DT range, using optional m/z and RT bounds
collect_xic_arrays_by_mz(mz_min, mz_max[, ...])loads XIC (retention time, intensity) as arrays for m/z range, ignoring DT if present
collect_xic_arrays_by_mz_dt(mz_min, mz_max, ...)loads XIC (RT, intensity) as arrays for target mass within a DT window
load_scan_cache([scan_cache_file, ...])load the scan cache from file
metadata(header)retrieve the array of metadata from the specified header
save_scan_cache([scan_cache_file])saves the scan cache to file for fast loading later
- mzapy.MZA.__init__(self, h5_file, cache_metadata_headers=['Scan', 'MSLevel', 'RetentionTime', 'IonMobilityFrame', 'IonMobilityBin', 'IonMobilityTime'], ms1lvl=1, io_threads=8, cache_scan_data=False, mza_version='new')
init a new MZA instance using the path to the source HDF5 file
- Parameters:
- h5_file
str file name (and optionally path to) the source HDF5 file
- cache_metadata_headers
list(str)orstr, default=_CACHE_METADATA_HEADERS specify which metadata headers to cache in memory for faster access, [] to cache none, ‘all’ to cache all, by default the set defined in _config._CACHE_METADATA_HEADERS is used
- ms1lvl
int, default=1 mslvl value corresponding to MS1 data (1 if MSMS data is present, 0 other times)
- io_threads
int, default=8 number of threads to use for performing IO tasks
- cache_scan_data
bool, default=False whether to cache extracted scan data for faster subsequent access
- mza_version
str, default=’new’ temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.
- h5_file
Note
The cache_metadata_headers kwarg is used to control which metadata headers are cached in memory for faster access.
Caching more metadata headers reduces access time, but there is a tradeoff with the memory footprint of the MZA
instance.
from mzapy import MZA
# load some data from ./data/example.h5, cache the default metadata headers
h5_cache_std = MZA('./data/example.h5')
# load some data from ./data/example.h5, do not cache any metadata headers (smallest memory footprint)
h5_cache_none = MZA('./data/example.h5', cache_metadata_headers=[])
# load some data from ./data/example.h5, cache all metadata headers (largest memory footprint)
h5_cache_all = MZA('./data/example.h5', cache_metadata_headers='all')
- mzapy.MZA.close(self)
closes the underlying h5py File object, trying to read data after closing will cause errors
Scan Caching
- mzapy.MZA.load_scan_cache(self, scan_cache_file=None, ignore_no_cache_file=False)
load the scan cache from file
- Parameters:
- scan_cache_file
str, optional if provided, override the default scan cache file name
- ignore_no_cache_file
bool, default=False do not raise an exception if the cache file is not found, just silently use {} as the scan cache
- scan_cache_file
- mzapy.MZA.save_scan_cache(self, scan_cache_file=None)
saves the scan cache to file for fast loading later
- Parameters:
- scan_cache_file
str, optional if provided, override the default scan cache file name
- scan_cache_file
Metadata
- mzapy.MZA.metadata(self, header)
retrieve the array of metadata from the specified header
- Parameters:
- header
str specify which metadata header to fetch
- header
- Returns:
- metadata_column
numpy.ndarray(?) array of metadata (could be type
float,int, orbytesdepending on the header)
- metadata_column
Extracted Ion Chromatograms
- mzapy.MZA.collect_xic_arrays_by_mz(self, mz_min, mz_max, rt_bounds=None, mslvl=1, verbose=False)
loads XIC (retention time, intensity) as arrays for m/z range, ignoring DT if present
- Parameters:
- mz_min
float lower m/z bound
- mz_max
float upper m/z bound
- rt_bounds
tuple(float, float), optional (lower, upper) RT bounds
- mslvl
int, default=1 MS level to select from
- verbose
bool, default=False print information about the progress
- mz_min
- Returns:
- xic_rt
numpy.ndarray(float) retention time component of XIC
- xic_int
numpy.ndarray(int) intensity component of XIC
- xic_rt
- mzapy.MZA.collect_xic_arrays_by_mz_dt(self, mz_min, mz_max, dt_min, dt_max, rt_bounds=None, mslvl=1, verbose=False)
loads XIC (RT, intensity) as arrays for target mass within a DT window
- Parameters:
- mz_min
float lower m/z bound
- mz_max
float upper m/z bound
- dt_min
float lower DT bound
- dt_max
float upper DT bound
- rt_bounds
tuple(float, float), optional (lower, upper) RT bounds
- mslvl
int, default=1 MS level to select from
- verbose
bool, default=False print information about the progress
- mz_min
- Returns:
- xic_rt
numpy.ndarray(float) retention time component of XIC
- xic_int
numpy.ndarray(int) intensity component of XIC
- xic_rt
Arrival Time Distributions
- mzapy.MZA.collect_atd_arrays_by_rt_mz(self, mz_min, mz_max, rt_min, rt_max, dt_bounds=None, mslvl=1, verbose=False)
loads ATD (dt, intensity) as arrays for target mass within an RT window
- Parameters:
- mz_min
float lower m/z bound
- mz_max
float upper m/z bound
- rt_min
float lower RT bound
- rt_max
float upper RT bound
- dt_bounds
tuple(float), optional (lower, upper) drift bounds, tightening DT bounds around area of interest reduces extraction time
- mslvl
int, default=1 MS level to select from
- verbose
bool, default=False print information about the progress
- mz_min
- Returns:
- atd_dt
numpy.ndarray(float) drift time component of ATD
- atd_int
numpy.ndarray(int) intensity component of ATD
- atd_dt
MS1 Spectra
- mzapy.MZA.collect_ms1_arrays_by_rt(self, rt_min, rt_max, mz_bounds=None)
loads MS1 spectrum (m/z, intensity) as arrays for m/z and RT range, ignoring DT if present
- Parameters:
- rt_min
float lower RT bound
- rt_max
float upper RT bound
- mz_bounds
tuple(float), optional (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time
- rt_min
- Returns:
- ms1_mz
numpy.ndarray(float) m/z component of MS1 spectrum
- ms1_int
numpy.ndarray(int) intensity component of MS1 spectrum
- ms1_mz
- mzapy.MZA.collect_ms1_arrays_by_dt(self, dt_min, dt_max, mz_bounds=None)
loads MS1 spectrum (m/z, intensity) as arrays for m/z and DT range
- Parameters:
- dt_min
float lower DT bound
- dt_max
float upper DT bound
- mz_bounds
tuple(float), optional (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time
- dt_min
- Returns:
- ms1_mz
numpy.ndarray(float) m/z component of MS1 spectrum
- ms1_int
numpy.ndarray(int) intensity component of MS1 spectrum
- ms1_mz
- mzapy.MZA.collect_ms1_arrays_by_rt_dt(self, rt_min, rt_max, dt_min, dt_max, mz_bounds=None, verbose=False)
loads MS1 spectrum (m/z, intensity) as arrays for m/z, RT, and DT range
- Parameters:
- rt_min
float lower RT bound
- rt_max
float upper RT bound
- dt_min
float lower bound of DT window to select data from
- dt_max
float upper bound of DT window to select data from
- mz_bounds
tuple(float), optional (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time
- verbose
bool, default=False print information about the progress
- rt_min
- Returns:
- ms1_mz
numpy.ndarray(float) m/z component of MS1 spectrum
- ms1_int
numpy.ndarray(int) intensity component of MS1 spectrum
- ms1_mz
- mzapy.MZA.collect_ms1_df_by_rt(self, rt_min, rt_max, mz_bounds=None, verbose=False)
collects MS1 data within RT ranges, IM dimension is collapsed/ignored
- Parameters:
- rt_min
float lower bound of RT window to select data from
- rt_max
float upper bound of RT window to select data from
- mz_bounds
tuple(float, float), optional mz_min, mz_max
- verbose
bool, default=False print information about the progress
- rt_min
- Returns:
- data
pandas.DataFrame data frame with columns mzbin, mz, intensity, rt, frame
- data
- mzapy.MZA.collect_ms1_df_by_rt_dt(self, rt_min, rt_max, dt_min, dt_max, mz_bounds=None, verbose=False)
collects MS1 data within RT and DT ranges
- Parameters:
- rt_max
float lower bound of RT window to select data from
- rt_max
float upper bound of RT window to select data from
- dt_min
float lower bound of DT window to select data from
- dt_max
float upper bound of DT window to select data from
- mz_bounds
tuple(float, float), optional mz_min, mz_max
- verbose
bool, default=False print information about the progress
- rt_max
- Returns:
- data
pandas.DataFrame data frame with columns mzbin, mz, intensity, rt, dt, frame
- data
MS2 Spectra
- mzapy.MZA.collect_ms2_arrays_by_rt_dt(self, rt_min, rt_max, dt_min, dt_max, mz_bounds=None, verbose=False)
loads MS2 spectrum (m/z, intensity) as arrays for m/z, RT, and DT range
- Parameters:
- rt_min
float lower RT bound
- rt_max
float upper RT bound
- dt_min
float lower bound of DT window to select data from
- dt_max
float upper bound of DT window to select data from
- mz_bounds
tuple(float), optional (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time
- verbose
bool, default=False print information about the progress
- rt_min
- Returns:
- ms1_mz
numpy.ndarray(float) m/z component of MS1 spectrum
- ms1_int
numpy.ndarray(int) intensity component of MS1 spectrum
- ms1_mz
- mzapy.MZA.collect_ms2_df_by_rt(self, rt_min, rt_max, mz_bounds=None, verbose=False)
collects MS2 data within RT ranges, IM dimension is collapsed/ignored
- Parameters:
- rt_min
float lower bound of RT window to select data from
- rt_max
float upper bound of RT window to select data from
- mz_bounds
tuple(float, float), optional mz_min, mz_max
- verbose
bool, default=False print information about the progress
- rt_min
- Returns:
- data
pandas.DataFrame data frame with columns mzbin, mz, intensity, rt
- data
- mzapy.MZA.collect_ms2_df_by_rt_dt(self, rt_min, rt_max, dt_min, dt_max, mz_bounds=None, verbose=False)
collects MS2 data within RT and DT ranges
- Parameters:
- rt_max
float lower bound of RT window to select data from
- rt_max
float upper bound of RT window to select data from
- dt_min
float lower bound of DT window to select data from
- dt_max
float upper bound of DT window to select data from
- mz_bounds
tuple(float, float), optional mz_min, mz_max
- verbose
bool, default=False print information about the progress
- rt_max
- Returns:
- data
pandas.DataFrame data frame with columns mzbin, mz, intensity, rt, dt
- data
2-Dimensional Data
- mzapy.MZA.collect_rtmz_arrays(self, rt_bounds=None, mz_bounds=None, verbose=False)
loads RTMZ data (rt, mz, intensity) as arrays ignoring/collapsing DT, using optional m/z and RT bounds
- Parameters:
- rt_bounds
tuple(float), optional (lower, upper) retention time bounds, tightening RT bounds around area of interest reduces extraction time
- mz_bounds
tuple(float), optional (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time
- verbose
bool, default=False print information about the progress
- rt_bounds
- Returns:
- rtmz_rt
numpy.ndarray(float) retention time component of RTMZ data
- rtmz_mz
numpy.ndarray(float) m/z component of RTMZ data
- rtmz_int
numpy.ndarray(int) intensity component of RTMZ data
- rtmz_rt
- mzapy.MZA.collect_rtmz_arrays_by_dt(self, dt_min, dt_max, rt_bounds=None, mz_bounds=None, verbose=False)
loads RTMZ data (rt, mz, intensity) as arrays within a target DT range, using optional m/z and RT bounds
- Parameters:
- dt_min
float lower DT bound
- dt_max
float upper DT bound
- rt_bounds
tuple(float), optional (lower, upper) retention time bounds, tightening RT bounds around area of interest reduces extraction time
- mz_bounds
tuple(float), optional (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time
- verbose
bool, default=False print information about the progress
- dt_min
- Returns:
- rtmz_rt
numpy.ndarray(float) retention time component of RTMZ data
- rtmz_mz
numpy.ndarray(float) m/z component of RTMZ data
- rtmz_int
numpy.ndarray(int) intensity component of RTMZ data
- rtmz_rt
- mzapy.MZA.collect_dtmz_arrays_by_rt(self, rt_min, rt_max, dt_bounds=None, mz_bounds=None, verbose=False)
loads DTMZ data (dt, mz, intensity) as arrays within a target RT range, using optional m/z and DT bounds
- Parameters:
- rt_min
float lower RT bound
- rt_max
float upper RT bound
- dt_bounds
tuple(float), optional (lower, upper) drift bounds, tightening DT bounds around area of interest reduces extraction time
- mz_bounds
tuple(float), optional (lower, upper) m/z bounds, filters data after extraction so no effect on extraction time
- verbose
bool, default=False print information about the progress
- rt_min
- Returns:
- dtmz_dt
numpy.ndarray(float) drift time component of DTMZ data
- dtmz_mz
numpy.ndarray(float) m/z component of DTMZ data
- dtmz_int
numpy.ndarray(int) intensity component of DTMZ data
- dtmz_dt
- mzapy.MZA.collect_rtdt_arrays_by_mz(self, mz_min, mz_max, rt_bounds=None, dt_bounds=None, verbose=False)
loads RTDT data (rt, dt, intensity) as arrays within a target m/z range, using optional RT and DT bounds
- Parameters:
- mz_min
float lower m/z bound
- mz_max
float upper m/z bound
- rt_bounds
tuple(float), optional (lower, upper) retention time bounds, tightening RT bounds around area of interest reduces extraction time
- dt_bounds
tuple(float), optional (lower, upper) drift time bounds, tightening DT bounds around area of interest reduces extraction time
- verbose
bool, default=False print information about the progress
- mz_min
- Returns:
- rtdt_rt
numpy.ndarray(float) retention time component of RTDT data
- rtdt_dt
numpy.ndarray(float) drift time component of RTDT data
- rtdt_int
numpy.ndarray(int) intensity component of RTDT data
- rtdt_rt