Implemented loaders¶
For some commonly used datasets, we provide the function to directly
load the dataset. The loaders might be limited by IO, but if you have
enough memory, you can simply cache the dataset with
dataset.cache() or convert them to tfrecords.
The RuNNer format¶
RuNNer data (used by the RuNNer code: http://www.uni-goettingen.de/en/560580.html) has the format:
begin
lattice float float float
lattice float float float
lattice float float float
atom floatcoordx floatcoordy floatcoordz int_atom_symbol floatq 0 floatforcex floatforcey floatforcez
atom 1 2 3 4 5 6 7 8 9
energy float
charge float
comment arbitrary string
end
The order of the lines within the begin/end block are arbitrary. Coordinates, charges, energies and forces are all in atomic units.
-
pinn.io.load_runner(flist, **kwargs)¶ Loads runner formatted trajectory
Parameters: - flist (str) – one or a list of runner formatted trajectory(s)
- **kwargs – split options, see
pinn.io.base.split_list
The CP2K format¶
Loads output from CP2K. This loader expects coordinates, forces, energy and cell outputs in respective output files.
-
pinn.io.load_cp2k(coord_file, force_file, ener_file, cell_file, **kwargs)¶ Loads CP2K formatted trajectories
CP2K outputs the coord, force, energy and cell in separate files. It is assumed that different files come in consistent units (no unit conversion is done in the loader).
Parameters: - coord_file – one or a list of CP2K .xyz files for coordinates
- force_file – one or a list of CP2K .xyz files for forces
- ener_file – one or a list of CP2K .ener files
- cell_file – one or a list of CP2K .cell files
- **kwargs – split options, see
pinn.io.base.split_list
QM9 dataset¶
The QM9 dataset includes many computed properties for 134K stable organic molecules. See ref. [ramakrishnan_dral_dral_rupp_anatole_von_lilienfeld_2017] for more details.
The default behavior here is to label the internal energy “U0” as
“e_data”. This behavior can be tweaked with the label_map
parameter.
-
pinn.io.load_qm9(flist, label_map={'e_data': 'U0'}, **kwargs)¶ Loads the QM9 dataset
QM9 provides a variety of labels, but typically we are only training on one target, e.g. U0. A
label_mapoption is offered to choose the output dataset structure, by default, it only takes “U0” and maps that to “e_data”, i.e. label_map={‘e_data’: ‘U0’}.Other available labels are:
['tag', 'index', 'A', 'B', 'C', 'mu', 'alpha', 'homo', 'lumo', 'gap', 'r2', 'zpve', 'U0', 'U', 'H', 'G', 'Cv']
Desciptions of those tags can be found in QM9’s description file.
Parameters: - flist (list) – list of QM9-formatted data files.
- label_map (dict) – dictiionary
- **kwargs – split options, see
pinn.io.base.split_list
ANI-1 dataset¶
The ANI-1 dataset consists of 20M off-equilibrium DFT energies for organic molecules. See ref. [smith_isayev_roitberg_2017] for more details.
-
pinn.io.load_ani(filelist, cycle_length=4, **kwargs)¶ Loads the ANI-1 dataset
Parameters: - filelist (list) – filenames of ANI-1 h5 files.
- cycle_length (int) – number of parallel threads to read h5 file
- **kwargs – split options, see
pinn.io.base.split_list
Numpy dataset¶
Another easy way to generate your own dataset is to store the data as a dictionary of numpy arrays. See how it’s done in the toy problem.