Implemented loaders¶

For some commonly used datasets, we provide the function to directly load the dataset. The loaders might be limited by IO, but if you have enough memory, you can simply cache the dataset with dataset.cache() or convert them to tfrecords.

The RuNNer format¶

RuNNer data (used by the RuNNer code: http://www.uni-goettingen.de/en/560580.html) has the format:

begin
lattice float float float
lattice float float float
lattice float float float
atom floatcoordx floatcoordy floatcoordz int_atom_symbol floatq 0  floatforcex floatforcey floatforcez
atom 1           2           3           4               5      6  7           8           9
energy float
charge float
comment arbitrary string
end

The order of the lines within the begin/end block are arbitrary. Coordinates, charges, energies and forces are all in atomic units.

pinn.io.load_runner(flist, **kwargs)¶

Loads runner formatted trajectory

Parameters:	flist (str) – one or a list of runner formatted trajectory(s) **kwargs – split options, see `pinn.io.base.split_list`

The CP2K format¶

Loads output from CP2K. This loader expects coordinates, forces, energy and cell outputs in respective output files.

pinn.io.load_cp2k(coord_file, force_file, ener_file, cell_file, **kwargs)¶

Loads CP2K formatted trajectories

CP2K outputs the coord, force, energy and cell in separate files. It is assumed that different files come in consistent units (no unit conversion is done in the loader).

Parameters:	coord_file – one or a list of CP2K .xyz files for coordinates force_file – one or a list of CP2K .xyz files for forces ener_file – one or a list of CP2K .ener files cell_file – one or a list of CP2K .cell files **kwargs – split options, see `pinn.io.base.split_list`

QM9 dataset¶

The QM9 dataset includes many computed properties for 134K stable organic molecules. See ref. [ramakrishnan_dral_dral_rupp_anatole_von_lilienfeld_2017] for more details.

The default behavior here is to label the internal energy “U0” as “e_data”. This behavior can be tweaked with the label_map parameter.

pinn.io.load_qm9(flist, label_map={'e_data': 'U0'}, **kwargs)¶

Loads the QM9 dataset

QM9 provides a variety of labels, but typically we are only training on one target, e.g. U0. A label_map option is offered to choose the output dataset structure, by default, it only takes “U0” and maps that to “e_data”, i.e. label_map={‘e_data’: ‘U0’}.

Other available labels are:

['tag', 'index', 'A', 'B', 'C', 'mu', 'alpha', 'homo', 'lumo',
 'gap', 'r2', 'zpve', 'U0', 'U', 'H', 'G', 'Cv']

Desciptions of those tags can be found in QM9’s description file.

Parameters:	flist (list) – list of QM9-formatted data files. label_map (dict) – dictiionary **kwargs – split options, see `pinn.io.base.split_list`

ANI-1 dataset¶

The ANI-1 dataset consists of 20M off-equilibrium DFT energies for organic molecules. See ref. [smith_isayev_roitberg_2017] for more details.

pinn.io.load_ani(filelist, cycle_length=4, **kwargs)¶

Loads the ANI-1 dataset

Parameters:	filelist (list) – filenames of ANI-1 h5 files. cycle_length (int) – number of parallel threads to read h5 file **kwargs – split options, see `pinn.io.base.split_list`

Numpy dataset¶

Another easy way to generate your own dataset is to store the data as a dictionary of numpy arrays. See how it’s done in the toy problem.