Implemented loaders

For some commonly used datasets, we provide the function to directly load the dataset. The loaders might be limited by IO, but if you have enough memory, you can simply cache the dataset with dataset.cache() or convert them to tfrecords.

The RuNNer format

RuNNer data (used by the RuNNer code: http://www.uni-goettingen.de/en/560580.html) has the format:

begin
lattice float float float
lattice float float float
lattice float float float
atom floatcoordx floatcoordy floatcoordz int_atom_symbol floatq 0  floatforcex floatforcey floatforcez
atom 1           2           3           4               5      6  7           8           9
energy float
charge float
comment arbitrary string
end

The order of the lines within the begin/end block are arbitrary. Coordinates, charges, energies and forces are all in atomic units.

pinn.io.load_runner(flist, **kwargs)

Loads runner formatted trajectory

Parameters:
  • flist (str) – one or a list of runner formatted trajectory(s)
  • **kwargs – split options, see pinn.io.base.split_list

The CP2K format

Loads output from CP2K. This loader expects coordinates, forces, energy and cell outputs in respective output files.

pinn.io.load_cp2k(coord_file, force_file, ener_file, cell_file, **kwargs)

Loads CP2K formatted trajectories

CP2K outputs the coord, force, energy and cell in separate files. It is assumed that different files come in consistent units (no unit conversion is done in the loader).

Parameters:
  • coord_file – one or a list of CP2K .xyz files for coordinates
  • force_file – one or a list of CP2K .xyz files for forces
  • ener_file – one or a list of CP2K .ener files
  • cell_file – one or a list of CP2K .cell files
  • **kwargs – split options, see pinn.io.base.split_list

QM9 dataset

The QM9 dataset includes many computed properties for 134K stable organic molecules. See ref. [ramakrishnan_dral_dral_rupp_anatole_von_lilienfeld_2017] for more details.

The default behavior here is to label the internal energy “U0” as “e_data”. This behavior can be tweaked with the label_map parameter.

pinn.io.load_qm9(flist, label_map={'e_data': 'U0'}, **kwargs)

Loads the QM9 dataset

QM9 provides a variety of labels, but typically we are only training on one target, e.g. U0. A label_map option is offered to choose the output dataset structure, by default, it only takes “U0” and maps that to “e_data”, i.e. label_map={‘e_data’: ‘U0’}.

Other available labels are:

['tag', 'index', 'A', 'B', 'C', 'mu', 'alpha', 'homo', 'lumo',
 'gap', 'r2', 'zpve', 'U0', 'U', 'H', 'G', 'Cv']

Desciptions of those tags can be found in QM9’s description file.

Parameters:
  • flist (list) – list of QM9-formatted data files.
  • label_map (dict) – dictiionary
  • **kwargs – split options, see pinn.io.base.split_list

ANI-1 dataset

The ANI-1 dataset consists of 20M off-equilibrium DFT energies for organic molecules. See ref. [smith_isayev_roitberg_2017] for more details.

pinn.io.load_ani(filelist, cycle_length=4, **kwargs)

Loads the ANI-1 dataset

Parameters:
  • filelist (list) – filenames of ANI-1 h5 files.
  • cycle_length (int) – number of parallel threads to read h5 file
  • **kwargs – split options, see pinn.io.base.split_list

Numpy dataset

Another easy way to generate your own dataset is to store the data as a dictionary of numpy arrays. See how it’s done in the toy problem.