Manipulating datasets¶
Batching the dataset¶
Most TensorFlow operations (caching, repeating, shuffling) can be
directly applied to the dataset. However, to handle datasets with
different numbers of atoms in each structure, which is often the case,
we use a special sparse_batch operation to create minibatches of
the data in a sparse form. For example:
from pinn.io import sparse_batch
dataset = # Load some dataset here
batched = dataset.apply(sparse_batch(100))
API reference¶
-
pinn.io.sparse_batch(batch_size, drop_remainder=False, num_parallel_calls=8, atomic_props=['f_data', 'q_data', 'f_weights'])¶ This returns a dataset operation that transforms single samples into sparse batched samples. The atomic_props must include all properties that are defined on an atomic basis besides ‘coord’ and ‘elems’.
Parameters: - drop_remainder (bool) – option for padded_batch
- num_parallel_calls (int) – option for map
- atomic_props (list) – list of atomic properties
TFRecord¶
The tfrecord format is a serialized format for efficient data reading in TensorFlow. The format is especially useful for streaming the data over a network. It can also be used for caching preprocessed data.
The tfrecord writer/loader also supports batched and preprocessed datasets. When writing the dataset, a .yml file records the data structure of the dataset, and a .tfr file holds the data. For example:
from glob import glob
from pinn.io import load_QM9, sparse_batch
from pinn.io import write_tfrecord, load_tfrecord
# Load QM9 dataset
filelist = glob('/home/yunqi/datasets/QM9/dsgdb9nsd/*.xyz')
train_set = load_QM9(filelist)['train'].apply(sparse_batch(10))
# Write as tfrecord and read
write_tfrecord('train.yml', train_set)
dataset = load_tfrecord('train.yml')
The training set will be saved in the train.tfr, while
train.yml holds the information about the data structure.
API reference¶
-
pinn.io.write_tfrecord(fname, dataset, log_every=100, pre_fn=None)¶ Helper function to convert dataset object into tfrecord file.
fname must end with .yml or .yaml. The data will be written in a .tfr file with the same suffix.
Parameters: - dataset (Dataset) – input dataset.
- fname (str) – filename of the dataset to be saved.
-
pinn.io.load_tfrecord(fname)¶ Load tfrecord dataset.
Parameters: - fname (str) – filename of the .yml metadata file to be loaded.
- dtypes (dict) – dtype of dataset.