Loading data

Reading a dataset

To be able to shuffle and split the dataset, we require the dataset to be represented as a list of datums. In the simplest case, the dataset could be a list of structure files, each contains one structure and label (or a sample). PiNN provides a list_loader decorator which turns a function reading a single sample into a function that transform a list of samples into a dataset. For example:

from pinn.io import list_loader

@list_loader()
def load_file_list(filename):
    # read a single file here
    coord = ...
    elems = ...
    e_data = ...
    datum = {'coord': coord, 'elems':elems, 'e_data': e_data}
    return datum

An example notebook on preparing datasets can be found here.

Splitting the dataset

It is a common practice to split the dataset into subsets for validation in machine learning tasks. Our dataset loaders support a split option to do this. The split can be a nested dictionary of relative ratios of subsets. The dataset loader will return a nested structure of datasets with corresponding ratios. For example:

dataset = load_qm9(filelist, split={'train':8, 'test':[1,2,3]}
train = dataset['train']
test1 = dataset['test'][0]

Here train and test1 will become tf.dataset objects which can be consumed by our models. By default, the dataset are split into three subsets (train: 80%, test: 10%, vali: 10%). Note that the loaders also requires a seed parameter for the split to be consistent, and its default value is 0.