Exploring the Datasets¶
The Dataset
collection represents a table whose rows correspond to
Molecules
and whose columns correspond to properties.
Columns may either result from QCFractal-based calculations or be contributed from outside sources.
For example, the QM9 dataset in QCArchive contains small organic molecules with up to 9 heavy atoms, and includes
the original reported PBE0 energies, as well as energies calculated with a variety of other density functionals and basis sets.
The existing Datasets
can be listed with
FractalClient.list_collections("Dataset")
and obtained using FractalClient.get_collection("Dataset", name)
.
Querying the Data¶
Available result specifications (method, basis set, program, keyword, driver combinations) in a
Dataset
may be listed with the list_values
method. Values are queried with the get_values
method. For results computed
using QCFractal, the underlying Records
are retrieved with get_records
.
For further details about how to query Datasets
see the QCArchive examples.
Statistics and Visualization¶
Statistical operations on Datasets
may be performed using
statistics
command
and plotted using the visualize
command.
For examples of visualizing Datasets
,
see the QCArchive examples.
Creating the Datasets¶
Construct an empty Dataset
:
>>> import qcportal as ptl
>>> client = plt.FractalClient() # add server and login information as needed
>>> ds = ptl.collections.Dataset("name", client=client)
The primary index of a Dataset
is a list of Molecules
.
Molecules
can be added to a Dataset
with
add_entry
:
>>> ds.add_entry(name, molecule)
Once all Molecules
are added,
the changes can be committed to the server with save
method.
Note that this requires write permissions.
>>> ds.save()
Computational Tasks¶
Computations on the molecules within the Datasets
can be performed using the
compute
command.
If the results of the requested computation already exist in the Dataset
,
they will be reused to avoid recomputation. Note that for perfoming computations,
compute permissions is required.
>>> models = {('b3lyp', 'def2-svp'), ('mp2', 'cc-pVDZ')}
>>> for method, basis in models:
>>> print(method, basis)
>>> spec = {"program": "psi4",
>>> "method": method,
>>> "basis": basis,
>>> "keywords": "my_keywords",
>>> "tag": "mgwtfm"}
>>> ds.compute(**spec)
Note
A default quantum chemical program and a set of computational
keywords can be specified for a Dataset
.
These default values will be used in the compute
,
get_values
, and
get_records
methods.
>>> ds.set_default_program("psi4")
>>> keywords = ptl.models.KeywordSet(values={'maxiter': 1000,
>>> 'e_convergence': 8,
>>> 'guess': 'sad',
>>> 'scf_type': 'df'})
>>> ds.add_keywords("my_keywords", "psi4", keywords, default=True)
>>> ds.save()
API¶
-
class
qcportal.collections.
Dataset
(name: str, client: Optional[FractalClient] = None, **kwargs: Any)[source]¶ The Dataset class for homogeneous computations on many molecules.
- Variables
client (client.FractalClient) – A FractalClient connected to a server
data (dict) – JSON representation of the database backbone
df (pd.DataFrame) – The underlying dataframe for the Dataset object
-
class
DataModel
(*, id: str = 'local', name: str, collection: str, provenance: Dict[str, str] = {}, tags: List[str] = [], tagline: str = None, description: str = None, group: str = 'default', visibility: bool = True, view_url_hdf5: str = None, view_url_plaintext: str = None, view_metadata: Dict[str, str] = None, view_available: bool = False, metadata: Dict[str, Any] = {}, default_program: str = None, default_keywords: Dict[str, str] = {}, default_driver: str = 'energy', default_units: str = 'kcal / mol', default_benchmark: str = None, alias_keywords: Dict[str, Dict[str, str]] = {}, records: List[qcportal.collections.dataset.MoleculeEntry] = None, contributed_values: Dict[str, qcportal.collections.dataset.ContributedValues] = None, history: Set[Tuple[str, str, str, Optional[str], Optional[str]]] = {}, history_keys: Tuple[str, str, str, str, str] = ('driver', 'program', 'method', 'basis', 'keywords'))[source]¶ - Parameters
id (str, Default: local)
name (str)
collection (str)
provenance (name=’provenance’ type=Mapping[str, str] required=False default={}, Default: {})
tags (List[str], Default: [])
tagline (str, Optional)
description (str, Optional)
group (str, Default: default)
visibility (bool, Default: True)
view_url_hdf5 (str, Optional)
view_url_plaintext (str, Optional)
view_metadata (name=’view_metadata’ type=Optional[Mapping[str, str]] required=False default=None, Optional)
view_available (bool, Default: False)
metadata (Dict[str, Any], Default: {})
default_program (str, Optional)
default_keywords (name=’default_keywords’ type=Mapping[str, str] required=False default={}, Default: {})
default_driver (str, Default: energy)
default_units (str, Default: kcal / mol)
default_benchmark (str, Optional)
alias_keywords (Dict[str, Dict[str, str]], Default: {})
records (
MoleculeEntry
, Optional)contributed_values (
ContributedValues
, Optional)history (Set[Tuple[str, str, str, str, str]], Default: set())
history_keys (Tuple[str, str, str, str, str], Default: (‘driver’, ‘program’, ‘method’, ‘basis’, ‘keywords’))
-
add_contributed_values
(contrib: qcportal.collections.dataset.ContributedValues, overwrite: bool = False) → None[source]¶ Adds a ContributedValues to the database. Be sure to call save() to commit changes to the server.
- Parameters
contrib (ContributedValues) – The ContributedValues to add.
overwrite (bool, optional) – Overwrites pre-existing values
-
add_entry
(name: str, molecule: Molecule, **kwargs: Dict[str, Any]) → None[source]¶ Adds a new entry to the Dataset
- Parameters
name (str) – The name of the record
molecule (Molecule) – The Molecule associated with this record
**kwargs (Dict[str, Any]) – Additional arguments to pass to the record
-
add_keywords
(alias: str, program: str, keyword: KeywordSet, default: bool = False) → bool[source]¶ Adds an option alias to the dataset. Not that keywords are not present until a save call has been completed.
- Parameters
alias (str) – The alias of the option
program (str) – The compute program the alias is for
keyword (KeywordSet) – The Keywords object to use.
default (bool, optional) – Sets this option as the default for the program
-
compute
(method: str, basis: Optional[str] = None, *, keywords: Optional[str] = None, program: Optional[str] = None, tag: Optional[str] = None, priority: Optional[str] = None, protocols: Optional[Dict[str, Any]] = None) → qcportal.models.rest_models.ComputeResponse[source]¶ Executes a computational method for all reactions in the Dataset. Previously completed computations are not repeated.
- Parameters
method (str) – The computational method to compute (B3LYP)
basis (Optional[str], optional) – The computational basis to compute (6-31G)
keywords (Optional[str], optional) – The keyword alias for the requested compute
program (Optional[str], optional) – The underlying QC program
tag (Optional[str], optional) – The queue tag to use when submitting compute requests.
priority (Optional[str], optional) – The priority of the jobs low, medium, or high.
protocols (Optional[Dict[str, Any]], optional) – Protocols for store more or less data per field. Current valid protocols: {‘wavefunction’}
- Returns
- An object that contains the submitted ObjectIds of the new compute. This object has the following fields:
ids: The ObjectId’s of the task in the order of input molecules
submitted: A list of ObjectId’s that were submitted to the compute queue
existing: A list of ObjectId’s of tasks already in the database
- Return type
ComputeResponse
-
download
(local_path: Optional[Union[str, pathlib.Path]] = None, verify: bool = True, progress_bar: bool = True) → None[source]¶ Download a remote view if available. The dataset will use this view to avoid server queries for calls to: - get_entries - get_molecules - get_values - list_values
- Parameters
local_path (Optional[Union[str, Path]], optional) – Local path the store downloaded view. If None, the view will be stored in a temporary file and deleted on exit.
verify (bool, optional) – Verify download checksum. Default: True.
progress_bar (bool, optional) – Display a download progress bar. Default: True
-
get_entries
(subset: Optional[List[str]] = None, force: bool = False) → pandas.core.frame.DataFrame[source]¶ Provides a list of entries for the dataset
- Parameters
subset (Optional[List[str]], optional) – The indices of the desired subset. Return all indices if subset is None.
force (bool, optional) – skip cache
- Returns
A dataframe containing entry names and specifciations. For Dataset, specifications are molecule ids. For ReactionDataset, specifications describe reaction stoichiometry.
- Return type
pd.DataFrame
-
get_index
(subset: Optional[List[str]] = None, force: bool = False) → List[str][source]¶ Returns the current index of the database.
- Returns
ret – The names of all reactions in the database
- Return type
List[str]
-
get_keywords
(alias: str, program: str, return_id: bool = False) → Union[KeywordSet, str][source]¶ Pulls the keywords alias from the server for inspection.
- Parameters
alias (str) – The keywords alias.
program (str) – The program the keywords correspond to.
return_id (bool, optional) – If True, returns the
id
rather than theKeywordSet
object. Description
- Returns
The requested
KeywordSet
orKeywordSet
id
.- Return type
Union[‘KeywordSet’, str]
-
get_molecules
(subset: Optional[Union[str, Set[str]]] = None, force: bool = False) → Union[pandas.core.frame.DataFrame, Molecule][source]¶ Queries full Molecules from the database.
- Parameters
subset (Optional[Union[str, Set[str]]], optional) – The index subset to query on
force (bool, optional) – Force pull of molecules from server
- Returns
Either a DataFrame of indexed Molecules or a single Molecule if a single subset string was provided.
- Return type
Union[pd.DataFrame, ‘Molecule’]
-
get_records
(method: str, basis: Optional[str] = None, *, keywords: Optional[str] = None, program: Optional[str] = None, include: Optional[List[str]] = None, subset: Optional[Union[str, Set[str]]] = None, merge: bool = False) → Union[pandas.core.frame.DataFrame, ResultRecord][source]¶ Queries full ResultRecord objects from the database.
- Parameters
method (str) – The computational method to query on (B3LYP)
basis (Optional[str], optional) – The computational basis query on (6-31G)
keywords (Optional[str], optional) – The option token desired
program (Optional[str], optional) – The program to query on
include (Optional[List[str]], optional) – The attributes to return. Otherwise returns ResultRecord objects.
subset (Optional[Union[str, Set[str]]], optional) – The index subset to query on
merge (bool) – Merge multiple results into one (as in the case of DFT-D3). This only works when include=[‘return_results’], as in get_values.
- Returns
Either a DataFrame of indexed ResultRecords or a single ResultRecord if a single subset string was provided.
- Return type
Union[pd.DataFrame, ‘ResultRecord’]
-
get_values
(method: Optional[Union[List[str], str]] = None, basis: Optional[Union[List[str], str]] = None, keywords: Optional[str] = None, program: Optional[str] = None, driver: Optional[str] = None, name: Optional[Union[List[str], str]] = None, native: Optional[bool] = None, subset: Optional[Union[List[str], str]] = None, force: bool = False) → pandas.core.frame.DataFrame[source]¶ Obtains values matching the search parameters provided for the expected return_result values. Defaults to the standard programs and keywords if not provided.
Note that unlike get_records, get_values will automatically expand searches and return multiple method and basis combinations simultaneously.
None is a wildcard selector. To search for None, use “None”.
- Parameters
method (Optional[Union[str, List[str]]], optional) – The computational method (B3LYP)
basis (Optional[Union[str, List[str]]], optional) – The computational basis (6-31G)
keywords (Optional[str], optional) – The keyword alias
program (Optional[str], optional) – The underlying QC program
driver (Optional[str], optional) – The type of calculation (e.g. energy, gradient, hessian, dipole…)
name (Optional[Union[str, List[str]]], optional) – Canonical name of the record. Overrides the above selectors.
native (Optional[bool], optional) – True: only include data computed with QCFractal False: only include data contributed from outside sources None: include both
subset (Optional[List[str]], optional) – The indices of the desired subset. Return all indices if subset is None.
force (bool, optional) – Data is typically cached, forces a new query if True
- Returns
A DataFrame of values with columns corresponding to methods and rows corresponding to molecule entries.
- Return type
DataFrame
-
list_keywords
() → pandas.core.frame.DataFrame[source]¶ Lists keyword aliases for each program in the dataset.
- Returns
A dataframe containing programs, keyword aliases, KeywordSet ids, and whether those keywords are the default for a program. Indexed on program.
- Return type
pd.DataFrame
-
list_records
(dftd3: bool = False, pretty: bool = True, **search: Optional[Union[List[str], str]]) → pandas.core.frame.DataFrame[source]¶ Lists specifications of available records, i.e. method, program, basis set, keyword set, driver combinations None is a wildcard selector. To search for None, use “None”.
- Parameters
pretty (bool) – Replace NaN with “None” in returned DataFrame
**search (Dict[str, Optional[str]]) – Allows searching to narrow down return.
- Returns
Record specifications matching **search.
- Return type
DataFrame
-
list_values
(method: Optional[Union[List[str], str]] = None, basis: Optional[Union[List[str], str]] = None, keywords: Optional[str] = None, program: Optional[str] = None, driver: Optional[str] = None, name: Optional[Union[List[str], str]] = None, native: Optional[bool] = None, force: bool = False) → pandas.core.frame.DataFrame[source]¶ Lists available data that may be queried with get_values. Results may be narrowed by providing search keys. None is a wildcard selector. To search for None, use “None”.
- Parameters
method (Optional[Union[str, List[str]]], optional) – The computational method (B3LYP)
basis (Optional[Union[str, List[str]]], optional) – The computational basis (6-31G)
keywords (Optional[str], optional) – The keyword alias
program (Optional[str], optional) – The underlying QC program
driver (Optional[str], optional) – The type of calculation (e.g. energy, gradient, hessian, dipole…)
name (Optional[Union[str, List[str]]], optional) – The canonical name of the data column
native (Optional[bool], optional) – True: only include data computed with QCFractal False: only include data contributed from outside sources None: include both
force (bool, optional) – Data is typically cached, forces a new query if True
- Returns
A DataFrame of the matching data specifications
- Return type
DataFrame
-
set_default_benchmark
(benchmark: str) → bool[source]¶ Sets the default benchmark value.
- Parameters
benchmark (str) – The benchmark to default to.
-
set_default_program
(program: str) → bool[source]¶ Sets the default program.
- Parameters
program (str) – The program to default to.
-
set_view
(path: Union[str, pathlib.Path]) → None[source]¶ Set a dataset to use a local view.
- Parameters
path (Union[str, Path]) – path to an hdf5 file representing a view for this dataset
-
statistics
(stype: str, value: str, bench: Optional[str] = None, **kwargs: Dict[str, Any]) → Union[numpy.ndarray, pandas.core.series.Series, numpy.float64][source]¶ Provides statistics for various columns in the underlying dataframe.
- Parameters
stype (str) – The type of statistic in question
value (str) – The method string to compare
bench (str, optional) – The benchmark method for the comparison, defaults to default_benchmark.
kwargs (Dict[str, Any]) – Additional kwargs to pass to the statistics functions
- Returns
Returns an ndarray, Series, or float with the requested statistics depending on input.
- Return type
np.ndarray, pd.Series, float
-
to_file
(path: Union[str, pathlib.Path], encoding: str) → None[source]¶ Writes a view of the dataset to a file
- Parameters
path (Union[str, Path]) – Where to write the file
encoding (str) – Options: plaintext, hdf5
-
visualize
(method: Optional[str] = None, basis: Optional[str] = None, keywords: Optional[str] = None, program: Optional[str] = None, groupby: Optional[str] = None, metric: str = 'UE', bench: Optional[str] = None, kind: str = 'bar', return_figure: Optional[bool] = None, show_incomplete: bool = False) → plotly.Figure[source]¶ - Parameters
method (Optional[str], optional) – Methods to query
basis (Optional[str], optional) – Bases to query
keywords (Optional[str], optional) – Keyword aliases to query
program (Optional[str], optional) – Programs aliases to query
groupby (Optional[str], optional) – Groups the plot by this index.
metric (str, optional) – The metric to use either UE (unsigned error) or URE (unsigned relative error)
bench (Optional[str], optional) – The benchmark level of theory to use
kind (str, optional) – The kind of chart to produce, either ‘bar’ or ‘violin’
return_figure (Optional[bool], optional) – If True, return the raw plotly figure. If False, returns a hosted iPlot. If None, return a iPlot display in Jupyter notebook and a raw plotly figure in all other circumstances.
show_incomplete (bool, optional) – Display statistics method/basis set combinations where results are incomplete
- Returns
The requested figure.
- Return type
plotly.Figure