Introduction to DeepChem — Driving AI in Science

Arun Thiagarajan
CodeX
Published in
4 min readAug 27, 2021

--

In Tokyo 2020 Olympics, many records and personal bests tumbled in the track and field events. For the Italian company Mondo which makes surfaces and artificial turf systems, it was a sort of different triumph. It took almost three years in coming up with the track surface used in Tokyo — testing different versions, sourcing materials, experimenting with different rubbers, collecting feedback from athletes about the surfaces.

Traditionally, solutions for scientific problems are found by experimental research. It involves conducting experiments over a long period time, analyzing the information and backing it by theories. This imposes high cost on time and resources required to find solution. For example, to find a new drug for a disease, the number of possible drugs are very high which result in high cost for finding that one life saving drug. Together with approaches like simulation and calculation based on theories for creating datasets, deep learning can aid in using AI algorithms for finding solutions to problems in science.

Machine learning methods can help in framing problems as a prediction or a classification tasks. This can be predicting new drug for a disease or
finding new materials based on structural properties etc. To this end, DeepChem is a tool which accelerates the use of AI in science.

The use of deep learning in science comes with barriers. DeepChem helps in reducing the barriers of using deep learning/AI in science. In the rest of the blog post, we discuss how DeepChem helps scientists, researchers and engineers in applying deep learning to science.

The success of machine learning methods depends on the dataset used.
To apply deep learning for a scientific problem, one needs to collect data, featurize, split and transform the data for learning tasks. DeepChem helps in this process at all stages. For dataset, users can use their own dataset or use the dataset provided with MoleculeNet suite of datasets for evaluating their approach.

Featurization helps in representing the data points in a suitable form for machine learning applications. There is no one standard featurization technique which could be used on all datasets. For examples, problems like
predicting reaction energy of molecules requires the molecules to be featurized while protein folding requires molecules to be featurized in a different way. DeepChem provides a suite of featurizers which could be used by scientists on-the-go for deep learning applications in their data sets.

For most of the traditional machine learning problem, data points are randomly split into train-valid-test or k-fold split for evaluation. But randomly splitting data points into train test and valid groups is not always useful for scientific tasks. DeepChem provides a deepchem.splits API for splitting datasets in a scientifically aware way. For example, let’s consider the task of splitting molecules for drug discovery. There are molecules which are small (0–100 atoms) and the large ones(>100 atoms). An ideal split contains equal proportion of small and large molecules as in the original data set. In imbalanced datasets, this cannot always achieved by random split. DeepChem helps to split data by it’s feature properties (number of atoms in this example) to get a scientifically meaningful split.

DeepChem also has a deepchem.trans which helps in transforming the features (minmax transformer, etc). All these without DeepChem requires the installation and use of various libraries. DeepChem presents the functionalities of key library functions with a unified DeepChem API
and presents it as a single package making it easier for scientists, engineers and other users to get started.

On making prediction’s for classification and regression tasks in a dataset,
one can use either the suite of models provided by DeepChem or build their own models. DeepChem supports wrapping a wide range of models from other machine learning framework like Tensorflow, PyTorch, jax, scikit-learn etc making it suitable for different scientific applications. Users can also integrate their own machine learning models from other frameworks for easier integration and usage with DeepChem framework.

Metrics helps in evaluating machine learning models. DeepChem provides standard metrics of scikit-learn as part of deepchem.metrics API as well as other metrics used in science like BEDROC score. For hyper-parameter tuning which cannot be directly learned, DeepChem provides hyper-parameter optimization algorithm which can be used during validation steps making it easier to tune parameters. DeepChem also has other tools like Weights & Biases integrated in it which can help in experiment tracking, dataset versioning and model management.

This image details the design of deepchem. It starts with data loading, then featurizing it, splitting and transformation of dataset, followed by training the dataset using a machine learning model and evaluating the results.
Current Design of DeepChem. Source: https://deepchem.io/

Overall, DeepChem is a powerful tool which can drive AI in science. It is currently being developed to solve problems in semiconductors, material science, bioinformatics and many other application areas in addition to its core strengths in chem-informatics and drug discovery.

Getting in Touch

You can get in touch with the DeepChem community via github and gitter. For DeepChem tutorials, you can visit here and YouTube channel. You can also follow DeepChem in twitter.

--

--