Multi-Instance TasksscDTI(Li, Michelle, et al.)

(Li, Michelle, et al.)

DataLoader

Dataset Description

To curate target information for a therapeutic area, we examine the drugs indicated for the therapeutic area of interest and its descendants. The two therapeutic areas examined are rheumatoid arthritis (RA) and inflammatory bowel disease. Positive examples (i.e., where the label y = 1) are proteins targeted by drugs that have at least completed phase 2 of clinical trials for treating a specific therapeutic area. As such, a protein is a promising candidate if a compound that targets the protein is safe for humans and effective for treating the disease. We retain positive training examples activated in at least one cell type-specific protein interaction network. We define negative examples (i.e., where the label y = 0) as druggable proteins that do not have any known association with the therapeutic area of interest according to Open Targets. A protein is deemed druggable if targeted by at least one existing drug. We extract drugs and their nominal targets from Drugbank. We retain negative training examples activated in at least one cell type-specific protein interaction network. Note: to get the exact cell-type-specific data and labels used in the PINNACLE paper, please refer to the tdc_ml.scDTI benchmark group.

Task Description

Classification. Given the protein and cell-context, predict whether the protein is a therapeutic target.

Dataset Statistics

The final number of positive (negative) samples for RA and IBD were 152 (1,465) and 114 (1,377), respectively. In PINNACLE, this dataset was augmented to include 156 cell types.

Available Splits

Cold Protein Split

Usage Example

from tdc_ml.multi_pred import scDTI

data = scDTI(name='opentargets_dti')

# Access the data
df = data.get_data()
print(df.head())

# Get train/val/test splits
split = data.get_split()
print(split)

License

This dataset is licensed under CC BY 4.0.