DisGeNET

GDA

Dataset Description

DisGeNET is a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases. DisGeNET integrates data from expert curated repositories, GWAS catalogues, animal models and the scientific literature. DisGeNET data are homogeneously annotated with controlled vocabularies and community-driven ontologies. TDC uses the curated subset from UNIPROT, CGI, ClinGen, Genomics England, CTD (human subset), PsyGeNET, and Orphanet. TDC maps disease ID to disease definition through MedGen and maps GeneID to uniprot amino acid sequence.

Task Description

Regression. Given the disease description and the amino acid sequence of the gene, predict their association.

Dataset Statistics

52,476 gene-disease pairs, 7,399 genes, 7,095 diseases

Available Splits

Random Split

Usage Example

from tdc_ml.multi_pred import GDA

data = GDA(name='DisGeNET')

# Access the data
df = data.get_data()
print(df.head())

# Get train/val/test splits
split = data.get_split()
print(split)

License

This dataset is licensed under CC BY-NC-SA 4.0.