Repositories

A Data Repository is a system that stores and publishes Data Files for download. Such repositories have their own architecture, data access controls, data portals and download clients. Generally speaking, there are two types of repositories:

  1. Cloud - offers facilities for compute and storage
  2. Non-Cloud - provides storage functionality only

On this page you can find an overview of all of these repositories, their purpose and function, as well as providing links to each repositories' important pages and resources.

All of the ICGC data be searched for using the ICGC Data Portal. Generally the data is divided into projects. Every repository has a repository code / id that is used to identify it in the ICGC API.

Collaboratory

Academic research cloud infrastructure built to house ICGC data.

Property Value
Name Cancer Genome Collaboratory
Contact ICGC ARGO Contact page
Repository type Cloud
ICGC Portal Page Portal
Download Client Tarball, Docker
Repo Code collaboratory

Obtaining Data Access

To obtain access to ICGC Controlled Data, learn more about the DACO application process here, and then start a DACO application.

Download Client Operation

To operate the Score download client, follow the instructions here.

AWS

Amazon cloud service containing ICGC data.

Property Value
Name ICGC Storage Server (hosted at AWS)
Contact ICGC ARGO Contact page
Repository type Cloud
ICGC Portal Page Portal
Download Client Tarball, Docker
Repo Code aws-virginia

Obtaining Data Access

To obtain access to ICGC Controlled Data, learn more about the DACO application process here, and then start a DACO application.

Download Client Operation

To operate the Score download client, follow the instructions here.

EGA

The European Genome-Phenome Archive (EGA) is co-managed by EBI and CRG. Data can only be downloaded through their EGA download client, but metadata may be viewed on their website. Files are grouped into datasets based on the study they were collected in, and access is granted on a dataset by dataset basis. This repository carries both ICGC and non-ICGC data.

Property Value
Name European Genome Archive
Contact helpdesk@ega-archive.org
Repository type Non-Cloud
Offical Website https://ega-archive.org
ICGC Portal Page Portal
Download Client Zipfile
Repo Code ega

Obtaining Data Access

To obtain access to ICGC Controlled Data, learn more about the DACO application process here, and then start a DACO application. Once approved by ICGC DACO, you will need to contact EGA to have your EGA account set up.

Download Client Operation

To operate the EGA download client, follow the instructions here.

GDC

The Genomic Data Commons is a US government (NIH / NCI) run data repository for cancer genomic information. Notably, the it carries data from The Cancer Genome Atlas (TCGA) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET). Currently the GDC is the largest single repository of ICGC data. It focuses on studies in the United States.

Property Value
Name Genomic Data Commons
Contact support@nci-gdc.datacommons.io
Repository type Non-Cloud
Official Data Portal https://portal.gdc.cancer.gov/
ICGC Portal Page Portal
Download Client Download the client here
Repo Code gdc

Obtaining Data Access

To obtain access you must have an eRA commons account, and you must have dbGap access to the data on the GDC you are interested in. Talk to your team leader if you do not have this access. Once this account is set up, you can log in to the GDC using the your dbGaP credentials.

https://gdc.cancer.gov/access-data/obtaining-access-controlled-data

Download Client Operation

Once you or your project leader have attained access to the reseach project,you will need to download access tokens from the gdc data portal. A comprehensive guide on how to use the GDC client is available here.

PDC

The Bionimbus Protected Data Cloud (PDC) is a secure biomedical cloud operated at FISMA moderate as IaaS with an NIH Trusted Partner status for analyzing and sharing protected datasets. The Bionimbus PDC is a collaboration between the University of Chicago Center for Data Intensive Science (CDIS) and the Open Commons Consortium (OCC). The Bionimbus PDC allows users authorized by NIH to compute over human genomic data in a secure compliant fashion.

It is a secure data cloud that stores US PCAWG data.

Property Value
Name Bionimbus Protected Data Cloud
Contact support@opensciencedatacloud.org
Repository type Cloud
Official Website https://bionimbus-pdc.opensciencedatacloud.org
ICGC Portal Page Portal
Download Client Amazon Web Services Command Line Interface
Client Documentation AWS Guide
Repo Code pdc

Obtaining Data Access

Same as obtaining Data Access to GDC

Download Client Operation

The data in the PDC can be accessed using the AWS CLI. You will first need to enter your key and secret key with aws configure and follow the prompts. This key can be download from the projects tab of the official PDC website. Once your credentials have been entered, you can begin downloading.

Download manifest file from ICGC Portal

As described in the Search for PCAWG data section, once you satisfied with the search result of TCGA data files, click on the "Download Manifest" button as illustrated below to retrieve the manifest tarball (named as manifest.*.tar.gz). Unpack the tarbal, you should get a file named as manifest.pdc.*.sh, eg, manifest.pdc.1586448715169.sh.

Convert ICGC manifest file to PDC's Gen3 manifest file

We need to convert the mainfest file to PDC's Gen3 manifest file before downloading the actual data files. A Python script (dcc_to_gen3.py) is needed to perform the conversion, the script can be downloaded with the following command:

wget https://raw.githubusercontent.com/uc-cdis/pdc_tools/1.0/dcc_manifest_conversion/dcc_to_gen3.py

You need to have Python 3 and required libraries (such as numpy and pandas) installed. Once installed, you can run the script to get Gen3 manifest file. Remember to replace the ICGC manifest with your own file name.

python dcc_to_gen3.py --manifest manifest.pdc.1586448715169.sh

This will produce a Gen3 manifest file named as gen3_manifest_manifest.pdc.1586448715169.sh.json, which contains information needed to download the acctual data from PDC using gen3-client tool.

Install Gen3-client

Run the following commands to install gen3-client if you are using macOS:

mkdir -p ~/.gen3
echo "" >> ~/.bashrc
echo "export PATH=\$PATH:~/.gen3" >> ~/.bashrc
curl https://api.github.com/repos/uc-cdis/cdis-data-client/releases/latest | grep browser_download_url.*osx |  cut -d '"' -f 4 | wget -qi -
unzip dataclient_osx.zip
mv gen3-client ~/.gen3
rm dataclient_osx.zip
source ~/.bashrc

With that you should be able to run gen3-client command from your console and see the usage message.

For installing gen3-client on other OS, please follow instructions here.

Get gen3-client API key and configure your profile

Now you need to create gen3-client API key from https://icgc.bionimbus.org after authentication via NIH eRA commons. To do that goto login page, and click on "Login with NIH" button. After authenticated successfully, please goto https://icgc.bionimbus.org/identity to create the API key. On the popup dialog click on "Download json" to retrive API key, as shown below:

The API key will be saved as credentials.json. You can then use it to configure a profile, let's name the profile icgc:

gen3-client configure  --profile=icgc --cred=credentials.json --apiendpoint=https://icgc.bionimbus.org/

Upon success, you should see a message: Profile 'icgc' has been configured successfully.

Download data using gen3-client download-multiple command

With icgc profile configured, you can download the PCAWG data using the gen3 manifest prepared earlier as follow:

gen3-client download-multiple --profile=icgc --manifest=gen3_manifest_manifest.pdc.1586448715169.sh.json --no-prompt

Azure

Microsoft Azure cloud service containing ICGC data.

Property Value
Name Microsoft Azure
Contact ICGC ARGO Contact page
Repository type Cloud
ICGC Portal Page Portal
Download Client Tarball, Docker
Repo Code azure

Obtaining Data Access

To obtain access to ICGC Controlled Data, learn more about the DACO application process here, and then start a DACO application.

Download Client Operation

To operate the Score download client, follow the instructions here.