Data Download Guide

Overview

This user guide describes the steps to securely explore and download ICGC data stored in Amazon (AWS) or Collaboratory (OpenStack) cloud environments. For more information about ICGC cloud initiatives, please see ICGC in the Cloud.

Please see Terms for a glossary of terms used in this guide.

Process at a glance

The figure below illustrates the overall process and systems involved:

  1. Authorization Apply for DACO Access if not already approved Upon approval, login to the Data Portal Generate an Access Token for cloud download
  2. Compute Prerequistes Provision a Compute Instance in the target cloud
  3. Installation Download and install the Score Client
  4. Configuration Configure the Score Client to use the generated Access Token
  5. File Search Identify files of interest using the Data Portal
  6. Score Client Usage Download or view data with the provided Score Client or via an external tool

Cloud Process Diagram

The subsequent sections will provide additional details on each of these topics.

Authorization

There are two prerequesites to using the Score Client: DACO Approval and a self-provisioned Access Token.

DACO Cloud Access

DACO Cloud Access is acquired after your DACO application was been approved. To apply for DACO access please follow the instructions provided at https://daco.icgc-argo.org/. Once approved, you will be able to login to the Data Portal to generate an Access Token. To login, click on the “Login” link in the upper right-hand corner of the page. When prompted, choose to login with your ICGC DACO-approved account. After successful authentication, you will know that you have Cloud Access to the controlled tier if the “login” link is replaced with a green cloud icon:

DACO Cloud Access Link

Access Tokens

The Access Token model used for protecting the cloud data set follows a similar process to Github’s personal access tokens. Tokens are used instead of a username / password to securely access ICGC resources.

Related to Access Tokens is the concept of Scopes. Tokens allow you to associate Scopes which limit access to that needed for the target environment. This enhances security by following the Principle of Least Privilege. Cloud specific Scopes will become available after acquiring DACO Approval. An instance of a cloud download token will grant access to all of the available data in that environment.

Token Manager

To acquire an Access Token, you must first obtain DACO Approval and login to the Data Portal. After a successful login, there will be Token Manager link in the upper right corner of the page. Clicking on this link will display the Token Manager dialog:

Token Manager Link

From this dialog, you can manage the Access Tokens associated with your account. Importantly, you may delete and regenerate an access token if you believe that it has been compromised.

Token Manager Screenshot

When creating an Access Token, you will need to specify the Scope associated with the target cloud(s).

AWS

In the case of the ICGC AWS data set, an access token with the aws.read scope is required to access the controlled access data

Collaboratory

In the case of the ICGC Collaboratory data set, an access token with the collab.read scope is required to access the controlled access data

Azure

In the case of the ICGC Azure data set, an access token with the azure.read scope is required to access the controlled access data

You can verify that your Access Token has the desired scopes by inspecting it in the table at the bottom of the dialog. For security purposes, Access Tokens must remain private and not be shared with anyone.

Following the creation of a Compute Instance, discussed in the next section, you will need to edit the Score Client client configuration file to include the generated Access Token. See the Configuration section for additional information.

Compute Prerequisites

Compute Instance

As a first step in downloading data, you will need to create a Compute Instance to run the Score Client and any other supporting software.

AWS

In order to run within EC2, you will need your own AWS account to provision a running EC2 instance. Any data processing will be charged to this account. Note that ICGC data download from S3 to the same EC2 region is free of charge. Please see Amazon's documentation for detailed instructions.

Collaboratory

In order to run within Collaboratory, you will need to be enrolled. To begin the enrollment process, please send an email to help@cancercollaboratory.org.

UPDATE: Downloading data objects hosted in Collaboratory is no longer required to be performed in a Collaboratory compute instance.

Azure

In order to run within Azure, you will need your own Microsoft Azure account to provision a running Azure instance. Any data processing will be charged to this account.

The following sections provide guidance on selecting and configuring the chosen instance type.

Resources

As data files are quite large, users should have enough local disk space to store files downloaded from the remote repository.

More processing cores will give greater parallelism, and therefore, better thoughput of downloads.

By default the Score Client is configured to use a maximum of 3G of RAM. Most of time this is more than sufficient.

Operating System

The Score Client has been designed to work on modern Mac and Linux distributions. Windows should work as well but remains untested.

Installation of the Score Client

This section describes how to install the Score Client. The are two options: (a) from a tarball and (b) from a Docker image hosted on Dockerhub.

Install from Tarball

The Score Client requires Java 11 to be installed. The procedure for installing Java 11 will vary depending on the operating system and package manager used. As an example, here we show how to install Oracle JDK 11 or Open JDK 11 on Ubuntu Linux distribution

# Install Oracle JDK 11
sudo add-apt-repository ppa:linuxuprising/java
sudo apt-get update
sudo apt-get upgrade -y
sudo apt install oracle-java11-installer-local oracle-java11-set-default-local
# Or you can install OpenJDK 11:
apt-get install openjdk-11-jdk

In order to use the mount feature, FUSE is required. On most Linux based systems this will require installing libfuse-dev, fuse and other packages, below is the command to install them on Ubuntu.

sudo apt-get install -y libfuse-dev fuse curl wget software-properties-common

With dependencies installed, now we can install the Score Client itself. The latest version can be downloaded from here, or use the following commands to download from command line.

wget -O score-client.tar.gz https://artifacts.oicr.on.ca/artifactory/dcc-release/bio/overture/score-client/\[RELEASE\]/score-client-\[RELEASE\]-dist.tar.gz
tar -xvzf score-client.tar.gz
cd score-client-2.0.0  # or newer version
bin/score-client

Install from Docker Image

We also support a Docker image of the Score Client that is bundled with Java 8 for easy deployment.

The image is hosted at https://hub.docker.com/r/overture/score/ and downloaded by issuing the following command:

docker pull overture/score

Once pulled, you can open a shell in the container by executing:

docker run -it overture/score
bin/score-client

There is no entry point or command defined for the image. The software is located at score-client which is also the working directory of the container. All other steps for using the Score Client will be the same for both Docker and tarball installations.

Configuration

The configuration of the Score Client is stored in the conf/application.properties file of the distribution.

Access Configuration

The main configuration element is the access token generated in Access Token above.

Edit application.properties and add the generated accesss token to the line like below (remember to remove the leading '#' to uncomment the line):

accessToken=<access token>

When using Docker, this can also be set with an environmental variable:

docker run -it -e ACCESSTOKEN=<access token> overture/score

Collaboratory

In addition to the above, you will need to change the bin/score-client script to add this line STORAGE_PROFILE=collab. This can also performed externally via the environmental variable of the same name: export STORAGE_PROFILE=collab. Note that it is also possible to override this per execution using bin/score-client's --profile collab argument.

Transport Configuration

Based on the target Compute Instance defined in Compute Prerequisites and transfer speed requirements, it may be necessary to make changes to how the Score Client transfers data. This is achieved by setting transport.parallel and transport.memory in file conf/application.properties:

  • transport.parallel controls the number of concurrent threads for multi-part data transfers. It is recommended to set this to the number of cores of the Compute Instance.
  • transport.memory is the amount of non-heap memory per thread, in gigabytes. It is recommended set this to a value of 1 (1 GB). Be sure to leave enough memory for the operating system and any other software that may be running on the Compute Instance.

Finding files of interest can be done via the Data Portal. Objects are identified by their Object ID.

  • Navigate to Data Repositories section of the ICGC Data Portal
  • Click on the AWS or Collaboratory filter in the left hand pane
  • Filter based on properties of interest (e.g. donor id, specimen id, etc.)
  • Export a Manifest for future use with the Score Client

The Manifest is the main way to define what files should be downloaded by the Score Client. However, knowing the Object ID is sufficient for a single file download. To generate a Manifest, click on the "Download Files" link the the Data Repository browser. You will be prompted with a "Download Files" dialog:

Download Manifest Dialog

Manifests downloaded from the Data Portal can be transferred to the Score Client instance by using SFTP or SCP. For convenience, when files to be downloaded are all from a single repository, it is also possible to use a Manifest ID saved on the Data Portal by clicking on the "Manifest ID" button. See the Score Client Usage section for usage information.

Score Client Usage

This section provides information on how to use the Score Client once it has been properly downloaded and configured. It assumes the user possesses and has configured the requisite access token discussed previously.

The Score Client has the general syntax:

score-client [options] [command] [command options]

It offers a set of commands, where each command has its own set of options to influence its operation.

Note that: the example commands used below assume we download from Collaboratory, this can be achieved by set STORAGE_PROFILE environment variable to collab:

export STORAGE_PROFILE=collab

Help

The Score Client provides a --help option to list the available commands and a brief description of their supported options:

bin/score-client --help

It is also possible to get information on a specific command using the help command:

bin/score-client help download

URL Command

The url command is the most basic command supported by the Score Client. It allows one to resolve the underyling S3 URL for the requested object. This is useful if one wants to directly access the URL via HTTPS with an external client or tool (e.g. curl, wget, etc.)

bin/score-client url --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd

An example of using wget:

bin/score-client url --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd
Resolving URL for object: ddcdd044-adda-5f09-8849-27d6038f8ccd
https://object.cancercollaboratory.org:9080/oicr.icgc.28/data/ddcdd044-adda-5f09-8849-27d6038f8ccd?...[snip]

wget -O ddcdd044-adda-5f09-8849-27d6038f8ccd "https://object.cancercollaboratory.org:9080/oicr.icgc.28/data/ddcdd044-adda-5f09-8849-27d6038f8ccd?...[snip]"

You should always double-quote the URL that you pass to wget.

Download Command

The download command allows fast parallel download of remote objects. It can be run in one of two modes: (a) single object mode and (b) Manifest driven mode

Note that the Score Client is able to resume an interrupted download session. Simply rerun the same command again and it will continue.

Object ID

This mode is useful when downloading an ad-hoc list of one or more objects with known Object ID's, perhaps acquired from the Data Portal:

bin/score-client download --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd --output-dir data

You can also specify multiple object id's separated by spaces

bin/score-client download --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd 5cc35183-9291-5711-967d-30afcf20e71f --output-dir data

Downloads will be stored in the folder specified by --output-dir.

Manifest

Using a Manifest is ideal for downloading multiple files identified through the Data Portal. The repository file search allows one to generate a Manifest file that can be supplied for bulk downloading files. It also provides some additional metadata for selected files that gives the donor, specimen and sample context.

bin/score-client download --manifest manifest.tsv --output-dir data

The optional --output-layout option can be used to organize the downloads into a couple of predefined directory layouts. See the --help for addional information.

Manifest Command

The manifest command allows a user to quickly view the contents of a download Manifest produced by the Data Portal. A Manifest can come from:

  • The local file system
  • A Manifest ID that is hosted on the Data Portal
  • Any URL

A Manifest is a TSV file that contains both file identifying fields and satellite metadata for understanding the relationships to other data including donor, project and study.

Manifest from a Local File

An example of using a local file system Manifest:

bin/score-client manifest --manifest manifest.aws-virginia.1444232116728.txt

Manifest from the Data Portal

An example of using a Data Portal hosted Manifest:

bin/score-client manifest --manifest 49e91614-7811-11e5-8a58-34363bcf803c

Manifest from a URL

An example of using a URL hosted Manifest:

bin/score-client manifest --manifest http://hastebin.com/raw/ujajodilih

View Command

The view command is a minimal version of samtools view. It allows one to request a “genomic slice” of the remote BAM file, freeing the user from having to download the entire file locally, saving bytes and time.

The following example will download reads overlapping the region 1 - 10,000 on chromosome 1:

bin/score-client view --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd --query 1:1-10000

The BAI is automatically discovered and streamed as part of the operation.

For quickly accessing only the BAM header one can issue:

bin/score-client view --header-only --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd

It is also possible to pipe the output of the above to samtools, etc. for pipelining a workflow:

bin/score-client view --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd --query 1:1-100000 | samtools mpileup -

Batch/Manifest Slicing

As of version 1.0.14, the client supports slicing across "batches" of specimens listed in a manifest file (tab-delimited files produced by the Data Portal as opposed to XML-format manifests from GNOS). The output of this feature is illustrated here:

Multiple query regions can be specified at the command line,

bin/score-client view --manifest manifest.txt --query 1:1245-1425 1:1578-1818 1:18100-19780 1:81011-81491 1:18100-19780 1:2772220-2772272 --output-dir data

or in a BED file

bin/score-client view --manifest manifest.txt --bed-query test.bed --output-dir data

There is also a switch to have indexes generated for the output

bin/score-client view --manifest manifest.txt --output-index --bed-query test.bed --output-dir data

Mount Command

The mount command can be used to mount the remote S3 bucket as a read-only FUSE file system. This is very useful to browse and explore the available files, as well as quickly see their size and date of modification using common commands such as ls, find, du and tree. It also works very well with standard analysis tools such as samtools.

Files are organized into a virtual directory structure. The following shows the default bundle layout:

/bundleId1/fileName1
/bundleId1/fileName2
...
/bundleId1/fileNamei
...
/bundleIdn/fileName1
/bundleIdn/fileName2
...
/bundleIdn/fileNamej

where bundleId and fileName are the original Bundle ID and file name of the file respectively. It possible to control the layout using the --layout option. Using --layout object-id will instead produce a flat list of files named by their associated Object ID.

The file system implementation's performance is optimized for serial reads. Frequent random access patterns will lead to very poor performance. Under the covers, each random seek requires a new HTTP connection to S3 with the appropriate Range header set which is an expensive operation. For this reason, it is only recommended for streaming analysis (e.g. samtools view like functionality).

Mount All Files

To mount all available files locally, issue the following:

# Create the mount point
sudo mkdir /mnt/icgc
sudo chmod 777 /mnt/icgc/

# Mount
bin/score-client mount --mount-point /mnt/icgc --cache-metadata

NOTE: Please be advised it is not advisable to mount all files, it may take very long time. See following section how to mount fewer objects using manifest_id.

To speed up subsequent mounts, one can specify the --cache-metadata flag above which will locally store an index of the file system.

Once mounted, you can use standard analysis tools against files found under the mount point:

# Slice
samtools view /mnt/icgc/fff75930-0f8c-4c99-9b48-732e7ed4c625/443a7a6ab964e41c011cc9a303bc086c.bam 1:10000-20000

Mount Only Manifest Entries

To filter the mount to only include the files specified in a Manifest, issue the following:

# Mount
bin/score-client mount --mount-point /mnt/icgc --manifest <manifest_id or manifest_file>

See the manifest command for more details on how to specify a Manifest.

Mount in Docker

To avoid having to install the FUSE and Java dependencies when working with the mount command, it is very convenient to mount from within a Docker container. This is also useful for creating a custom image for analysis that derives from the one published by ICGC. First, ensure that Docker and the Score Client image is installed. See the Installation section for details.

Next, export the access token generated from the portal:

# Export access token, please replace <accessToken> with your own token
export ACCESSTOKEN=<accessToken>

And then mount the file system inside the container against the empty /mnt directory:

# Alias for ease of use, assume we use collab profile
alias docker-score-client="docker run -it --rm -e ACCESSTOKEN --privileged -v `pwd`:/score-client/manifest overture/score bin/score-client --profile collab"
# Mount the file system in the container
docker-score-client mount --mount-point /mnt --manifest <manifest_id or manifest/manifest_file> 

Note that the --privileged Docker option is required for FUSE in order to access the host's /dev/fuse device.

In another terminal, you can access the newly mounted file system:

# List all files recursively
docker exec -it $(docker ps -lq) find /mnt

To perform analysis within the container:

# Open a shell in the previously created container
docker exec -it $(docker ps -lq) bash

# Install samtools
apt-get install samtools

# Slice
samtools view /mnt/fff75930-0f8c-4c99-9b48-732e7ed4c625/443a7a6ab964e41c011cc9a303bc086c.bam 1:10000-20000

Due to a limitation of Docker it is not possible to access a FUSE mounted file system from the host operating system. Please see here for more details.

FAQs

Where can I find the Bundle ID associated with an Object ID?

Currently the only way to retrieve the Bundle ID of an Object ID is by viewing the file entity page in the Data Portal. Navigate to the Data Repository browser and enter the Object ID in the "File" filter and click on the resulting record.

Where are detailed Score Client logs stored?

The Score Client log file is stored at logs/client.log

How long will pre-signed URLs remain valid?

Pre-signed URLs are valid for 1 day from the time they are issued. For security purposes, a URL issued to one user must not be used by another and must be kept private.

Does the client maintain state?

Yes, the client maintains state, for downloads, in the working directory in a hidden file .<Object ID>/meta. This file includes cached pre-signed URLS. If your downloads fail unexpectedly, then try deleting this directory to purge pre-signed URLs that may have expired. Also, when using the mount command with the --cache-metadata option, .entities.cache and .objects.cache are stored in the current working directory.

Why do I get a security exception when I try to download an object?

If you are targeting the AWS cloud, ensure that you are running within the us-east-1 region.

I can’t use the result of a url command with samtools:

samtools doesn’t support the HTTPS protocol*, which is required by ICGC to access S3-stored data files. Use the client view command to pipe data to samtools, download the desired files locally, or use the mount command to create a FUSE mount of the ICGC data files.

* Update: As of commit fe1f08a samtools now supports file access over HTTPS and Amazon S3.

Why is my 'Total bytes read' count different from my 'Total bytes written'?
./bin/score-client --profile collab download --object-id 6d89e978-34f6-5074-b30e-01b7203fcbb3 --output-dir /tmp
Downloading...
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
100% [##################################################]  Parts: 208/208, Checksum: 100%, Write/sec: 47.7M/s, Read/sec: 48.1M/s
Finalizing...
Total execution time:         1.301 h

Total bytes read    : 225,224,593,891
Total bytes written : 223,331,450,672

Because of the size of BAM files, ICGC upload/downloads tend to be long-running, making them susceptible to any of the myriad ways a network can fail. ICGC attempts to recover from these usually-brief outages automatically and this often necessitates repeat downloads of sub-parts of the file. This will result in a "Total bytes read" amount larger than the "Total bytes written". The total byte counts are informational only and not used to determine "correctness" or "completeness" of any given download.

How do I report a bug in the software?

Please contact the DCC and include the version of the software in the body of the message (bin/score-client --version).

Terms

Related terms and their definitions are given below:

Term Meaning
Access Token An authorization mechanism created by the Data Portal to access data.
Bundle ID An identifier that refers to a submission bundle of related files. Typically the files produced by analysis workflows are packaged as a single unit. However, when a bundle is imported into a cloud repository each file in the bundle is given its own Object ID.
Compute Instance A user virtual machine operating in a cloud environment.
DACO The ICGC Data Access Compliance Office which handles requests from researchers for access to controlled data from the ICGC.
DACO Cloud Access DACO approval means automatic Cloud Access.
DCC The ICGC Data Coordination Center (DCC) performs quality assessment, curation and data releases and also manages the data flow from projects and centers to the central ICGC database and public repositories.
Data Portal The ICGC data portal located at https://dcc.icgc.org.
FUSE Filesystem in Userspace is an operating system mechanism for Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code.
Manifest A file used as input to the Score Client to describe and identify files to be downloaded.
Object ID The unique identifier of an object expressed as a UUID. In the command line interface this is refered to as --object-id.
OpenStack OpenStack is a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.
Scope A user permission or authorization to access a resource.
Score Client Sofware provided by ICGC required to download data from AWS S3.
S3 Amazon Simple Storage Service, the physical store of the ICGC AWS data.
Token Manager Section of the portal used to manage Access Tokens.