This user guide describes the steps to securely explore and analyze ICGC data stored in Amazon (AWS) or Collaboratory (OpenStack) cloud environments. For more information about ICGC cloud initiatives, please see ICGC in the Cloud.
Please see Terms for a glossary of terms used in this guide.
The figure below illustrates the overall process and systems involved:
- Authorization Apply for DACO Cloud Access if not already approved Upon approval, login to the Data Portal Generate an Access Token for cloud download
- Compute Prerequistes Provision a Compute Instance in the target cloud
- Installation Download and install the ICGC Storage Client
- Configuration Configure the Storage Client to use the generated Access Token
- File Search Identify files of interest using the Data Portal
- Storage Client Usage Download or view data with the provided Storage Client or via an external tool
The subsequent sections will provide additional details on each of these topics.
The usage of the distributed Storage Client is required to provide additional security while operating in participating cloud environments and to enhance user download speeds.
Security is enforced by a coordinating ICGC storage server. The client communicates with the server which brokers downloads and converts ICGC Access Tokens into Amazon pre-signed urls. Once a successful authorization handshake between the client and the server is established, downloads will be transferred directly from S3 to the client, maintaining fast access to the data.
It is important to note that the provided software only functions within the
us-east-1 EC2 region of AWS located in Northern Virginia, U.S. where the data is physically stored. Attempting to use the software outside of this region will be denied data access.
Downloads only function inside Collaboratory's OpenStack environment.
Lastly, it is the user’s responsibility to protect the data after it has been attained. This includes any subsequent analyses and storage on cloud and downstream resources.
There are two prerequesites to using the Storage Client: DACO Cloud Access status and a self-provisioned Access Token.
DACO Cloud Access
DACO Cloud Access is prerequisite to using the Storage Client. To apply for DACO access please follow the instructions provided at https://icgc.org/daco. Once approved, you will be able to login to the Data Portal to generate an Access Token. To login, click on the “Login” link in the upper right-hand corner of the page. When prompted, choose to login with either your ICGC.org login or one of the supported OpenID providers (e.g. Google). After successful authentication, you will know that you have Cloud Access to the controlled tier if the “login” link is replaced with a green cloud icon:
The Access Token model used for protecting the cloud data set follows a similar process to Github’s personal access tokens. Tokens are used instead of a username / password to securely access ICGC resources.
Related to Access Tokens is the concept of Scopes. Tokens allow you to associate Scopes which limit access to that needed for the target environment. This enhances security by following the Principle of Least Privilege. Cloud specific Scopes will become available after acquiring DACO Cloud Access. An instance of a cloud download token will grant access to all of the available data in that environment.
To acquire an Access Token, you must first obtain DACO Clould Access and login to the Data Portal. After a successful login, there will be Token Manager link in the upper right corner of the page. Clicking on this link will display the Token Manager dialog:
From this dialog, you can manage the Access Tokens associated with your account. Importantly, you may delete and regenerate an access token if you believe that it has been compromised.
When creating an Access Token, you will need to specify the Scope associated with the target cloud(s).
In the case of the ICGC AWS data set, an access token with the aws.download scope is required to access the controlled access data
In the case of the ICGC Collaboratory data set, an access token with the collab.download scope is required to access the controlled access data
You can verify that your Access Token has the desired scopes by inspecting it in the table at the bottom of the dialog. For security purposes, Access Tokens must remain private and not be shared with anyone.
Following the creation of a Compute Instance, discussed in the next section, you will need to edit the Storage Client client configuration file to include the generated Access Token. See the Configuration section for additional information.
As a first step in analyzing data, you will need to create a Compute Instance to run the Storage Client and any other supporting software.
In order to run within EC2, you will need your own AWS account to provision a running EC2 instance. Any data processing will be charged to this account. Note that ICGC data download from S3 to the same EC2 region is free of charge. Please see Amazon's documentation for detailed instructions.
The following sections provide guidance on selecting and configuring the chosen instance type.
As data files are quite large, users should have enough local disk space to store files downloaded from the remote repository.
More processing cores will give greater parallelism, and therefore, better thoughput of downloads.
By default the storage client is configured to use a maximum of 3G of RAM. Most of time this is more than sufficient.
The Storage Client has been designed to work on modern Mac and Linux distributions. Windows should work as well but remains untested.
The Storage Client requires Java 8 to be installed. It has been tested using the Oracle distribution. The procedure for installing Java 8 will vary depending on the operating system and package manager used.
In order to use the mount feature, FUSE is required. On most Linux based systems this will require installing
This section describes how to install the Storage Client. The are two options: (a) from a tarball and (b) from a Docker image hosted on Dockerhub.
Install from Tarball
To begin using the Storage Client, the first step is to download the distribution. The latest version can be downloaded from here.
wget -O icgc-storage-client.tar.gz https://dcc.icgc.org/api/v1/ui/software/icgc-storage-client/latest tar -xvzf icgc-storage-client.tar.gz
After untaring the archive, the Storage Client will be available at
bin/icgc-storage-client. Steps to verify the authenticity and integrity of the download can be found on our software page.
Install from Docker Image
We also support a Docker image of the Storage Client that is bundled with Java 8 for easy deployment.
The image is hosted at https://hub.docker.com/r/icgc/icgc-storage-client/ and downloaded by issuing the following command:
docker pull icgc/icgc-storage-client
Once pulled, you can open a shell in the container by executing:
docker run -it icgc/icgc-storage-client
There is no entry point or command defined for the image. The software may be located at
/icgc/icgc-storage-client which is also the working directory of the container. All other steps for using the Storage Client will be the same for both Docker and tarball installations.
The configuration of the Storage Client is stored in the
conf/application.properties file of the distribution.
The main configuration element is the access token generated in Access Token above. Configuration is stored in the
conf/ directory of the distribution.
application.properties and add the generated accesss token to the line:
When using Docker, this can also be set with an environmental variable:
docker run -it -e ACCESSTOKEN=<access token> icgc/icgc-storage-client
In addition to the above, you will need to change the
bin/icgc-storage-client script to set
STORAGE_PROFILE=collab. This can also performed externally via the environmental variable of the same name. Note that it is also possible to override this per execution using
--profile collab argument.
Based on the target Compute Instance defined in Compute Prerequisites and transfer speed requirements, it may be necessary to make changes to how the Storage Client transfers data. This is achieved by setting
transport.parallelcontrols the number of concurrent threads for multi-part data transfers. It is recommended to set this to the number of cores of the Compute Instance.
transport.memoryis the amount of non-heap memory per thread, in gigabytes. It is recommended set this to a value of
1(1 GB). Be sure to leave enough memory for the operating system and any other software that may be running on the Compute Instance.
Finding files of interest can be done via the Data Portal. Objects are identified by their Object ID.
- Navigate to repository file search
- Click on the
Collaboratoryfilter in the left hand pane
- Filter based on properties of interest (e.g. donor id, specimen id, etc.)
- Export a Manifest for future use with the Storage Client
The Manifest is the main way to define what files should be downloaded by the Storage Client. However, knowing the Object ID is sufficient for a single file download. To generate a Manifest, click on the "Download manifests" link the the Data Repository browser. You will be prompted with a "Download manifests" dialog:
Manifests downloaded from the Data Portal can be transferred to the Storage Client instance by using SFTP or SCP. For convenience, it is also possible to use a Manifest ID saved on the Data Portal by clicking on the "Manifest ID" button. See the Storage Client Usage section for usage information.
Storage Client Usage
This section provides information on how to use the Storage Client once it has been properly downloaded and configured. It assumes the user possesses and has configured the requisite access token discussed previously.
The Storage Client has the general syntax:
icgc-storage-client [options] [command] [command options]
It offers a set of commands, where each command has its own set of options to influence its operation.
The Storage Client provides a
--help option to list the available commands and a brief description of their supported options:
It is also possible to get information on a specific command using the
bin/icgc-storage-client help download
url command is the most basic command supported by the Storage Client. It allows one to resolve the underyling S3 URL for the requested object. This is useful if one wants to directly access the URL via HTTPS with an external client or tool (e.g.
bin/icgc-storage-client url --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd
An example of using
bin/icgc-storage-client url --object-id <Object ID> Resolving URL for object: ddcdd044-adda-5f09-8849-27d6038f8ccd (offset = 0, length = -1) https://s3-external-1.amazonaws.com/...[snip] wget "https://s3-external-1.amazonaws.com/...[snip]"
You should always double-quote the URL that you pass to wget.
download command allows fast parallel download of remote objects. It can be run in one of two modes: (a) single object mode and (b) Manifest driven mode
Note that the Storage Client is able to resume an interrupted download session. Simply rerun the same command again and it will continue.
This mode is useful when downloading an ad-hoc list of one or more objects with known Object ID's, perhaps acquired from the Data Portal:
bin/icgc-storage-client download --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd --output-dir data
You can also specify multiple object id's separated by spaces
bin/icgc-storage-client download --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd 008da0c1-70cc-61ae-3bab-09aa17fad451 --output-dir data
Downloads will be stored in the
Using a Manifest is ideal for downloading multiple files identified through the Data Portal. The repository file search allows one to generate a Manifest file that can be supplied for bulk downloading files. It also provides some additional metadata for selected files that gives the donor, specimen and sample context.
bin/icgc-storage-client download --manifest manifest.txt --output-dir data
--output-layout option can be used to organize the downloads into a couple of predefined directory layouts. See the
--help for addional information.
manifest command allows a user to quickly view the contents of a download Manifest produced by the Data Portal. A Manifest can come from:
- The local file system
- A Manifest ID that is hosted on the Data Portal
- Any URL
A Manifest is a TSV file that contains both file identifying fields and satellite metadata for understanding the relationships to other data including donor, project and study.
Manifest from a Local File
An example of using a local file system Manifest:
bin/icgc-storage-client manifest --manifest manifest.aws-virginia.1444232116728.txt
Manifest from the Data Portal
An example of using a Data Portal hosted Manifest:
bin/icgc-storage-client manifest --manifest 49e91614-7811-11e5-8a58-34363bcf803c
Manifest from a URL
An example of using a URL hosted Manifest:
bin/icgc-storage-client manifest --manifest http://hastebin.com/raw/ujajodilih
view command is a minimal version of samtools view. It allows one to request a “genomic slice” of the remote BAM file, freeing the user from having to download the entire file locally, saving bytes and time.
The following example will download reads overlapping the region 1 - 100000 in chromosome 1:
bin/icgc-storage-client view --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd --query 1:1-100000
The BAI is automatically discovered and streamed as part of the operation.
For quickly accessing only the BAM header one can issue:
bin/icgc-storage-client view --header-only --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd
It is also possible to pipe the output of the above to
samtools, etc. for pipelining a workflow:
bin/icgc-storage-client view --stdout --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd | samtools mpileup -
As of version 1.0.14, the client supports slicing across "batches" of specimens listed in a manifest file (tab-delimited files produced by the Data Portal as opposed to XML-format manifests from GNOS). The output of this feature is illustrated here:
Multiple query regions can be specified at the command line,
bin/icgc-storage-client view --manifest /data/manifest.txt --query 1:1245-1425 1:1578-1818 1:18100-19780 1:81011-81491 1:18100-19780 1:2772220-2772272 --output-dir /data/query-results
or in a BED file
bin/icgc-storage-client view --manifest /data/manifest.txt --bed-query /data/query/profile12.bed --output-dir /data/query-results
There is also a switch to have indexes generated for the output
bin/icgc-storage-client view --manifest /data/manifest.txt --output-index --bed-query /data/query/profile12.bed --output-dir /data/query-results
mount command can be used to mount the remote S3 bucket as a read-only FUSE file system. This is very useful to browse and explore the available files, as well as quickly see their size and date of modification using common commands such as
tree. It also works very well with standard analysis tools such as
Files are organized into a virtual directory structure. The following shows the default
/bundleId1/fileName1 /bundleId1/fileName2 ... /bundleId1/fileNamei ... /bundleIdn/fileName1 /bundleIdn/fileName2 ... /bundleIdn/fileNamej
fileName are the original Bundle ID and file name of the file respectively. It possible to control the layout using the
--layout option. Using
--layout object-id will instead produce a flat list of files named by their associated Object ID.
The file system implementation's performance is optimized for serial reads. Frequent random access patterns will lead to very poor performance. Under the covers, each random seek requires a new HTTP connection to S3 with the appropriate
Range header set which is an expensive operation. For this reason, it is only recommended for streaming analysis (e.g.
samtools view like functionality).
Mount All Files
To mount all available files locally, issue the following:
# Create the mount point mkdir /mnt/icgc # Mount bin/icgc-storage-client mount --mount-point /mnt/icgc
To speed up subsequent mounts, one can specify the
--cache-metadata flag above which will locally store an index of the file system.
Once mounted, you can use standard analysis tools against files found under the mount point:
# Slice samtools view /mnt/icgc/fff75930-0f8c-4c99-9b48-732e7ed4c625/443a7a6ab964e41c011cc9a303bc086c.bam 1:10000-20000
Mount Only Manifest Entries
To filter the mount to only include the files specified in a Manifest, issue the following:
# Create the mount point mkdir /mnt/icgc # Mount bin/icgc-storage-client mount --mount-point /mnt/icgc --manifest manifest>
manifest command for more details on how to specify a Manifest.
Mount in Docker
To avoid having to install the FUSE and Java dependencies when working with the
mount command, it is very convenient to mount from within a Docker container. This is also useful for creating a custom image for analysis that derives from the one published by ICGC. First, ensure that Docker and the Storage Client image is installed. See the Installation section for details.
Next, export the access token generated from the portal:
# Export access token export ACCESSTOKEN=accessToken>
And then mount the file system inside the container against the empty
# Alias for ease of use alias icgc-storage-client="docker run -it --rm -e ACCESSTOKEN --privileged icgc/icgc-storage-client bin/icgc-storage-client"
# Mount the file system in the container icgc-storage-client mount --mount-point /mnt
Note that the
--privileged Docker option is required for FUSE in order to access the host's
In another terminal, you can access the newly mounted file system:
# List all files recursively docker exec -it $(docker ps -lq) find /mnt
To perform analysis within the container:
# Open a shell in the previously created container docker exec -it $(docker ps -lq) bash # Install samtools apt-get install samtools # Slice samtools view /mnt/fff75930-0f8c-4c99-9b48-732e7ed4c625/443a7a6ab964e41c011cc9a303bc086c.bam 1:10000-20000
Due to a limitation of Docker it is not possible to access a FUSE mounted file system from the host operating system. Please see here for more details.
Where can I find the Bundle ID associated with an Object ID?
Currently the only way to retrieve the Bundle ID of an Object ID is by viewing the file entity page in the Data Portal. Navigate to the Data Repository browser and enter the Object ID in the "File" filter and click on the resulting record.
Where are detailed Storage Client logs stored?
The Storage Client log file is stored at
How long will pre-signed URLs remain valid?
Pre-signed URLs are valid for 1 day from the time they are issued. For security purposes, a URL issued to one user must not be used by another and must be kept private.
Does the client maintain state?
Yes, the client maintains state, for downloads, in the working directory in a hidden file
.<Object ID>/meta. This file includes cached pre-signed URLS. If your downloads fail unexpectedly, then try deleting this directory to purge pre-signed URLs that may have expired. Also, when using the
mount command with the
.objects.cache are stored in the current working directory.
Why do I get a security exception when I try to download an object?
If you are targeting the AWS cloud, ensure that you are running within the
us-east-1 region. If you are targeting Collaboratory, make sure you are inside the OpenStack environment.
I can’t use the result of a
url command with
samtools doesn’t support the HTTPS protocol*, which is required by ICGC to access S3-stored data files. Use the client
view command to pipe data to samtools, download the desired files locally, or use the mount command to create a FUSE mount of the ICGC data files.
* Update: As of commit fe1f08a
samtools now supports file access over HTTPS and Amazon S3.
Why is my 'Total bytes read' count different from my 'Total bytes written'?
./icgc/bin/icgc-storage-client --profile collab download --object-id 6d89e978-34f6-5074-b30e-01b7203fcbb3 --output-dir /tmp Downloading... ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 100% [##################################################] Parts: 208/208, Checksum: 100%, Write/sec: 47.7M/s, Read/sec: 48.1M/s Finalizing... Total execution time: 1.301 h Total bytes read : 225,224,593,891 Total bytes written : 223,331,450,672
Because of the size of BAM files, ICGC upload/downloads tend to be long-running, making them susceptible to any of the myriad ways a network can fail. ICGC attempts to recover from these usually-brief outages automatically and this often necessitates repeat downloads of sub-parts of the file. This will result in a "Total bytes read" amount larger than the "Total bytes written". The total byte counts are informational only and not used to determine "correctness" or "completeness" of any given download.
How do I report a bug in the software?
Please contact email@example.com and include the version of the software in the body of the message (
Related terms and their definitions are given below:
|Access Token||An authorization mechanism created by the Data Portal to access data.|
|Bundle ID||An identifier that refers to a submission bundle of related files. Typically the files produced by analysis workflows are packaged as a single unit. However, when a bundle is imported into a cloud repository each file in the bundle is given its own Object ID.|
|Compute Instance||A user virtual machine operating in a cloud environment.|
|DACO||The Data Access Compliance Office which handles requests from researchers for access to controlled data from the ICGC.|
|DACO Cloud Access||DACO access with supplemental approved Cloud Access status.|
|DCC||The ICGC Data Coordination Center (DCC) performs quality assessment, curation and data releases and also manages the data flow from projects and centers to the central ICGC database and public repositories.|
|Data Portal||The ICGC data portal located at https://dcc.icgc.org.|
|FUSE||Filesystem in Userspace is an operating system mechanism for Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code.|
|Manifest||A file used as input to the Storage Client to describe and identify files to be downloaded.|
|Object ID||The unique identifier of an object expressed as a UUID. In the command line interface this is refered to as
|OpenStack||OpenStack is a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.|
|Scope||A user permission or authorization to access a resource.|
|Storage Client||Sofware provided by ICGC required to download data from AWS S3.|
|S3||Amazon Simple Storage Service, the physical store of the ICGC AWS data.|
|Token Manager||Section of the portal used to manage Access Tokens.|