icgc-get User Guide

Overview

ICGC data resides in many data repositories and compute clouds around the world. These data repositories each have their own environment (public cloud, private cloud, on-premise file systems, etc.), access controls (DACO, OAuth, asymmetric keys, IP filtering), data download clients and configuration mechanisms. Thus, there is much for a user to learn and perform before actually acquiring the data. This is compounded by the fact that the number of environments are increasing over time and their characteristics are frequently changing. A coordinated mechanism to bootstrap and streamline the data access process is highly desirable. This is the problem the icgc-get tool helps to solve.

Please note: Pre-requisites for accessing the different repositories, as described in the Cloud User Guide, still apply. Most notably, that access to Collaboratory and AWS-Virginia require icgc-get to be run from the correct cloud environment.

As depicted in the figure below, downloading data with icgc-get is a three easy-step process:

  • Select data of interest using the ICGC Data Repository browser in the ICGC Data Portal. These files can belong to multiple repositories.
  • Obtain a icgc-get manifest ID for this dataset by clicking on the icgc-get button above the file table.
  • Execute on the computer you want to download data the icgc-get’s downloadcommand with your manifest ID as parameter.

icgc-get will then download the requested data files from repositories you have access to in a configurable order of preferences. As simple as that! More details and advanced features are described in sections below.

Quickstart

To quickly get started with icgc-get:

  1. Download and install the client
  2. Run the icgc-get configure command to setup your environment
  3. Run the icgc-get check command to ensure your credentials are correct
  4. Generate a Manifest ID via the Repository Browser
  5. Run the icgc-get report command to inspect your request before downloading
  6. Run icgc-get download -m <manifest-id> to download files in your manifest

Prerequisites

Operating System

icgc-get is supported on MacOS and Linux environments. Windows is currently unsupported.

Dependencies

Docker is required to use the pre-packaged download clients of the various respositories. To install Docker, please see the installation guide.

If not using Docker, it is expected that the user will have installed each of the clients required to access repositories of interest.

Please see the Architecture section for more information on how the dependencies are used.

Installation

The distribution may be downloaded from the binaries page

To install the latest version on Mac or Linux, issue the following in a terminal:

curl https://dcc.icgc.org/api/v1/ui/software/icgc-get/linux/latest -o icgc-get.latest.zip -L
unzip icgc-get.latest.zip

This will extract the icgc-get executable in the current directory.

Configuration

After installing icgc-get, you may want to do configure some of the essential usage parameters, such as your access credentials, usage mode and output directory. The simplest way to do this is to invoke the icgc-get configure command and follow the instructions of the prompts. This will keep the operation of tool simpler in the future.

To specify which config file to use either pass an absolute path to the config file to the command line with --config, or declare an environmental variable ICGCGET_CONFIG that contains the absolute path. If neither of these options are chosen, the tool will look for .icgc-get/config.yaml in your home directory.

Should you wish to get started right away, it is possible to run the tool without making a configuration file at all. Most of the parameters have default options if no config file is loaded. The only exceptions are --output, --repos, access credentials and, if you are not using Docker, tool paths. These options can be passed via the appropriate command line options.

Configuration Overrides

In addition to using the configure command, most configuration options can be overwritten either through the command line, by assigning environmental variables, or by directly editing the config.yaml file. Configuration options have the same name regardless of how they are inputed, but the different input methods have their own syntax. Environmental variables are in all caps, have underscores as separators, and are prefixed by ICGCGET_. Command line options have dashes as separators and are prefixed by two dashes. Config file options use a colon, followed by a newline and two spaces as separators.

icgc-get creates logfiles of it's operations, and collects the logfiles generated by some clients. The location and name of this logfile can be set with the --logfile option, the logfile: configuration parameter or the ICGCGET_LOGFILE environmental variable. When downloading files from Collaboratory, the ICGC Storage Server, GNOS repositories or the GDC additional client logfiles will be saved in the same directory as the specified logfile.

If you wish to use a different version of the docker container, this can be controlled via the ICGCGET_CONTAINER_TAG environmental variable or the corresponding configuration option. This is not recommended, and there is the possibility that non-default container versions may not be compatible with your installation of icgc-get.

It is necessary to specify the directory for downloaded files to be saved to under the --output argument if you are running icgc-get locally.

Repository Precedence

It is also recommended to specify a common list of repositories in your preferred order of precedence. When downloading a file, the tool will first try to find the file on the first specified repository, then the second, etc.

Please use the following format to define your repositories in the configuration file.

repos:
 - collaboratory
 - pcawg-chicago-icgc
 - pdc

Valid repositories are:

Code Repository
aws-virginia Amazon Web Services
collaboratory Collaboratory
ega European Genome Association
gdc Genomic Data Commons
pcawg-chicago-icgc Pan-Cancer Chicago repository
pcawg-chicago-tcga Pan-Cancer Chicago repository TCGA data
pcawg-cghub Pan-Cancer Santa-Cruz repository
pcawg-heidelberg Pan-Cancer Heidelberg repository
pcawg-london Pan-Cancer London repository
pcawg-tokyo Pan-Cancer Tokyo repository
pcawg-seoul Pan-Cancer Seoul repository
pcawg-barcelona Pan-Cancer Barcelona repository
pdc Bionimbus Protected Data Cloud

All clients require an absolute path to your local client installation set as ICGCGET_REPO_PATH as an environmental variable or under --repo-path in the config file unless they are being run through docker. All clients support the ability to configure the number of data streams to use when downloading under --repo-transport-parallel or REPO_TRANSPORT_PARALLEL Most clients can be made to download using the UDT protocol by using the --repo-udt config option.

Using the Portal generated ManifestId

In addition to manually specifying repository precedence via command line, the precedence can also be specifed using a manifestId. When selecting files using Portal, a manifestId can be generated with a user defined precedence. When file copies reside in more than one repository, this feature can be used to download a unique file copy from the first available repository (i.e no duplicate file downloads). A typical usecase is when file copies exist in different geographical locations, and the user defines repositories ordered by increasing distance from their geographical location in order to optimize download speeds.

The files can be downloaded using the following command: ./icgc-get download -m <manifestId>. If one or more repositories defined by the manifestId are missing from the current configuration, an error message will be returned containing a list of the missing repositories.

Repository Credentials

Collaboratory and AWS Credentials

These repositories are both accessed through the ICGC Storage Client, and share their configuration parameters under the icgc namespace. For both of these repositories provide an UUID for your ICGC access token to the --icgc-access parameter. You may also specify the transport file from protocol, under --icgc-transport-file.from.

EGA Credentials

EGA access should be provided as your EGA username to --ega-username and your EGA password to --ega-password. It should be noted that there have been reliability issues experienced should the transport parallel of the EGA client increase beyond 1.

GDC Credentials

GDC access should be provided as the full GDC access token or a path to the token file to --gdc-token. Though there are unsecured files present on the GDC data repository, for simplicity purposes a GDC access token is required for all downloads from the GDC.

GNOS Credentials

GNOS access should be provided as a key to --gnos-key-repo where repo is the repository code for the GNOS repository you need to access.

PDC Credentials

PDC access should be provided as a key to --pdc-key and a secret key to --pdc-secret.

Commands

All commands save configure share the --config, --logfile --verbose/-v and the --docker options.

Universal Option Description
--config Path to configuration file
--logfile Path to log file
--verbose -v Flag that increases tool verbosity
-d, --docker Option controlling the hosting of clients in docker

Configure Command

This command will start a series of prompts for you to enter application paths, access credentials, output directories and logfile locations. Any of these prompts can be bypassed by immediately pressing the enter key if the parameter is not relevant for your planned use of icgc-get. By default, configure will write to the default config file, but the destination can be overwritten with the -c tag. Should there be an existing configuration file at the target destination, existing configuration values can be kept by pressing enter in response to the prompt. Please note that some passwords, and secret keys will not be shown on the command prompt for security reasons, but can still be entered and can still be kept as the current value by pressing enter.

Option Description
-c, --config Destination for new or exisiting config file

Check Command

This command will test the provided credentials for each repository specified.

Due to the security protocols of each client, there are two ways in which this access check can occur. For PDC, GNOS and GDC icgc-get is only capable of determining if you have access to the specific files targeted for download, not the state of your permissions for the repository as a whole. When performing an access check for these repositories, you must provide a manifest id or list of files using the same formatting as the download command. For more detailed information about your permissions on these repositories contact their respective support departments.

For the AWS, Collaboratory, and EGA repositories, the access check will determine if you have access to the entire repository or not. These checks will occur even if file prioritization leads to no files being downloaded from any of these repositories.

Option Description
IDS Specify FI ids to check access to. Only requiered for PDC and GDC
-r, --repos Repeatable option used to specify repositories to download from
-o, --override Flag used to override warning messages
--no-ssl-verify Flag used to disable ssl verification. Not recommended

To do a status check on the same files used in the examples above:

./icgc-get check FI99996 FI99990 FI250134 -r collaboratory -r gdc

Sample output:

Valid access to the Collaboratory.
Valid access to the GDC files.

Report Command

Another useful subcommand is report. This takes the same primary inputs as download, but instead of downloading the specified files, it will provide a list of all files that are about to be downloaded, including their size, data type, name and the repository they are hosted on.

Option Description
-f, --table-format Controls output format. Valid options are json and tsv
-t, --data-type Controls output type. Valid argument is summary
-r, --repos Repeatable option used to specify repositories to download from
-o, --override Flag used to override warning messages
--no-ssl-verify Flag used to disable ssl verification. Not recommended

By default the command outputs a table, but the output can be altered to json via -f json or tsv

via -f tsv. Should you find file by file output too granular for a particularly large download, the option -t summary can be used to switch to a summarized version of the table. If an output directory is specified, then the command will search that directory to determine of any of the files are already present, and add a downloaded column that marks these files.

Example invocations of the report commands:

./icgc-get report FI99996 FI99990 FI250134 -r collaboratory -r gdc

Example invocations of the report commands:

./icgc-get report FI99996 FI99990 FI250134 -r collaboratory -r gdc -t summary

Sample output:

╒══════════╤════════╤════════╤═══════════════╤═══════════════╤═══════════════╕
│          │   Size │ Unit   │ File Format   │ Data Type     │ Repo          │
╞══════════╪════════╪════════╪═══════════════╪═══════════════╪═══════════════╡
│ FI99996  │   3.52 │ GB     │ BAM           │ Aligned Reads │ gdc           │
├──────────┼────────┼────────┼───────────────┼───────────────┼───────────────┤
│ FI99990  │  435.7 │ MB     │ BAM           │ Aligned Reads │ gdc           │
├──────────┼────────┼────────┼───────────────┼───────────────┼───────────────┤
│ FI250134 │ 197.44 │ KB     │ VCF           │ StGV          │ collaboratory │
╘══════════╧════════╧════════╧═══════════════╧═══════════════╧═══════════════╛
╒══════════════════════╤════════╤════════╤══════════════╤═══════════════╕
│                      │   Size │ Unit   │   File Count │   Donor_Count │
╞══════════════════════╪════════╪════════╪══════════════╪═══════════════╡
│ collaboratory        │ 197.44 │ KB     │            1 │             1 │
├──────────────────────┼────────┼────────┼──────────────┼───────────────┤
│ collaboratory: StGV  │ 197.44 │ KB     │            1 │             1 │
├──────────────────────┼────────┼────────┼──────────────┼───────────────┤
│ gdc                  │   3.94 │ GB     │            2 │             2 │
├──────────────────────┼────────┼────────┼──────────────┼───────────────┤
│ gdc: Aligned Reads   │   3.94 │ GB     │            2 │             2 │
├──────────────────────┼────────┼────────┼──────────────┼───────────────┤
│ Total                │   3.94 │ GB     │            3 │             3 │
├──────────────────────┼────────┼────────┼──────────────┼───────────────┤
│ Total: Aligned Reads │   3.94 │ GB     │            2 │             2 │
├──────────────────────┼────────┼────────┼──────────────┼───────────────┤
│ Total: StGV          │ 197.44 │ KB     │            1 │             1 │
╘══════════════════════╧════════╧════════╧══════════════╧═══════════════╛

Download Command

The two syntax for performing a download using icgc-get are:

./icgc-get --config [CONFIG] --docker [true|false] download <file-ids> [REPO] [OPTIONS]
./icgc-get --config [CONFIG] --docker [true|false] download -m <manifest-id> [REPO] [OPTIONS]

Where <file-ids> is a space separated list of ICGC file IDs, identifiable by their FI prefix (e.g. FI1234).

Option Description
-m, --manifest Flag used to specify that a manifest id has been passed
-r, --repos Repeatable option used to specify repositories to download from
-o, --override Flag used to override warning messages
--no-ssl-verify Flag used to disable ssl verification. Not recommended

The first required argument is the set of ICGC file ids or ICGC manifest id corresponding to the file or files you wish to download. This should either be in the form of one or more FI ids, FI followed by some amount of numbers, or a manifest uuid. If this is for a manifest id append the tag -m or --manifest. These ids may be retrieved from the ICGC data portal through the icgc-get button on the Data Repositories page. icgc-get is not capable of parsing client manifest files on the local machine. For more information on repository presendence using the manifestId, refer to Using the Portal generated ManifestId.

Using this command also requires you to specify the repository or repositories that are being targeted for download and the output directory, provided they have not been added to the config file.

Prepend each repository with -r, for example -r aws-virginia -r ega. The order that the repositories are listed is important: files will be downloaded from the first specified repository if possible, and subsequent repositories only if the file was not found on any previous repository.

The download command comes with an automatic prompt that warns the user if the projected download size approaches the total available space in the download directory. It is possible to suppress this warning using the -o flag.

Sample invocation of the download command:

./icgc-get download FI378424 -r  collaboratory

Version Command

This is an informative command that displays the version of all clients used by icgc-get. This command will check the version of clients that have their tool paths are specified in the config file provided.

./icgc-get version

Sample output:

ICGC-Get Version: 0.2.8
Clients:
 AWS CLI Version:             1.10.34
 EGA Client Version:          2.2.2
 GDC Client Version:          0.7
 ICGC Storage Client Version: 1.0.13

Architecture

This section describes the inner workings of icgc-get. Understanding this section is not required for operation of the tool. However, it may be useful for context and those curious about implementation details.

General Operation

In both modes of operation, icgc-get must be passed a manifest or file id that has been received from the ICGC Data Portal. This identifier is used by icgc-get to query more in-depth file metadata from the ICGC API. This takes two calls to the API, and the gathered data is used to identify the client to call and how to execute the call.

In the default mode of operation, all of the download clients are installed in the user's file system. The user is required to tell icgc-get where to find the installed clients. This enables icgc-get to directly communicate with the download clients, and for the clients to directly place their output into the local filesystem.

Operation using Docker

Alternatively, when the client is run using docker, the download clients are no longer directly accessible, as they are in a separate Linux container. To communicate with the download clients, icgc-get needs to use the Docker daemon as an intermediary. Similarly, when the the clients finish downloading the files they are unable to directly place them into the local filesystem beacuse of the isolated nature of the Linux container. They instead place them in a special mounted directory, which is shared between the Linux container and the local filesystem. icgc-get is monitoring the Docker daemon for a signal that the download client has finished working, and upon that signal, moves all of the files in the mounted directory to their proper place in the local filesystem.