Submission File Format

File Structure

  • The submission data must be in tab-delimited format.
  • Each column corresponds to a data element defined in DCC Data Element specification.
  • Column order and case must match the data elements in DCC Element specification
  • Extra columns are not allowed
  • Required values cannot have null values
  • Each mutation/variant is represented as a row (one mutation per row)

An example file is shown below (note that parts of the lines are omitted for readability):

analysis_id analyzed_sample_id mutation_type chromosome chromosome_start chromosome_end reference_genome_allele control_genotype mutated_from_allele mutated_to_allele tumour_genotype
m124 ssm_3396649 3 20 49510011 49510012 GA GA/GA GA - GA/-
m124 ssm_61023021 2 X 115303927 115303927 - -/- - T -/T
m124 ssm_175270973 4 15 39884779 39884787 ACTCAGACC ACTCAGACC/ACTCAGACC ACTCAGACC TTGT ACTCAGACC/TTGT
m124 ssm_175270973 1 15 39884792 39884792 C C/C C T C/T
m124 ssm_4545634 3 12 23454340 23454341 GA GA/GA GA - GA/-

ICGC DCC Data File Specification

ICGC DCC provides a data file specification for each data type which details the required format to construct a valid submission file. You can view the current ICGC DCC Data Specification here .

Column Description
Data Element ID Name of the column that must be included in the submission file
Name The descriptive name of the Data Element ID
Description Definition of the Data Element ID
Data Type The required type required for the given Data Element ID (ie. Integer, text,controlled vocabulary)
CV Codes Controlled vocabulary (if applicable to the Data Element ID)
Required? Indicates whether the Data Element ID requires a value
N/A Code Valid Indicates whether the Data Element ID accepts the reserve codes -777 or -888
Controlled Access Indicates whether Data Element ID is open or controlled access
Regexp A Java regular expression indicating required format
Examples Examples of valid values
Notes Additional notes describing requirements/restrictions and cross-field validation checks

Current Dictionary and Codelists

To view current dictionary, please go to Dictionary Viewer. Green-highlighted rows, such as "donor_id" are considered identifier data fields (foreign keys) and must be unique for each row.

Alternatively, you can also access the JSON format of the DCC Data Specification via REST webservice. Please see Submission API for details

File Naming Conventions

Clinical/Experimental Files

Category Data type File type File name

Description

Core Clinical Files

donor donor.txt[.gz|.bz2] Donor information
specimen specimen.txt[.gz|.bz2] Specimen information
sample sample.txt[.gz|.bz2] Analyzed sample information

Optional Clinical Files

surgery surgery[.gz|.bz2] Donor surgery information
exposure exposure[.gz|.bz2] Donor environmental exposure
family family.txt[.gz|.bz2] Donor family history
biomarker biomarker.txt[.gz|.bz2] Donor biomarkers
therapy therapy.txt[.gz|.bz2] Donor therapy

Experimental

Files

ssm metadata ssm_m.txt[.gz|.bz2] Simple somatic mutations including single base substitutions and indels of ≤200 bp
primary ssm_p.txt[.gz|.bz2]
sgv metadata sgv_m.txt[.gz|.bz2] Simple germline variations including single base substitutions and indels of ≤200 bp
primary sgv_p.txt[.gz|.bz2]
cnsm metadata cnsm_m.txt[.gz|.bz2] Copy number somatic mutations
primary cnsm_p.txt[.gz|.bz2]
secondary cnsm_s.txt[.gz|.bz2]
stsm metadata stsm_m.txt[.gz|.bz2] Structural somatic mutations
primary stsm_p.txt[.gz|.bz2]
secondary stsm_s.txt[.gz|.bz2]
exp metadata exp_m.txt[.gz|.bz2] Gene expression
gene expression exp_g.txt[.gz|.bz2]
mirna metadata mirna_m.txt[.gz|.bz2] miRNA expression
primary mirna_p.txt[.gz|.bz2]
secondary mirna_s.txt[.gz|.bz2]
jcn metadata jcn_m.txt[.gz|.bz2] Exon junction
primary jcn_p.txt[.gz|.bz2]
pexp metadata pexp_m.txt[.gz|.bz2] Protein expression
primary pexp_p.txt[.gz|.bz2]