clinvartab2bigquery

command
v0.0.0-...-66a4754 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 10, 2022 License: BSD-3-Clause Imports: 12 Imported by: 0

README

# From https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/README
Last updated February 10, 2020
README file for ftp://ftp.ncbi.nih.gov/pub/clinvar

This directory contains reports from the ClinVar dataset and documents about ClinVar development. Sections in this README are divided by type of content. 

This directory has a folder for documents related to the collaboration with ClinGen (http://www.clinicalgenome.org/). For more details, see http://www.ncbi.nlm.nih.gov/clinvar/docs/review_guidelines.

This README file also documents ClinVar-related data in other directories, such as 
        ftp://ftp.ncbi.nlm.nih.gov/pub/GTR/standard_terms, for terminology used by both GTR and ClinVar.
        ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen, for gene-disease relationships related to OMIM

http://www.ncbi.nlm.nih.gov/clinvar/


================================================================================
SUBMISSIONS
================================================================================
  Details about how to submit are provided at this site:
       http://www.ncbi.nlm.nih.gov/clinvar/docs/submit/

--------------------------------------------------
clinvar_submission.xsd
--------------------------------------------------
  
  If you have any questions about submitting data to ClinVar as an XML file, please contact us via clinvar@ncbi.nlm.nih.gov
  This is the .xsd file for validating records submitted to ClinVar as xml.

------------------------------------------------------------
history of the versions of clinvar_submission.xsd
subdirectory xsd_submission
------------------------------------------------------------
This subdirectory archives different versions of the submission xsd.  The current one is also accessible as clinvar_submission.xsd


--------------------------------------------------
submission_templates 
subdirectory ftp://ftp.ncbi.nih.gov/pub/clinvar/submission_templates
--------------------------------------------------
  This subdirectory contains excel spreadsheets that can be used to 
  submit data to ClinVar. 
 
    There are two templates in this directory:

         SubmissionTemplate.xlsx
         SubmissionTemplateLite.xlsx

    SubmissionTemplate.xlsx is the standard submission template.

    SubmissionTemplateLite.xlsx is simpler, and is designed for
    submissions with less supporting evidence.

================================================================================
EXTRACTS OF CLINVAR DATA
================================================================================
ClinVar data are provided for download as extracts in xml, vcf and tab-delimited 
formats in the directories described below.

-------------------------------------------------
Updates
-------------------------------------------------
Data on the ftp site are updated monthly and weekly.
Please note on each file how when the data are refreshed.

Monthly:  usually the first Thursday of the month.
          A copy is also made to the archive sub-directory.

Weekly:   usually on Mondays
          XML: Weekly releases are not archived, and accumulate in the weekly_release subdirectory until the next monthly release.
			 tab-delimited: The latest version is retained in the tab-delimited path.
			                The archive subdirectory retains the copy consistent with the monthly release.

-------------------------------------------------
clinvar_public.xsd
-------------------------------------------------
The schema for the export version of the XML 
clinvar_public.xsd		Link to the current version such as /xsd_public/clinvar_public_1.5.xsd

------------------------------------------------------------
history of the versions of clinvar_submission.xsd
subdirectory xsd_public
------------------------------------------------------------
This subdirectory archives different versions of xsd used to validate ClinVar's comprehensive export as xml.  The current one is also accessible as clinvar_submission.xsd

The version number is represented in the file name.

----------------------------------------
release_notes
----------------------------------------
The release_notes subdirectory  contains reports of the differences between versions of clinvar_public.xsd


================================================================================
NAMES OF PHENOTYPES
================================================================================

--------------------------------------------------
disease_names
--------------------------------------------------

  This document is updated daily, and is provided to report the preferred names and
  identifiers used in GTR and ClinVar. Please note there may be more than one
  line per condition, when a name is used by more than one source. This
  differs from the gene_condition_source_id file because it is comprehensive,
  and does not require knowledge of any gene-to-disease relationship.
    NOTE: in February, 2020, the scope of the sources being reported was modified.
	 In particular, specific submitters and historical references to GeneTests were removed.
	 Attributing to GARD and SNOMED CT were suspended until maintenance is more up to date.
	 
  Tab-delimited file with the following 7 fields:

DiseaseName:          The name preferred by GTR and ClinVar
SourceName:           Sources that also use this preferred name
ConceptID:            The identifier assigned to a disorder associated with this
                        gene. If the value starts with a C and is followed by digits,
                        the ConceptID is a value from UMLS; if a value begins with CN,
                        it was created by NCBI-based processing.
SourceID:             Identifier used by the source reported in column 2 (SourceName)
DiseaseMIM:           MIM number for the condition.
LastUpdated:          Last time this record was modified by NCBI staff
Category:             Category of disease (as reported in ClinVar's XML), one of:
                        - Blood group
                        - Disease
                        - Finding
                        - Named protein variant
                        - Pharmacological response
                        - phenotype instruction


--------------------------------------------------
gene_condition_source_id
--------------------------------------------------

  This document is updated daily, and is provided to report gene-disease relationships used in ClinVar, Gene, GTR and MedGen.
  The sources of information for the gene-disease relationship include OMIM, GeneReviews, and a limited amount of curation by NCBI staff.
  The scope of disorders reported in this file is a subset of the disease_names file because a gene-to-disease relationship is required.  
  Tab-delimited file with the following fields:

GeneID:               The NCBI GeneID
GeneSymbol:           The preferred symbol corresponding to the GeneID
ConceptID:            The identifier assigned to a disorder associated with this
                        gene. If the value starts with a C and is followed by digits,
                        the ConceptID is a value from UMLS; if a value begins with
                        CN, it was created by NCBI-based processing
DiseaseName:          Full name for the condition
SourceName:           Sources that use this name
SourceID:             The identifier used by this source
DiseaseMIM:           MIM number for the condition
LastUpdated:          Last time this record was modified by NCBI staff

--------------------------------------------------
ConceptID_history.txt 
--------------------------------------------------
  --Added to the directory October 24, 2012
  
  This document is updated daily, and is provided to help track changes in
identifiers assigned to phenotypes over time. The ConceptID values in the
first column are no longer active, and are either discontinued (the value in
column 2 is 'No longer reported', or replaced by a record with a different
identifier.  That replacement may result either because of a merge (one record
becoming secondary to another) or because of a change in numbering, usually
because an identifier assigned by NCBI (starting with CN) is now thought to be
represented by a ConceptID from UMLS (starting with C followed by numerals).

Previous ConceptID 			the outdated identifier
Current ConceptID				the current identifier
Date of Action					the date this change occurred


--------------------------------------------------
dbGaP_frequency_study_list 
--------------------------------------------------
  --Added to the directory September 29, 2015

Text and html files reporting the studies in dbGaP that were assessed for single nucleotide variants reported in ClinVar.
dbGaP_frequency_study_list.html
dbGaP_frequency_study_list.txt

================================================================================
SUBDIRECTORIES
================================================================================
community		 files generated in the initial design of ClinVar
presentations	 slides or other documents about ClinVar
submission_templates	  templates for submission by spreadsheet
tab_delimited			  flattened tabular data summaries of several types
-----------------------VCF------------------------
See the README specific to ClinVar's VCF files:
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/README_VCF.txt
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/README_VCF.txt
--------------------------------------------------
xml				 An extraction of data in ClinVar as xml
The xsd for the export version of the XML is clinvar_public.xsd
For more details about the files in the xml directory, please refer to 
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/_README


==================================================
tab_delimited sub-directory

ClinVar makes every attempt to retain backward compatibility in its information products. 
If changes are made to these tab-delimited files, new columns are usually added to the end. 
If, however, a column needs to be removed, the names of the remaining column headers will be stable. 

Most of the files in this directory are generated weekly, usually on Mondays.
Not all of the files in this directory are archived, so please note the for each file whether
  monthly versions are copies to the archive subdirectory the first of each month.

--------------------------------------------------------------------------------
1. gene_specific_summary
--------------------------------------------------------------------------------
Generated weekly
Archived monthly ( first Thurday of each month)

Although this report is generated each week, it is currently based on statistics that are captured the first day of each month.
Therefore there will be some discrepancies between what is reported in this file and what may be viewed interactively on the web.

A tab-delimited report, for each gene, of the number of submissions and the number of different variants (alleles).
Because some variant-gene relationships are submitted, and some are calculated from overlapping annotation, in January of 2015, the report was modified to indicate when the gene-variant relationship was submitted.

Symbol                 Gene symbol (if officially named, from HGNC, else from NCBI's Gene database)                
GeneID                 Unique identifier from  NCBI's Gene database
Total_submissions		  Total submissions to ClinVar with variants in/overlapping this gene
Total_alleles          Number of alleles submitted to ClinVar for this gene
Submissions_reporting_this_gene
                       Subset of the total submissions that also reported the gene
Alleles_reported_Pathogenic_Likely_pathogenic
                       Number of variants reported as pathogenic or likely pathogenic
                       Excludes structural variants that may overlap a gene
Gene_MIM_Number        The MIM number for this gene
Number_Uncertain       Submissions with an interpretation of 'Uncertain significance'
Number_with_conflicts  Number of VariationIDs for this gene with conflicting interpretations
--------------------------------------------------------------------------------
2. variant_summary.txt
--------------------------------------------------------------------------------
Generated weekly
Archived monthly (first Thurday of each month)


A tab-delimited report based on each variant at a location on the genome for which data have been submitted to ClinVar.  
The data for the variant are reported for each assembly, so most variants have a line for GRCh37 (hg19) and another line for GRCh38 (hg38).
 
Please note: Beginning in October 2016, this file was modified to restrict reporting to attributes of an AlleleID, not a mixture of AlleleID and VariationID.  The modifications were announced in our September 2016 release notes:  ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/release_notes/20160901_data_release_notes.pdf.

             The last file that reported VariationID was ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/archive/variant_summary_2016-09.txt.gz
             The first with the new set of columns is ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/archive/variant_summary_2016-10.txt.gz
             Content that used to be in this file may be found in 
                    ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/hgvs4variation.txt.gz
                    ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variation_allele.txt.gz

Please note: Beginning in November 2019, the values for referenceAllele and alternateAllele are being written according to the VCF
             standard.  For single nucleotide variants there was no change in the value.

             See also the authoritative file for identifiers assigned to genes represented by NCBI, namely:
                    ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz

AlleleID               integer value as stored in the AlleleID field in ClinVar  (//Measure/@ID in the XML)
Type                   character, the type of variant represented by the AlleleID
Name                   character, ClinVar's preferred name for the record with this AlleleID
GeneID                 integer, GeneID in NCBI's Gene database, reported if there is a single gene, otherwise reported as -1.
GeneSymbol             character, comma-separated list of GeneIDs overlapping the variant
HGNC_ID                string, of format HGNC:integer, reported if there is a single GeneID. Otherwise reported as '-'
ClinicalSignificance   character, comma-separated list of aggregate values of clinical significance calculated for this variant
                       If the value is 'no interpretation for the single variant', this allele was submitted to
                       ClinVar as part of a haplotype or genotype, and its specific significance has not been submitted to ClinVar.
                       NOTE: Now that the aggregate values of clinical significance give precedence to records
                             with assertion criteria and evidence, the values in this column may appear to be in
                             conflict with the value reported in ClinSigSimple. 
ClinSigSimple          integer, 0 = no current value of Likely pathogenic or Pathogenic
                                1 = at least one current record submitted with an interpretation of Likely pathogenic or
                                    Pathogenic (independent of whether that record includes assertion criteria and evidence).
                               -1 = no values for clinical significance at all for this variant or set of variants; used for
                                    the "included" variants that are only in ClinVar because they are included in a
                                    haplotype or genotype with an interpretation
                       NOTE: Now that the aggregate values of clinical significance give precedence to records with
                             assertion criteria and evidence, the values in this column may appear to be in conflict with the
                             value reported in ClinicalSignificance.  In other words, if a submission without assertion criteria and
                             evidence interpreted an allele as pathogenic, and those with assertion criteria and evidence interpreted
                             as benign, then ClinicalSignificance would be reported as Benign and ClinSigSimple as 1.
LastEvaluated          date, the latest date any submitter reported clinical significance
RS# (dbSNP)            integer, rs# in dbSNP, reported as -1 if missing
nsv/esv (dbVar)        character, the NSV identifier for the region in dbVar
RCVaccession           character, list of RCV accessions that report this variant
PhenotypeIDs           character, list of identifiers for phenotype(s) interpreted for this variant
PhenotypeList          character, list of names corresponding to PhenotypeIDs
Origin                 character, list of all allelic origins for this variant
OriginSimple           character, processed from Origin to make it easier to distinguish between germline and somatic
Assembly               character, name of the assembly on which locations are based  
ChromosomeAccession    Accession and version of the RefSeq sequence defining the position reported in the start and stop columns. 
                            Please note some of these accessions may be for sub-chromosomal regions.
Chromosome             character, chromosomal location
Start                  integer, starting location, in pter->qter orientation
Stop                   integer, end location, in pter->qter orientation
ReferenceAllele        The reference allele according to the vcf standard.
AlternateAllele        The alternate allele according to the vcf standard.
Cytogenetic            character, ISCN band
ReviewStatus           character, highest review status for reporting this measure. For the key to the terms, 
                           and their relationship to the star graphics ClinVar displays on its web pages, 
                           see http://www.ncbi.nlm.nih.gov/clinvar/docs/variation_report/#interpretation
									Note also that 'no interpretation for the single variant' is used for AlleleIDs in ClinVar
									that were submitted as part of the definition of a complex allele, but not interpreted
									individually.
NumberSubmitters       integer, number of submitters describing this variant
Guidelines             character, ACMG only right now, for the reporting of incidental variation in a Gene 
                       enumerates whether the guideline is from 2013 (ACMG2013, PubMed 23788249) or 2016 (ACMG2016, PubMed 27854360)
                               (NOTE: if ACMG, not a specific to the AlleleID but to the Gene in which the AlleleID is found)
TestedInGTR            character, Y/N for Yes/No if there is a test registered as specific to this variant 
                          in the NIH Genetic Testing Registry (GTR)
OtherIDs               character, list of other identifiers or sources of information about this variant
SubmitterCategories    coded value to indicate whether data were submitted by another resource (1), any other type of source (2) or both (3)
VariationID            The identifier ClinVar uses specific to the AlleleID.  Not all VariationIDS that may be related to
                           the AlleleID are reported in this file. For a comprehensive mapping of AlleleID to VariationID,
									please use ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variation_allele.txt.gz.
							      Note also that some of the values for VariationID are not supported in the current
									default web display, but will be reported from ClinVar 2.0 as current seen from our preview site.
							  
                       
--------------------------------------------------------------------------------
3. cross_references.txt
--------------------------------------------------------------------------------
Generated weekly
Not archived

A tab-delimited report based on each variant in ClinVar, providing identifiers in other databases and when those data were last updated. This file is based on AlleleID rather than VariationID for complex alleles, so it corresponds to a unique genomic location.
NOTE: this file is preliminary and currently includes only identifiers in dbSNP and dbVar. Identifiers from more databases will be added in the future.

AlleleID		 	 		  integer value as stored in the AlleleID field in ClinVar  (//Measure/@ID in the XML)
Database					  name of the database
ID							  identifier used by that database
last_updated			  date the identifier /AlleleID relationship was created or last updated

--------------------------------------------------------------------------------
4. var_citations.txt
--------------------------------------------------------------------------------
Generated weekly
Not archived

A tab-delimited report of citations associated with data in ClinVar, connected to the AlleleID, the VariationID, and either rs# from dbSNP or nsv in dbVar.

AlleleID                       integer value as stored in the AlleleID field in ClinVar  (//Measure/@ID in the XML)
VariationID                    The identifier ClinVar uses to anchor its default display. (in the XML,  //MeasureSet/@ID)
rs										 rs identifier from dbSNP, null if missing
nsv									 nsv identifier from dbVar, null if missing
citation_source					 The source of the citation, either PubMed, PubMedCentral, or the NCBI Bookshelf
citation_id							 The identifier used by that source

--------------------------------------------------------------------------------
5. summary_of_conflicting_interpretations.txt
--------------------------------------------------------------------------------
Generated weekly
Archived monthly ( first Thurday of each month)

This file first became available in January, 2016. It replaces summary_of_conflicting_data.txt (documented below) and differs in that
   a. It is limited to differences in interpretation (i.e. does not report differences in the phenotype being interpreted)
   b. Reports all pairwise differences, so that if submitter a differs from submitters b and c, and submitter b differs from c,  a-b, a-c, b-c will all be reported instead of just a-b and a-c
   c. Reports fewer columns

Gene_Symbol                          If in a gene, its symbol
NCBI_Variation_ID                    The identifier ClinVar uses to anchor its default display. (in the XML,  //MeasureSet/@ID)
ClinVar_Preferred                    The preferred description ClinVar uses for this VariationID
Submitter1                           Name of this submitter
Submitter1_SCV  							 Accession assigned to this submission
Submitter1_ClinSig                   Clinical signficance asserted by this submitter
Submitter1_LastEval                  Date last evaluated by this submitter
Submitter1_ReviewStatus              Review status of this submission
Submitter1_Sub_Condition             Submitted name of condition
Submitter1_Description  				 Description of the interpretation
Submitter2                           Name of this submitter
Submitter2_SCV                       Accession assigned to this submission 
Submitter2_ClinSig				       Clinical significance asserted by this submitter
Submitter2_LastEval                  Date last evaluated by this submitter
Submitter2_ReviewStatus              Review status of this submission
Submitter2_Sub_Condition             Submitted name of condition
Submitter2_Description               Description of the interpretation
Rank_diff                            Rank value assigned to the differences in interpretation:
                                        -1: one of the interpretations is not in the set of Pathogenic, Likely pathogenic, Uncertain significance, Likely benign, Benign
                                         0: difference in phenotype only
                                         1-4, difference when both interpretations are in the set of Pathogenic, Likely pathogenic, Uncertain significance, 
                                              Likely benign, Benign, where 4 is most divergent
Conflict_Reported                    yes or no.  Useful to supplement the Rank_diff column when Rank_diff = 1 but a conflict is still reported.
Variant_Type                         the type of variant being described
Submitter1_Method                    the collection method(s) reported by this submitter
Submitter2_Method                    the collection method(s) reported by this submitter
--------------------------------------------------------------------------------
6. hgvs4variation.txt.gz
--------------------------------------------------------------------------------
Updated weekly
Not archived

A compressed report of HGVS expressions ClinVar reports per VariationID and AlleleID. These are broadly categorized by type, based on the reference sequence (coding, genomic, non-coding, protein, RNA) and on the complexity of the submission represented by the VariationID (CompoundHeterozygote, Distinct chromosomes, Haplotype, Phase unknown). 

The header of the file explains the columns, which include the VariationID, the AlleleID, the type and the HGVS expression.  The NCBI GeneID and GeneSymbol are included for ready filtering of lines in the file by gene. The assembly is provided for the HGVS expressions based on chromosome sequences; otherwise the assembly is reported as 'na'.
HINT:  Please note that for human, the accession of the RefSeq representing each chromosome indicates the chromosome being represented. In other words, NC_000001 is for chromosome 1, NC_000002 is for chromosome 2, ... NC_000023 is for X, and NC_000024 is for Y.
In the December release, 3 columns were added to support those wishing to identify which HGVS expressions are used for naming, which HGVS expressions were provided explicitly by a submitter, and which are based on RefSeqs that are reference standards on RefSeqGenes.
--------------------------------------------------------------------------------
7. variation_allele.txt
--------------------------------------------------------------------------------
Updated weekly
Not archived

Mapping of ClinVar's VariationID (used to build the URL on the web site) and the AlleleIDs assigned to each simple variant.

1. VariationID:            the identifier assigned by ClinVar and used to build the URL, namely https://ncbi.nlm.nih.gov/clinvar/VariationID
2. Type:                   Types of VariationID include Variant (simple variant), Haplotype, CompoundHeterozygote, Complex, Phase unknown, Distinct chromosomes
3. AlleleID:               the integer identifier assigned by ClinVar to each simple allele
4. Interpreted:            _yes_ indicates an interpretation was submitted about the VariationID specifically,
                           _no_ indicates that information about the VariationID was submitted as a component of a different record.

--------------------------------------------------------------------------------
8. submission_summary.txt
--------------------------------------------------------------------------------
Generated weekly
Archived monthly (first Thurday of each month)

   Overview of interpretation, phenotypes, observations, and methods reported in each current submission 

1.  VariationID:              the identifier assigned by ClinVar and used to build the URL, namely https://ncbi.nlm.nih.gov/clinvar/VariationID
2.  ClinicalSignificance:     interpretation of the variation-condition relationship
3.  DateLastEvaluated:        the last date the variation-condition relationship was evaluated by this submitter
4.  Description:              an optional free text description of the basis of the interpretation
5.  SubmittedPhenotypeInfo:   the name(s) or identifier(s)  submitted for the condition that was interpreted relative to the variant
6.  ReportedPhenotypeInfo:    the MedGen identifier/name combinations ClinVar uses to report the condition that was interpreted. 'na' means there is no public identifier in MedGen for the condition.
7.  ReviewStatus:             the level of review for this submission, namely http://www.ncbi.nlm.nih.gov/clinvar/docs/variation_report/#review_status
8.  CollectionMethod:         the method by which the submitter obtained the information provided
9.  OriginCounts:             the reported origin and the number of observations for each origin
10. Submitter:                the submitter of this record
11. SCV:                      the accession and current version assigned by ClinVar to the submitted interpretation of the variation-condition relationship
12. SubmittedGeneSymbol:      the symbol provided by the submitter for the gene affected by the variant. May be null.

--------------------------------------------------------------------------------
9. allele_gene.txt
--------------------------------------------------------------------------------
Updated weekly
Not archived

Reports per ClinVar's AlleleID, the genes that are related to that gene and how they are related. The values for category are:

asserted, but not computed:          Submitted as related to a gene, but not within the location of that gene on the genome
genes overlapped by variant          The gene and variant overlap
near gene, downstream                Outside the location of the gene on the genome, within 5 kb
near gene, upstream                  Outside the location of the gene on the genome, within 5 kb
within multiple genes by overlap     The variant is within genes that overlap on the genome. Includes introns.
within single gene	 	 				 The variant is in only one gene.  Includes introns.

AlleleID:                            the integer identifier assigned by ClinVar to each simple allele
GeneID:                              integer, GeneID in NCBI's Gene database  
Symbol                               character, Symbol preferred in NCBI's Gene database. Is the symbol from HGNC when available
Name                                 character, full name of the gene
GenesPerAlleleID                     integer, number of genes related to the allele
Category        							 character, type of allele-gene relationship
Source                               character, was the relationship submitted or calculated?

10. organization_summary.txt
updated weekly
not archived

organization                         the name of the lab and the institution of which it is part	
organization ID                      the id used in ClinVar and GTR; often reported as OrgID;
                                     append to https://www.ncbi.nlm.nih.gov/clinvar/submitters to review more details
institution type                     type of organization
street address                       street address
country                              country
number of ClinVar submissions        number of submission to ClinVar
date last submitted                  last date on a public submission from this organization
maximum review status                the 'most stars' valid for any submission from this organization
collection methods                   comma-delimited list of methods used to determine information for the submission
novel and updates                    values are novel, novel and updates.  The latter indicates the submitter has provided updates.
clinical significance categories submitted
                                     list of types of interpretations from this organization
number of submissions from clinical testing
                                     number of submissions for the list of categories in 'collection methods'
number of submissions from research
                                     number of submissions for the list of categories in 'collection methods'
number of submissions from literature only
                                     number of submissions for the list of categories in 'collection methods'
number of submissions from curation
                                     number of submissions for the list of categories in 'collection methods'
number of submissions from phenotyping
                                     number of submissions for the list of categories in 'collection methods'
												 

11. special_requests subdirectory
This path contains files that were requested by more than one user. The reports are not kept current, but if the contents are considered useful, they may be converted to production level and updated regularly.  For more details, see the README.txt file in that path.

================================================================================
VALIDATING FILE DOWNLOADS
================================================================================
We are providing md5 checksum files to validate file downloads to ensure your ftp transfer is complete. If you are unfamiliar with md5 it is a string of letters and numbers that act as a fingerprint for a file. When you download the file generate an md5 hash and compare to the value in our md5 checksum file to ensure your download has the entire file. We are currently providing md5 files for some of the tab_delmited files and the ClinVarFullRelease XML file. After you download a file, use a utility to create a checksum value and compare it to the one we provide. In the linux enviroment, the utility is likely md5sum. There is freeware available for tools in other environments. 

================================================================================
RELATED SITES
================================================================================
--------------------------------------------------
Gene-disease relationships
--------------------------------------------------
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen

This file, completely documented in ftp://ftp.ncbi.nih.gov/gene/README, maintains information about gene-disease relationships inferred from gene-variation and variation-disease relationships or reported by OMIM.

--------------------------------------------------
standard terms
--------------------------------------------------
ftp://ftp.ncbi.nlm.nih.gov/pub/GTR/standard_terms/

This directory contains terms used by ClinVar and GTR in specified categories.  Cross-references to the term in other databases may also be provided.

================================================================================
ClinGen 
================================================================================
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/ClinGen/

ClinVar is an active participant in the ClinGen project. (http://clinicalgenome.org/)
As part of that collaboration, ClinVar requests that submitters wishing to be identified as an expert panel provide documentation about their their methods of determining Clinical Significance. The form to be completed is provided in this directory.


================================================================================
MISCELLANEOUS
================================================================================
--------------------------------------------------
2013.1-hgmd-public.tsv
--------------------------------------------------
  This file was removed June 19, 2013 on request from HGMD.

vcf
  This path was removed to make explicit the difference between VCF files on GRCh37 and GRCh38.
It was replaced with vcf_GRCh37 and vcf_GRCh38


======================================================================
Partial revision history
======================================================================
December 3, 2014:    Added the file /tab_delimited/summary_of_conflicting_data.txt
January 12, 2015:    Documented the changes to gene_specific_summary.txt and explained the new directory structure
                     for vcf files.
September 10, 2015:  Documented providing md5 files.
April 7, 2016:       Discontinued generating summary_of_conflicting_data.txt
                     The historical reports are maintained in the archive directory (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/archive/)


It was defined as below:
Generated monthly, the first Thursday of the month
A tab-delimited report based on variants in ClinVar, for which information has been provided by more than one submitter, and for which there is inconsistency in reporting phenotype or interpretation.  The file includes some basic information about the variant, and then describes what each submitter said for which a discrepancy was noted.  Although generated primarily for the submitter, or groups reviewing evidence supporting variant assessments, this file may be of interest to others as well. Please note *all* data conflicts are reported.  To focus on a subset of interest, we suggest you use either the Rank_diff or Conflict_Reported columns. For example, look for rows in the report where  Rank_diff >=2 for Conflict_Reported is 'yes'.

Gene_Symbol                          If in a gene, its symbol
NCBI_Variation_ID                    The identifier ClinVar uses to anchor its default display. (in the XML,  //MeasureSet/@ID)
NCBI_AlleleID                        The identifier ClinVar uses to define an allele. (in the XML,  //Measure/@ID)
HGVS                                 The default HGVS expression for this variant
Submitter1                           Name of this submitter
Submitter1_ID   							 Identifier for this variant provided by this submitter, or constructed by NCBI for the submitter
Submitter1_SCV  							 Accession assigned to this submission
Submitter1_Definition   				 Variant as defined by this submitter  (currently missing for those not submitted by HGVS)
Submitter1_ClinSig                   Clinical significance asserted by this submitter
Submitter1_LastEval                  Date last evaluated by this submitter
Submitter1_AssertionMethod           Assertion method
Submitter1_Sub_Condition             Submitted name of condition
Submitter1_Calc_Condition            Name ClinVar uses for this condition
Submitter1_Description  				 Description of the interpretation
Submitter2                           Name of this submitter
Submitter2_ID                        Identifier for this variant provided by this submitter, or constructed by NCBI for the submitter
Submitter2_SCV                       Accession assigned to this submission 
Submitter2_Definition                Variant as defined by this submitter  (currently missing for those not submitted by HGVS)
Submitter2_ClinSig				       Clinical significance asserted by this submitter
Submitter2_LastEval                  Date last evaluated by this submitter
Submitter2_AssertionMethod           Assertion method
Submitter2_Sub_Condition             Submitted name of condition
Submitter2_Calc_Condition            Name ClinVar uses for this condition
Submitter2_Description               Description of the interpretation
Rank_diff                            Rank value assigned to the differences in interpretation:
                                        -1: one of the interpretations is not in the set of Pathogenic, Likely pathogenic, Uncertain significance, Likely benign, Benign
                                         0: difference in phenotype only
                                         1-4, difference when both interpretations are in the set of Pathogenic, Likely pathogenic, Uncertain significance, 
                                              Likely benign, Benign, where 4 is most divergent
Conflict_Reported                    yes or no.  Useful to supplement the Rank_diff column when Rank_diff = 1 but a conflict is still reported.
Variant_Type                         the type of variant being described

August 4, 2016:        Added documentation for several new files in the tab-delimited subdirectory
                       * hgvs4variation.txt.gz
                       * submission_summary.txt
                       * variation_allele.txt
                       Added explanation of numbering systems for locations in VCF vs. all other reports
                                      
September 24, 2016:	  Added documentation for the allele_gene files in the tab-delimited directory
November   9, 2016:    Corrected documentation for variant_summary that should have been included in the October update.
December   9, 2016:    Added SubmittedGeneSymbol to the description of submission_summary.txt, and updated the description of hgvs4variation.txt.gz
January   19, 2017     Corrected inconsistent definition of NumberSubmitters in variant_summary.txt
November  30, 2017     Modification to summary_of_conflicting_interpretations.txt
February  20, 2019     Added documentation of organization_summary.txt
November  12, 2019     Noted change in reporting referenceAllele and alternateAllele in variant_summary.txt; added more information about releases.

Documentation

Overview

clinvar2bigquery is designed to take clinvar data, filter it using criteria that are useful to my line of work, and then format it so that it can be ingested in bigquery. Specifically, this ingests the file located at https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz

See https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/README for info on interpreting the fields.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL