Sequence Read Archive

{{Short description|Database of DNA sequencing data}}

{{infobox biodatabase

|title =Sequence Read Archive

|logo = File:Database.png

|description = FASTQ Sequences
BAM data

|scope =

|organism = all

|center = National Center for Biotechnology Information
European Bioinformatics Institute

DNA Data Bank of Japan

|laboratory =

|author =

|pmid =

|released =

|standard =

|format =

|url = {{URL|https://www.ncbi.nlm.nih.gov/sra/}}
{{URL|http://www.ebi.ac.uk/ena/}}

{{URL|http://trace.ddbj.nig.ac.jp/dra/index_e.html}}

|download =

|webservice =

|sql =

|sparql =

|webapp =

|standalone =

|license =

|versioning =

|frequency =

|curation =

|bookmark =

|version =

}}

The Sequence Read Archive (SRA, previously known as the Short Read Archive) is a bioinformatics database that provides a public repository for DNA sequencing data, especially the "short reads" generated by high-throughput sequencing, which are typically less than 1,000 base pairs in length.{{cite journal|last1=Wheeler|first1=DL|last2=Barrett|first2=T|last3=Benson|first3=DA|last4=Bryant|first4=SH|last5=Canese|first5=K|last6=Chetvernin|first6=V|last7=Church|first7=DM|last8=Dicuccio|first8=M|last9=Edgar|first9=R|last10=Federhen|first10=S|last11=Feolo|first11=M|last12=Geer|first12=LY|last13=Helmberg|first13=W|last14=Kapustin|first14=Y|last15=Khovayko|first15=O|last16=Landsman|first16=D|last17=Lipman|first17=DJ|last18=Madden|first18=TL|last19=Maglott|first19=DR|author-link19=Donna R. Maglott|last20=Miller|first20=V|last21=Ostell|first21=J|last22=Pruitt|first22=KD|last23=Schuler|first23=GD|last24=Shumway|first24=M|last25=Sequeira|first25=E|last26=Sherry|first26=ST|last27=Sirotkin|first27=K|last28=Souvorov|first28=A|last29=Starchenko|first29=G|last30=Tatusov|first30=RL|last31=Tatusova|first31=TA|last32=Wagner|first32=L|last33=Yaschenko|first33=E|title=Database resources of the National Center for Biotechnology Information.|journal=Nucleic Acids Research|date=Jan 2008|volume=36|issue=Database issue|pages=D13-21|pmid=18045790|doi=10.1093/nar/gkm1000|pmc=2238880}} The archive is part of the International Nucleotide Sequence Database Collaboration (INSDC), and run as a collaboration between the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ).

The archive was established by the National Center for Biotechnology Information (NCBI) in 2007 in order to provide a repository for data produced by RNA-Seq and ChIP-Seq studies as well as large-scale studies including the Human Microbiome Project and the 1000 Genomes Project.{{cite journal|last1=Galperin|first1=M. Y.|last2=Fernandez-Suarez|first2=X. M.|title=The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection|journal=Nucleic Acids Research|date=5 December 2011|volume=40|issue=D1|pages=D1–D8|doi=10.1093/nar/gkr1196|pmid=22144685|pmc=3245068}} Originally called the Short Read Archive, the name was changed in anticipation of future sequencing technologies being able to produce longer sequence reads.{{cite news | first = Jim | last = Ostell | title = NCBI's Sequence Read Archive: A Core Enabling Infrastructure | year = 2009 | url = http://www.bio-itworld.com/BioIT_Article.aspx?id=94069 | work = Bio IT World | access-date = 2013-01-08}}

File:History (and predicted future) size of the Sequence Read Archive.svg's Genome Analyzer.{{cite journal|last1=Kodama|first1=Y.|last2=Shumway|first2=M.|last3=Leinonen|first3=R.|title=The sequence read archive: explosive growth of sequencing data|journal=Nucleic Acids Research|volume=40|issue=D1|year=2011|pages=D54–D56|issn=0305-1048|doi=10.1093/nar/gkr854|pmid=22009675|pmc=3245110}}

]]

The volume of data deposited in the Sequence Read Archive has grown rapidly. As of September 2010, 65% of the SRA was human genomic sequence, with another 16% relating to human metagenome sequence reads.{{cite journal |author1=Leinonen R |author2=Sugawara H |author3=Shumway M |title=The sequence read archive |journal=Nucleic Acids Res. |volume=39 |issue=Database issue |pages=D19–21 |date=January 2011 |pmid=21062823 |pmc=3013647 |doi=10.1093/nar/gkq1019 }} Much of this data was deposited through the 1000 Genomes Project. In June 2011, the data contained within the SRA passed 100 Terabases of DNA in volume.

The preferred data format for files submitted to the SRA is the BAM format, which is capable of storing both aligned and unaligned reads. Internally the SRA relies on the NCBI SRA Toolkit, used at all three INSDC member databases, to provide flexible data compression, API access and conversion to other formats such as FASTQ.

NCBI announced their plan to close the NCBI SRA in February 2011 due to funding reduction.{{cite journal|last1=GB Editorial Team|title=Closure of the NCBI SRA and implications for the long-term future of genomics data storage|journal=Genome Biology|date=Mar 22, 2011|volume=12|issue=3|page=402|pmc=3129670|doi=10.1186/gb-2011-12-3-402|pmid=21418618 |doi-access=free }} However, EBI and DDBJ announced that they would continue to support the SRA.{{cite web|title=DDBJ will continue Sequence Raw Data Archiving|url=http://www.ddbj.nig.ac.jp/whatsnew/2011/DRA20110222.html|website=www.ddbj.nig.ac.jp|access-date=2 September 2014}} In October 2011, NCBI announced continuation of funding for the SRA.

Deposition of data in the SRA is mandated by most funding agencies and open access journals. Nature Publishing Group journals require that DNA and RNA sequencing data is made available through the SRA.{{cite web|title=Availability of data and materials : authors and referees @ npg|url=http://www.nature.com/authors/policies/availability.html|website=www.nature.com|access-date=2 September 2014}}

See also

References

{{Reflist}}