supercomputer architecture

{{Short description|Design of high-performance computers}}

File:Jade CINES.jpg Altix supercomputer with 23,000 processors at the CINES facility in France]]

Approaches to supercomputer architecture have taken dramatic turns since the earliest systems were introduced in the 1960s. Early supercomputer architectures pioneered by Seymour Cray relied on compact innovative designs and local parallelism to achieve superior computational peak performance.{{cite book|author1=Sao-Jie Chen|author2=Guang-Huei Lin|author3=Pao-Ann Hsiung|author4=Yu-Hen Hu|title=Hardware Software Co-Design of a Multimedia Soc Platform|url=https://books.google.com/books?id=OXyo3om9ZOkC|access-date=15 June 2012|date=9 February 2009|publisher=Springer|isbn=978-1-4020-9622-8|pages=70–72}} However, in time the demand for increased computational power ushered in the age of massively parallel systems.

While the supercomputers of the 1970s used only a few processors, in the 1990s, machines with thousands of processors began to appear and by the end of the 20th century, massively parallel supercomputers with tens of thousands of commercial off-the-shelf processors were the norm. Supercomputers of the 21st century can use over 100,000 processors (some being graphic units) connected by fast connections.

Throughout the decades, the management of heat density has remained a key issue for most centralized supercomputers. The large amount of heat generated by a system may also have other effects, such as reducing the lifetime of other system components. There have been diverse approaches to heat management, from pumping Fluorinert through the system, to a hybrid liquid-air cooling system or air cooling with normal air conditioning temperatures.

Systems with a massive number of processors generally take one of two paths: in one approach, e.g., in grid computing the processing power of a large number of computers in distributed, diverse administrative domains, is opportunistically used whenever a computer is available. In another approach, a large number of processors are used in close proximity to each other, e.g., in a computer cluster. In such a centralized massively parallel system the speed and flexibility of the interconnect becomes very important, and modern supercomputers have used various approaches ranging from enhanced Infiniband systems to three-dimensional torus interconnects.

Context and overview

Since the late 1960s the growth in the power and proliferation of supercomputers has been dramatic, and the underlying architectural directions of these systems have taken significant turns. While the early supercomputers relied on a small number of closely connected processors that accessed shared memory, the supercomputers of the 21st century use over 100,000 processors connected by fast networks.

Throughout the decades, the management of heat density has remained a key issue for most centralized supercomputers. Seymour Cray's "get the heat out" motto was central to his design philosophy and has continued to be a key issue in supercomputer architectures, e.g., in large-scale experiments such as Blue Waters.{{cite book|last=Murray|first=Charles J.|title=The supermen : the story of Seymour Cray and the technical wizards behind the supercomputer|year=1997|publisher=John Wiley|location=New York|isbn=978-0-471-04885-5|pages=[https://archive.org/details/supermenstory00murr/page/133 133–135]|url=https://archive.org/details/supermenstory00murr/page/133}} The large amount of heat generated by a system may also have other effects, such as reducing the lifetime of other system components.

File:IBM BLADECENTER HS22 7870 JPN JPY.jpg blade]]

There have been diverse approaches to heat management, e.g., the Cray 2 pumped Fluorinert through the system, while System X used a hybrid liquid-air cooling system and the Blue Gene/P is air-cooled with normal air conditioning temperatures.{{cite book|last=Tokhi|first=M. O.|title=Parallel computing for real-time signal processing and control|year=2003|publisher=Springer|location=London [u.a.]|isbn=978-1-85233-599-1|pages=201–202|author2=Hossain, M. A. |author3=Shaheed, M. H. }}{{cite book|last=Varadarajan|first=S.|title=Proceedings 13th International Conference on Computer Communications and Networks (IEEE Cat No 04EX969) ICCCN-04 |chapter=Keynote I: "System X building the virginia tech supercomputer" |pages=1|date=14 March 2005|doi=10.1109/ICCCN.2004.1401570|issn=1095-2055|isbn=978-0-7803-8814-7}}{{cite web|last=Prickett Morgan|first=Timothy|title=IBM uncloaks 20 petaflops BlueGene/Q super|url=https://www.theregister.co.uk/2010/11/22/ibm_blue_gene_q_super/|work=The Register|date=22 November 2010}} The heat from the Aquasar supercomputer is used to warm a university campus.{{cite web|title=IBM Hot Water-Cooled Supercomputer Goes Live at ETH Zurich |url=http://www.hpcwire.com/hpcwire/2010-07-02/ibm_hot_water-cooled_supercomputer_goes_live_at_eth_zurich.html |work=HPCwire |location=Zurich |date=2 July 2010 |url-status=dead |archive-url=https://web.archive.org/web/20120813212211/http://www.hpcwire.com/hpcwire/2010-07-02/ibm_hot_water-cooled_supercomputer_goes_live_at_eth_zurich.html |archive-date=13 August 2012 }}{{cite web|last=LaMonica|first=Martin|title=IBM liquid-cooled supercomputer heats building|url=http://news.cnet.com/8301-11128_3-20004543-54.html|work=Green Tech|publisher=Cnet|date=10 May 2010|access-date=5 February 2012|archive-date=1 November 2013|archive-url=https://web.archive.org/web/20131101060256/http://news.cnet.com/8301-11128_3-20004543-54.html|url-status=dead}}

The heat density generated by a supercomputer has a direct dependence on the processor type used in the system, with more powerful processors typically generating more heat, given similar underlying semiconductor technologies.{{cite book|title=Supercomputing research advances|year=2008|publisher=Nova Science Publishers|location=New York|isbn=978-1-60456-186-9|pages=313–314|editor=Yongge Huáng}} While early supercomputers used a few fast, closely packed processors that took advantage of local parallelism (e.g., pipelining and vector processing), in time the number of processors grew, and computing nodes could be placed further away, e.g., in a computer cluster, or could be geographically dispersed in grid computing.{{cite encyclopedia |year=2008 |title =Supercomputer Architecture |encyclopedia=Encyclopedia of Computer Science and Technology |location= |page=217|last=Henderson|first=Harry|isbn=978-0-8160-6382-6}} As the number of processors in a supercomputer grows, "component failure rate" begins to become a serious issue. If a supercomputer uses thousands of nodes, each of which may fail once per year on the average, then the system will experience several node failures each day.{{cite book|title=Computational science -- ICCS 2005. 5th international conference, Atlanta, GA, USA, May 22-25, 2005 : proceedings|year=2005|publisher=Springer|location=Berlin|isbn=978-3-540-26043-1|edition=1st|editor=Vaidy S. Sunderam|pages=60–67}}

As the price/performance of general purpose graphic processors (GPGPUs) has improved, a number of petaflop supercomputers such as Tianhe-I and Nebulae have started to rely on them.{{cite web|last=Prickett Morgan|first=Timothy|title=Top 500 supers – The Dawning of the GPUs|url=https://www.theregister.co.uk/2010/05/31/top_500_supers_jun2010/|work=The Register|date=31 May 2010}} However, other systems such as the K computer continue to use conventional processors such as SPARC-based designs and the overall applicability of GPGPUs in general purpose high performance computing applications has been the subject of debate, in that while a GPGPU may be tuned to score well on specific benchmarks its overall applicability to everyday algorithms may be limited unless significant effort is spent to tune the application towards it.{{cite book|author1=Rainer Keller|author2=David Kramer|author3=Jan-Philipp Weiss|title=Facing the Multicore-Challenge: Aspects of New Paradigms and Technologies in Parallel Computing|url=https://books.google.com/books?id=-luqXPiew_UC&pg=PA118|access-date=15 June 2012|date=1 December 2010|publisher=Springer|isbn=978-3-642-16232-9|pages=118–121}} However, GPUs are gaining ground and in 2012 the Jaguar supercomputer was transformed into Titan by replacing CPUs with GPUs.{{cite journal|last=Poeter|first=Damon|title=Cray's Titan Supercomputer for ORNL could be world's fastest|journal=PC Magazine|date=11 October 2011|url=https://www.pcmag.com/article2/0,2817,2394515,00.asp}}{{cite web|last=Feldman|first=Michael|title=GPUs Will Morph ORNL's Jaguar Into 20-Petaflop Titan|work=HPC Wire|date=11 October 2011|url=http://www.hpcwire.com/hpcwire/2011-10-11/gpus_will_morph_ornl_s_jaguar_into_20-petaflop_titan.html}}{{cite web|last=Prickett Morgan|first=Timothy|title=Oak Ridge changes Jaguar's spots from CPUs to GPUs|website=The Register |date=11 October 2011|url=https://www.theregister.co.uk/2011/10/11/oak_ridge_cray_nvidia_titan/}}

As the number of independent processors in a supercomputer increases, the way they access data in the file system and how they share and access secondary storage resources becomes prominent. Over the years a number of systems for distributed file management were developed, e.g., the IBM General Parallel File System, BeeGFS, the Parallel Virtual File System, Hadoop, etc.{{cite book|title=Euro-Par 2009 parallel processing workshops : HPPC, HeteroPar, PROPER, ROIA, UNICORE, VHPC, Delft, The Netherlands, August 25-28, 2009; workshops|year=2010|publisher=Springer|location=Berlin|isbn=978-3-642-14121-8|page=345|edition=Online-Ausg.|editor1=Hai-Xiang Lin |editor2=Michael Alexander |editor3=Martti Forsell }}{{cite book|author1=Reiner Dumke|author2=René Braungarten|author3=Günter Büren|title=Software Process and Product Measurement: International Conferences, IWSM 2008, MetriKon 2008, and Mensura 2008, Munich, Germany, November 18-19, 2008 : Proceedings|url=https://books.google.com/books?id=5OiwaRX6g5YC|access-date=15 June 2012|date=3 December 2008|publisher=Springer|isbn=978-3-540-89402-5|pages=144–117}} A number of supercomputers on the TOP100 list such as the Tianhe-I use Linux's Lustre file system.

Early systems with a few processors

{{See also|History of supercomputing}}

The CDC 6600 series of computers were very early attempts at supercomputing and gained their advantage over the existing systems by relegating work to peripheral devices, freeing the central processing unit (CPU) to process actual data. With the Minnesota FORTRAN compiler the 6600 could sustain 500 kiloflops on standard mathematical operations.{{cite journal|last=Frisch|first=Michael J.|title=Remarks on algorithm 352 [S22], algorithm 385 [S13], algorithm 392 [D3]|journal=Communications of the ACM|date=December 1972|volume=15|issue=12|page=1074|doi=10.1145/361598.361914|s2cid=6571977|doi-access=free}}

File:Cray2.jpeg centralized access, keeping distances short and uniform.{{cite book|last1=Hill|first1=Mark D.|title=Readings in computer architecture|year=2000|publisher=Morgan Kaufmann|location=San Francisco|isbn=978-1-55860-539-8|pages=40–49|author-link2=Norman Jouppi|first2=Norman P.|last2=Jouppi|first3=Gurindar|last3=Sohi}}]]

Other early supercomputers such as the Cray 1 and Cray 2 that appeared afterwards used a small number of fast processors that worked in harmony and were uniformly connected to the largest amount of shared memory that could be managed at the time.

These early architectures introduced parallel processing at the processor level, with innovations such as vector processing, in which the processor can perform several operations during one clock cycle, rather than having to wait for successive cycles.

In time, as the number of processors increased, different architectural issues emerged.

Two issues that need to be addressed as the number of processors increases are the distribution of memory and processing. In the distributed memory approach, each processor is physically packaged close with some local memory. The memory associated with other processors is then "further away" based on bandwidth and latency parameters in non-uniform memory access.

In the 1960s pipelining was viewed as an innovation, and by the 1970s the use of vector processors had been well established. By the 1980s, many supercomputers used parallel vector processors.{{cite book|last=Hoffman|first=Allan R.|title=Supercomputers : directions in technology and applications|year=1989|publisher=National Academy Press|location=Washington, D.C.|isbn=978-0-309-04088-4|pages=35–47}}

The relatively small number of processors in early systems, allowed them to easily use a shared memory architecture, which allows processors to access a common pool of memory. In the early days a common approach was the use of uniform memory access (UMA), in which access time to a memory location was similar between processors. The use of non-uniform memory access (NUMA) allowed a processor to access its own local memory faster than other memory locations, while cache-only memory architectures (COMA) allowed for the local memory of each processor to be used as cache, thus requiring coordination as memory values changed.{{cite book|last=El-Rewini|first=Hesham|title=Advanced computer architecture and parallel processing|year=2005|publisher=Wiley-Interscience|location=Hoboken, NJ|isbn=978-0-471-46740-3|pages=77–80|author2=Mostafa Abd-El-Barr }}

As the number of processors increases, efficient interprocessor communication and synchronization on a supercomputer becomes a challenge. A number of approaches may be used to achieve this goal. For instance, in the early 1980s, in the Cray X-MP system, shared registers were used. In this approach, all processors had access to shared registers that did not move data back and forth but were only used for interprocessor communication and synchronization. However, inherent challenges in managing a large amount of shared memory among many processors resulted in a move to more distributed architectures.{{cite book|author=J. J. Dongarra|author2=L. Grandinetti|author3=J. Kowalik|author4=G.R. Joubert|title=High Performance Computing: Technology, Methods and Applications|url=https://books.google.com/books?id=iqSWDaSFNvkC&pg=PR4|access-date=15 June 2012|date=13 September 1995|publisher=Elsevier|isbn=978-0-444-82163-8|pages=123–125}}

Massive centralized parallelism

{{See also|Computer cluster|Massively parallel (computing)}}

File:BlueGeneL cabinet.jpg/L cabinet showing the stacked blades, each holding many processors]]

During the 1980s, as the demand for computing power increased, the trend to a much larger number of processors began, ushering in the age of massively parallel systems, with distributed memory and distributed file systems, given that shared memory architectures could not scale to a large number of processors.{{cite book |author=Greg Astfalk |url=https://books.google.com/books?id=43cfAvRSSAAC&pg=PR4 |title=Applications on Advanced Architecture Computers |publisher=SIAM |year=1996 |isbn=978-0-89871-368-8 |pages=62 |access-date=15 June 2012}} Hybrid approaches such as distributed shared memory also appeared after the early systems.{{cite book|author1=Jelica Protić|author2=Milo Tomašević|author3=Milo Tomasevic|author4=Veljko Milutinović|title=Distributed shared memory: concepts and systems|url=https://books.google.com/books?id=Jd1QAAAAMAAJ|access-date=15 June 2012|year=1998|publisher=IEEE Computer Society Press|isbn=978-0-8186-7737-3|pages=ix–x}}

The computer clustering approach connects a number of readily available computing nodes (e.g. personal computers used as servers) via a fast, private local area network.{{cite book|title=Network-based information systems : first international conference, NBiS 2007, Regensburg, Germany, September 3-7, 2007 : proceedings|year=2007|publisher=Springer|location=Berlin|isbn=978-3-540-74572-3|editor1=Tomoya Enokido |editor2=Leonard Barolli |editor3=Makoto Takizawa |page=375}} The activities of the computing nodes are orchestrated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit, e.g. via a single system image concept.

Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. It is distinct from other approaches such as peer-to-peer or grid computing which also use many nodes, but with a far more distributed nature. By the 21st century, the TOP500 organization's semiannual list of the 500 fastest supercomputers often includes many clusters, e.g. the world's fastest in 2011, the K computer with a distributed memory, cluster architecture.[https://web.archive.org/web/20120120015214/http://i.top500.org/sublist TOP500 list] To view all clusters on the TOP500 list select "cluster" as architecture from the "sublist menu" on the TOP500 site.{{cite book|last1=Yokokawa|first1=M.|date=22 August 2011|doi=10.1109/ISLPED.2011.5993668|pages=371–372|last2=Shoji|first2=Fumiyoshi|last3=Uno|first3=Atsuya|last4=Kurokawa|first4=Motoyoshi|last5=Watanabe|first5=Tadashi|title=IEEE/ACM International Symposium on Low Power Electronics and Design |chapter=The K computer: Japanese next-generation supercomputer development project |isbn=978-1-61284-658-3|s2cid=13436840}}

When a large number of local semi-independent computing nodes are used (e.g. in a cluster architecture) the speed and flexibility of the interconnect becomes very important. Modern supercomputers have taken different approaches to address this issue, e.g. Tianhe-1 uses a proprietary high-speed network based on the Infiniband QDR, enhanced with FeiTeng-1000 CPUs. On the other hand, the Blue Gene/L system uses a three-dimensional torus interconnect with auxiliary networks for global communications.{{cite web|last=Knight|first=Will|title=IBM creates world's most powerful computer|url=https://www.newscientist.com/article/dn12145-ibm-creates-worlds-most-powerful-computer.html|work=New Scientist|date=27 June 2007}} In this approach each node is connected to its six nearest neighbors. A similar torus was used by the Cray T3E.{{cite journal|last=Adiga |first=N. R. |author2=Blumrich, M. A.; Chen, D.; Coteus, P.; Gara, A.; Giampapa, M. E.; Heidelberger, P.; Singh, S.; Steinmacher-Burow, B. D.; Takken, T.; Tsao, M.; Vranas, P. |title=Blue Gene/L torus interconnection network |journal=IBM Journal of Research and Development |date=March 2005 |volume=49 |issue=2.3 |pages=265–276 |doi=10.1147/rd.492.0265 |url=http://www.cc.gatech.edu/classes/AY2008/cs8803hpc_spring/papers/bgLtorusnetwork.pdf |url-status=dead |archive-url=https://web.archive.org/web/20110815102821/http://www.cc.gatech.edu/classes/AY2008/cs8803hpc_spring/papers/bgLtorusnetwork.pdf |archive-date=2011-08-15 }}

Massive centralized systems at times use special-purpose processors designed for a specific application, and may use field-programmable gate arrays (FPGA) chips to gain performance by sacrificing generality. Examples of special-purpose supercomputers include Belle,Condon, J.H. and K.Thompson, "Belle Chess Hardware", In Advances in Computer Chess 3 (ed.M.R.B.Clarke), Pergamon Press, 1982. Deep Blue,{{Cite book

|last=Hsu|first=Feng-hsiung|author-link=Feng-hsiung Hsu

|year=2002

|title=Behind Deep Blue: Building the Computer that Defeated the World Chess Champion

|publisher=Princeton University Press

|isbn=978-0-691-09065-8}} and Hydra,{{cite book|last=Donninger|first=Chrilly|author2=Ulf Lorenz |title=Field Programmable Logic and Application |chapter=The Chess Monster Hydra |year=2004|volume=3203|pages=927–932|doi=10.1007/978-3-540-30117-2_101|series=Lecture Notes in Computer Science|isbn=978-3-540-22989-6|s2cid=5467762}} for playing chess, Gravity Pipe for astrophysics,{{cite book|last=Makino|first=Junichiro|title=Scientific simulations with special purpose computers : the GRAPE systems|year=1998|publisher=Wiley|location=Chichester [u.a.]|isbn=978-0-471-96946-4|author2=Makoto Taiji }} MDGRAPE-3 for protein structure computation

molecular dynamicsRIKEN press release, [http://www.riken.jp/engn/r-world/info/release/press/2006/060619/index.html Completion of a one-petaflops computer system for simulation of molecular dynamics] {{Webarchive|url=https://web.archive.org/web/20121202053547/http://www.riken.jp/engn/r-world/info/release/press/2006/060619/index.html |date=2012-12-02 }} and Deep Crack,{{cite book |title=Cracking DES – Secrets of Encryption Research, Wiretap Politics & Chip Design |author=Electronic Frontier Foundation |isbn=978-1-56592-520-5 |publisher=Oreilly & Associates Inc |year=1998 |url=https://archive.org/details/crackingdes00elec }} for breaking the DES cipher.

Massive distributed parallelism

{{Main|Grid computing|Quasi-opportunistic supercomputing}}

File:ArchitectureCloudLinksSameSite.png

Grid computing uses a large number of computers in distributed, diverse administrative domains. It is an opportunistic approach which uses resources whenever they are available.{{cite book|last=Prodan|first=Radu|title=Grid computing experiment management, tool integration, and scientific workflows|year=2007|publisher=Springer|location=Berlin|isbn=978-3-540-69261-4|pages=1–4|author2=Thomas Fahringer }} An example is BOINC a volunteer-based, opportunistic grid system.{{cite book|last=Vega|first=Francisco Fernández de Vega|title=Parallel and distributed computational intelligence|year=2010|publisher=Springer-Verlag|location=Berlin|isbn=978-3-642-10674-3|pages=65–68|edition=Online-Ausg.|editor=Erick Cantú-Paz}} Some BOINC applications have reached multi-petaflop levels by using close to half a million computers connected on the internet, whenever volunteer resources become available.[http://www.boincstats.com/stats/project_graph.php?pr=bo BOIN statistics, 2011] {{Webarchive|url=https://web.archive.org/web/20100919090657/http://boincstats.com/stats/project_graph.php?pr=bo |date=2010-09-19 }} However, these types of results often do not appear in the TOP500 ratings because they do not run the general purpose Linpack benchmark.

Although grid computing has had success in parallel task execution, demanding supercomputer applications such as weather simulations or computational fluid dynamics have remained out of reach, partly due to the barriers in reliable sub-assignment of a large number of tasks as well as the reliable availability of resources at a given time.{{cite book|title=Languages and compilers for parallel computing : 22nd international workshop, LCPC 2009, Newark, DE, USA, October 8-10, 2009, revised selected papers|year=2010|publisher=Springer|location=Berlin|isbn=978-3-642-13373-2|pages=10–11|edition=1st|editor=Guang R. Gao}}{{cite book|title=Euro-par 2010, Parallel Processing Workshops Heteropar, Hpcc, Hibb, Coregrid, Uchpc, Hpcf, Proper, Ccpi, Vhpc, Iscia, Italy, August 31 - September 3, 2010.|publisher=Springer-Verlag New York Inc|location=Berlin [u.a.]|isbn=978-3-642-21877-4|pages=274–277|editor=Mario R. Guarracino|date=2011-06-24}}

In quasi-opportunistic supercomputing a large number of geographically disperse computers are orchestrated with built-in safeguards. The quasi-opportunistic approach goes beyond volunteer computing on a highly distributed systems such as BOINC, or general grid computing on a system such as Globus by allowing the middleware to provide almost seamless access to many computing clusters so that existing programs in languages such as Fortran or C can be distributed among multiple computing resources.

Quasi-opportunistic supercomputing aims to provide a higher quality of service than opportunistic resource sharing.{{cite book|title=Computational science -- ICCS 2008 : 8th international conference, Krakow, Poland, June 23-25, 2008; proceedings|year=2008|publisher=Springer|location=Berlin|isbn=978-3-540-69383-3|edition=Online-Ausg.|editor=Marian Bubak|pages=112–113}} The quasi-opportunistic approach enables the execution of demanding applications within computer grids by establishing grid-wise resource allocation agreements; and fault tolerant message passing to abstractly shield against the failures of the underlying resources, thus maintaining some opportunism, while allowing a higher level of control.{{cite journal|last=Kravtsov|first=Valentin|author2=David Carmeli |author3=Werner Dubitzky |author4=Ariel Orda |author5=Assaf Schuster |author6=Benny Yoshpa |title=Quasi-opportunistic supercomputing in grids|journal=IEEE International Symposium on High Performance Distributed Computing|year=2007|pages=233–244|url=http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.135.8993}}{{cite book|title=Computational science - ICCS 2009 : 9th international conference, Baton Rouge, LA, USA, May 25-27, 2009; proceedings|year=2009|publisher=Springer|location=Berlin|isbn=978-3-642-01969-2|pages=387–388|editor=Gabrielle Allen|editor-link= Gabrielle Allen}}

See also

References