Google data centers#Software
{{short description|Facilities containing Google servers}}
{{Use American English|date=July 2021}}
{{Use mdy dates|date=July 2021}}
File:Google datacenter (2007) - panoramio - erwinboogert (2).jpg, Netherlands]]
{{external media
| float = right
| caption = YouTube
| headerimage= File:YouTube 2024.svg
| video1 = [https://www.youtube.com/watch?v=zDAYZU4A3w0&t=17s Google Data Center 360° Tour]
}}
Google data centers are the large data center facilities Google uses to provide their services, which combine large drives, computer nodes organized in aisles of racks, internal and external networking, environmental controls (mainly cooling and humidification control), and operations software (especially as concerns load balancing and fault tolerance).
There is no official data on how many servers are in Google data centers, but Gartner estimated in a July 2016 report that Google at the time had 2.5 million servers. This number is changing as the company expands capacity and refreshes its hardware.{{cite web |title=How Many Servers Does Google Have? |url=https://www.datacenterknowledge.com/archives/2017/03/16/google-data-center-faq |website=Data Center Knowledge |date=March 16, 2017 |access-date=September 20, 2018}}
Locations
The locations of Google's various data centers by continent are as follows:{{cite web|url=https://about.google/locations/?region=north-america&office=mountain-view|title=Google data centers, locations|access-date=July 21, 2014}}{{Cite web |title=ISO/IEC 27001 - Compliance |url=https://cloud.google.com/security/compliance/iso-27001 |access-date=2023-07-11 |website=Google Cloud |language=en}}
Hardware
= Original hardware =
The original hardware (circa 1998) that was used by Google when it was located at Stanford University included:{{cite web|url=http://google.stanford.edu/googlehardware.html |title="Google Stanford Hardware" |access-date=March 23, 2017 |url-status=bot: unknown |archive-url=https://web.archive.org/web/19990209043945/http://google.stanford.edu/googlehardware.html |archive-date=February 9, 1999 }}. Stanford University (provided by Internet Archive). Retrieved on July 10, 2006.
- Sun Microsystems Ultra II with dual 200 MHz processors, and 256 MB of RAM. This was the main machine for the original Backrub system.
- 2 × 300 MHz dual Pentium II servers donated by Intel, they included 512 MB of RAM and 10 × 9 GB hard drives between the two. It was on these that the main search ran.
- F50 IBM RS/6000 donated by IBM, included 4 processors, 512 MB of memory and 8 × 9 GB hard disk drives.
- Two additional boxes included 3 × 9 GB hard drives and 6 x 4 GB hard disk drives respectively (the original storage for Backrub). These were attached to the Sun Ultra II.
- SSD disk expansion box with another 8 × 9 GB hard disk drives donated by IBM.
- Homemade disk box which contained 10 × 9 GB SCSI hard disk drives.
= Google Cluster =
The state of Google infrastructure in 2003 was described in a report by Luiz André Barroso, Jeff Dean, and Urs Hölzle as a "reliable computing infrastructure from clusters of unreliable commodity PCs".{{Cite journal |last1=Barroso |first1=L.A. |last2=Dean |first2=J. |last3=Holzle |first3=U. |date=March 2003 |title=Web search for a planet: The Google cluster architecture |url=https://ieeexplore.ieee.org/document/1196112 |journal=IEEE Micro |volume=23 |issue=2 |pages=22–28 |doi=10.1109/MM.2003.1196112 |issn=1937-4143}}
On average, a single search query requires reads ~100 MB of data, and consumes
CPU cycles. During peak time, Google serves ~1000 queries per second. To handle this peak load, they built a compute cluster with ~15,000 commodity-class PCs instead of expensive supercomputer hardware to save money. To make up for the lower hardware reliability, they wrote fault tolerant software.
The structure of the cluster consists of 5 parts. Central Google Web servers (GWS) face the public Internet. Upon receiving a user request, the Google Web server communicates with a spell checker, an advertisement server, many index servers, many document servers. Each of the 4 parts responds to a part of the request, and the GWS assembles their responses and serves the final response to the user.
The raw documents were ~100 TB, and the index files were ~10 TB. The index files are sharded, and each shard is served by a "pool" of index servers. Similarly, the raw documents are also sharded. Each query to the index file results in a list of document IDs, which are then sent to the document servers to retrieve the title and the keyword-in-context snippets.
There were several CPU generations in use, ranging from single-processor 533MHz Intel-Celeron-based servers to dual 1.4GHz Intel Pentium III. Each server contains one or more hard drives, 80 GB each. Index servers have less disk space than document servers. Each rack has two Ethernet switches, one per side. The servers on each side interconnect via a 100-Mbps. Each switch has a ~250 MB/sec uplink to a central switch that connects to all racks.
The design objectives include:
- Use low-reliability consumer hardware and make up for it with fault-tolerant software.
- Maximize parallelism, such as by splitting a single document match lookup in a large index into a MapReduce over many small indices.
- Partition index data and computation to minimize communication and evenly balance the load across servers, because the cluster is a large shared-memory machine.
- Minimize system management overheads by developing all software in-house.
- Pick hardware that maximizes performance/price, not absolute performance.
- Pick hardware that has high thoroughput over high latency. This is because queries are served with massive parallelism, with very few dependent steps and minimal communication between servers, so high latency does not matter.
Due to the massive parallelism, scaling up hardware scales up the thoroughput linearly, i.e. doubling the compute cluster doubles the number of queries servable per second.
The cluster is made of server racks at 2 configurations: 40 x 1u per side with 2 sides, or 20 x 2u per side with 2 sides. The power consumption is 10 kW per rack, at a density of 400 W/ft^2, consuming 10 MWh per month, costing $1,500 per month.
= Production hardware =
As of 2014, Google has used a heavily customized version of Debian Linux. They migrated from a Red Hat-based system incrementally in 2013.{{Cite web|url=https://events.static.linuxfound.org/sites/events/files/lcjp13_merlin.pdf|title=Case Study: Live upgrading many thousand of servers from an ancient Red Hat distribution to a 10 year newer Debian based one|last=Merlin|first=Marc|date=2013|website=Linux Foundation|access-date=June 9, 2017}}
The customization goal is to purchase CPU generations that offer the best performance per dollar, not absolute performance. How this is measured is unclear, but it is likely to incorporate running costs of the entire server, and CPU power consumption could be a significant factor.{{cite book|title=Strategies for E-business|author1=Tawfik Jelassi |author2=Albrecht Enders |chapter=Case study 16 — Google|publisher=Pearson Education|year=2004|isbn=978-0-273-68840-2|page=424}} Servers as of 2009–2010 consisted of custom-made open-top systems containing two processors (each with several cores), a considerable amount of RAM spread over 8 DIMM slots housing double-height DIMMs, and at least two SATA hard disk drives connected through a non-standard ATX-sized power supply unit.{{YouTube|M5wfv7RE_J4|Google's secret power supplies}} The servers were open top so more servers could fit into a rack. According to CNET and a book by John Hennessy, each server had a novel 12-volt battery to reduce costs and improve power efficiency.Computer Architecture, Fifth Edition: A Quantitative Approach, {{ISBN|978-0123838728}}; Chapter Six; 6.7 "A Google Warehouse-Scale Computer" [https://books.google.com/books?id=gQ-fSqbLfFoC&q=google+ups&pg=PA471 page 471] "Designing motherboards that only need a single 12-volt supply so that the UPS function could be supplied by standard batteries associated with each server"[https://www.cnet.com/news/google-uncloaks-once-secret-server-10209580/ Google uncloaks once-secret server], April 1, 2009.
According to Google, their global data center operation electrical power ranges between 500 and 681 megawatts.{{Cite web|url=https://sustainability.google/|title=Google Sustainability|website=Google Sustainability}}{{Cite web |url=http://www.analyticspress.com/datacenters.html |title=Analytics Press Growth in data center electricity use 2005 to 2010 |access-date=May 22, 2012 |archive-url=https://web.archive.org/web/20120111082558/http://www.analyticspress.com/datacenters.html |archive-date=January 11, 2012 |url-status=dead }}
The combined processing power of these servers might have reached from 20 to 100 petaflops in 2008.[http://blogs.nmscommunications.com/communications/2008/05/google-surpasses-supercomputer-community-unnoticed.html Google Surpasses Supercomputer Community, Unnoticed?] {{Webarchive|url=https://web.archive.org/web/20081205075355/http://blogs.nmscommunications.com/communications/2008/05/google-surpasses-supercomputer-community-unnoticed.html |date=December 5, 2008 }}, May 20, 2008.
= Network topology =
Details of the Google worldwide private networks are not publicly available, but Google publications{{Citation | url = http://research.google.com/pubs/pub36603.html | title = Research | year = 2010 | volume = 48 | issue = 7 | contribution = Fiber Optic Communication Technologies: What's Needed for Datacenter Network Operations| last1 = Lam | first1 = Cedric F. | last2 = Liu | first2 = Hong | last3 = Koley | first3 = Bikash | last4 = Zhao | first4 = Xiaoxue | last5 = Kamalov | first5 = Valey | last6 = Gill | first6 = Vijay }}{{Citation | url = https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36936.pdf|page=4|author=Lam, Cedric F.|date=2010 |title=FTTH look ahead — technologies & architectures}} make references to the "Atlas Top 10" report that ranks Google as the third largest ISP behind Level 3.
In order to run such a large network, with direct connections to as many ISPs as possible at the lowest possible cost, Google has a very open peering policy.{{Citation | url = http://www.peeringdb.com/view.php?asn=15169 | contribution = kumara ASN15169 | title = Peering DB}}
From this site, we can see that the Google network can be accessed from 67 public exchange points and 69 different locations across the world. As of May 2012, Google had 882 Gbit/s of public connectivity (not counting private peering agreements that Google has with the largest ISPs). This public network is used to distribute content to Google users as well as to crawl the internet to build its search indexes.
The private side of the network is a secret, but a recent disclosure from Google{{Citation | url = http://opennetsummit.org/speakers.html | contribution = Urs Holzle | title = Speakers | publisher = Open Network Summit | access-date = May 22, 2012 | archive-url = https://web.archive.org/web/20120510132234/http://opennetsummit.org/speakers.html | archive-date = May 10, 2012 | url-status = dead }} indicate that they use custom built high-radix switch-routers (with a capacity of 128 × 10 Gigabit Ethernet port) for the wide area network. Running no less than two routers per datacenter (for redundancy) we can conclude that the Google network scales in the terabit per second range (with two fully loaded routers the bi-sectional bandwidth amount to 1,280 Gbit/s).
These custom switch-routers are connected to DWDM devices to interconnect data centers and point of presences (PoP) via dark fiber.
From a datacenter view, the network starts at the rack level, where 19-inch racks are custom-made and contain 40 to 80 servers (20 to 40 1U servers on either side, while new servers are 2U rackmount systems.[http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/googlecluster-ieee.pdf Web Search for a Planet: The Google Cluster Architecture] (Luiz André Barroso, Jeffrey Dean, Urs Hölzle) Each rack has an Ethernet switch). Servers are connected via a 1 Gbit/s Ethernet link to the top of rack switch (TOR). TOR switches are then connected to a gigabit cluster switch using multiple gigabit or ten gigabit uplinks.{{Cite web |url=http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294.F09/wharehousesizedcomputers.pdf |title=Warehouse size computers |access-date=May 22, 2012 |archive-date=June 19, 2012 |archive-url=https://web.archive.org/web/20120619202125/http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294.F09/wharehousesizedcomputers.pdf |url-status=dead }} The cluster switches themselves are interconnected and form the datacenter interconnect fabric (most likely using a dragonfly design rather than a classic butterfly or flattened butterfly layout[http://research.google.com/pubs/archive/37069.pdf Denis Abt High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities]).
From an operation standpoint, when a client computer attempts to connect to Google, several DNS servers resolve www.google.com
into multiple IP addresses via Round Robin policy. Furthermore, this acts as the first level of load balancing and directs the client to different Google clusters. A Google cluster has thousands of servers, and once the client has connected to the server additional load balancing is done to send the queries to the least loaded web server. This makes Google one of the largest and most complex content delivery networks.{{cite book|title=Network Programming in .NET|author=Fiach Reid|pages=[https://archive.org/details/networkprogrammi00fiac/page/251 251–253]|chapter=Case Study: The Google search engine|publisher=Digital Press|year=2004|isbn=978-1-55558-315-6|chapter-url=https://archive.org/details/networkprogrammi00fiac/page/251}}
Google has numerous data centers scattered around the world. At least 12 significant Google data center installations are located in the United States. The largest known centers are located in The Dalles, Oregon; Atlanta, Georgia; Reston, Virginia; Lenoir, North Carolina; and Moncks Corner, South Carolina.{{cite web | url = http://www.datacenterknowledge.com/archives/2008/03/27/google-data-center-faq/ | title = Google Data Center FAQ | author = Rich Miller | work = Data Center Knowledge | date = March 27, 2008 | access-date = March 15, 2009 | archive-url = https://web.archive.org/web/20090313082216/http://www.datacenterknowledge.com/archives/2008/03/27/google-data-center-faq/ | archive-date = March 13, 2009 | url-status = dead | df = mdy-all }} In Europe, the largest known centers are in Eemshaven and Groningen in the Netherlands and Mons, Belgium. Google's Oceania Data Center is located in Sydney, Australia.{{cite web| url = http://www.itnews.com.au/News/168772,found-google-australias-secret-data-network.aspx | title =Found: Google Australia's secret data network | author = Brett Winterford| work = ITNews | date= March 5, 2010 | access-date = March 20, 2010}}
= Data center network topology =
To support fault tolerance, increase the scale of data centers and accommodate low-radix switches, Google has adopted various modified Clos topologies in the past.{{Cite book|last1=Singh|first1=Arjun|last2=Ong|first2=Joon|last3=Agarwal|first3=Amit|last4=Anderson|first4=Glen|last5=Armistead|first5=Ashby|last6=Bannon|first6=Roy|last7=Boving|first7=Seb|last8=Desai|first8=Gaurav|last9=Felderman|first9=Bob|last10=Germano|first10=Paulie|last11=Kanagala|first11=Anand|title=Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication |chapter=Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network |date=2015 |pages=183–197 |chapter-url=https://research.google/pubs/pub43837/ |doi=10.1145/2785956.2787508|s2cid=2817692|doi-access=free|isbn=978-1-4503-3542-3 }}
= Project 02 =
One of the largest Google data centers is located in the town of The Dalles, Oregon, on the Columbia River, approximately 80 miles (129 km) from Portland. Codenamed "Project 02", the complex was built in 2006 and is approximately the size of two American football fields, with cooling towers four stories high.Markoff, John; Hansell, Saul. "[https://www.nytimes.com/2006/06/14/technology/14search.html Hiding in Plain Sight, Google Seeks More Power.]" New York Times. June 14, 2006. Retrieved on October 15, 2008.Google "[https://www.google.com/datacenter/thedalles/ The Dalles, Oregon Data Center]" Retrieved on January 3, 2011. The site was chosen to take advantage of inexpensive hydroelectric power, and to tap into the region's large surplus of fiber optic cable, a remnant of the dot-com boom. A blueprint of the site appeared in 2008.Strand, Ginger.
"[http://harpers.org/media/slideshow/annot/2008-03/zoom.html Google Data Center]" Harper's Magazine. March 2008. Retrieved on October 15, 2008. {{webarchive |url=https://web.archive.org/web/20120830120156/http://harpers.org/media/slideshow/annot/2008-03/zoom.html |date=August 30, 2012 }}
= Summa papermill =
In February 2009, Stora Enso announced that they had sold the Summa paper mill in Hamina, Finland to Google for 40 million Euros.{{cite web | url = http://www.storaenso.com/media-centre/press-releases/2009/02/Pages/stora-enso-divests-summa-mill.aspx | title = Stora Enso divests Summa Mill premises in Finland for million | publisher = Stora Enso | date = February 12, 2009 | access-date = December 2, 2009 | archive-url = https://web.archive.org/web/20090413023322/http://www.storaenso.com/media-centre/press-releases/2009/02/Pages/stora-enso-divests-summa-mill.aspx | archive-date = April 13, 2009 | url-status = dead }}{{dead link|date=November 2017}} {{Cite journal | url = http://www.kauppalehti.fi/5/i/talous/uutiset/etusivu/uutinen.jsp?oid=2009/02/18987 | title = Stooora yllätys: Google ostaa Summan tehtaan | journal = Kauppalehti | date = February 12, 2009 | place = Helsinki | access-date = February 12, 2009 | language = fi | archive-url = https://web.archive.org/web/20090214232026/http://www.kauppalehti.fi/5/i/talous/uutiset/etusivu/uutinen.jsp?oid=2009%2F02%2F18987 | archive-date = February 14, 2009 | url-status = dead }} Google invested 200 million euros on the site to build a data center and announced additional 150 million euro investment in 2012.{{Cite journal | url = http://www.taloussanomat.fi/talous/2009/03/04/google-investoi-200-miljoonaa-euroa-haminaan/20095951/133 | title = Google investoi 200 miljoonaa euroa Haminaan | journal = Taloussanomat | date = February 4, 2009 | place = Helsinki | access-date = March 15, 2009 | language = fi}}{{cite web| url=https://www.google.com/about/datacenters/inside/locations/hamina/ |title=Hamina, Finland|access-date=April 23, 2018}} Google chose this location due to the availability and proximity of renewable energy sources.[http://www.fincloud.freehostingcloud.com/ Finland – First Choice for Siting Your Cloud Computing Data Center.] {{webarchive|url=https://web.archive.org/web/20130706112506/http://www.fincloud.freehostingcloud.com/ |date=July 6, 2013 }} Accessed August 4, 2010.
= Floating data centers =
{{See also|Google barges}}
In 2013, the press revealed the existence of Google's floating data centers along the coasts of the states of California (Treasure Island's Building 3) and Maine. The development project was maintained under tight secrecy. The data centers are 250 feet long, 72 feet wide, 16 feet deep. The patent for an in-ocean data center cooling technology was bought by Google in 2009{{cite web|url=https://www.theguardian.com/technology/2013/oct/30/google-secret-floating-data-centers-california-maine|title=Google's worst-kept secret: floating data centers off US coasts|website=Theguardian.com|date=October 30, 2013|author=Rory Carroll|access-date=December 8, 2018}}{{cite web|url=https://www.datacenterknowledge.com/archives/2009/04/29/google-gets-patent-for-data-center-barges|title=Google Gets Patent for Data Center Barges|website=Datacenterknowledge.com|date=April 29, 2009|author=Rich Miller|access-date=December 8, 2018}} (along with a wave-powered ship-based data center patent in 2008{{cite web|url=https://www.cnet.com/news/google-files-patent-for-wave-powered-floating-data-center/|title=Google files patent for wave-powered floating data center|website=Cnet.com|date=September 8, 2008|author=Martin Lamonica|access-date=December 8, 2018}}{{cite web|url=https://www.datacenterdynamics.com/news/googles-ship-based-datacenter-patent-application-surfaces/|title=Google's ship based datacenter patent application surfaces|website=Datacenterdynamics.com|date=September 7, 2008|access-date=December 8, 2018}}). Shortly thereafter, Google declared that the two massive and secretly-built infrastructures were merely "interactive learning centers, [...] a space where people can learn about new technology."{{cite web|url=https://www.theguardian.com/technology/2013/nov/06/google-barge-mystery-solved-interactive-learning-center|title=Google barge mystery solved: they're for 'interactive learning centers'|website=Theguardian.com|date=November 6, 2013|access-date=December 8, 2018}}
Google halted work on the barges in late 2013 and began selling off the barges in 2014.{{cite news|url=https://www.mercurynews.com/2014/08/01/google-confirms-selling-a-mystery-barge/|title=Google confirms selling a mystery barge|date=August 1, 2014|work=San Jose Mercury News|author=Brandon Bailey|access-date=April 7, 2015}}{{cite news|url=https://consumerist.com/2014/11/07/what-happened-to-those-google-barges/|title=What Happened To Those Google Barges?|date=November 7, 2014|work=Consumerist|author=Chris Morran|access-date=January 15, 2017}}
Software
Most of the software stack that Google uses on their servers was developed in-house.{{cite book | title=An Introduction to Search Engines and Web Navigation|author=Mark Levene|publisher= Pearson Education |year=2005|isbn=978-0-321-30677-7|page=73}} According to a well-known former Google employee in 2006, C++, Java, Python and (more recently) Go are favored over other programming languages.{{cite web|url=http://www.artima.com/weblogs/viewpost.jsp?thread=143947 |title=Python Status Update |publisher=Artima |date=January 10, 2006 | access-date = February 17, 2012}} For example, the back end of Gmail is written in Java and the back end of Google Search is written in C++.{{cite web|url=http://panela.blog-city.com/python_at_google_greg_stein__sdforum.htm |title=Warning |work=Panela |publisher=Blog-city |access-date=February 17, 2012 |url-status=dead |archive-url=https://web.archive.org/web/20111228034932/http://panela.blog-city.com/python_at_google_greg_stein__sdforum.htm |archive-date=December 28, 2011 }} Google has acknowledged that Python has played an important role from the beginning, and that it continues to do so as the system grows and evolves.{{cite web| url = http://python.org/about/quotes/ |title=Quotes about Python |publisher=Python |access-date = February 17, 2012}}
The software that runs the Google infrastructure includes:{{cite web|url= http://highscalability.com/google-architecture |title=Google Architecture |publisher= High Scalability |date=November 22, 2008 |access-date=February 17, 2012}}
- Google Web Server (GWS){{snd}} custom Linux-based Web server that Google uses for its online services.
- Storage systems:
- Google File System and its successor, Colossus{{Citation | first = Andrew | last = Fikes | url = http://research.google.com/university/relations/facultysummit2010/storage_architecture_and_challenges.pdf | contribution = Storage Architecture and Challenges | title = TechTalk | date = July 29, 2010 }}{{dead link|date=April 2019 |bot=InternetArchiveBot |fix-attempted=yes }}{{cite web | url = http://www.systutorials.com/3202/colossus-successor-to-google-file-system-gfs/ | title = Colossus: Successor to the Google File System (GFS) | publisher = SysTutorials | date = November 29, 2012 | access-date=May 10, 2016}}
- Bigtable{{snd}} structured storage built upon GFS/Colossus
- Spanner{{snd}} planet-scale database, supporting externally-consistent distributed transactions{{Citation | first = Jeffrey 'Jeff' | last = Dean | year = 2009 | url = http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf | contribution = Design, Lessons and Advice from Building Large Distributed Systems | format = keynote talk presentation | title = Ladis | publisher = Cornell}}
- Google F1{{snd}} a distributed, quasi-SQL DBMS based on Spanner, substituting a custom version of MySQL.{{Citation | url = http://research.google.com/pubs/pub38125.html | title = Research | format = presentation | place = Sigmod | contribution = F1 — the Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business | year = 2012 | first1 = Jeffrey 'Jeff' | last1 = Shute | first2 = Mircea | last2 = Oancea | first3 = Stephan | last3 = Ellner | first4 = Benjamin 'Ben' | last4 = Handy | first5 = Eric | last5 = Rollins | first6 = Bart | last6 = Samwel | first7 = Radek | last7 = Vingralek | first8 = Chad | last8 = Whipkey | first9 = Xin | last9 = Chen | first10 = Beat | last10 = Jegerlehner | first11 = Kyle | last11 = Littlefield | first12 = Phoenix | last12 = Tong | access-date = September 18, 2012 | archive-date = June 9, 2017 | archive-url = https://web.archive.org/web/20170609055135/https://research.google.com/pubs/pub38125.html | url-status = dead }}
- Chubby lock service
- MapReduce and Sawzall programming language
- Indexing/search systems:
- TeraGoogle{{snd}} Google's large search index (launched in early 2006){{Cite web|date=2008-07-28|title=Google alums rev up a new search engine|url=https://www.latimes.com/archives/la-xpm-2008-jul-28-fi-search28-story.html|access-date=2021-09-15|website=Los Angeles Times|language=en-US}}
- Caffeine (Percolator){{snd}} continuous indexing system (launched in 2010).
- Hummingbird{{snd}} major search index update, including complex search and voice search.{{cite web|url=http://insidesearch.blogspot.co.il/2013/09/fifteen-years-onand-were-just-getting.html |title=Google official release note |access-date=September 28, 2013}}
- Borg declarative process scheduling software
Google has developed several abstractions which it uses for storing most of its data:{{cite web |url=http://www.eweekeurope.co.uk/news/news-it-infrastructure/google-developing-caffeine-storage-system-1620 |title=Google Developing Caffeine Storage System | TechWeekEurope UK |publisher=Eweekeurope.co.uk |date=August 18, 2009 |access-date=February 17, 2012 |archive-url=https://web.archive.org/web/20111115162704/http://www.eweekeurope.co.uk/news/news-it-infrastructure/google-developing-caffeine-storage-system-1620 |archive-date=November 15, 2011 |url-status=dead }}
- Protocol Buffers{{snd}} "Google's lingua franca for data",{{cite web|url=https://code.google.com/apis/protocolbuffers/docs/overview.html |title=Developer Guide – Protocol Buffers – Google Code |access-date=February 17, 2012}} a binary serialization format which is widely used within the company.
- SSTable (Sorted Strings Table){{snd}} a persistent, ordered, immutable map from keys to values, where both keys and values are arbitrary byte strings. It is also used as one of the building blocks of Bigtable.http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf {{Bare URL PDF|date=March 2022}}
- RecordIO{{snd}} a sequence of variable sized records.{{cite web|author=windley on |url=http://www.windley.com/archives/2008/06/velocity_08_storage_at_scale.shtml |title=Phil Windley's Technometria | Velocity 08: Storage at Scale |publisher=Windley.com |date=June 24, 2008 |access-date=February 17, 2012}}{{cite web|url=https://groups.google.com/group/protobuf/browse_thread/thread/ee27572aef9da70a |title=Message limit – Protocol Buffers | Google Groups |access-date=February 17, 2012}}
= Software development practices =
Most operations are read-only. When an update is required, queries are redirected to other servers, so as to simplify consistency issues. Queries are divided into sub-queries, where those sub-queries may be sent to different ducts in parallel, thus reducing the latency time.
To lessen the effects of unavoidable hardware failure, software is designed to be fault tolerant. Thus, when a system goes down, data is still available on other servers, which increases reliability.
Search infrastructure
= Index =
Like most search engines, Google indexes documents by building a data structure known as inverted index. Such an index obtains a list of documents by a query word. The index is very large due to the number of documents stored in the servers.
The index is partitioned by document IDs into many pieces called shards. Each shard is replicated onto multiple servers. Initially, the index was being served from hard disk drives, as is done in traditional information retrieval (IR) systems. Google dealt with the increasing query volume by increasing number of replicas of each shard and thus increasing number of servers. Soon they found that they had enough servers to keep a copy of the whole index in main memory (although with low replication or no replication at all), and in early 2001 Google switched to an in-memory index system. This switch "radically changed many design parameters" of their search system, and allowed for a significant increase in throughput and a large decrease in latency of queries.{{cite web|url=http://research.google.com/people/jeff/WSDM09-keynote.pdf |title=Jeff Dean's keynote at WSDM 2009 |access-date=February 17, 2012}}
In June 2010, Google rolled out a next-generation indexing and serving system called "Caffeine" which can continuously crawl and update the search index. Previously, Google updated its search index in batches using a series of MapReduce jobs. The index was separated into several layers, some of which were updated faster than the others, and the main layer wouldn't be updated for as long as two weeks. With Caffeine, the entire index is updated incrementally on a continuous basis. Later Google revealed a distributed data processing system called "Percolator"Daniel Peng, Frank Dabek. (2010). [https://research.google.com/pubs/pub36726.html Large-scale Incremental Processing Using Distributed Transactions and Notifications]. Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation. which is said to be the basis of Caffeine indexing system.The Register. [https://www.theregister.co.uk/2010/06/09/google_completes_caffeine_search_index_overhaul/ Google Caffeine jolts worldwide search machine]The Register. [https://www.theregister.co.uk/2010/09/24/google_percolator/ Google Percolator – global search jolt sans MapReduce comedown]
= Server types =
Google's server infrastructure is divided into several types, each assigned to a different purpose:{{cite book|title=Future of Google Earth|author=Chandler Evans|chapter=Google Platform|publisher=Madison Publishing Company|year=2008|isbn=978-1-4196-8903-1|page=299}}{{cite book|title=Google Power|author=Chris Sherman|pages=[https://archive.org/details/googlepowerunlea0000sher/page/10 10–11]|publisher=McGraw-Hill Professional|year=2005|isbn=978-0-07-225787-8|chapter=How Google Works|url=https://archive.org/details/googlepowerunlea0000sher/page/10}}{{cite book|chapter=How Google Works|pages=[https://archive.org/details/googlepediaultim0000mill/page/17 17–18]|title=Googlepedia|author=Michael Miller|publisher=Pearson Technology Group|year=2007|isbn=978-0-7897-3639-0|url=https://archive.org/details/googlepediaultim0000mill/page/17}}
- Web servers coordinate the execution of queries sent by users, then format the result into an HTML page. The execution consists of sending queries to index servers, merging the results, computing their rank, retrieving a summary for each hit (using the document server), asking for suggestions from the spelling servers, and finally getting a list of advertisements from the ad server.
- Data-gathering servers are permanently dedicated to spidering the Web. Google's web crawler is known as GoogleBot. They update the index and document databases and apply Google's algorithms to assign ranks to pages.
- Each index server contains a set of index shards. They return a list of document IDs ("docid"), such that documents corresponding to a certain docid contain the query word. These servers need less disk space, but suffer the greatest CPU workload.
- Document servers store documents. Each document is stored on dozens of document servers. When performing a search, a document server returns a summary for the document based on query words. They can also fetch the complete document when asked. These servers need more disk space.
- Ad servers manage advertisements offered by services like AdWords and AdSense.
- Spelling servers make suggestions about the spelling of queries.
There are also "canary requests", whereby a request is first sent to one or two leaf servers to see if the response time is reasonable. If not, then the request fails. This provides security.{{Cite journal |last=Dean |first=Jeffrey |last2=Barroso |first2=Luiz André |date=February 2013 |title=The tail at scale |url=https://dl.acm.org/doi/10.1145/2408776.2408794 |journal=Communications of the ACM |language=en |volume=56 |issue=2 |pages=74–80 |doi=10.1145/2408776.2408794 |issn=0001-0782}}
Security
{{external media
| float = right
| caption = YouTube
| headerimage= File:YouTube 2024.svg
| video1 = [https://www.youtube.com/watch?v=kd33UVZhnAA Google Data Center Security: 6 Layers Deep]
}}
In October 2013, The Washington Post reported that the U.S. National Security Agency intercepted communications between Google's data centers, as part of a program named MUSCULAR.{{cite news|url=https://www.washingtonpost.com/world/national-security/nsa-infiltrates-links-to-yahoo-google-data-centers-worldwide-snowden-documents-say/2013/10/30/e51d661e-4166-11e3-8b74-d89d714ca4dd_story.html|title=NSA infiltrates links to Yahoo, Google data centers worldwide, Snowden documents say|last1=Gellman|first1=Barton|date=October 30, 2013|newspaper=The Washington Post|access-date=November 1, 2013|last2=Soltani|first2=Ashkan}}{{cite web|url=https://www.nytimes.com/2013/10/31/technology/nsa-is-mining-google-and-yahoo-abroad.html|title=N.S.A. Said to Tap Google and Yahoo Abroad|last1=Savage|first1=Charlie|last2=Miller|first2=Claire Cain|date=October 30, 2013|website=The New York Times|access-date=March 9, 2017|last3=Perlroth|first3=Nicole}} This wiretapping was made possible because, at the time, Google did not encrypt data passed inside its own network.{{cite web|url=https://arstechnica.com/information-technology/2013/10/how-the-nsas-muscular-tapped-googles-and-yahoos-private-networks/|title=How the NSA's MUSCULAR tapped Google's and Yahoo's private networks|last=Gallagher|first=Sean|date=October 31, 2013|website=Ars Technica|publisher=Condé Nast|access-date=March 9, 2017}} This was rectified when Google began encrypting data sent between data centers in 2013.{{cite web|url=https://www.nytimes.com/2013/11/01/technology/angry-over-us-surveillance-tech-giants-bolster-defenses.html|title=Angry Over U.S. Surveillance, Tech Giants Bolster Defenses|last=Miller|first=Claire Cain|date=October 31, 2013|website=The New York Times|access-date=March 9, 2017}}
Environmental impact
File:Google Mayes County P0004991a.jpg at MidAmerica Industrial Park]]
Google's most efficient data center runs at {{convert|35|C}} using only fresh air cooling, requiring no electrically powered air conditioning.{{cite web|url=http://www.geek.com/chips/googles-most-efficient-data-center-runs-at-95-degrees-1478473/|title=Google's most efficient data center runs at 95 degrees|last=Humphries|first=Matthew|date=March 27, 2012|website=geek.com|archive-url=https://web.archive.org/web/20160613112324/http://www.geek.com/chips/googles-most-efficient-data-center-runs-at-95-degrees-1478473/|archive-date=June 13, 2016|access-date=June 13, 2016}}
In December 2016, Google announced that—starting in 2017—it would purchase enough renewable energy to match 100% of the energy usage of its data centers and offices. The commitment will make Google "the world's largest corporate buyer of renewable power, with commitments reaching 2.6 gigawatts (2,600 megawatts) of wind and solar energy".{{cite web|url=https://blog.google/topics/environment/100-percent-renewable-energy/|title=We're set to reach 100% renewable energy — and it's just the beginning|last=Hölzle|first=Urs|date=December 6, 2016|website=The Keyword Google Blog|access-date=December 8, 2016}}{{cite web|url=https://www.theverge.com/2016/12/6/13852004/google-data-center-oklahoma-renewable-energy-climate-change|title=Google just notched a big victory in the fight against climate change|last=Statt|first=Nick|date=December 6, 2016|website=The Verge|publisher=Vox Media|access-date=December 8, 2016}}{{cite web|url=https://techcrunch.com/2016/12/06/google-says-it-will-hit-100-renewable-energy-by-2017/|title=Google says it will hit 100% renewable energy by 2017|last=Etherington|first=Darrell|date=December 7, 2016|website=TechCrunch|publisher=AOL|access-date=December 8, 2016}}
{{Clear}}
References
{{Reflist|30em}}
Further reading
{{refbegin}}
- {{cite journal|journal=IEEE Micro|volume=23|pages=22–28|date=March–April 2002|author1=L.A. Barroso |author2=J. Dean |author3=U. Hölzle |title=Web search for a planet: The Google cluster architecture|doi=10.1109/MM.2003.1196112|url=http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/googlecluster-ieee.pdf|issue=2}}
- Shankland, Stephen, CNET news "[http://news.cnet.com/8301-1001_3-10209580-92.html Google uncloaks once-secret server] {{Webarchive|url=https://web.archive.org/web/20140716084210/http://news.cnet.com/8301-1001_3-10209580-92.html |date=July 16, 2014 }}." April 1, 2009.
{{refend}}
External links
- [http://research.google.com/pubs/papers.html Google Research Publications]
- [http://research.google.com/archive/googlecluster-ieee.pdf Web Search for a Planet: The Google Cluster Architecture] (Luiz André Barroso, Jeffrey Dean, Urs Hölzle)
{{Google LLC}}