Monday, April 2, 2007

Technology

(Photo from April 9, 2007 issue of BusinessWeek)
One of the most remarkable things about Google is how it works: "Our business relies on our software and hardware infrastructure, which provides substantial computing resources at low cost. We currently use a combination of off-the-shelf and custom software running on clusters of commodity computers. Our considerable investment in developing this infrastructure has produced several key benefits. It simplifies the storage and processing of large amounts of data, eases the deployment and operation of large-scale global products and services, and automates much of the administration of large-scale clusters of computers." (How Google Works: http://www.baselinemag.com/print_article2/0,1217,a=182560,00.asp)

"Google runs on hundreds of thousands of servers—by one estimate, in excess of 450,000—racked up in thousands of clusters in dozens of data centers around the world. It has data centers in Dublin, Ireland; in Virginia; and in California, where it just acquired the million-square-foot headquarters it had been leasing. It recently opened a new center in Atlanta, and is currently building two football-field-sized centers in The Dalles, Oregon." (How Google Works: http://www.baselinemag.com/print_article2/0,1217,a=182560,00.asp)

What Google uses to be the most efficient search engine and to handle massive amount of CPU and networking workload is called Grid Computing. It is part of what is called Internet 2.0 or Web 2.0, it is the wave of the future. Grid Computing is defined as utilizing a parallel infrastructure of many computers to distribute the workload evenly by pooling or sharing multiple computers' system resources. So instead of one big all-powerful and more importantly expensive supercomputer you split it over smaller more common and cheaper computers. It is the same technology used in the Human Genome Project to map the Human Genome (which has now turned into the Human Proteonome Project) , it is the same technology SETI@home (http://setiathome.berkeley.edu/) uses, and the same thing medical research companies use to find a cure for cancer.

As Google CEO Eric Schmidt states "Google does more than simply buy lots of PC-class servers and stuff them in racks, we're really building what we think of internally as supercomputers."

Google has also managed to develop portable data centers, so they can pack up a data center into a 20 or 40 foot shipping container and load it up on a tractor trailer rig to deploy anywhere. The trailer contains 5000 processors and 3.5 Petabytes of data storage space that can be delivered overnight.

What Google does with all this storage space is cache a large part of the Internet on its data centers (about 737,000,000 websites), so then it jumps from site to site through links, and creates a index. Then when someone types in a search query on http://www.google.com/ the search is then compares the query to the index and finds the best match, based on content and the number of links from other sites. The link analyzing system is called PageRank and was developed by Page and Brin, Altavista displayed the number of links associated with a site but never utilized that information. All of this happens in order to determine the best match and display them in ranking order for your convenience in just milliseconds, the time to locate your search results in always displayed in the upper right hand corner.

Larry Page and Sergey Brin also came up with BigFiles, which split large files into small pieces to be stored on many computer that are pooling their hard drive space. Google incorporates the Google File System (GFS) onto its servers that makes sure at least 3 copies are stored on separate computers to ensure data consistency and error prevention, so Google can store its data reliable on low-cost and more unreliable computers.

Google upgraded its software by creating BigTables which is a Database Management System, it stores structured data from Google, Google Maps, Google Earth, and Search History using standard relational databases such as MySQL. BigTables just breaks down tables into smaller pieces to be stored on multiple computers.

Google uses a version of RedHat Linux with kernel-level modifications by their programmers. Their Distributed Filing System is the Google Filing System, the Distributed Scheduling system is called the Global Work Queue, and their Database Management Systems are BigTables and Berkley DB which I believe is Oracle's.

http://www.baselinemag.com/print_article2/0,1217,a=182560,00.asp

No comments: