Google Books Ngram Viewer

Google used some of the data obtained from 15 million scanned books to build Google Books Ngram Viewer.

“The datasets we’re making available today to further humanities research are based on a subset of that corpus, weighing in at 500 billion words from 5.2 million books in Chinese, English, French, German, Russian, and Spanish. The datasets contain phrases of up to five words with counts of how often they occurred in each year. (…) The Ngram Viewer lets you graph and compare phrases from these datasets over time, showing how their usage has waxed and waned over the years,” says Jon Orwant, from the Google Books team.


The nice thing is that the raw data is licensed as Creative Commons Attribution and can be downloaded for free. Maybe Google should use the same license for the Ngram database obtained from indexing the web.

Easter Egg in Google Books Ngram Viewer

Rickrolling seems to be Google’s favorite prank. If you try to search for [never gonna give you up] in Google’s recently launched Ngram viewer, you’ll have a pleasant surprise: a YouTube video of Rick Astley’s “Never Gonna Give You Up”.

Never gonna give you up,
Never gonna let you down
Never gonna run around and desert you
Never gonna make you cry,
Never gonna say goodbye
Never gonna tell a lie and hurt you…


{ Thanks, Federico. }

Find out what’s in a word, or five, with the Google Books Ngram Viewer

Scholars interested in topics such as philosophy, religion, politics, art and language have employed qualitative approaches such as literary and critical analysis with great success. As more of the world’s literature becomes available online, it’s increasingly possible to apply quantitative methods to complement that research. So today Will Brockman and I are happy to announce a new visualization tool called the Google Books Ngram Viewer, available on Google Labs. We’re also making the datasets backing the Ngram Viewer, produced by Matthew Gray and intern Yuan K. Shen, freely downloadable so that scholars will be able to create replicable experiments in the style of traditional scientific discovery.

Comparing instances of [flute], [guitar], [drum] and [trumpet] (

blue, red, yellow and green respectively)

in English literature from 1750 to 2008


Since 2004, Google has digitized more than 15 million books worldwide. The datasets we’re making available today to further humanities research are based on a subset of that corpus, weighing in at 500 billion words from 5.2 million books in Chinese, English, French, German, Russian, and Spanish. The datasets contain phrases of up to five words with counts of how often they occurred in each year.

These datasets were the basis of a research project led by Harvard University’s Jean-Baptiste Michel and Erez Lieberman Aiden published today in Science and coauthored by several Googlers. Their work provides several examples of how quantitative methods can provide insights into topics as diverse as the spread of innovations, the effects of youth and profession on fame, and trends in censorship.

The Ngram Viewer lets you graph and compare phrases from these datasets over time, showing how their usage has waxed and waned over the years. One of the advantages of having data online is that it lowers the barrier to serendipity: you can stumble across something in these 500 billion words and be the first person ever to make that discovery. Below I’ve listed a few interesting queries to pique your interest:

World War I, Great War
child care, nursery school, kindergarten
fax, phone, email
look before you leap, he who hesitates is lost
virus, bacteria
tofu, hot dog
burnt, burned
flute, guitar, trumpet, drum
Paris, London, New York, Boston, Rome
laptop, mainframe, microcomputer, minicomputer
fry, bake, grill, roast
George Washington, Thomas Jefferson, Abraham Lincoln
supercalifragilisticexpialidocious

We know nothing can replace the balance of art and science that is the qualitative cornerstone of research in the humanities. But we hope the Google Books Ngram Viewer will spark some new hypotheses ripe for in-depth investigation, and invite casual exploration at the same time. We’ve started working with some researchers already via our Digital Humanities Research Awards, and look forward to additional collaboration with like-minded researchers in the future.

Posted by Jon Orwant, Engineering Manager, Google Books Permalink

You can count the number of books in the world on 25,972,976 hands

Ever wonder just how many different books there are in the world? After some intensive analysis, we’ve come up with a number. Standing on the shoulders of giants—libraries and cataloging organizations—and based on our computational resources and experience of organizing millions of books through our Books Library Project and Books Partner Program since 2004, we’ve determined that number.

As of today, we estimate that there are 129,864,880 different books in the world. That’s a lot of knowledge captured in the written word! This calculation used an algorithm that combines books information from multiple sources including libraries, WorldCat, national union catalogs and commercial providers. And the actual number of books is always increasing.

Ultimately, it is truly incredible to fathom the depth and breadth of published works out there in the world. To find out how we calculated this number (no, we didn’t count them on our fingers:), check out the Google Books blog.

Posted by Leonid Taycher

Google Books goes Dutch

In recent months, I’ve got to know a group of people in the Hague who are working on an ambitious project to make the rich fabric of Dutch cultural and political history as widely accessible as possible – via the Internet.

That team is from the National Library of the Netherlands, the Koninklijke Bibliotheek (KB), and as of today, we’ll be working in partnership to add to the library’s own extensive digitisation efforts. We’ll be scanning more than 160,000 of its public domain books, and making this collection available globally via Google Books. The library will receive copies of the scans so that they can also be viewed via the library’s website. And significantly for Europe, the library also plans to make the digitised works available via Europeana, Europe’s cultural portal.

The books we’ll be scanning constitute nearly the library’s entire collection of out-of-copyright books, written during the 18th and 19th centuries. The collection covers a tumultuous period of Dutch history, which saw the establishment of the country’s constitution and its parliamentary democracy. Anyone interested in Dutch history will be able to access and view a fascinating range of works by prominent Dutch thinkers, statesmen, poets and academics and gain new insights into the development of the Netherlands as a nation state.

This is the third agreement we’ve announced in Europe this year, following our projects with the Italian Ministry of Cultural Heritage and the Austrian National Library. The Dutch national library is already well underway with its own ambitious scanning programme, which will eventually see all of its Dutch books, newspapers and periodicals from 1470 onwards being made available online. By any measure, this is a huge task, requiring significant resources, and we’re pleased to be able to help the library accelerate towards its goal of making all Dutch books accessible anywhere in the world, at the click of a mouse.

It’s exciting to note just how many libraries and cultural ministries are now looking to preserve and improve access to their collections by bringing them online. Much of humanity’s cultural, historical, scientific and religious knowledge, collected and curated over centuries, sits in Europe’s libraries, and its great to see that we are all striving towards the same goal of improving access to knowledge for all.

Google and other technology companies have an important role to play in achieving this goal, and we hope that by partnering with major European cultural institutions such as the Dutch national library, we will be able to accelerate the rapid growth of Europe’s digital library.

Posted by Philippe Colombet