culturomics

Culturomics is a form of computational lexicology that studies human behavior and cultural trends through the quantitative analysis of digitized texts.{{cite news | first=Patricia | last=Cohen | title=In 500 Billion Words, New Window on Culture | newspaper=New York Times | url=https://www.nytimes.com/2010/12/17/books/17words.html | date=16 December 2010}}{{cite journal|first=Brian|last=Hayes|url=http://www.americanscientist.org/issues/id.12418,y.2011,no.3,content.true,page.1,css.print/issue.aspx|title=Bit Lit|journal=American Scientist|volume=99|issue=3|page=190|doi=10.1511/2011.90.190|date=May–June 2011|access-date=2011-09-09|archive-url=https://web.archive.org/web/20161018204747/http://www.americanscientist.org/issues/id.12418,y.2011,no.3,content.true,page.1,css.print/issue.aspx|archive-date=2016-10-18|url-status=dead}} Researchers data mine large digital archives to investigate cultural phenomena reflected in language and word usage.{{cite journal | url=http://www.amhighed.com/documents/charleston2011/AIHE2011_Proceedings.pdf#page=228 | first=David W. | last=Letcher | title=Cultoromics: A New Way to See Temporal Changes in the Prevalence of Words and Phrases | journal=American Institute of Higher Education 6th International Conference Proceedings | volume=4 | issue=1 | page=228 | date=April 6, 2011 | access-date=September 9, 2011 | archive-url=https://web.archive.org/web/20160303215026/http://www.amhighed.com/documents/charleston2011/AIHE2011_Proceedings.pdf#page=228 | archive-date=March 3, 2016 | url-status=dead }} The term is an American neologism first described in a 2010 Science article called Quantitative Analysis of Culture Using Millions of Digitized Books, co-authored by Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden.{{cite journal | first1=Jean-Baptiste | last1=Michel | first2=Erez | last2=Liberman Aiden | title=Quantitative Analysis of Culture Using Millions of Digitized Books | journal=Science | date=16 December 2010 | doi=10.1126/science.1199644 | pmid=21163965 | volume=331 | issue=6014 | pmc=3279742 | pages=176–82}}

Michel and Aiden helped create the Google Labs project Google Ngram Viewer which uses n-grams to analyze the Google Books digital library for cultural patterns in language use over time.

Because the Google Ngram data set is not an unbiased sample,{{Cite journal|last1=Pechenick|first1=Eitan Adam|last2=Danforth|first2=Christopher M.|last3=Dodds|first3=Peter Sheridan|date=2015-10-07|title=Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution|journal=PLOS ONE|volume=10|issue=10|pages=e0137041|doi=10.1371/journal.pone.0137041|issn=1932-6203|pmc=4596490|pmid=26445406|arxiv=1501.00960|bibcode=2015PLoSO..1037041P|doi-access=free}} and does not include metadata,{{Cite journal|last=Koplenig|first=Alexander|title=The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII|journal=Digital Scholarship in the Humanities|date=April 2017|volume=32|issue=1|pages=169–188|doi=10.1093/llc/fqv037|issn=2055-7671}} there are several pitfalls when using it to study language or the popularity of terms.{{Cite magazine|url=https://www.wired.com/2015/10/pitfalls-of-studying-language-with-google-ngram/|title=The Pitfalls of Using Google Ngram to Study Language|last=Zhang|first=Sarah|magazine=WIRED|access-date=2017-05-24}} Medical literature accounts for a large, but shifting, share of the corpus,[https://books.google.com/ngrams/graph?content=sunlight%2Csummer%2Cwinter%2Chealth+care%2Ctreatment%2Cvirus&year_start=1800 Comparison of example terms] which does not take into account how often the literature is printed, or read.

Studies

File:Tripletsnew2012.png

In a study called Culturomics 2.0, Kalev H. Leetaru examined news archives including print and broadcast media (television and radio transcripts) for words that imparted tone or "mood" as well as geographic data.{{cite journal | first=Kalev H. | last=Leetaru | title=Culturomics 2.0: Forecasting Large-Scale Human Behavior Using Global News Media Tone In Time And Space | journal=First Monday | volume=16 | issue=9 | date=5 September 2011 | doi=10.5210/fm.v16i9.3663 | doi-access= free}}{{cite web |url=http://www.gizmag.com/culturomics-using-media-coverage/19749/ |title=Culturomics research uses quarter-century of media coverage to forecast human behavior |first=Darren|last=Quick |date=7 September 2011 |publisher=Gizmag.com |access-date=9 September 2011}} The research retroactively predicted the 2011 Arab Spring and successfully estimated the final location of Osama bin Laden to within {{convert|124|miles|km}}.

In a 2012 paper by Alexander M. Petersen and co-authors,{{cite journal | title=Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death | first=Alexander M. |last=Petersen | journal=Scientific Reports | date=15 March 2012 | volume=2 | pages=313 |doi=10.1038/srep00313 | pmid=22423321 | pmc=3304511 | arxiv=1107.3707 | bibcode=2012NatSR...2..313P }} they found a "dramatic shift in the birth rate and death rates of words": Deaths have increased and births have slowed. The authors also identified a universal "tipping point" in the life cycle of new words: at about 30 to 50 years after their origin, they either enter the long-term lexicon or fall into disuse.[https://www.wsj.com/articles/SB10001424052702304459804577285610212146258 "The New Science of the Birth and Death of Words "], CHRISTOPHER SHEA, Wall Street Journal, March 16, 2012

Culturomic approaches have been taken in the analysis of newspaper content in a number of studies by I. Flaounas and co-authors. These studies showed macroscopic trends across different news outlets and countries. In 2012, a study of 2.5 million articles suggested that gender bias in news coverage depends on topic and how the readability of newspaper articles is related to topic.{{Cite journal|doi = 10.1080/21670811.2012.714928|title = Research Methods in the Age of Digital Journalism|year = 2013|last1 = Flaounas|first1 = Ilias|last2 = Ali|first2 = Omar|last3 = Lansdall-Welfare|first3 = Thomas|last4 = De Bie|first4 = Tijl|last5 = Mosdell|first5 = Nick|last6 = Lewis|first6 = Justin|last7 = Cristianini|first7 = Nello|journal = Digital Journalism|volume = 1|pages = 102–116|s2cid = 61080552|doi-access = free}} A separate study by the same researchers, covering 1.3 million articles from 27 countries,{{Cite journal|doi = 10.1371/journal.pone.0014243|title = The Structure of the EU Mediasphere|year = 2010|last1 = Flaounas|first1 = Ilias|last2 = Turchi|first2 = Marco|last3 = Ali|first3 = Omar|last4 = Fyson|first4 = Nick|last5 = De Bie|first5 = Tijl|last6 = Mosdell|first6 = Nick|last7 = Lewis|first7 = Justin|last8 = Cristianini|first8 = Nello|journal = PLOS ONE|volume = 5|issue = 12|pages = e14243|pmid = 21170383|pmc = 2999531|bibcode = 2010PLoSO...514243F|doi-access = free}} showed macroscopic patterns in the choice of stories to cover. In particular, countries made similar choices when they were related by economic, geographical and cultural links. The cultural links were revealed by the similarity in voting for the Eurovision song contest. This study was performed on a vast scale, by using statistical machine translation, text categorisation and information extraction techniques.

The possibility to detect mood shifts in a vast population by analysing Twitter content was demonstrated in a study by T. Lansdall-Welfare and co-authors.{{Cite book|doi = 10.1145/2187980.2188264|chapter = Effects of the recession on public mood in the UK|title = Proceedings of the 21st international conference companion on World Wide Web - WWW '12 Companion|year = 2012|last1 = Lansdall-Welfare|first1 = Thomas|last2 = Lampos|first2 = Vasileios|last3 = Cristianini|first3 = Nello|page = 1221|isbn = 9781450312301|s2cid = 1825992}} The study considered 84 million tweets generated by more than 9.8 million users from the United Kingdom over a period of 31 months, showing how public sentiment in the UK has changed with the announcement of spending cuts.

In a 2013 study by S Sudhahar and co-authors, the automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, turning textual data into network data. The resulting networks, which can contain thousands of nodes, are then analysed by using tools from Network theory to identify the key actors, the key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes.{{Cite journal|doi = 10.1017/S1351324913000247|title = Network analysis of narrative content in large corpora|year = 2015|last1 = Sudhahar|first1 = Saatviga|last2 = De Fazio|first2 = Gianluca|last3 = Franzosi|first3 = Roberto|last4 = Cristianini|first4 = Nello|journal = Natural Language Engineering|volume = 21|pages = 81–112|url = https://research-information.bris.ac.uk/en/publications/network-analysis-of-narrative-content-in-large-corpora(dfb87140-42e2-486a-91d5-55f9007042df).html|hdl = 1983/dfb87140-42e2-486a-91d5-55f9007042df|s2cid = 3385681|hdl-access = free}}

In a 2014 study by T Lansdall-Welfare and co-authors, 5 million news articles were collected over 5 years{{Cite book|doi = 10.1109/BigData.2014.7004454|chapter = On the coverage of science in the media: A big data study on the impact of the Fukushima disaster|title = 2014 IEEE International Conference on Big Data (Big Data)|year = 2014|last1 = Lansdall-Welfare|first1 = Thomas|last2 = Sudhahar|first2 = Saatviga|last3 = Veltri|first3 = Giuseppe A.|last4 = Cristianini|first4 = Nello|pages = 60–66|hdl = 2381/31439|isbn = 978-1-4799-5666-1|s2cid = 7686818}} and then analyzed to suggest a significant shift in sentiment relative to coverage of nuclear power, corresponding with the disaster of Fukushima. The study also extracted concepts that were associated with nuclear power before and after the disaster, explaining the change in sentiment with a change in narrative framing.

In 2015, a study revealed the bias of the Google books data set, which "suffers from a number of limitations which make it an obscure mask of cultural popularity," and calls into question the significance of many of the earlier results.

Culturomic approaches can also contribute towards conservation science through a better understanding of human-nature relationships, with the first research published by McCallum and Bury in 2013.{{cite journal | doi=10.1007/s10531-013-0476-6 | title=Google search patterns suggest declining interest in the environment | date=2013 | last1=McCallum | first1=Malcolm L. | last2=Bury | first2=Gwendolyn W. | journal=Biodiversity and Conservation | volume=22 | issue=6–7 | pages=1355–1367 | bibcode=2013BiCon..22.1355M }} This study revealed a precipitous decline in public interest in environmental issues. In 2016, a publication by Richard Ladle and colleagues{{cite journal | doi=10.1002/fee.1260 | title=Conservation culturomics | year=2016 | last1=Ladle | first1=Richard J. | last2=Correia | first2=Ricardo A. | last3=Do | first3=Yuno | last4=Joo | first4=Gea-Jae | last5=Malhado | first5=Ana CM | last6=Proulx | first6=Raphaël | last7=Roberge | first7=Jean-Michel | last8=Jepson | first8=Paul | journal=Frontiers in Ecology and the Environment | volume=14 | issue=5 | pages=269–275 | bibcode=2016FrEE...14..269L | s2cid=199392763 | url=https://ora.ox.ac.uk/objects/uuid:54fec40c-eac0-4b45-8403-4276ad4a3dd9 }} highlighted five key areas where culturomics can be used to advance the practice and science of conservation, including recognizing conservation-oriented constituencies and demonstrating public interest in nature, identifying conservation emblems, providing new metrics and tools for near-real-time environmental monitoring and to support conservation decision making, assessing the cultural impact of conservation interventions, and framing conservation issues and promoting public understanding.

In 2017, a study correlated joint pain with Google search activity and temperature.{{Cite journal|last1=Telfer|first1=Scott|last2=Obradovich|first2=Nick|date=2017-08-09|title=Local weather is associated with rates of online searches for musculoskeletal pain symptoms|journal=PLOS ONE|volume=12|issue=8|pages=e0181266|doi=10.1371/journal.pone.0181266|pmid=28792953|pmc=5549896|bibcode=2017PLoSO..1281266T|issn=1932-6203|doi-access=free}} While the study observed higher search activity for hip and knee pain (but not arthritis) during higher temperatures, it does not (and cannot) control for relevant other factors such as activity. Mass media misinterpreted this as "myth busted: rain does not increase joint pain",{{Cite news|url=https://www.nbcnews.com/health/health-news/does-rain-increase-joint-pain-google-says-no-n791226|title=Are achy joints associated with rain? Google suggests otherwise|work=NBC News|access-date=2017-08-10}}{{Cite news|url=http://www.menshealth.com/health/joint-pain-and-weather|title=This Myth About Joint Pain Is Total Crap|date=2017-08-10|work=Men's Health|access-date=2017-08-10}} while the authors speculate the observed correlation is due to "changes in physical activity levels".{{Cite news|url=https://www.sciencedaily.com/releases/2017/08/170809142022.htm|title=Rain increases joint pain? Google suggests otherwise: People's activity levels -- increasing as temperatures rise, to a point -- are likelier than the weather itself to cause pain that motivates online searches, researchers say|work=ScienceDaily|access-date=2017-08-10}}

Criticism

Linguists and lexicographers have expressed skepticism regarding the methods and results of some of these studies, including one by Petersen et al.[http://bostonglobe.com/ideas/2013/02/10/when-physicists-linguistics/ZoHNxhE6uunmM7976nWsRP/story.html "When physicists do linguistics"], BEN ZIMMER, Boston Globe, February 10, 2013 Others have demonstrated bias in the Ngram data set. Their results "call into question the vast majority of existing claims drawn from the Google Books corpus": "Instead of speaking about general linguistic or cultural change, it seems to be preferable to explicitly restrict the results to linguistic or cultural change ‘as it is represented in the Google Ngram data’" because it is unclear what caused the observed change in the sample. Ficetola critiqued the use of Google Trends, suggesting interest was actually increasing.{{cite journal | doi=10.1007/s10531-013-0552-y | title= Is interest toward the environment really declining? The complexity of analysing trends using internet search data | year=2014 | last1=Ficetola | first1=G. F. | journal=Biodiversity and Conservation | volume=23 | issue=12 | pages=2983–2988 | s2cid= 17003129 | url=https://link.springer.com/article/10.1007/s10531-013-0552-y }} But, in their rebuttal McCallum and Bury{{cite journal | title=Public interest in the environment is falling: a response to Ficetola (2013) | year=2014 | last1=McCallum | first1=Malcolm L. | journal=Biodiversity and Conservation | volume=23 | issue=2 | pages=1057–1062 | doi=10.1007/s10531-014-0640-7| bibcode=2014BiCon..23.1057M | s2cid=7056654 |url=https://link.springer.com/article/10.1007/s10531-014-0640-7}} provided that as far as public policy was concerned, proportional data was important and absolute numbers irrelevant, explaining that policy is driven by the opinion of the largest portion of the population not the absolute number with decisions made according to majority influence, not simply number of votes.

References

External links

[http://www.culturomics.org/home Culturomics.org], website by The Cultural Observatory at Harvard directed by Erez Lieberman Aiden and Jean-Baptiste Michel

Category:Computational linguistics

Category:2010s neologisms

culturomics

Studies

Criticism

See also

References

Further reading

External links