Dplyr

{{Short description|R package}}

{{Multiple issues|

{{Context|date=September 2020}}

{{Notability|1=Products|date=March 2024}}

}}

{{lowercase title}}

{{Infobox software

| name = dplyr

| title = dplyr

| logo = Dplyr hex logo.svg

| logo caption =

| screenshot =

| caption =

| collapsible =

| author = Hadley Wickham, Romain François, Lionel Henry, Kirill Müller, Davis Vaughan

| developer =

| released = {{Start date and age|2014|01|07}}

| discontinued =

| latest release version = 1.1.0

| latest release date = {{Start date and age|2023|01|29}}

| latest preview version =

| latest preview date =

| programming language = R

| license = MIT License

| alexa =

| website = {{URL|https://dplyr.tidyverse.org//}}

}}

dplyr is an R package whose set of functions are designed to enable dataframe (a spreadsheet-like data structure) manipulation in an intuitive, user-friendly way. It is one of the core packages of the popular tidyverse set of packages in the R programming language.{{Cite journal |last1=Wickham |first1=Hadley |last2=Averick |first2=Mara |last3=Bryan |first3=Jennifer |last4=Chang |first4=Winston |last5=McGowan |first5=Lucy D'Agostino |last6=François |first6=Romain |last7=Grolemund |first7=Garrett |last8=Hayes |first8=Alex |last9=Henry |first9=Lionel |last10=Hester |first10=Jim |last11=Kuhn |first11=Max |last12=Pedersen |first12=Thomas Lin |last13=Miller |first13=Evan |last14=Bache |first14=Stephan Milton |last15=Müller |first15=Kirill |date=2019-11-21 |title=Welcome to the Tidyverse |journal=Journal of Open Source Software |language=en |volume=4 |issue=43 |pages=1686 |doi=10.21105/joss.01686 |issn=2475-9066|doi-access=free }} Data analysts typically use dplyr in order to transform existing datasets into a format better suited for some particular type of analysis, or data visualization.{{Cite web|last=Yadav|first=Rohit|date=2019-10-29|title=Python's Pandas vs R's Tidyverse: Who Comes Out On Top?|url=https://analyticsindiamag.com/pythons-pandas-vs-rs-tidyverse-who-wins/|access-date=2021-02-06|website=Analytics India Magazine|language=en-US}}{{Cite web|last=Krill|first=Paul|date=2015-06-30|title=Why R? The pros and cons of the R language|url=https://www.infoworld.com/article/2940864/r-programming-language-statistical-data-analysis.html|access-date=2021-02-06|website=InfoWorld|language=en}}

For instance, someone seeking to analyze a large dataset may wish to only view a smaller subset of the data. Alternatively, a user may wish to rearrange the data in order to see the rows ranked by some numerical value, or even based on a combination of values from the original dataset. Functions within the dplyr package will allow a user to perform such tasks.

dplyr was launched in 2014.{{Cite web|title=Introducing dplyr|url=https://blog.rstudio.com/2014/01/17/introducing-dplyr/|access-date=2020-09-02|website=blog.rstudio.com|date=17 January 2014 |language=en-us}} On the dplyr web page, the package is described as "a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges."{{Cite web|title=Function reference|url=https://dplyr.tidyverse.org/reference/index.html|access-date=2021-02-06|website=dplyr.tidyverse.org|language=en}}

The five core verbs

While dplyr actually includes several dozen functions that enable various forms of data manipulation, the package features five primary verbs or actions:{{Cite book|last1=Grolemund|first1=Garrett|url=https://r4ds.had.co.nz/|title=5 Data transformation {{!}} R for Data Science|last2=Wickham|first2=Hadley}}

  • filter(), which is used to extract rows from a dataframe, based on conditions specified by a user;
  • select(), which is used to subset a dataframe by its columns;
  • arrange(), which is used to sort rows in a dataframe based on attributes held by particular columns;
  • mutate(), which is used to create new variables, by altering and/or combining values from existing columns; and
  • summarize(), also spelled summarise(), which is used to collapse values from a dataframe into a single summary.

Additional functions

In addition to its five main verbs, dplyr also includes several other functions that enable exploration and manipulation of dataframes. Included among these are:

  • count(), which is used to sum the number of unique observations that contain some particular value or categorical attribute;
  • rename(), which enables a user to alter the column names for variables, often to improve ease of use and intuitive understanding of a dataset;
  • slice_max(), which returns a data subset that contains the rows with the highest number of values for some particular variable;
  • slice_min(), which returns a data subset that contains the rows with the lowest number of values for some particular variable.

Built-in datasets

The dplyr package comes with five datasets. These are: band_instruments, band_instruments2, band_members, starwars, storms.

Copyright & license

The copyright to dplyr is held by Posit PBC, formerly RStudio PBC. dplyr was originally released under a GPL license{{citation_needed|date=January 2023}}, but in 2022, Posit changed the license terms for the package to the "more permissive" MIT License.{{Cite web|title=A Grammar of Data Manipulation|url=https://dplyr.tidyverse.org|access-date=2023-01-14|website=tidyverse.org|language=en}} The main difference between the two types of license is that the MIT license allows subsequent re-use of code within proprietary software, whereas a GPL license does not.

References