Python vs R for Data Science
Python vs R: R is an established language. Python is rapidly growing. Pitting one against the other, which will be the best language for Data Science?
As the world’s internet population grows, as does the amount of data to be collected, organized, and utilized for any purpose. Forbes reports that, by 2025, the amount of digital data worldwide will be 163 zettabytes. With these large amounts of data will come an increasing need for data scientists and data analysts. Data analysis is needed across multiple industries, from large corporations to schools to hospitals and research facilities. The work done by data scientists in these fields greatly affects our society and everyday lives. Research completed on product sales controls product improvement and discontinuations. Models rendered from medical research and studies help to develop new drugs and treatments for various illnesses. Our lives would be drastically different without the collection, analysis, and utilization of data.
Today’s discussion is an examination of two major programming languages used in data science: Python and R. We will review each language with talking points focused on market saturation, ease of use and learning, data collection, visualizations, and machine learning. Is there one language that’s better than the other?
Python vs R: Introduction
R is an interpreted language developed by Ross Ihaka and Robert Gentleman, released as open-source software in 1995. Development of R actually took place over four years between 1991 and 1995; but after its public debut, it was another five years for the first official stable beta release to come out in 2000. The development of R was an improvement of the S programming language (a statistical computing language developed in the 70s), combining it with the lexical scoping semantics from Scheme, which allows objects to be treated as blocks of code rather than the program as a whole.
Interpreted languages directly execute instructions written using code and commands using an interpreter rather than a compiler used in other paradigms like Object-Oriented Programming (OOP). Without the need for a compiler, R code can be executed directly from a command window, though more commonly applications like RStudio (discussed later) are used in modern day.
With its background in statistics, R is used across multiple industries often for statistical analysis, machine learning, finance, and academic research and experiments.
Python is an open source multi-paradigm programming language developed by Guido van Rossum from 1989 to its first release in 1994. After working on ABC (a general purpose programming language and integrated IDE), van Rossum originally designed Python to interface with the Amoeba operating system.
As a multi-paradigm language, Python can support OOP, structured programming, as well as aspect-oriented programming. An Integrated Development Environment (IDEs) is needed to run Python. Common IDEs include the Pydev plugin for the ever-popular Eclipse IDE, Pycharm by Jetbrains, and Visual Studio Code. Often, these IDEs are also paired with text editors like Vim and Sublime Text.
Given the versatility of muti-paradigm support, Python can be found anywhere in the fields of computer & data science: software development, web applications, task automation, AI & machine learning, data analysis, and visualization.
Python vs R: Quick Comparison
Python vs R: Market Saturation
When choosing a programming language, market saturation is a factor that should be acknowledged. Questions to consider include:
- How common is this programming language in the field?
- How hard or easy will it be for me to learn this language and use it on a daily basis?
- How easy will it be for me to find support when I need it?
The sections below address each of these points with linked resources for further reading.
How Popular is R? Python?
Data from the TIOBE Index shows that, between April 2021 and April 2022, Python has risen from rank #3 to #1. These rankings are based on the number of worldwide courses, third party vendors, and skilled engineers related to each language the TIOBE Index reports on. R, despite being lower on the totem pole, is still a very popular language. During this same timeframe, R increased from rank #16 in April 2021 to rank #11.
These trends are very similar in PYPL Popularity of Programming Index, with April 2022 showing Python as rank #1 and R #7 worldwide. PYPL uses raw data from Google Trends to determine the number of searches related to tutorials for each programming language in the market worldwide. Despite dipping slightly (2.2%) over the last year, Python still held the majority of searches at 27.95% of total tutorial searches in April 2022. R saw growth of .5%, totaling 4.41% of market searches. Given that Python is not a specialized statistical language like R, the disparities need to be taken with a grain of salt. While R is typically limited to the data science/analytics/statistical fields, Python has a much larger reach through computer science and coding fields.
Overall, however, Python does appear to be the more popular programming language for data scientists. In October 2019, Kaggle surveyed nearly 20,000 data professionals and found that 87% of the surveyed population used Python on a daily basis, whereas 31% used R.
What Support is Available?
Both languages have amassed a large amount of support in communities such as Stack Overflow, Stack Exchange, and Github. There are over 270,000 repositories for Python and more than 26,000 repositories for R on Github alone. On Stack Exchange, a specialized community called CrossValidated is available for R users focusing on statistics.
Social media sites such as YouTube and Reddit are also fantastic resources for help and learning. In addition to r/Programming, there are multiple specialized subreddits available to offer their support:
- r/Python - offers assistance for Python users and learners
- r/matlab - handles questions typically related to the Python matlab library
- r/Rlanguage - assistance for R users and learners
- r/stats - specific assistance related to R and statistics
In addition, each language has its own proprietary and official forums available. Python’s community forums on python.org offer boards for libraries, general support, new ideas, and other options. RStudio also has a proprietary discussion board for help with R and package development.
Python vs R: Packages and Libraries
Above and beyond the built-in functionalities of each language are pre-coded packages and libraries. Terminology for each language includes packages for R and libraries for Python. Each serves its own purpose, either designed for new processes or to improve upon existing ones.
Packages are available for R on CRAN, the Comprehensive R Archive Network. Currently, there are almost 20,000 packages available for download on CRAN. This number increases every day, with users submitting new packages for download. Packages in the CRAN repository are regulated according to the CRAN Repository Policy, which has a strict set of regulations a package must meet for acceptance. In addition to this, Github has a large number of libraries available for download.
Python libraries are most often found on Github, available from public repositories. Many of the more popular libraries, like SciPy and Pandas have their own websites with available documentation and support. There is no official standard location for the storage of Python libraries.
Learning R and Python
In general, Object Oriented Programming languages are the easiest to learn because they model the world with the same techniques as a human brain. Objects in OOP are based on the concept of real world objects. Methods and functions can be thought of as procedures and actions. Most data scientists will agree that Python is the easier of the two to learn, especially if one already has a background in programming and is familiar with OOP languages. Its syntax was modeled after natural language, making it faster to learn and code. The general consensus is that R is initially more challenging to learn due to the nature of functional languages not being as intuitive as OOP programming.
As mentioned above, learning tools are widely available online. With today’s culture, there are a multitude of online resources available for learning:
- For R specifically, R Studio curates and maintains a large list of resources available, ranging from R basics to advanced books on Deep Learning
- Python.org offers free Python video courses, tutorials and ebooks for download
- YouTube videos and tutorials are offered by many content creators
- Free and low-cost certification courses through accredited schools are offered through sites like EdX.org and Coursera
- FreeCodeCamp.org has tutorials available on their website and web courses available on YouTube
Which Language Has More Resources?
Due to the rise in popularity over the last 10 years, the market is heavily saturated with support and resources for Python. In addition to this, its user-friendly design streamlines the learning process, making it the recommended language for beginners and those that do not want to limit its use to data science applications only.
Python vs R: Gathering and Sorting Data
Naturally, Data Scientists and Analysts need to use data to conduct their daily tasks. To do so, strategies need to be made to gather and store data. Many companies have in-house data storage using databases such as MSSqlServer or MySQL. Some data may also be collected through the web through web scraping, web crawlers, and application programming interfaces, or APIs. For the purpose of this article, we will focus on data wrangling for our language comparison.
What is Data Wrangling?
Data wrangling is a catch-all term that covers the processes that transform raw data into easily used formats, such as merging data from multiple sources into a single dataset, identifying and filling gaps in those datasets, removing unnecessary or unused data, and identifying data outliers that may skew analysis results. Commonly data wrangling may also be referred to as Data Munging. To gather raw data, web scraping is common in data analytics. Web scraping, web harvesting, or web data extraction is a method of extracting data from websites using a web browser and a script or process. For web scraping in general, proficiency in HTML and CSS are a requirement when using both R and Python.
Web Scraping with R
Web Scraping with Python
There are two popular Python libraries developed for web scraping in mind, which are Beautiful Soup and Scrapy. Once the data is gathered, other libraries such as Pandas, Numpy, and Matplotlib make it easy to gather and organize the scraped data. When writing web scraping scripts, the syntax of Python is easy to read and easy to understand. Small bits of code can handle large amounts of data; though since Python doesn’t support multithreading, gathering large amounts of data can become quite slow as the processes bottleneck.
Storing Scraped Data
Once data is gathered, both Python and R are able to write data to databases like MySQL, Oracle, Sybase, PostgreSQL, etc.. Both languages can also organize data into Data Frames, with Python exporting data as CSV and JSON files, and R exporting data into CSV files. More information on storage, including packages and libraries, will be discussed in the next section.
Which Language is Better for Data Wrangling?
In today’s market, each language is equal in terms of data wrangling. The intended use is what determines which language one would choose. For example, if the wrangled data will be used in web or proprietary applications, then Python offers more diversity. If data is being used for statistical analysis or research, R is the better option (for now).
Python vs R: Data Analysis & Statistical Analysis
After data is gathered, it needs to be cleaned up and organized before use. This is a loose use of the term “use”, as data use can involve reporting, data visualizations, or preparing data for other business/application uses. Here we will talk about cleaning and organizing data and database integration.
Cleaning and Organizing Raw Data
Data cleaning is an imperative step before completing any analytical process on raw data. As with any project, one must start with a clean slate or results will likely be skewed due to bad data. Data cleaning not only includes familiarizing oneself with the dataset and file sizes, but also identifying duplicates, outliers, missing data, wrong data, and formatting issues. A few packages R offers for cleaning data include:
- Tidyr - identifies variables in dataset and performs functions to gather, separate, or spread the data
- Sqldf - a packages that allows the user to write SQL code in R studio
- Janitor - finds duplicates
- RMarkdown - while this isn’t a data-cleaning tool per se, it aids in embedding documentation into the project, which helps with organization
Python also offers multiple libraries to aid in data cleaning. The most popular options include:
- Pandas - a library that has classes to read, process, and write CSV data files into dataframes to perform different cleaning tasks
- NumPy - designed to work with arrays, NumPy is often paired with Pandas for general data cleaning
- Matplotlib - a visualization library that’s used to create distribution plots to find areas where the data is insufficient
- Seaborn - another visualization library built on top of matplotlib that offers more customization features, which can come in useful when visualizing usable data
Cons for each of these languages during data cleaning involves speed. Because R stores data in memory, it is typically the slower of the two. However, data cleaning typically involves very large sets of data. In cases where large amounts of data need to be evaluated, Python is actually at a disadvantage because of the lack of multithreading support.
Data might not always be used immediately, so long-term storage is needed. We’ve discussed that both R and Python are able to export data tables into .csv and text files. Most often, data will then be moved into a database such as MySQL, PostgreSQL, and Oracle.
Database Connections with R
RStudio has a built-in Connections Pane to streamline connections so data can easily be connected directly through the platform. There are a number of ODBC drivers available on RStudio’s database page (db.rstudio.com). Each drivers’ page offers a complete guide to the packages required, the connection settings, and any known issues with the connection.
In addition to the ODBC drivers, R has several packages available to make the database integration easier:
RStudio’s database website has a full explanation of the use of each of these packages along with installation and usage instructions available for free.
Database Connections with Python
Database connections with Python typically require the installation libraries for database connectors. Items necessary for database connections with Python include an IDE, chosen database software, and an installed support library or database API. The pyodbc module is the most recommended option for database connections. It can connect to numerous databases, and its github repository has a richly developed documentation section for assistance with an array of databases.
Less-often discussed is Python’s built-in database feature. By simply importing a sqlLite library, the built-in database can be used without connecting to any other database software. This option is much more lightweight than a full-grown database product and is less secure, but if only basic SQL functions and table handling is required, this is a viable option.
Which Language is Better for Data Analysis?
R is the winner for this section. The factors that affected this decision were the speed of data cleaning, packages available for data cleaning, and the ease of which the database connections are established. Packages used in R for data cleaning are well-established and, once learned, easy to use. As mentioned above, despite R typically being the slower of the two languages, handling time for the larger sets of data is faster. Finally, documentation and instructions for ODBC connection packages are easy to find, clear and concise, and all housed in one location.
Python vs R: Visualization
Data visualization is the graphical representation of data. Modeling and data visualization allows data scientists and analysts to use large datasets to render graphs and images. Physical representation of data allows entities to track progress, make decisions, display predictions, etc. for their given metrics. Common visualizations include:
- Bar & pie charts
- Scatter/line plots
- Time series
- Relationship maps
- Heat maps
- Geological maps
- 3D plots
- Higher-dimensional plots
- Word clouds
The resounding sentiment in data science is the idiom, “a picture is worth a thousand words”. Graphics are key for data analysts to present their data in their given subfields. We’ll discuss libraries, packages, and techniques used for each language for visualizations.
There are nearly 2,000 repositories on Github containing packages designed for data visualization using R. Below are four of the most common packages used, along with a brief description of each:
- Ggplot2 - the top-recommended package for chart and graph creation, designed for easy creation of aesthetically pleasing graphs
- Lattice - high-level data visualization built as an enhancement to the base graphics and modified to accept multivariate data, allowing easy creation of multiple small graphs
- RGL - interactive 3-d plots
Pros and cons of visualization with R are sentiments that have already been expressed in this article. R has a robust number of packages available to handle almost any visualization, though the major con is that it is difficult as a beginner to learn.
Python has thousands of libraries and library modifications available for data visualization. Below are four of the most common libraries used for this purpose:
- Matplotlib - low-level interface used for scatter plots, line plots, histograms, etc.
- Seaborn - high-level interface built on top of matplotlib that design customization like visually pleasing styles and color palettes
- Plotly - used for plotting, similar to matplotlib, but has tools built-in to allow for outliers and anomalies, more graph customization, and more attractive design components
There are a large number of libraries and repositories available for Python, and with the number of daily Python users growing, so will this number. For very large datasets, memory does become an issue, causing slow performance.
Which Language is Better for Visualizations?
R still has a leg up on Python visualization. Both offer many packages and libraries for low-level and highly detailed visualizations. Of the two, R is more mature in its offerings and has more developed support for this purpose. Python’s lack of mobile support is also a drawback, as more of today’s communications and business must be mobile friendly.
Python vs R: Machine Learning and Deep Learning
Now growing at a rapid pace is the demand for skills in machine learning and deep learning. Take a look below at what machine learning is and how Python and R are equipped to handle these tasks.
What the Heck is Machine Learning? Deep Learning?
Machine learning is a type of AI that uses algorithms to track historical data and use this data as input to predict future output. Search engines, bank fraud monitoring software, talk to text speech recognition software, GPS arrival estimations and route management all are great examples of the effects of machine learning on our daily lives. Diving deeper, real life applications of deep learning are becoming more popular, like self-driving cars, social media algorithms, and virtual assistants like Siri.
Deep learning is a specialized type of machine learning modeled after the way human brains identify objects, categorize them, and learn. Deep learning differs from ML in that, while it can accept datasets as input to perform its tasks, it doesn’t need to. Instead of relying on input on which characteristics to look at, deep learning algorithms can identify the features of different items in raw, unstructured data; distinguish them from one another; and accurately categorize them.
Machine Learning with R
To conduct machine learning with R, one (or more) of the following packages is highly encouraged. The packages below are specifically designed with machine learning in mind:
- Caret - Classification And REgression Training. Designed to streamline the creation of predictive models w/tools for data splitting, pre-processing, model tuning, etc.
- Dplyr - tool for working with data frame objects both in & out of memory
- Random Forest - gathers numeric data or factors into a large number of decision trees. The common output from the maximum number of observations is considered the final output. Helps to solve regression and classification tasks
- E1071 - implements support vector machines (SVM), shortest path computation, bagged clustering, Naive Bayes classifier, fuzzy clustering, etc.
- Rpart - short for recursive partitioning and regression training. Performs a two-stage step for classification and regression
Pros and Cons of Machine Learning with R
- Well-established in machine learning with multiple packages available
- Community support is available
- Designed for use with large data sets
- Uses a large amount of memory
- Nature of the free open source packages means that the algorithms are inconsistent across the board, requiring dedicated learning time for each new package being used
Machine Learning with Python
Python also has a number of libraries that are both designed for and useful or machine and deep learning:
- Numpy - is designed for large multi-dimensional array and matrix processing. It has a large collection of built-in high-level mathematical functions
- Scipy - contains modules for linear algebra, integration, linear optimization, and statistics. Scipy is useful for image manipulation
- Scikit-learn - built on top of numpy & scipy, this is one of the more popular Python machine learning libraries. It can also be used for data mining and data analysis
- Theano - is used to define, evaluate, and optimize mathematical expressions. Theano can easily handle multi-dimensional arrays in its computations. Often, it used for unit testing and self-verification to detect errors
- Tensorflow - was developed by Google Brain for deep learning and neural networks. It allows distribution of work onto multiple CPU or GPU cores (this feature is not commercially supported, so any support is community-based)
Pros and Cons of Machine Learning with Python
- Since ML has become more popular, algorithms no longer need to be coded manually & are available in multiple libraries
- Since Python requires less coding due to its streamlined syntax, ML applications can be developed faster
- Support for both OOP and procedural programming models creates ease in replicating real-world functions that are easily reusable
- The multithreading issue becomes especially apparent during the machine learning process, as it bottlenecks operations
- Execution is slow since python is executed line by line
- Mobile environments aren’t easily supported
- Database access layers are present, but are underdeveloped, making it not ideal to use with very large sets of data
Which Language is Better for Machine Learning?
While Python is quickly on its way to surpass R for Machine Learning, it hasn’t done so just yet. R is currently the stronger and more well-supported language for ML, but Python has the potential to pull through in the near future.
Python vs R: Which Language is Better, R or Python?
As the amount of data grows worldwide, so does the need for individuals who can gather, process, and use it. Newcomers to the field often ask, “Which language should I learn?”. After conducting the research needed to write this article, we have determined that the answer is:
If a language is being learned with a specific task in mind, then there is a clear winner based on that task. For example, if ease of learning and an abundance of knowledge and support are driving factors, then Python would be the obvious choice. If languages are being evaluated for a position related to statistics or machine learning, the best option would be R–for right now.
Otherwise, if the decision is based off of the sections listed above, R still just barely pulls ahead of Python. As time goes on and more developments are made, it is a very real probability that Python will outshine R in most tasks within the next 5 - 10 years.
If you have no goal in mind other than diving into the world of data science and analytics, flip a coin. Both R and Python are robust, well-developed languages with very similar outputs, albeit very different approaches.
For more information on each language, initial download, and proprietary support we recommend the enterprise websites below: