{"id":1948,"date":"2019-03-29T06:33:51","date_gmt":"2019-03-29T06:33:51","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/03\/29\/python-for-data-science\/"},"modified":"2019-03-29T06:33:51","modified_gmt":"2019-03-29T06:33:51","slug":"python-for-data-science","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/03\/29\/python-for-data-science\/","title":{"rendered":"Python for Data Science"},"content":{"rendered":"<p>Author: datascience@berkeley Staff<\/p>\n<div>\n<p><img decoding=\"async\" src=\"https:\/\/corp-mktg.s3.amazonaws.com\/cask\/prod\/ucb-mids\/content\/a1badde8117142aebd829f260c9b2c7a\/UCB-MIDS_Python-for-Data-Science_hero.jpg\" width=\"768\"><\/p>\n<p>Programming languages that build the apps, programs and environments you use are sophisticated and, according to the TIOBE Index, <span><a href=\"https:\/\/www.tiobe.com\/tiobe-index\/programming-languages-definition\/#instances\" target=\"_blank\" rel=\"noopener noreferrer\">there are more than 250 programming languages currently in existence<\/a>. One of the most popular of these is <a href=\"https:\/\/docs.python.org\/2\/faq\/general.html#why-was-python-created-in-the-first-place\" target=\"_blank\" rel=\"noopener noreferrer\">Python, an open-source language that\u2019s been around since February of 1991.<\/a> Data scientists have been using Python regularly for years, but let\u2019s take a closer look at what Python is and why it\u2019s popular among data scientists. <\/span><\/p>\n<h2>Introducing Python<\/h2>\n<p>Python is an extensible and portable programming language that can be run on Unix, Mac, or Windows. Because of this accessibility and portability, it has no shortage of users. New Python users can learn enough to work with code quickly, with a large community to support their efforts. A 2016 O\u2019Reilly Media survey found that <span><a href=\"https:\/\/www.oreilly.com\/ideas\/2016-data-science-salary-survey-results\" target=\"_blank\" rel=\"noopener noreferrer\">54 percent of data scientists use Python in their work,<\/a> up from 40 percent in 2013. <em>The Economist<\/em> even claimed in 2018 that <a href=\"https:\/\/www.economist.com\/graphic-detail\/2018\/07\/26\/python-is-becoming-the-worlds-most-popular-coding-language\" target=\"_blank\" rel=\"noopener noreferrer\">Python is becoming the world\u2019s most popular coding language.<\/a> <\/span><\/p>\n<p>Corporate and research usage supports these numbers. For years, <span><a href=\"https:\/\/code.fb.com\/production-engineering\/python-in-production-engineering\/\" target=\"_blank\" rel=\"noopener noreferrer\">Python has been the language of choice for production engineers at Facebook;<\/a> in fact, it is the third-most popular option. And <a href=\"https:\/\/quintagroup.com\/cms\/python\/google\">Python is one of Google\u2019s official languages<\/a> \u2014 meaning it can be deployed to production within the company. <a href=\"https:\/\/pydanny-event-notes.readthedocs.io\/en\/latest\/socalpiggies\/20110526-wda.html\" target=\"_blank\" rel=\"noopener noreferrer\">Walt Disney Animation Studios<\/a> uses Python for many creative tasks. <a href=\"https:\/\/realpython.com\/world-class-companies-using-python\/\" target=\"_blank\" rel=\"noopener noreferrer\">Companies like Industrial Light and Magic, Spotify, Quora, Netflix, Dropbox, and Reddit all rely on Python<\/a> for everything from moviemaking to social news aggregation. <a href=\"https:\/\/cacm.acm.org\/blogs\/blog-cacm\/176450-python-is-now-the-most-popular-introductory-teaching-language-at-top-u-s-universities\/fulltext\" target=\"_blank\" rel=\"noopener noreferrer\">Python is even the most popular introductory coding language taught<\/a> at top US universities, in part because of its popularity in so many settings.<\/span><\/p>\n<p>A wide range of companies and institutions with very different goals all prefer to use Python, which is a testament to its flexibility. But how does it work, exactly?<\/p>\n<p>For starters,\u00a0<a href=\"https:\/\/www.datasciencegraduateprograms.com\/python\/\" target=\"_blank\" rel=\"noopener noreferrer\">Python supports multiple paradigms,<\/a> including functional programming, object-oriented programming, structured programming, and procedural programming. It\u2019s the Swiss Army knife of languages, allowing the production environment and researchers to all use the same <a href=\"https:\/\/www.fastcompany.com\/3030877\/businesses-can-now-use-the-same-stats-language-as-universities-thanks-to-pandas\" target=\"_blank\" rel=\"noopener noreferrer\">tools<\/a>. This means that it can handle website construction, data mining, and much more \u2014 all in the same language.<\/p>\n<p>Furthermore, Python can be extended via libraries to allow data scientists to tackle machine learning, data analysis, and beyond.<a href=\"https:\/\/medium.freecodecamp.org\/the-hitchhikers-guide-to-machine-learning-algorithms-in-python-bfad66adb378\" target=\"_blank\" rel=\"noopener noreferrer\">The active community of Python users provides easy-to-follow tutorials<\/a> that make it simple and quick for machine learning. This makes Python more than just a programming language; it\u2019s one of many tools that data scientists can use to explore and analyze their datasets.<\/p>\n<h2>Why is data science using Python?<\/h2>\n<p>Because the language is multifaceted and flexible and has easy readability, Python is an obvious language of choice in the field. However, Python usage is relatively new. As a result, Python libraries such as <a href=\"http:\/\/pandas.pydata.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Pandas<\/a> help individuals clean up data and perform advanced <a href=\"https:\/\/www.datasciencegraduateprograms.com\/python\/\" target=\"_blank\" rel=\"noopener noreferrer\">manipulation<\/a>.<\/p>\n<p>Numbers on Pandas usage are hard to come by, but Quartz notes that <a href=\"https:\/\/qz.com\/1126615\/the-story-of-the-most-important-tool-in-data-science\/\" target=\"_blank\" rel=\"noopener noreferrer\">Stack Overflow saw 1 million unique visitors viewing 5 million questions on Pandas in October 2017 alone.<\/a><\/p>\n<p><a href=\"https:\/\/stackoverflow.blog\/2017\/09\/14\/python-growing-quickly\/\" target=\"_blank\" rel=\"noopener noreferrer\">The growth of Python in data science has gone hand in hand with that of Pandas,<\/a> which opened the use of Python for data analysis to a broader audience by enabling it to deal with row-and-column datasets, import CSV files, and much more.<\/p>\n<p>While Pandas may be the best-known library, there are hundreds of specialized libraries that serve a similar purpose, such as SymPy (for statistical applications), PyMC (machine learning), matplotlib (plotting and visualization), and PyTables (storage and data formatting). These and other specialized libraries aid in everything from machine learning to data preprocessing to neural networks. One of the main benefits of Python is that its flexible nature enables the data scientist to use one tool every step of the way.<\/p>\n<p>Another plus is the large community of data scientists, machine learning experts, and programmers who go out of their way not only to make it easy to learn Python and machine learning but also to provide <a href=\"https:\/\/archive.ics.uci.edu\/ml\/index.php\" target=\"_blank\" rel=\"noopener noreferrer\">datasets to test a Python student&#8217;s mastery of their newfound skills.<\/a> Whether you are a social scientist who needs Python for advanced data analysis or an experienced developer interested in a growing field, a part of the Python community is ready to help you out.<\/p>\n<p>However, with so many resources available to help you utilize Python, how can you know which one will be best for you?<\/p>\n<p>Learning from a trusted source like UC Berkeley can ensure that you are able to use the programming language with confidence. Through datascience@berkeley, UC Berkeley\u2019s <a href=\"https:\/\/datascience.berkeley.edu\/\" target=\"_blank\" rel=\"noopener noreferrer\">online Master of Information and Data Science<\/a> you can take an entire course on <a href=\"https:\/\/datascience.berkeley.edu\/academics\/curriculum\/python-for-data-science\/\" target=\"_blank\" rel=\"noopener noreferrer\">Python for data science.<\/a> Students are introduced to a range of Python objects and control structures; the course then has you build on this knowledge with classes and object-oriented programming before delving into Python\u2019s system of packages for data analysis.<\/p>\n<h2>Python vs. R: What&#8217;s the difference?<\/h2>\n<p>Like Python, R is another open-source programming language that was developed in the 1990s, with an initial release in 1995. Also like Python, <a href=\"https:\/\/www.r-project.org\/about.html\" target=\"_blank\" rel=\"noopener noreferrer\">R is available for Windows, Unix, and MacOS.<\/a> However, while Python is a general language that can handle everything from data mining to website construction, R is a domain-specific language, developed with statisticians in mind.<\/p>\n<p>Because of this, R is known for providing statistical and graphical techniques, including linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, and clustering. In addition, since it\u2019s closely tied to academia, packages usually exist for new research, keeping R on the cutting edge and making it great for use in data science. In fact, <span><a href=\"https:\/\/www.infoworld.com\/article\/2940864\/application-development\/r-programming-language-statistical-data-analysis.html\" target=\"_blank\" rel=\"noopener noreferrer\">many popular machine learning algorithms are implemented in R.<\/a><\/span><\/p>\n<p>In the aforementioned TIOBE Index, R was ranked 16th in popularity of programming languages as of December 2018, compared to Python\u2019s third-place ranking. As mentioned, however, R was developed with a specific audience in mind, as compared to the broader flexibility of Python, which may account for some of the difference in TIOBE-accounted popularity.<\/p>\n<p><a href=\"https:\/\/www.datacamp.com\/community\/tutorials\/r-or-python-for-data-analysis\" target=\"_blank\" rel=\"noopener noreferrer\">DataCamp breaks down the difference between the Python and R further,<\/a> saying that &#8220;R focuses on better user-friendly data analysis, statistics, and graphical models,&#8221; whereas &#8220;Python emphasizes productivity and code readability.&#8221; The usability of R versus the flexibility of Python may seem to put the two languages in competition with one another, but the fact is that they\u2019re both useful. Selection of a language just depends on its intended purpose.<\/p>\n<p>By that count, one data editor says that <a href=\"https:\/\/qz.com\/1063071\/the-great-r-versus-python-for-data-science-debate\/\" target=\"_blank\" rel=\"noopener noreferrer\">Python is better used for repeated tasks such as data manipulation, while R is good for exploring datasets on an ad hoc basis.<\/a><\/p>\n<h2>Which is better for a data scientist: Python or R?<\/h2>\n<p>Python and R have a lot of strengths in common. Both have active communities online, as evidenced by dedicated mailing lists, knowledgeable Stack Overflow users, and user-contributed documentation. Both are comparable in overall usage, with 52% of data scientists using <a href=\"https:\/\/www.oreilly.com\/ideas\/2016-data-science-salary-survey-results\" target=\"_blank\" rel=\"noopener noreferrer\">R<\/a> (vs. 54% for Python). And both are free, open source, and extensible, giving them flexibility and increased usability across disciplines.<\/p>\n<p>However, Python has the following features:<\/p>\n<ul>\n<li>An extensive standard library: <a href=\"https:\/\/www.oracle.com\/technetwork\/articles\/piotrowski-pythoncore-084049.html\" target=\"_blank\" rel=\"noopener noreferrer\">Python covers areas like string processing, internet protocols, software engineering, and operating system interfaces<\/a><\/li>\n<li>High stability: Python has new, stable releases that are issued roughly every 18 months<\/li>\n<li>Ease of use: Python is easy to use with its simple syntax and readability, which makes the code easy to understand and maintain<\/li>\n<\/ul>\n<p>R is known for the following strengths:<\/p>\n<ul>\n<li>High-quality plots: publication-quality plots (including mathematical symbols and formulae) can be produced easily<\/li>\n<li>Vast package ecosystem: R is readily extensible, and packages exist for most statistical techniques<\/li>\n<\/ul>\n<p>This isn\u2019t to say that these languages don\u2019t have their weaknesses, and it\u2019s important to note that both languages share similar strengths. Python, despite its booming presence in the data science field, can handle some \u2014 but certainly not all \u2014 advanced manipulation. And experts in R cite some of its top weaknesses as memory, speed, and efficiency.<\/p>\n<p>Overall, neither programming language is truly better for data science; it all depends on the functionality the user needs. Specifically, if you\u2019re considering one or the other, you should ask yourself these questions:<\/p>\n<ul>\n<li>What is the problem you\u2019re looking to solve?<\/li>\n<li>Is it statistic-heavy or could it be tackled in a different manner?<\/li>\n<li>What kind of learning curve are you prepared to take on? (Keep in mind that R has a steeper learning curve than Python)<\/li>\n<\/ul>\n<p>If you\u2019re looking to extend your work into data analysis using a highly popular programming language (that can be used outside the data science field), the ever-popular Python is a good place to start.<\/p>\n<div class=\"container container--md u--background-color-light u--padding-3 u--text-align-center u--margin-bottom-3\">\n<h2 class=\"h5\">Share this on social media:<\/h2>\n<p><a href=\"https:\/\/www.facebook.com\/sharer\/sharer.php?u=%20https:\/\/datascience.berkeley.edu\/blog\/python-data-science\/?utm_source=facebook&#038;utm_medium=social&#038;utm_campaign=blog\" target=\"_blank\" rel=\"noopener noreferrer\">Facebook<\/a>\u00a0|\u00a0 <a href=\"https:\/\/www.linkedin.com\/shareArticle?mini=true&#038;url=%20https:\/\/datascience.berkeley.edu\/blog\/python-data-science\/?utm_source=linkedin&#038;utm_medium=social&#038;utm_campaign=blog\" target=\"_blank\" rel=\"noopener noreferrer\">LinkedIn<\/a>\u00a0|\u00a0 <a href=\"https:\/\/twitter.com\/intent\/tweet?text=%20https:\/\/datascience.berkeley.edu\/blog\/python-data-science\/?utm_source=twitter&#038;utm_medium=social&#038;utm_campaign=blog\" target=\"_blank\" rel=\"noopener noreferrer\">Twitter<\/a><\/p>\n<\/div>\n<p><span>Citation for this content:\u00a0<a href=\"https:\/\/datascience.berkeley.edu\/\">datascience@berkeley, the online Master of Information and Data Science from UC Berkeley<\/a><\/span><\/p>\n<\/div>\n<p><a href=\"https:\/\/datascience.berkeley.edu\/blog\/python-data-science\/\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: datascience@berkeley Staff Programming languages that build the apps, programs and environments you use are sophisticated and, according to the TIOBE Index, there are more [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/03\/29\/python-for-data-science\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":1949,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1948"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=1948"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/1948\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/1949"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=1948"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=1948"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=1948"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}