博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Seven Python Tools All Data Scientists Should Know How to Use
阅读量:4707 次
发布时间:2019-06-10

本文共 6704 字,大约阅读时间需要 22 分钟。

If you’re an aspiring data scientist, you’re inquisitive – always exploring, learning, and asking questions. Online tutorials and videos can help you prepare you for your first role, but the best way to ensure that you’re ready to be a data scientist is by making sure you’re fluent in the tools people use in the industry.

I asked our data science faculty to put together seven python tools that they think all data scientists should know how to use.  and  programs both focus on making sure students spend ample time immersed in these technologies, investing the time to gain a deep understanding of these tools will give you a major advantage when you apply for your first job. Check them out below:

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers enhanced introspection, rich media, additional shell syntax, tab completion, and rich history. IPython provides the following features:

  • Powerful interactive shells (terminal and Qt-based)
  • A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media
  • Support for interactive data visualization and use of GUI toolkits
  • Flexible, embeddable interpreters to load into one’s own projects
  • Easy to use, high performance tools for parallel computing

Contributed by , Director of Science, Head of Galvanize Experts

GraphLab Create is a Python library, backed by a C++ engine, for quickly building large-scale, high-performance data products.

Here are a few of the features of GraphLab Create:

  • Ability to analyze terabyte scale data at interactive speeds, on your desktop
  • A Single platform for tabular data, graphs, text, and images
  • State of the art machine learning algorithms including deep learning, boosted trees, and factorization machines
  • Run the same code on your laptop or in a distributed system, using a Hadoop Yarn or EC2 cluster
  • Focus on tasks or machine learning with the flexible API
  • Easily deploy data products in the cloud using Predictive Services
  • Visualize data for exploration and production monitoring

Contributed by , Lead Data Science Instructor at Galvanize

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate. pandas does not implement significant modeling functionality outside of linear and panel regression; for this, look to statsmodels and scikit-learn. More work is still needed to make Python a first class statistical modeling environment, but we are well on our way toward that goal.

Contributed by , Director of Science, Head of Galvanize Experts

Linear Programming is a type of optimisation where an objective function should be maximised given some constraints. PuLP is an Linear Programming modeler written in python. PuLP can generate LP files and call on use highly optimized solvers, GLPK, COIN CLP/CBC, CPLEX, and GUROBI, to solve these linear problems.

Contributed by , Data Science Instructor at Galvanize

matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB® or Mathematica®), web application servers, and six graphical user interface toolkits.

matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code.

For simple plotting the pyplot interface provides a MATLAB-like interface, particularly when combined with IPython. For the power user, you have full control of line styles, font properties, axes properties, etc, via an object oriented interface or via a set of functions familiar to MATLAB users.

Contributed by , Chief Science Officer at Galvanize

Scikit-Learn is a simple and efficient tool for data mining and data analysis. What is so great about it is that it’s accessible to everybody, and reusable in various contexts. It is built on NumPy,SciPy, and mathplotlib. Scikit is also an open source that is commercially usable – BSD licence. Scikit-Learn has the following features:

  • Classification – Identifying to which category an object belongs to
  • Regression – Predicting a continuous-valued attribute associated with an object
  • Clustering – Automatic grouping of similar objects into sets
  • Dimensionality Reduction – Reducing the number of random variables to consider
  • Model Selection – Comparing, validating and choosing parameters and models
  • Preprocessing – Feature extraction and normalization

Contributed by , Data Science Instructor at Galvanize

Spark consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.

Contributed by , Lead Data Science Instructor at Galvanize

Still hungry for more data science? Enter  for a chance to win tickets awesome conferences like  and the , or get discounts on Python resources like  and .

转载于:https://www.cnblogs.com/yymn/p/4652377.html

你可能感兴趣的文章
基于dubbo的分布式系统(一)安装docker
查看>>
Recursion
查看>>
66. Plus One
查看>>
COMP30023 Computer Systems 2019
查看>>
CSS选择器分类
查看>>
Kali学习笔记39:SQL手工注入(1)
查看>>
C# MD5加密
查看>>
Codeforces Round #329 (Div. 2)D LCA+并查集路径压缩
查看>>
移动应用开发测试工具Bugtags集成和使用教程
查看>>
Java GC、新生代、老年代
查看>>
Liferay 6.2 改造系列之十一:默认关闭CDN动态资源
查看>>
多线程
查看>>
折线切割平面
查看>>
获取当前路径下的所有文件路径 :listFiles
查看>>
图像形态学及更通用的形态学的原理及细节汇总
查看>>
linux开启coredump的3种方法
查看>>
数据驱动之 python + requests + Excel
查看>>
小鸡啄米问题求解
查看>>
Castle.net
查看>>
HDU1532 网络流最大流【EK算法】(模板题)
查看>>