pandas内置绘图_使用Pandas内置功能探索数据集-程序员宅基地

技术标签: python  数据分析  机器学习  

pandas内置绘图

Each and every data scientist is using the very famous libraries for data manipulation that is Pandas, built on top of the Python programming language. It is a powerful python package that makes importing, cleaning, analyzing, and exporting the data easier.

每个数据科学家都在使用非常著名的Pandas库进行数据处理,该库基于Python编程语言构建。 它是一个功能强大的python软件包,可简化导入,清理,分析和导出数据的过程。

In a nutshell, Pandas is like excel for Python, with tables (which in pandas are called DataFrames), rows and columns (which in pandas are called Series), and many functionalities that make it an awesome library for processing and data inspection and manipulation.

简而言之,Pandas就像Python的excel一样,具有表(在Pandas中称为DataFrames),行和列(在Pandas中称为Series)以及许多功能,使其成为处理,数据检查和操作的出色库。

Sharing some of the great insights and hacks in pandas which makes data analysis more fun and handy.

分享大熊猫的一些见解和技巧,从而使数据分析变得更加有趣和便捷。

import pandas as pd

While reading the dataframe many times we face the problem that a complete set of rows are not visible, thus analyzing the data becomes quite difficult. So pandas provide the function as “set_options” which help us to define the maximum number of counts of rows to be displayed.

在多次读取数据帧时,我们面临的问题是看不到完整的行集,因此分析数据变得非常困难。 因此,熊猫提供了“ set_options ”功能,可以帮助我们定义要显示的最大行数。

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

1.导入数据 (1. Importing the data)

Here are the different formats of data that can be imported by using pandas read functionality.

这是可以使用熊猫读取功能导入的数据的不同格式。

Csv, Excel, Html, Binary files, Pickle file, Json file, SQL query

Csv,Excel,HTML,二进制文件,Pickle文件,Json文件,SQL查询

The format which is most commonly used for machine learning is CSV i.e (comma separated file) and every data scientist encounters a CSV file on a daily basis, so we would restrict it to a CSV file.

机器学习最常用的格式是CSV,即(逗号分隔的文件),每个数据科学家每天都会遇到CSV文件,因此我们将其限制为CSV文件。

Some of the important key arguments while reading CSV file in pandas are

在熊猫中读取CSV文件时,一些重要的关键参数是

  • delimiter: a blank space, comma, or other character or symbol that separates different cells in a row

    定界符:空格,逗号或其他字符或符号,用于分隔行中的不同单元格

  • header: a row which is to be used as column name

    标头:将用作列名的行

  • index_col: columns to be used as a row labels

    index_col:用作行标签的列

  • usecol: the name of columns to be used while reading the file if provided only a subset of the file will be read

    usecol:如果仅读取文件的一个子集,则在读取文件时使用的列名

  • skiprows: number of lines to be skipped, generally if a file has a blank line or unnecessary content which need to skip

    skiprows:要跳过的行数,通常在文件中包含空白行或需要跳过的不必要内容的情况下

pd.read_csv(filepath, delimiter, header, names, index_col, usecol, skiprows, parse_dates, keep_date_col, chunksize)pd.read_csv("data.csv", header= 

Once the data is imported, it is known as the dataframe in Pandas definition.

数据导入后,在Pandas定义中称为数据框。

2. 了解数据 (2. Understanding the data)

df = pd.DataFrame({
        

"Company": ["Kia", "Hyundai", "Hyundai", "Hyundai", "Hyundai","Honda","Honda", "Honda", "Honda", "Kia"],

"Segment": ["premium", "budget", "luxury", "premium", "budget","premium", "budget", "budget", "premium", "luxury"],

"Type": ["large", "small", "large", "small","small", "large", "small", "small", "large", "large"],

"CrashRating": [4.5, 2.5, 4, np.nan, 3, 4, 3, 4.2, 4.5, 4.2],

"CustomerFeedback": [9, 7, 5, 5, 8, 5.6, np.nan, 9, 9, 4.8]})

We can check the first n rows or last n rows of raw data by using the function “head(number of rows)”

我们可以使用“ head(行数)”功能检查原始数据的前n行或后n行

df.head(5) #here 5 is first 5 items in the dataframe
df.tail(5) #displays the bottom 5 rows of dataframe

This data is in the format of the table which is the same as visualized in excel or any other CSV reader. To interact more with data lets see some useful inbuilt functions.

此数据采用表格格式,与在excel或任何其他CSV阅读器中显示的格式相同。 要与数据进行更多交互,请看一些有用的内置函数。

  • info: This function provides the summary of the dataframe, that are number of rows, number of columns, name of each column along with the number of null values in that column and the type of data in that column

    info:此函数提供数据帧的摘要,即行数,列数,每列的名称以及该列中的空值数量和该列中的数据类型

df.info()
  • describe: used to analyze the data statistically, and thus only returns results for numerical columns in the dataframe. Returns the table comprises count, mean, standard deviation, minimum value, maximum value, and quantile values which are useful to detect outliers and see the distribution of data.

    describe:用于统计分析数据,因此仅返回数据框中数字列的结果。 返回的表包含计数,平均值,标准偏差,最小值,最大值和分位数,这些值可用于检测异常值和查看数据分布。

df.describe()
  • memoryusage: used to understand the memory usage of each column in bytes

    memoryusage:用于了解每列的内存使用情况(以字节为单位)

df.memoryusage()
  • dtype: to analyze the datatype of each column within the dataframe. Returns a series with a data type of each column

    dtype:分析数据框中每个列的数据类型。 返回具有每一列数据类型的序列

df.dtypes
  • isnull or isna: this function is used to calculate the missing values in the dataframe when used independently returns a bool (True or False) indicating if the value is NA, we can use it in multiple manners

    isull或isna:此函数用于在数据帧中单独使用时返回布尔值(True或False)以指示该值是否为NA时,计算数据框中的缺失值,我们可以以多种方式使用它

df.isnull().sum() #count the number of missing values in each column
df.isnull().mean()*100 #return the percentage of missing values in each column
  • unique: when we need to count the unique number of values in one pandas series (i.e. in one specific column of dataframe). Generally used to analyze categorical columns.

    唯一:当我们需要计算一个熊猫系列中唯一值的数量时(即在数据框的特定列中)。 通常用于分析分类列。

df[col].unique()
  • shape: used to define the dimensionality of the dataframe

    形状:用于定义数据框的尺寸

df.shape

3. 探索数据 (3. Exploring the Data)

Now that we have loaded our data into a DataFrame and understood its structure, let’s pick and perform visualizations on the data. When it comes to selecting your data, you can do it with both Indexes or based on certain conditions. In this section, let’s go through each one of these methods and do some exploratory analysis.

现在,我们已经将数据加载到DataFrame中并了解了其结构,现在让我们选择数据并对其进行可视化处理。 在选择数据时,可以同时使用两个索引或根据特定条件来执行。 在本节中,我们将逐一介绍这些方法中的每一种,并进行一些探索性分析。

  • Selecting the Columns

    选择列

Set of columns which we need to analyze can be select in the following way

我们需要分析的一组列可以通过以下方式选择

df[['Company', 'Type']]
df.loc[:,['Company', 'Type']]
df.iloc[:,[0,1]]
  • Selecting the Rows

    选择行

Selecting the specific rows for analysis can be achieved in the following manner

选择要分析的特定行可以通过以下方式实现

df.iloc[[1,2], :]
df.loc[[1,2], :]
  • Selecting the specific type of columns

    选择特定的列类型

Sometimes it is helpful to select the subset of the column having specific data types than this function can be used

有时选择列的子集会有所帮助 可以使用具有特定于此功能的数据类型

df.select_dtypes(include=['object'])
  • Selecting both rows and columns

    选择行和列

Most of you are curious to understand that is pandas so week that only one index can be selected at a time either set of rows or columns, no we can select a subset of rows and column at a single time

你们中的大多数人很好奇地理解这是大熊猫,所以一周只能一次选择一组行或列的索引,不,我们不能一次选择行和列的子集

df.iloc[0:2][['Segment', 'Type']]
df.iloc[0:2,1:3]
  • Applying filter

    应用过滤器

Now, in a real-time scenario, selecting the particular number of rows based on the indexes is quite tough. So the actual real-life requirement would be to filter out the rows that satisfy a certain condition. With respect to our dataset, we can filter by any of the following conditions

现在,在实时情况下,根据索引选择特定的行数非常困难。 因此,实际的实际需求是过滤出满足特定条件的行。 对于我们的数据集,我们可以通过以下任意条件进行过滤

df[df['Type']=='large']
df[(df['Type']=='large') & (df['Segment']=='luxury')]

4.处理和转换数据 (4. Handling and Transforming the Data)

After doing basic exploration analysis on data, now it’s time to handle missing values and transform the data to perform some advanced data exploration.

在对数据进行了基本的探索分析之后,现在是时候处理缺失值并转换数据以执行一些高级数据探索了。

  • Missing data handling

    缺少数据处理

Handling missing values is one the trickest and crucial part of data manipulation because replacing the missing cells reflects the change in the distribution of data. Depending on the characteristics of the dataset and the task we can choose to

处理丢失的值是数据操作中最棘手且至关重要的部分之一,因为替换丢失的单元格反映了数据分布的变化。 根据数据集的特征和我们可以选择的任务

  • Drop missing values: We can drop a row or column having missing values. Scenarios where more than 40% of the column have missing values than that whole column is dropped from the analysis. Dropping results in eliminating the entire row from the observation thus reducing the size of the dataframe.

    删除缺失值:我们可以删除具有缺失值的行或列。 场景中超过40%的列缺少值的情况比从分析中删除整个列的情况。 删除导致从观察中消除了整个行,​​从而减小了数据帧的大小。

  • Replace missing values: Depending upon the distribution of the column we can replace the missing values with a special value or an aggregate value such as mean, median, or any other dynamic value which could be average of similar observations. For time-series data missing values are generally replaced with a window of values before and after the observation.

    替换缺失值 :根据列的分布,我们可以将缺失值替换为特殊值或合计值,例如平均值,中位数或任何其他可能是相似观察值的平均值的动态值。 对于时间序列数据,通常将缺失值替换为观察前后的值窗口。

df[‘CrashRating’].fillna(df[‘CrashRating’].mean())
df.fillna(axis=0, method = ‘ffill’, limit =1)
  • Drop column or rows

    删除列或行

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names.

通过指定标签名称和相应的轴,或直接指定索引或列名称,删除行或列。

Dropping the column is helpful in the labeled dataframe where we want to remove y_true from training and test data.

在要从训练和测试数据中删除y_true的带标签数据框中,删除该列会很有帮助。

df.drop(['CrashRating'], axis=1)
df.drop([0,1], axis=0)
  • Group By

    通过...分组

In many situations, we split the data into sets like grouping records into buckets by categorical values and apply some functionality on each subset. In the apply functionality, we can perform the following operations:

在许多情况下,我们将数据分成几组,例如通过分类值将记录分组到存储桶中,并对每个子集应用某些功能。 在应用功能中,我们可以执行以下操作:

  • Aggregation − computing a summary statistic

    聚合-计算摘要统计
  • Transformation − perform some group-specific operation

    转换-执行一些特定于组的操作
  • Filtration − discarding the data with some condition

    过滤-在某些条件下丢弃数据
df.groupby(['Company','Segment']).mean()
  • Pivot table

    数据透视表

This is much similar to “groupby” which is also composed of counts, sums, or other aggregations derived from a table of data. You may have used this feature in spreadsheets, where you would choose the rows and columns to aggregate on, and the values for those rows and columns. It allows us to summarize data as grouped by different values, including values in categorical columns.

这与“ groupby”非常相似,“ groupby”也由从数据表派生的计数,总和或其他聚合组成。 您可能在电子表格中使用了此功能,可以在其中选择要汇总的行和列,以及这些行和列的值。 它使我们可以按不同值(包括分类列中的值)对数据进行汇总。

Pivot table function in pandas takes certain arguments as input:

pandas中的数据透视表函数将某些参数作为输入:

  • index, columns

    索引

  • values = the name of the column of values to be aggregated in the ultimate table, then grouped by the Index and Columns and aggregated according to the Aggregation Function

    values =要在最终表中聚合的值列的名称,然后按索引和列分组,并根据聚合函数进行聚合

  • aggfunc= (Aggregation Function) how rows are summarized, such as sum, mean, or count

    aggfunc =(聚合函数)如何汇总行,例如求和,均值或计数

df.pivot_table(index=['Company', 'Type'], columns=['Segment'], values=['CrashRating'], aggfunc='mean')
  • Merge and Concatenation

    合并与串联

When importing data from multiple files in a separate dataframe it becomes necessary to concat, merge, or join such files into one.

从一个单独的数据框中的多个文件导入数据时,有必要将这些文件合并,合并或合并为一个文件。

  • concat() — performs all the concatenation operations along an axis while performing optimal set logic (union or intersection) of the indexes

    concat() —沿轴执行所有串联操作,同时执行索引的最佳设置逻辑(联合或交集)

df_new = pd.DataFrame({
        
"Company": [ "Hyundai","Honda", "Honda", "Honda", "Kia"],
"Segment": ["premium", "budget", "luxury", "luxury", "luxury"],
"Type": ["large", "small", "large", "large", "large"],
"CrashRating": [3.8, 3.5, 4, 4.2, 3],
"CustomerFeedback": [8, 7, 7, 6, 7.5 ]})df_result = pd.concat([df, df_new])
df_result = pd.concat([df, df_new], keys=[‘old’,’new’])
  • merge() — pandas have in-memory join operations very similar to relational databases like SQL, in SQL where we use “join” to combine two tables on one common index.

    merge()—大熊猫具有类似于SQL之类的关系数据库的内存中联接操作,在SQL中,我们使用“联接”将两个表组合在一个公共索引上。

df_launingyear = pd.DataFrame({
        
"Company": [ "Hyundai","Honda", "Honda", "Honda", "Kia"],
"LaunchingYear": [2015, 2018, 2017, 2012, 2019]})pd.merge(df, df_launingyear, on='Company')
  • Create dummy variables

    创建虚拟变量

Categorical variables whose type are ‘object’ can not be used as it is for training the ML model, we need to create a dummy variable of that specific column using pandas “get_dummies” function

无法将类型为“对象”的分类变量原样用于训练ML模型,我们需要使用pandas“ get_dummies”函数为该特定列创建一个虚拟变量

pd.get_dummies(df['Company'])

保存数据框 (Saving a dataframe)

After performing exploratory analysis on the dataset we want to store the observations in the form of a new CSV file, which comprises additional information such as table returned by applying pivot_table function or filtering unnecessary details or new dataframe obtained after running concat or merge-operations.

在对数据集执行探索性分析之后,我们希望以新的CSV文件的形式存储观察值,其中包括其他信息,例如通过应用数据透视表功能或过滤不必要的详细信息返回的表,或在运行concat或合并操作后获得的新数据框。

Exporting the results in the form of a CSV file is a simpler step, as we just need to call “to_csv()” function with some arguments which are same as we used while reading data from CSV.

以CSV文件的形式导出结果是一个简单的步骤,因为我们只需要使用一些参数来调用“ to_csv()”函数,这些参数与从CSV读取数据时使用的参数相同。

df.to_csv('./data.csv', index_label=False)

结论 (Conclusion)

In this article, we have listed some general pandas function used to analyzed each dataset which we have gathered while working with Python and Jupyter Notebooks. We are sure these simple hacks will be of use to you and you will take back something from this article. Till then Happy Coding!.

在本文中,我们列出了一些通用的pandas函数,用于分析在使用Python和Jupyter Notebooks时收集的每个数据集。 我们确信这些简单的技巧对您有用,您将从本文中取回一些东西。 直到快乐编码!

Let us know if you like the blog, please do comment for any queries or suggestions and follow us on LinkedIn and Instagram. Your love and support inspire us to post our learning in a much better way..!!

让我们知道您是否喜欢该博客,如有任何疑问或建议,请发表评论,并在LinkedInInstagram上关注我们。 您的爱心支持会激励我们以更好的方式发表我们的学习经验。

翻译自: https://medium.com/@datasciencewhoopees/exploring-datasets-with-pandas-inbuilt-functionality-6c322c0cdd7d

pandas内置绘图

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/weixin_26752765/article/details/108132318

智能推荐

数据指标体系命名规范-程序员宅基地

文章浏览阅读3.8k次,点赞2次,收藏10次。原子指标定义原子指标 = 单一业务修饰词+基础指标词根,例如:支付金额-payment_amt命名规范派生指标定义派生指标 = 多业务修饰词+基础指标词根,派生指标继承原子指标的特性,例如:T+180体系课商品复购率,其中T+180是时间修饰词、体系课复购是业务修饰词、比率是基础指标词根命名规范日期类指标命名规范命名时要遵循:业务修饰词+基础指标词根+聚合修饰词(日期修饰词)。将日期后缀加到名称后面,如下图所示:聚合类型指标命名规范命名时要遵循:业务修饰词+基础指标词根+聚

怎样让一个div高度和浏览器高度一样_如何让div的高度等于浏览器可见区域的高度浏览器滚动div始终覆盖浏览器的整个-程序员宅基地

文章浏览阅读6.7k次,点赞2次,收藏2次。这个老生长谈的问题,不知困扰了多少前端开发人员,和后端程序员,其实很简单,这里写出来,让大家分享下,有很多人说,我已经设置div 100%了,怎么还没效果,我想说的是,有一个关键的东东,你没设置,html,body{height:100%;overflow:hidden;}哈哈,这回你会了吧,要同时设置。_如何让div的高度等于浏览器可见区域的高度浏览器滚动div始终覆盖浏览器的整个

cisco 2960 VLAN MAC_盘点Mac上的触控板(鼠标)增强工具-程序员宅基地

文章浏览阅读147次。今天小编给大家推荐几款Mac上好用的触控板/鼠标增强工具,拥有这些软件可以为触控板添加各种自定义的快捷键和手势动作,为鼠标的右键菜单添加功能,提高工作效率。一、BetterTouchTool触控板功能增强软件,一款专为Mac用户开发的Magic Mouse鼠标功能增强制作的软件。可以触发任意键盘快捷键和100多个预定义操作的组合,您几乎可以控制Mac的每个方面。BetterTouchTool fo...

cas 单点登录服务端客户端配置-程序员宅基地

文章浏览阅读124次。首先,下载cas-server-3.5.2-releasehttp://pan.baidu.com/s/1GJ8Gscas-client-3.2.1-releasehttp://pan.baidu.com/s/1glKFB提供俩个下载地址:先从服务端配置:我是新建一个web工程cas-server将 建几个文件夹src/libssrc/loc...

CSS 设置文字间距_css字间距-程序员宅基地

文章浏览阅读4.7w次,点赞17次,收藏53次。一、css word-spacing属性设置字间距(单词的间距)word-spacing 属性增加或减少单词间的空白(即字间隔);在这个属性中,“字” 定义为由空白符包围的一个字符串。也就是说该属性是以空格为基准进行调节间距的,如果多个字母被连在一起,则会被word-spacing视为一个单词;如果汉字被空格分隔,则分隔的多个汉字就被视为不同的单词,word-spacing属性此时有效。语法:word-spacing:值;normal:定义单词间的标准空间,默认值。 length:定义单词间的固定空_css字间距

关于安卓蓝牙2.0的app开发原理-程序员宅基地

文章浏览阅读805次。最近时间比较宽裕,觉得自己可以写一些东西来总结一下工作,索性就写一篇关于安卓蓝牙的开发总结吧安卓蓝牙开发其实也就是socket的开发,同时分为服务端和客户端,下面我就按照我的开发流程来降整个的安卓蓝牙2.0开发叙述一下,蓝牙4.0BLE我也会在之后给大家更新首先,我们要注册蓝牙相关的广播并在manifest中给出相应的权限(安卓6.0之后由于相应的底层改变,注册权限的时候不仅要给蓝牙的权限

随便推点

matplotlib绘制多张图、多子图、多例图_matplotlib同时绘制8个图-程序员宅基地

文章浏览阅读1.7k次。绘制多图关键:fig = plt.figure(1) 表示新建第几个图import matplotlib.pyplot as pltfig = plt.figure(1)plt_rec_loss = [1,2,3,4,5,6]plt_rec_recall = [4,3,6,5,8,9]plt.xlabel("epoch")plt.ylabel("loss")plt.plot(r..._matplotlib同时绘制8个图

RHCSA第五天作业-程序员宅基地

文章浏览阅读208次。1、新建几个普通用户wukong,wuneng,wujing,他们都属于xiyouji组的成员,其中wujing没有和系统交互的shell。[root@localhost ~]# groupadd xiyouji[root@localhost ~]# useradd -g xiyouji wukong[root@localhost ~]# useradd -g xiyouji wuneng[root@localhost ~]# useradd -g xiyouji wujing[root@loc

一个好用的数据分析工具:Cftool-程序员宅基地

文章浏览阅读4.8k次。同事最近在做数据分析,计算完全依赖于计算器,然后一个小规模的矩阵,就是用计算器一个个算出来;程序员看不下去,给他写了个exe,cmd下输入要求的数据,就直接给输出了。今天他在做数据分析,给了x-y数据,让我找拟合关系。先前接触过cftool,于是直接拿来用了:将x\y按照同样维度格式输入;命令行输入cftool,会出现一个窗口;将x\y数据加载;选择权重关系(同样权重就忽略此项);...

OpenCL错误码转字符串_cl_exec_status_error_for_events_in_wait_list-程序员宅基地

文章浏览阅读638次。OpenCL错误码转字符串(以中文表示)错误代码位:0 ~ -19、-30 ~ -68const char* errorCodeToString(cl_int errCode) { const char* err = NULL; switch (errCode) { case CL_SUCCESS: err = "CL_SUCCESS:命令成功执行,没有出现错误!"; break..._cl_exec_status_error_for_events_in_wait_list

获取本年、本月、本周时间范围_js获取时间(本周、本季度、本月..)-程序员宅基地

文章浏览阅读1k次。Js代码/*** 获取本周、本季度、本月、上月的开端日期、停止日期*/var now = new Date(); //当前日期var nowDayOfWeek = now.getDay(); //今天本周的第几天var nowDay = now.getDate(); //当前日var nowMonth = now.getMonth(); //当前月var nowYear = now.getYear..._js 本周 本月 时间段

GD32三种低功耗例程-程序员宅基地

文章浏览阅读1.1w次,点赞5次,收藏25次。GD32F303ZET6低功耗例程,睡眠模式、深度睡眠模式、待机模式

推荐文章

热门文章

相关标签