What is Pandas?

Pandas is an open-source library for the Python programming language. The name Pandas is from panel data, a statistical term for data with an empirical context. With Pandas, developers can import data from a variety of sources and create an object in Python. 

The objects that Pandas creates are called DataFrames. A DataFrame is a two-dimensional set of tabular data, much like a table in a relational database or a sheet in an Excel file. DataFrame objects are like any other object in Python, so you can manipulate them with Python syntax or with the methods included in the Pandas library.

How Do You Install Pandas?

Pandas is open-source and free, which means that the installation process is similar to many other Python libraries. The library comes bundled with Anaconda, a Python distribution intended for statistical and scientific computing.

In Anaconda, Pandas installation requires the following command:

conda install pandas

Pip users can install with this syntax:

pip install pandas

While Debian and Ubuntu support this command-line install:

sudo apt-get install python3-pandas

Pandas is dependent on the NumPy library, which supports multi-dimensional matrices.

How Do You Load Data in Pandas?

Pandas is extremely versatile in terms of the data it can import. To import any file, you must use the appropriate read command. The most common file types are

  • CSV: pandas.read_CSV() allows you to specify the separator, with commas as the default.
  • SQL: Pandas supports several types of SQL import, such as pandas.read_sql_table() which imports an SQL table and pandas.read_sql_query() which imports the results of an SQL query
  • Excelpandas.read_Excel() supports arguments to specify the sheet name and which columns to use.
  • JSONpandas.read_json allows you to specify orients within the JSON, or you can normalize the object with pandas.json_normalize().
  • HTML: Read a HTML table into memory with pandas.read_HTML()
  • Parquet: pandas.read_partquet() allows you to read from a columnar Parquet table and specify your preferred engine in the arguments
  • Clipboard data: Pandas can read directly from the data with pandas.read_clipboard()

All of the import commands will create a new DataFrame containing data from the source file. You can also create a new DataFrame from a Python source, such as a dictionary or a NumPy matrix, using the command pandas.DataFrame()

What Are the Main Features of Pandas?

Pandas is a large library with some useful functions and features. A few of the more popular options include: 

1. Multi-format Input

As well as the options listed above, Pandas supports data imports from fixed-width format files, feather format, ORC objects, Stata files, and data from Google Big Query. 

2. Data Cleansing

Pandas includes a few native commands that you can use to improve the quality of data within a DataFrame. For example, the command pandas.dataframe.isnull() will return all null values within a DataFrame. You can use this to weed out blank or incomplete records in the data set.

The replace command can also help to harmonize data in a set. For example, in a list of addresses where you want to standardize the state names, you could use commands like pandas.dataframe.replace([‘TX'],['Texas']) to perform a find & replace on all matching values. 

3. Hierarchical Indexing

A popular feature in Pandas is multi-indexing or hierarchical indexing. This allows you to use multiple columns as a single index within a DataFrame. The syntax for setting up such an index looks like this:

pandas.dataframe.setindex(['Column1', 'Column2'], inplace=True)

This feature provides you with a two-column index, allowing for more complex queries. This can also return hierarchical results, first sorted by Column1 and then sorted by Column2. 

Multi-indexing can also improve performance, but only if you sort the DataFrame after indexing. You perform this operation with 

pandas.Index.sort()

4. Joining and Merging

Pandas can combine two DataFrames into a single object. You can achieve this by using the following commands:

  • pandas.dataframe.append(DataFrame2): This command appends one data frame to the end of another. If the second DataFrame has columns that do not appear in the first, then Pandas will create new columns.
  • pandas.dataframe.concat(DataFrame1, DataFrame2): This command is a horizontal merge. The function adds the second object to the first as a series of new columns.
  • pandas.dataframe.join(DataFrame2, on=Column1, how=’inner/outer/left/right'): This command replicates an SQL Join, allowing you to merge tables with a shared column. As with an SQL Join, you can specify this as an inner, left, right, or outer join. 

The result of these commands is a single DataFrame object containing data from both sources.

5. Slicing and Series

Pandas includes a number of functions that can slice DataFrames and break them into subsets. The most basic command is .loc, which allows you to select values starting from a specified point.

The simplest form of this command is pandas.DataFrame.loc[‘rowname’, 'columnname']. This returns the value of a specific cell. By adding further arguments, you can slice the DataFrame with great accuracy.

.loc can also produce a series, which is essentially a single column, or a one-dimensional data array. You can then operate on these objects using the pandas.Series commands.

6. Styling

You can use Pandas to add styling options to a DataFrame, which can help to make some of the results more readable. 

Styling involves the pandas.Styler constructor, along with some elements of CSS. These styles can apply in a uniform fashion to values, such as applying a standard format to cash values, like $#.##.

Pandas also permits conditional values, including styles for maximum and minimum values, which allows you to create easy to read reports. Styler.export() allows you to export your current styles, which you can then import into another frame.  

7. Masking

Pandas includes a simple function for data obfuscation or data maskingpandas.DataFrame.mask() if you need to hide sensitive data values held in the DataFrame object

Valid conditions for masking are Boolean series, array-like or callable. There's also the option to perform the masking in-place on the data. You can also specify an alignment level or alignment axis if required.

Share This Article
facebook linkedin twitter

Glossary of Terms

A guide to the nomenclature of data integration technology.