A class for Pandas

There are a few things that we always do when we look at a dataset. We want to know

how many records do we have
what are the column headers
how many variables are we dealing with
are there any missing values
do we need to index the columns
how do the stats look, like minimum, maximum and average
how many unique values of a column
how many rows of a given value
a sample of records

There are many, many more basic things that we want to know before we get down to data exploration and analysis.

It would be nice to have a tool that accepts a CSV file and emits some of the information to the console. We can then take it forward from there. So now we know where we are at with the data on hand.

A wrapper for Pandas

We know that we are going to work with pandas package for data analysis. However, we write the same boilerplate code everytime we work with a new CSV file.

We are going to look at a way to reduce the startup time for setup and fetch the basic information from our dataset.

We will build a wrapper class around the pandas dataframe. So, let's jump right in.

The constructor accepts the file to begin with. If the file is not available at the path, then the program terminates with a message. Otherwise, the object is initialised with a pandas dataframe using the dataset in the CSV file.

def __init__(self, filepath):
        if not os.path.exists(filepath):
            print(f'file {filepath} does not exist.')
            os._exit(1)
        self.filepath = filepath
        self.df =  pd.read_csv(self.filepath)

The file path is passed as a parameter to the init() method. Next, we check to see if the file exists at that location. If it is not present we quit with an error message.

If the file exists at the location specified, a dataframe is obtained by calling the _readcsv() method of the pandas object. Make sure you have imported pandas first.

We provide a function getDataframe() to return the dataframe if necessary.

The shape property of the dataframe holds the information of rows and columns of the dataset. The function getDimension() returns it as a tuple.

We can get the information regarding the columns from the function getHeaders(). It returns the column names as a list object. The columns are indexed so that we could perform operations based on column index.

Pandas provides a describe() function for the basic statistics of a dataset such as mean, average, max, min and so on. The function getStats() returns this information.

getMissingValues() returns information regarding missing values in the dataset. It returns 0 if all numeric fields are filled.

To see a snapshot of the data we call the function getSampleData() . It returns the first and last five rows separated by a dotted line.

Excel provides pivot tables for aggregate metrics on columns. Pandas can do the same with pivot_table function. Our wrapper class has a function getColumnAggregates(). It accepts two parameters - the column name and the aggregate requested.

Here's the entire class with a sample CSV file of the weather data. You may swap in your own dataset and test it.

Let me know your feedback here or DM me at anand_b@twitter.com.

import  os
import  pandas as pd

class BasicCSV:

    def __init__(self, filepath):
        if not os.path.exists(filepath):
            print(f'file {filepath} does not exist.')
            os._exit(1)
        self.filepath = filepath
        self.df =  pd.read_csv(self.filepath)

    def getDataframe(self):
        ''' returns the dataframe '''
        return self.df

    def getDimension(self):
        ''' returns rows and columns as a tuple '''
        return self.df.shape

    def getHeaders(self):
        ''' returns indexed column names '''
        return list(enumerate(self.df.columns.to_list()))

    def getStats(self):
        ''' returns mean, min, max and other values of data '''
        return self.df.describe()

    def getSampleData(self):
        ''' returns the first and last five rows of data '''
        head = self.df.head()
        tail = self.df.tail()
        sep = '...... '
        return (head, sep, tail)

    def show(self, row_list, col_list):
        ''' returns data slice based on rows and columns specified '''
        return self.df.iloc[row_list, col_list]

    def getColumnAggregates(self, name, aggregate):
        ''' returns the aggregate grouped by column name '''
        return self.df.pivot_table(index = [name], aggfunc=aggregate)

    def getMissingValues(self):
        ''' returns a total of all missing values '''
        return self.df.isnull().sum().sum()

    def __repr__(self):
        return self.df.to_string()

    def showHelp(self):
        print('This class supports the following functions.\n')
        print(f'getDataframe()\n\t{self.getDataframe.__doc__}')
        print(f'getDimension()\n\t{self.getDimension.__doc__}')
        print(f'getHeaders()\n\t{self.getHeaders.__doc__}')
        print(f'getSampleData()\n\t{self.getSampleData.__doc__}')
        print(f'getMissingValues()\n\t{self.getMissingValues.__doc__}')
        print(f'getColumnAggregates()\n\t{self.getColumnAggregates.__doc__}')
        print(f'getStats()\n\t{self.getStats.__doc__}')
        print(f'show()\n\t{self.show.__doc__}')
        print(f'\nTest run\nfrom basic_csv import BasicCSV\nfile = BasicCSV("{self.filepath}")\nfile.info()\n')

    def info(self):
        ''' display basic info '''
        print(self.getDimension())
        print(self.getHeaders())
        print(self.getSampleData())
        print(self.getMissingValues())
        print(self.getStats())
        self.showHelp()

data = BasicCSV('weather.csv')
data.info()

Exercise

Enhance the class to display as much general information as possible regarding any dataset.

Happy coding. Enjoy.

Nan's Tech Blog

Nan's Tech Blog

A class for Pandas

Automate the boring stuff