There are a few things that we always do when we look at a dataset. We want to know
- how many records do we have
- what are the column headers
- how many variables are we dealing with
- are there any missing values
- do we need to index the columns
- how do the stats look, like minimum, maximum and average
- how many unique values of a column
- how many rows of a given value
- a sample of records
There are many, many more basic things that we want to know before we get down to data exploration and analysis.
It would be nice to have a tool that accepts a CSV file and emits some of the information to the console. We can then take it forward from there. So now we know where we are at with the data on hand.
A wrapper for Pandas
We know that we are going to work with pandas package for data analysis. However, we write the same boilerplate code everytime we work with a new CSV file.
We are going to look at a way to reduce the startup time for setup and fetch the basic information from our dataset.
We will build a wrapper class around the pandas dataframe. So, let's jump right in.
The constructor accepts the file to begin with. If the file is not available at the path, then the program terminates with a message. Otherwise, the object is initialised with a pandas dataframe using the dataset in the CSV file.
def __init__(self, filepath):
if not os.path.exists(filepath):
print(f'file {filepath} does not exist.')
os._exit(1)
self.filepath = filepath
self.df = pd.read_csv(self.filepath)
The file path is passed as a parameter to the init() method. Next, we check to see if the file exists at that location. If it is not present we quit with an error message.
If the file exists at the location specified, a dataframe is obtained by calling the _readcsv() method of the pandas object. Make sure you have imported pandas first.
We provide a function getDataframe() to return the dataframe if necessary.
The shape property of the dataframe holds the information of rows and columns of the dataset. The function getDimension() returns it as a tuple.
We can get the information regarding the columns from the function getHeaders(). It returns the column names as a list object. The columns are indexed so that we could perform operations based on column index.
Pandas provides a describe() function for the basic statistics of a dataset such as mean, average, max, min and so on. The function getStats() returns this information.
getMissingValues() returns information regarding missing values in the dataset. It returns 0 if all numeric fields are filled.
To see a snapshot of the data we call the function getSampleData() . It returns the first and last five rows separated by a dotted line.
Excel provides pivot tables for aggregate metrics on columns. Pandas can do the same with pivot_table function. Our wrapper class has a function getColumnAggregates(). It accepts two parameters - the column name and the aggregate requested.
Here's the entire class with a sample CSV file of the weather data. You may swap in your own dataset and test it.
Let me know your feedback here or DM me at anand_b@twitter.com.
import os
import pandas as pd
class BasicCSV:
def __init__(self, filepath):
if not os.path.exists(filepath):
print(f'file {filepath} does not exist.')
os._exit(1)
self.filepath = filepath
self.df = pd.read_csv(self.filepath)
def getDataframe(self):
''' returns the dataframe '''
return self.df
def getDimension(self):
''' returns rows and columns as a tuple '''
return self.df.shape
def getHeaders(self):
''' returns indexed column names '''
return list(enumerate(self.df.columns.to_list()))
def getStats(self):
''' returns mean, min, max and other values of data '''
return self.df.describe()
def getSampleData(self):
''' returns the first and last five rows of data '''
head = self.df.head()
tail = self.df.tail()
sep = '...... '
return (head, sep, tail)
def show(self, row_list, col_list):
''' returns data slice based on rows and columns specified '''
return self.df.iloc[row_list, col_list]
def getColumnAggregates(self, name, aggregate):
''' returns the aggregate grouped by column name '''
return self.df.pivot_table(index = [name], aggfunc=aggregate)
def getMissingValues(self):
''' returns a total of all missing values '''
return self.df.isnull().sum().sum()
def __repr__(self):
return self.df.to_string()
def showHelp(self):
print('This class supports the following functions.\n')
print(f'getDataframe()\n\t{self.getDataframe.__doc__}')
print(f'getDimension()\n\t{self.getDimension.__doc__}')
print(f'getHeaders()\n\t{self.getHeaders.__doc__}')
print(f'getSampleData()\n\t{self.getSampleData.__doc__}')
print(f'getMissingValues()\n\t{self.getMissingValues.__doc__}')
print(f'getColumnAggregates()\n\t{self.getColumnAggregates.__doc__}')
print(f'getStats()\n\t{self.getStats.__doc__}')
print(f'show()\n\t{self.show.__doc__}')
print(f'\nTest run\nfrom basic_csv import BasicCSV\nfile = BasicCSV("{self.filepath}")\nfile.info()\n')
def info(self):
''' display basic info '''
print(self.getDimension())
print(self.getHeaders())
print(self.getSampleData())
print(self.getMissingValues())
print(self.getStats())
self.showHelp()
data = BasicCSV('weather.csv')
data.info()
Exercise
Enhance the class to display as much general information as possible regarding any dataset.
Happy coding. Enjoy.