Read CSV in Python and Pandas

Baby steps to Data Science

In this post we will read data from a CSV file in two ways:

  • Python's CSV reader
  • Pandas CSV reader

Reading a CSV file is most often the first step in a data science project.

A CSV file can also be opened in a spreadsheet application like Microsoft's Excel or Google Sheets. While these applications provide excellent support for data analysis, automating your work and scaling to large datasets, Python is one of the best tools for Exploratory Data Analysis (EDA), the bedrock of data science.

CSV without Pandas

Python has a csv module. All you need to do is to import it in your program. Nothing to install.

The steps to read CSV in Python:

  1. Import csv module
  2. Open file to read using Python's file open() function
  3. Call reader method in the csv module
  4. The reader object can be subsequently converted to a Python's dictionary object.
  5. Better yet, use DictReader and build a list of dictionary objects of each row of data

Here's the code to do that. Notice that we put in a check to see if the CSV file contains a header. This is useful to skip header when reading in the data.

The CSV file we are going to use consists of weather data downloaded from the link here.

CSV reader

After import, the function csv.reader() is called with the csv file as the parameter. It functions like an iterator and returns the rows one by one when you call the next() on the reader object.

import  csv

csv_file = './weather.csv'
csv_data = csv.reader(csv_file, delimiter = '\t')
header = next(csv_data)

One of the options is delimiter to specify whether the column values are separated by a space or a tab, the default being a comma.

Place the CSV file in the same folder as the python program. If you want to load the file from another location, provide the path to it.

In the code below, we read in the file, then perform the following operations.

  1. Create a reader object
  2. Check if the file has a header row using the has_header property of the Sniffer class in the csv module
  3. Move the iterator pointer to the first row using next
  4. Read the header row
  5. Read each row from the reader object and store in a list.
  6. Outside the file operation, index the header using Python's enumerate function
import  csv

csv_file = './weather.csv'
header = []
rows = []
data = dict()
data_list = []
with open(csv_file, 'r') as file:
    csv_data = csv.reader(file)
    if csv.Sniffer().has_header:
        header = next(csv_data)
    for row in csv_data:
        rows.append(row)
    file.close()

header_indexes = list(enumerate(header))
print(f'Column headers: {header_indexes}\n')
print(f'Number of rows : {len(rows)}\n')

Our file does have a column header. We index it using Python's enumerate function.

header_indexes = list(enumerate(header))

We will use column index to fetch data from the rows.

We are using a dictionary data structure to store the data from the CSV file. We could of course run our little program while reading the file line by line, but as the program grows in complexity and you have to do go through the data multiple times, a dictionary comes in handy.

We have two options here:

  • manually create a dictionary
  • use csv module's DictReader

    Create a dictionary manually

for row in rows:
    data = dict(zip(header, row))
    data_list.append(data)

The manual operation shown here maybe useful for the beginner level coders to understand the nuances of low level coding, but the power of the csv module comes from the DictReader class.

CSV DictReader

Build a list of dictionary objects for every row in the CSV file. The dictionary object has header names for keys and column data as values. Display first two rows.

import  csv

csv_file = './weather.csv'

# csv DictReader
print('\nreading from DictReader\n')
rows = []
with open(csv_file, 'r') as file:
    dict_reader = csv.DictReader(file)
    i = 1
    for row in dict_reader:
        if i == 3:
            break
        print(row)
        i = i + 1

print()

CSV with Pandas

Pandas, a third-party package used extensively in data science, is a powerful tool to analyse and visualise data. In combination with NumPy and SciKit-Learn and other packages like Seaborn, Pandas provides an end-to-end tool for machine learning projects.

In order to use Pandas we will need to install it first.

Install Pandas

pip install pandas

pip is s Python's package installer. For best results update pip before installing any package.

# Using Pandas
import pandas as pd

csv_file = './weather.csv'
df = pd.read_csv(csv_file)

header = df.columns
header_indexes = list(enumerate(header))
print(header_indexes)

rows =  len(df)
print(f'rows = {rows}')

print(f'display first 2 rows after header: {df.head(2)}')