Having to read data from a relatively large number of machine generated (ie. similarily named and formatted) csv files is a common task, so doing it in an organized and efficient manner can save you hours and hours of work.

So, here are a few steps and techniques I tend to use in these cases:

  1. Load files with generator function
  2. Interact directly with the filesystem (no hardcoded filenames)
  3. Narrow down the data to the necessary amount
  4. Use regex for filtering and extracting information

1. Use Python generators

As a starting point, you can use pandas.read_csv() “manually” with a handful of files, but it can easily go out of control:

import pandas as pd

# Read csv files
data_1 = pd.read_csv('data_file_1.csv')
data_2 = pd.read_csv('data_file_2.csv')
data_3 = pd.read_csv('data_file_3.csv')
# ... possibly lots of other files

# "Manually" concatenate dataframes
data = pd.concat([
	data_1, 
	data_2, 
	data_3,
	# ... possibly lots of other dataframes
]) 

We need a few tricks to do this more efficiently. First step is to introduce a simple Python generator function (load_files() in this case), with which we can concatenate lots of dataframes later with a single line of code.

import pandas as pd

# List of filenames to read in
DATA_FILES = [
	'data_file_1.csv',
	'data_file_2.csv',
	'data_file_3.csv',
	# ...
]

# This is the actual generator
def load_files(filenames):
	for filename in filenames:
		yield pd.read_csv(filename)
		
# Concat dataframes, like if they were in a list
data = pd.concat(load_files(DATA_FILES))

Without getting bogged down in details, generators in Python are simple functions that - rather than returning a single value as “normal” functions would do - yield a series of values, and act like an iterable object (eg. as a list) when called.

The key difference: generators are “lazy”, meaning they only make the steps necessary for yielding the next value when they are asked to, which makes them extremely valuable when we want to read in large quantities of data to memory. (See Python Wiki for further details on generators.)

2. Use Python os module

There is still one huge limitation in our code: hardcoded filenames.

Luckily, Python os module has multiple ways to interact with the filesystem. Here, os.listdir() will do the trick, assuming we have a single directory that contains exactly the files we need to read, as it returns a list with the names of the entries (the other very useful method is os.walk(), check the docs for details.)

So, instead of hardcoding our filenames, we can read arbitrary number of files with os.listdir(), and use them as previously. Neat, huh?

import os
import pandas as pd

# Read filenames from the given path
data_files = os.listdir('path/to/datafiles')


def load_files(filenames):
	for filename in filenames:
		yield pd.read_csv(filename)
		

data = pd.concat(load_files(data_files))

Alternatively we could use Python’s built-in glob module. With glob.glob and glob.iglob methods one can do very similar things to what we did with os.listdir (though not exactly the same way), it’s worth reading the docs.

3. Only use what you need

Keeping only the necessary data from each file is a good practice for using the least possible amount of memory when loading a series of large csv files. Luckily pandas.read_csv() is one of the “richest” methods in the library, and its behavior can be finetuned to a great extent.

One minor shortfall of read_csv() is that it cannot skip arbitrary rows based on a function, ie. it is not possible to filter the dataset while loading the csv. For this we have to load all rows and necessary columns, and do the filtering on the dataframe itself.

Extending our example, let’s say we only need transaction ID, amount and SKU from our imaginary reports (out of dozens of columns), and only want to keep rows where Category is either “Outdoor” or “Sports”. We can use the usecols param, and filter the rows with DataFrame.query().

import os
import pandas as pd


data_files = os.listdir('path/to/datafiles')


def load_files(filenames):
	for filename in filenames:
        # Only use certain columns and rows from each file
		yield (
            pd.read_csv(
                filename, 
                usecols=['ID', 'Amount', 'SKU', 'Category']
            )
            .query('Category in ["Outdoor", "Sports"]')
        )
		

data = pd.concat(load_files(data_files))

(Note: .query() can be replaced by .pipe(lambda x: x[x['Category'].isin(['Outdoor', 'Sports'])]), which looks a bit more complicated, but works with column names with spaces in them.)

4. Leverage regular expressions

Up to now we read all files in our data directory. But what if we only need a subset of them, based on their filename? For simply filtering for filenames we can do something like this:

data_files = [f for f in os.listdir('path/to/datafiles') if 'monthly' in f]

The drawback is that we need to be sure that exactly those files are needed that have “monthly” somewhere in their name. Usually things are more complicated, we have to deal with almost identical filenames and need to extract crucial information from the filenames.

So we need the dreaded regular expressions and Python’s re module.

Say, we have a whole lot of irrelevant files and a bunch of monthly sales reports in our data directory, named as monthly_XXXX_YYYY-MM-DD.csv, where XXXX is the four-digit ID of the store and YYYY-MM-DD is the date when the report was generated.

Assuming, that

  1. we only want to load the monthly reports AND
  2. we also need to add the date and store ID as columns to our dataframe,

we can do something like this:

import os
import pandas as pd


data_files = os.listdir('path/to/datafiles')


def load_files(filenames):
    # Pre-compile regex for code readability
    regex = re.compile(r'^monthly_(\d{4})_(\d{4}-\d{2}-\d{2})\.csv$')

    # Map filenames to match objects, filter out not matching names
    matches = [m for m in map(regex.match, filenames) if m is not None]

    # Iterate over match objects instead of filenames
	for match in matches:
		yield (
            pd.read_csv(
                match.group(0), # .group(0) is always the full string
                usecols=['ID', 'Amount', 'SKU', 'Category']
            )
            .query('Category in ["Outdoor", "Sports"]')
            # Add new columns
            .assign(store_id=match.group(1), date=match.group(2)) 
            # Set types
            .astype({'store_id': 'int', 'date': 'datetime64[ns]'}) 
        )
		

data = pd.concat(load_files(data_files))

We did the filename filtering and the extraction with regex, so lots of things happened here. If you are not familiar with regular expressions, check out the docs for re and try regexpal.com, where you can play around with regex patterns.

On top of the regex stuff, notice the magical .assign() method, one of our best friends in method chaining, and the .astype() call for parsing strings into more memory efficient types (you can also parse date with dateutils.parser or something similar “on the fly”, which would be useful if you want to filter on the dates as well).

And the result

So, we have a fairly decent solution for loading arbitrary number of csv files with a distinct naming pattern from a particular directory path, while filtering them down for the required amount of data, and then merge them into a single dataframe.

And all this with a neat generator, ninja style regex and pretty method chaining.

import os
import pandas as pd

data_files = os.listdir('path/to/datafiles')


def load_files(filenames):
    regex = re.compile(r'^monthly_(\d{4})_(\d{4}-\d{2}-\d{2})\.csv$')
    matches = [m for m in map(regex.match, filenames) if m is not None]

	for match in matches:
		yield (
            pd.read_csv(
                match.group(0),
                usecols=['ID', 'Amount', 'SKU', 'Category']
            )
            .query('Category in ["Outdoor", "Sports"]')
            .assign(store_id=match.group(1), date=match.group(2)) 
            .astype({'store_id': 'int', 'date': 'datetime64[ns]'}) 
        )
		

data = pd.concat(load_files(data_files))