This is a guest post from Timothy O’Connell.
So recently, while working with an ancient, hastily written calendar application, it fell to me to pry a decade’s worth of data loose from a .mdb file.
And while the original specifications for the project only called for a csv for each of the database’s tables, scope began to creep as the client realized that he needed his data to be reformatted and, in some cases, recast. Due to the persnickety nature of this client horrifyingly random nature of his data (some dates were MM/DD/YY while others where a legit timestamp; some commas were escaped, others were not, etc.), I realized that I was going to have to implement an industrial strength solution: something that would scale. So rather than writing a series of one-off python scripts (using the totally kickass csv module), I decided to get the data into a Postgres database and then query this database as necessary.
First, I’ll lay out (most of) my program and then I’ll do a blow-by-blow, dwelling briefly on certain important or noteworthy parts.
Remember, what we’re doing here is grabbing one table at a time and stuffing it into a Postgres database (and maybe doing a little string sanitizing and type casting as we go). Anything more automated might not work, given the totally messed-up nature of the data and of the .mdb format in general.
#!/usr/bin/env python from cStringIO import StringIO import csv, os, psycopg2, re, subprocess, sys, time dbHost = "localhost" dbUser = "toconnell" dbPass = "XXXXXXXXXX" dbName = "toconnell" # This program is run from the CLI: arguments one and two are the .mdb file # and the table in that database that we're trying to import mdbFile = sys.argv mdbTable = sys.argv mdbDump = "/usr/bin/mdb-export" # Dynamically name the table, depending on the day the import is run tableName = "%s_%s" % (os.path.basename(mdbFile.replace(".","_").lower()),time.strftime("%Y_%m_%d")) def dumpMDB(mdbFile): command = [mdbDump,mdbFile,mdbTable] p = subprocess.Popen(command,stdout=subprocess.PIPE) mdbData = [line for line in p.stdout.readlines()] # Grab the column names columns = mdbData.pop(0) # Now write a CSV to the buffer for line in mdbData: tmpFile.write(line) # Now use the csv module to make a list where each line is a list tmpFile.seek(0,0) importData = csv.reader(tmpFile) dataList =  dataList.extend(importData) tmpFile.close() return columns,dataList def createTable(columns): # Take this string from the pretend CSV file, break it up and remake it as something like a query columnsList =  for item in columns.split(","): item = item.strip().lower() # Also do a little type casting in there (beats having to do it later) if item == "date": columnsList.append(item + " TIMESTAMP") else: columnsList.append(item + " TEXT") columns = ",".join(columnsList) print columns # Now connect and do table stuff conn = psycopg2.connect("dbname=%s user=%s host=%s password=%s" % (dbName,dbUser,dbHost,dbPass)) cursor = conn.cursor() # Check for previous tables: cursor.execute("SELECT * FROM pg_tables WHERE tablename LIKE '%%%s%%'" % tableName) results = cursor.fetchone() if results != None: # Drop previous ones: cursor.execute("DROP TABLE %s" % tableName) else: # Create a new one: cursor.execute("CREATE TABLE %s (%s)" % (tableName,columns)) conn.commit() def populateTable(columns,query): conn = psycopg2.connect("dbname=%s user=%s host=%s password=%s" % (dbName,dbUser,dbHost,dbPass)) cursor = conn.cursor() cursor.execute("INSERT INTO %s(%s) VALUES(%s)" % (tableName,columns,query)) conn.commit() def sanitize(query): # You can do as many of these as necessary: I've included a single one as an example finalList =  for item in query: item = item.replace("'","\'") finalList.append(item) return finalList if __name__ == "__main__": # First, instantiate our pretend cvs file tmpFile = StringIO() # Now get the data columns,dataList = dumpMDB(mdbFile) # Now reformat the "columns" string a little bit columnsList = [item.strip().lower().replace(" ","") for item in columns.split(",")] columns = ",".join(columnsList) # Now create a new table (delete the previous one) createTable(columns) # Finally, insert data, reporting progress on the CLI (better than doing # select count(*) every 15 seconds) total = len(dataList) n = 1 for item in dataList: item = sanitize(item) formatQuery = "'" + "','".join(item) + "'" populateTable(columns,formatQuery) print "Inserting data %s/%s..." % (n,total) n += 1 print "Done."
In my opinion, there are two noteworthy aspects of the above: the use of the csv module and the use of cStringIO to create a csv file in the buffer (instead of simply creating a file and deleting it).
In order to better explain those two aspects of the program, here’s that function again, broken into more easily digested pieces and presented with an in-line, blow-by-blow commentary:
def dumpMDB(mdbFile): command = [mdbDump,mdbFile,mdbTable] p = subprocess.Popen(command,stdout=subprocess.PIPE) mdbData = [line for line in p.stdout.readlines()]
Nothing special here: I do some pretty standard subprocess syntax that executes mdb-export (a part of the mdbtools package on Debian) on my file and dumps a table in csv format, one line at a time, into a list. Normally the program would just dump them to stdout: I’m just grabbing them up with subprocess’s Popen function.
# Grab the column names columns = mdbData.pop(0)
Now I do a list.pop(0) on this list to get the first line (i.e. the column names) of what I just dumped with mdb-export; using the built-in pop function without an integer gets you the last item in your list (in 2.5, at least).
# Now write a CSV to the buffer for line in mdbData: tmpFile.write(line) # Now use the csv module to make a list where each line is a list tmpFile.seek(0,0) importData = csv.reader(tmpFile) dataList =  dataList.extend(importData) tmpFile.close() return columns,dataList
This isn’t fancy code. I’m essentially creating a CSV file and reading each of its lines into a list. I do this instead of simply using the list I generated with the subprocess call because I’m nervous about splitting the strings that compose that list: as I mentioned above, there is a very real possibility that a given string will contain unescaped commas and other inappropriate characters within it and I don’t even want to have to contemplate how to split those strings correctly.
What I want to do is trust the csv module to figure that out for me.
So, in order to make that I happen, I need to give the csv module a CSV file it can read. Rather than writing a file to the filesystem (which no self-respecting sysadmin will do if he can help it), I need to make some cStringIO magic happen. And in order to make that buffer magic happen, I need to do three things:
- import cStringIO,
- instantiate the file-like object and
- do that weird seek to it.
The import (
from cStringIO import StringIO
) is fairly simple; if you’ve ever used cStringIO for anything, you’ve done this. The instantiation (
tmpFile = StringIO()
) is also straight out of the documentation.
But once I write all of the lines of my list to the cStringIO buffer, I’ve got to do
in order to then read from that file-like object with python’s built-in csv module. There are a number of obscure/arcane reasons for this, but, as I understand it, the seek(0,0) is necessary because, without it, your file-like object is merely a collection of strings and not file-like in the way that the csv module expects it to be file-like: once you do the seek, you’re ready to “read” your file-like object with the csv module.
From there, it’s a simple matter of using the reader() function of the csv module to get your CSV data from your cStringIO object, manipulating it however you see fit and then looping over it and inserting it in your database.
And then never, ever having to work with (i.e. around) MS Access again.