Using data from spreadsheets in Fedora with Python

Python is one of the most popular and powerful programming languages available. Because it’s free and open source, it’s available to everyone — and most Fedora systems come with the language already installed. Python is useful for a wide variety of tasks, but among them is processing comma-separated value (CSV) data. CSV files often start off life as tables or spreadsheets. This article shows how to get started working with CSV data in Python 3.

CSV data is precisely what it sounds like. A CSV file includes one row of data at a time, with data values separated by commas. Each row is defined by the same fields. Short CSV files are often easily read and understood. But longer data files, or those with more fields, may be harder to parse with the naked eye, so computers work better in those cases.

Here’s a simple example where the fields are Name, Email, and Country. In this example, the CSV data includes a field definition as the first row, although that is not always the case.

Name,Email,Country
John Q. Smith,jqsmith@example.com,USA
Petr Novak,pnovak@example.com,CZ
Bernard Jones,bjones@example.com,UK

Reading CSV from spreadsheets

Python helpfully includes a csv module that has functions for reading and writing CSV data. Most spreadsheet applications, both native like Excel or Numbers, and web-based such as Google Sheets, can export CSV data. In fact, many other services that can publish tabular reports will also export as CSV (PayPal for instance).

The Python csv module has a built in reader method called DictReader that can deal with each data row as an ordered dictionary (OrderedDict). It expects a file object to access the CSV data. So if our file above is called example.csv in the current directory, this code snippet is one way to get at this data:

f = open('example.csv', 'r')
from csv import DictReader
d = DictReader(f)
data = []
for row in d:
    data.append(row)

Now the data object in memory is a list of OrderedDict objects :

[OrderedDict([('Name', 'John Q. Smith'),
               ('Email', 'jqsmith@example.com'),
               ('Country', 'USA')]),
  OrderedDict([('Name', 'Petr Novak'),
               ('Email', 'pnovak@example.com'),
               ('Country', 'CZ')]),
  OrderedDict([('Name', 'Bernard Jones'),
               ('Email', 'bjones@example.com'),
               ('Country', 'UK')])]

Referencing each of these objects is easy:

>>> print(data[0]['Country'])
USA
>>> print(data[2]['Email'])
bjones@example.com

By the way, if you have to deal with a CSV file with no header row of field names, the DictReader class lets you define them. In the example above, add the fieldnames argument and pass a sequence of the names:

d = DictReader(f, fieldnames=['Name', 'Email', 'Country'])

A real world example

I recently wanted to pick a random winner from a long list of individuals. The CSV data I pulled from spreadsheets was a simple list of names and email addresses.

Fortunately, Python also has a helpful random module good for generating random values. The randrange function in the Random class from that module was just what I needed. You can give it a regular range of numbers — like integers — and a step value between them. The function then generates a random result, meaning I could get a random integer (or row number!) back within the total number of rows in my data.

So this small program worked well:

from csv import DictReader
from random import Random

d = DictReader(open('mydata.csv'))
data = []
for row in d:
    data.append(row)

r = Random()
winner = data[r.randrange(0, len(data), 1)]
print('The winner is:', winner['Name'])
print('Email address:', winner['Email'])

Obviously this example is extremely simple. Spreadsheets themselves include sophisticated ways to analyze data. However, if you want to do something outside the realm of your spreadsheet app, Python may be just the trick!


Photo by Isaac Smith on Unsplash.

FAQs and Guides For Developers Using Software

7 Comments

  1. Christian

    Thank you!. It is an excellent example.

  2. Paweł

    How about random.choice(data) ? 🙂

  3. Mark

    Very useful for people comfortable with using Python I suppose. However most people would not use Python for simple (or even complex) scripting.

    The ‘real world example can be simply done in bash (or any other shell) without needing to import functions or use arrays.

    # get number of lines in the example file mydata.csv
    linecount=`cat mydata.csv | wc -l`
    # get random number between 2 (to omit line 1 header line) and linecount
    # note: shuf is not limited to the shells RANDOM variable max of 32767 so
    # can be used on huge files
    linerandom=`shuf -i 2-${linecount} -n 1`
    # extract the line and format the output
    linedata=`head -${linerandom} mydata.csv | tail -1 \
    | awk -F, ‘{print “The winner is:”$1″\nEmail address:”$2}’`
    # display the output
    echo “${linedata}”

    A good article in the way it explains the layout of the array created and how to reference the results, as a non Python user I was easily able to see how it all works with that information.

    Having seen this article I did a quick search on DictReader and found this page https://courses.cs.washington.edu/courses/cse140/13wi/csv-parsing.html where the example shows using Dictreader against a CSV file in a way that almost emulates the intent of a sql query against a database which does make the function seem a little useful for people who prefer working with large CSV files for data queries rather than databases.

    I see there is also a DictWriter function, so pherhaps the article examples could be updated to show how to use the two functions in a SQL type way to convert a CSV file A into a CSV file B with fewer output fields/columns and fewer records based on a selection criteria ?. That may be a better example as the only main reasons I can think of for such python functions are to split out data in this way to create new smaller partial CSV datasets, perhaps in a corporate environment which is too cheap to use a database and stores data in spreadsheets but only wants people to see subsets of the spreadsheet.

    • Hopefully the formatting above is fixed for you now. Thank you for the contribution!

    • Vernon Van Steenkist

      While I agree 100% with your approach and your observation that using simple tools that process files line by line are much more efficient and scalable to large data sets than loading everything into Python tables, I would like to suggest a couple of improvements:

      First
      linecount=

      cat mydata.csv | wc -l

      I prefer
      linecount=$(wc -l mydata.csv | cut -f 1 -d” “)
      because it only processes every line in mydata.csv once.

      Second
      head -${linerandom} mydata.csv | tail -1
      I believe can be more simply written
      sed -n $linerandom’p’ mydata.csv

  4. I am surprised we’re talking about tabulated data in Python and not talking about Pandas, which is a WONDERFUL library for working with tabulated data.
    https://pandas.pydata.org/

Comments are Closed

The opinions expressed on this website are those of each author, not of the author's employer or of Red Hat. Fedora Magazine aspires to publish all content under a Creative Commons license but may not be able to do so in all cases. You are responsible for ensuring that you have the necessary permission to reuse any work on this site. The Fedora logo is a trademark of Red Hat, Inc. Terms and Conditions