Using data from spreadsheets in Fedora with Python

Posted by Paul W. Frields on March 30, 2020

Python is one of the most popular and powerful programming languages available. Because it’s free and open source, it’s available to everyone — and most Fedora systems come with the language already installed. Python is useful for a wide variety of tasks, but among them is processing comma-separated value (CSV) data. CSV files often start off life as tables or spreadsheets. This article shows how to get started working with CSV data in Python 3.

CSV data is precisely what it sounds like. A CSV file includes one row of data at a time, with data values separated by commas. Each row is defined by the same fields. Short CSV files are often easily read and understood. But longer data files, or those with more fields, may be harder to parse with the naked eye, so computers work better in those cases.

Here’s a simple example where the fields are Name, Email, and Country. In this example, the CSV data includes a field definition as the first row, although that is not always the case.

Name,Email,Country
John Q. Smith,jqsmith@example.com,USA
Petr Novak,pnovak@example.com,CZ
Bernard Jones,bjones@example.com,UK

Reading CSV from spreadsheets

Python helpfully includes a csv module that has functions for reading and writing CSV data. Most spreadsheet applications, both native like Excel or Numbers, and web-based such as Google Sheets, can export CSV data. In fact, many other services that can publish tabular reports will also export as CSV (PayPal for instance).

The Python csv module has a built in reader method called DictReader that can deal with each data row as an ordered dictionary (OrderedDict). It expects a file object to access the CSV data. So if our file above is called example.csv in the current directory, this code snippet is one way to get at this data:

f = open('example.csv', 'r')
from csv import DictReader
d = DictReader(f)
data = []
for row in d:
    data.append(row)

Now the data object in memory is a list of OrderedDict objects :

[OrderedDict([('Name', 'John Q. Smith'),
               ('Email', 'jqsmith@example.com'),
               ('Country', 'USA')]),
  OrderedDict([('Name', 'Petr Novak'),
               ('Email', 'pnovak@example.com'),
               ('Country', 'CZ')]),
  OrderedDict([('Name', 'Bernard Jones'),
               ('Email', 'bjones@example.com'),
               ('Country', 'UK')])]

Referencing each of these objects is easy:

>>> print(data[0]['Country'])
USA
>>> print(data[2]['Email'])
bjones@example.com

By the way, if you have to deal with a CSV file with no header row of field names, the DictReader class lets you define them. In the example above, add the fieldnames argument and pass a sequence of the names:

d = DictReader(f, fieldnames=['Name', 'Email', 'Country'])

A real world example

I recently wanted to pick a random winner from a long list of individuals. The CSV data I pulled from spreadsheets was a simple list of names and email addresses.

Fortunately, Python also has a helpful random module good for generating random values. The randrange function in the Random class from that module was just what I needed. You can give it a regular range of numbers — like integers — and a step value between them. The function then generates a random result, meaning I could get a random integer (or row number!) back within the total number of rows in my data.

So this small program worked well:

from csv import DictReader
from random import Random

d = DictReader(open('mydata.csv'))
data = []
for row in d:
    data.append(row)

r = Random()
winner = data[r.randrange(0, len(data), 1)]
print('The winner is:', winner['Name'])
print('Email address:', winner['Email'])

Obviously this example is extremely simple. Spreadsheets themselves include sophisticated ways to analyze data. However, if you want to do something outside the realm of your spreadsheet app, Python may be just the trick!

Photo by Isaac Smith on Unsplash.

Paul W. Frields

Paul W. Frields has been a Linux user and enthusiast since 1997, and joined the Fedora Project in 2003, shortly after launch. He was a founding member of the Fedora Project Board, and has worked on docsc, websites, advocacy, toolchain, and package maintenance. He joined Red Hat as Fedora Project Leader from February 2008 to July 2010, and remains with Red Hat as an engineering manager. He currently lives with his wife and two children in Virginia where he also runs a recording studio (5thdom.com).

7 Comments

Christian

Thank you!. It is an excellent example.

March 30, 2020
Paweł

How about random.choice(data) ? 🙂

March 31, 2020
- Paul W. Frields
  
  There are almost always multiple ways to solve a problem.
  
  March 31, 2020
Mark
Very useful for people comfortable with using Python I suppose. However most people would not use Python for simple (or even complex) scripting.

The ‘real world example can be simply done in bash (or any other shell) without needing to import functions or use arrays.
```
# get number of lines in the example file mydata.csv
linecount=`cat mydata.csv | wc -l`
# get random number between 2 (to omit line 1 header line) and linecount
# note: shuf is not limited to the shells RANDOM variable max of 32767 so
# can be used on huge files
linerandom=`shuf -i 2-${linecount} -n 1`
# extract the line and format the output
linedata=`head -${linerandom} mydata.csv | tail -1 \
| awk -F, ‘{print “The winner is:”$1″\nEmail address:”$2}’`
# display the output
echo “${linedata}”
```
A good article in the way it explains the layout of the array created and how to reference the results, as a non Python user I was easily able to see how it all works with that information.

Having seen this article I did a quick search on DictReader and found this page https://courses.cs.washington.edu/courses/cse140/13wi/csv-parsing.html where the example shows using Dictreader against a CSV file in a way that almost emulates the intent of a sql query against a database which does make the function seem a little useful for people who prefer working with large CSV files for data queries rather than databases.

I see there is also a DictWriter function, so pherhaps the article examples could be updated to show how to use the two functions in a SQL type way to convert a CSV file A into a CSV file B with fewer output fields/columns and fewer records based on a selection criteria ?. That may be a better example as the only main reasons I can think of for such python functions are to split out data in this way to create new smaller partial CSV datasets, perhaps in a corporate environment which is too cheap to use a database and stores data in spreadsheets but only wants people to see subsets of the spreadsheet.
April 2, 2020
- Paul W. Frields
  
  Hopefully the formatting above is fixed for you now. Thank you for the contribution!
  
  April 3, 2020
- Vernon Van Steenkist
  
  While I agree 100% with your approach and your observation that using simple tools that process files line by line are much more efficient and scalable to large data sets than loading everything into Python tables, I would like to suggest a couple of improvements:
  
  First
  linecount=
  
  cat mydata.csv | wc -l
  
  I prefer
  linecount=$(wc -l mydata.csv | cut -f 1 -d” “)
  because it only processes every line in mydata.csv once.
  
  Second
  head -${linerandom} mydata.csv | tail -1
  I believe can be more simply written
  sed -n $linerandom’p’ mydata.csv
  
  April 14, 2020
Jeffrey P Burdick

I am surprised we’re talking about tabulated data in Python and not talking about Pandas, which is a WONDERFUL library for working with tabulated data.
https://pandas.pydata.org/

April 6, 2020

Comments are Closed

Using data from spreadsheets in Fedora with Python

Reading CSV from spreadsheets

A real world example

Like this:

Paul W. Frields

7 Comments

Christian

Paweł

Paul W. Frields

Mark

Paul W. Frields

Vernon Van Steenkist

Jeffrey P Burdick

Subscribe to Fedora Magazine via Email

Contribute to the Magazine

Using data from spreadsheets in Fedora with Python

How to rebase to Fedora Linux 40 on Silverblue

OMG! We’re at forty! (Announcing the release of Fedora Linux 40)

Slimbook Fedora 2: New Ultrabooks for Fedora Linux 40

Reading CSV from spreadsheets

A real world example

Like this:

Paul W. Frields

7 Comments

Christian

Paweł

Paul W. Frields

Mark

Paul W. Frields

Vernon Van Steenkist

Jeffrey P Burdick

Subscribe to Fedora Magazine via Email

Contribute to the Magazine