Python: How to read and write CSV files


(Sponsors) Get started learning Python with DataCamp's free Intro to Python tutorial. Learn Data Science by completing interactive coding challenges and watching videos by expert instructors. Start Now!


What is CSV File?

CSV (Comma-separated values) is a common data exchange format used by the applications to produce and consume data. Some other well-known data exchange formats are XML, HTML, JSON etc.

A CSV file is a simple text file where each line contains a list of values (or fields) delimited by commas.

Although the term “Comma” appears in the format name itself, but you will encounter CSV files where data is delimited using tab ( \t) or pipe ( |) or any other character that can be used as a delimiter.

The first line of the CSV file represents the header containing a list of column names in the file. The header is optional but highly recommended.

The CSV file is commonly used to represent tabular data. For example, consider the following table:

IDNameCountryCodeDistrictPopulation
1KabulAFGKabol1780000
2QandaharAFGQandahar237500
3HeratAFGHerat186800
4Mazar-e-SharifAFGBalkh127800
5AmsterdamNLDNoord-Holland731200

The above table can be represented using CSV format as follows:

If the values in your CSV file contains commas, then it must be enclosed inside double quotes. For example:

NameAgeAddress
Jerry102776 McDowell Street, Nashville, Tennessee
Tom203171 Jessie Street, Westerville, Ohio
Mike301818 Sherman Street, Hope, Kansas

To retain the commas inside the Address field enclose it in double quotation marks, as follows:

Similarly, if you have double quotes embedded inside a field, it must be escaped with another double quote character. Otherwise, they will not be interpreted correctly. For example:

IdUserComment
1BobJohn said “Hello World”
2Tom“The Magician”

To retain the retain the double quotes inside the Comment field use two double quotes.

It is important to note that the CSV format is not fully standardized. So the rules we have just mentioned are not universal. Once in a while, you will encounter CSV files that have a different way of representing fields.

Fortunately, to make things easier for us Python provides the csv module.

Before we start reading and writing CSV files, you should have a good understanding of how to work with files in general. If you need a refresher, consider reading how to read and write file in Python.

The csv module is used for reading and writing files. It mainly provides following classes and functions:

  1. reader()
  2. writer()
  3. DictReader()
  4. DictWriter()

Let’s start with the reader() function.

Reading a CSV File with reader()

The reader() function takes a file object and returns a _csv.reader object that can be used to iterate over the contents of a CSV file. The syntax of reader() function is as follows:

Syntax: reader(fileobj [, dialect='excel' [, **fmtparam] ]) ->  _csv.reader

ArgumentDescription
fileobj(required) It refers to the file object
dialect(optional) Dialect refers to the different ways of formatting the CSV document. By default, the csv module uses the same format as Microsoft Excel. We will discuss dialect in detail later in this post.
fmtparam(optional) It refers to the set of keyword arguments to customize the dialect (see the next section).

Let’s say we have following CSV file:

employees.csv

Here is how to read this CSV file:

Expected Output:

Notice that each line in the CSV file is returned as a list of strings.

To get the data from certain fields, you can use indexing. For example:

Expected Output:

If you want to skip heading call the next() built-in function on the _csv.reader object and then loop over the remaining lines as usual.

Expected Output:

Customizing the reader()

By default, the csv module works according to the format used by Microsoft excel, but you can also define your own format using something called Dialect.

The following are some additional arguments that you can pass to the reader() function to customize its working.

  • delimiter – It refers to the character used to separate values (or fields) in the CSV file. It defaults to comma ( , ).
  • skipinitialspace  – It controls how the space following the delimiter will be interpreted. If True , the initial whitespaces will be removed. It defaults to False .
  • lineterminator – It refers to the character sequence used to terminate the line. It defaults to \r\n .
  • quotechar – It refers to the single character string that will be used to quote values if special characters (like delimiter) appears inside the field. It defaults to " .
  • quoting  – controls when quotes should be generated by the writer or recognized by the reader. It can take one of the following constants:
    • csv.QUOTE_MINIMAL  means add quote only when required, for example, when a
      field contains either the quotechar or the delimiter. This is the default.
    • csv.QUOTE_ALL means quotes everything regardless of the field type.
    • csv.QUOTE_NONNUMERIC means quotes everything except integers and floats.
    • csv.QUOTE_NONE means that do not quote anything on output. However, while reading quotes are included around the field values.
  • escapechar – It refers to the one-character string used to escape
    the delimiter when quoting is set to QUOTE_NONE. It defaults to None.
  • doublequote  – controls the handling of quotes inside fields. When
    True, two consecutive quotes are interpreted as one during read,
    and when writing, each quote character embedded in the data is
    written as two quotes.

Let’s workthrough some examples to better understand how these arguments work:

delimiter argument

employees_pipe.csv

This file uses pipe ( |) character as a delimiter. Here is how to read this CSV file:

Expected Output:

skipinitialspace argument

baseball_players.csv

This CSV file contains spaces following the comma ( ,). To read this CSV file correctly, set skipinitialspace to True, as follows:

Expected Output:

quotechar argument

addresses.csv

There are two things to notice in this file. First, the address field is wrapped using single quote ( ') instead of " double quote (which is the default). Second, there are spaces following the comma ( ,).

If you try to read this file without changing the quote character, you will get the output as follows:

Expected Output:

Notice that the address is split into three fields, which is certainly not correct. To fix the issue simply change quote character to a single quote ( ') using the quotechar argument:

Expected Output:

escapechar argument

comments.csv

This file uses backslash ( \) character to escape the embedded double quotes. However, by default the default csv module uses a double quote character to escape double quote character.

If try to read this file using default options you will get the output like this:

Expected Output:

This output is certainly not desirable. To get the correct output change the escape character using escapechar argument, as follows:

Expected Output:

doublequote argument

dialogues.csv

This file uses double quote to escape the embedded double quote characters in the field. By default, doublequote is set to True. As a result, while reading two consecutive double quotes are interpreted as one.

Expected Output:

If, however you set doublequote to False, the consecutive double quotes will appear in the output.

Expected Output:

Writing CSV files with writer()

To write data to a CSV file we use the writer() function. It accepts the same argument as the reader() function but returns a writer object (i.e _csv.writer):

Syntax: writer(fileobj [, dialect='excel' [, **fmtparam] ]) -> csv_writer

ArgumentDescription
fileobj(required) It refers to the file object
dialect(optional) Dialect refers to the different ways of formatting the CSV document. By default, the csv module uses the same format as Microsoft Excel. We will discuss dialect in detail later in this post.
fmtparam(optional) Formatting parameters, work same as the reader()’s function.

The writer instance provides the following two methods to write data:

MethodDescription
writerow(row)Writes a single row of data and returns the number of characters written. The row must be a sequence of strings and number.
writerows(rows)Writes multiple rows of data and returns None. The rows must be a sequence.

Here are examples:

Example 1: Using writerow()

Example 2: Using writerows()

The output generated by both listing will be the same and it looks like this:

customers.csv

Notice that only the address field is wrapped around double quotes. This is because by default the quoting argument is set to QUOTE_MINIMAL. In other words, fields will be quoted only when quotechar or delimiter appears in the data.

Let’s say you want double quotes around all textual data. To achieve this, set quoting argument to QUOTE_NONNUMERIC.

Expected Output:

customers.csv

Now all the names and addresses have double quotes around them.

If you want double quotes around all fields regardless of whether quotechar or delimiter appears in the data or not, set quoting to csv.QUOTE_ALL.

Expected Output:

Everything is double-quoted now.

It is important to note that when quoting is on (i.e quoting parameter has a value other than csv.QUOTE_NONE), the csv module uses the quotechar (which defaults to ") to quote field.

The following listing changes the quote character from double quote ( ") to a single quote ( ').

Expected Output:

In this case, the csv module uses the single quote ( ') instead of ( ") to quote fields containing quotechar or delimiter.

We can also turn off quoting all-together by setting quoting to csv.QUOTE_NONE. However, if you do that and delimiter character appears in the data then you will get an error like this:

Expected Output:

The problem is that the address field contains embedded commas ( ,) and since we have turned off the ability to quote fields, the csv module doesn’t know how to escape them properly.

This where the escapechar argument comes into play. It takes a one-character string that will be used to escape the delimiter when the the quoting is turned off (i.e quoting=csv.QUOTE_NONE).

The following listing set the escapechar to backslash ( \).

Expected Output:

Notice that the commas ( ,) in the address field is escaped using the backslash ( \) character.

You should now have a good understanding of various formatting arguments and the context in which they are used with the reader() and writer() function. In the next section will see some other ways to read and write data.

Reading a CSV file with DictReader

DictReader works almost exactly like reader() but instead of retuning a line as a list, it returns a dictionary. Its syntax is as follows:

Syntax:: DictReader(fileobj, fieldnames=None, restkey=None, restval=None, dialect='excel', **fmtparam)

ArgumentDescription
fileobj(required) It refers to the file object.
fieldnames(optional) It refers to the list of keys that will be used in the returned dictionary in order. If omitted, the field names are inferred from the first row of the CSV file.
restkey(optional) If the row has more fields than specified in the fieldnames parameter, then the remaining fields is stored as a sequence keyed by the value of restkey argument.
restval(optional) It provides value to fields which are missing from the input.
dialect(optional) Dialect refers to the different ways of formatting the CSV document. By default, the csv module uses the same format as Microsoft excel. We will discuss dialect in detail later in this post.
fmtparamIt refers to formatting arguments and works exactly like reader() and writer().

Let’s take some examples:

Example 1:

customers.csv

Expected Output:

Note: Order of keys in the result may vary. Since dictionary doesn’t preserve the order of elements.

In this case, the field names are inferred from the first line (or header) of the CSV file.

Example 2: Using fieldnames parameter

This CSV file has no header. So we have to provide field names via the fieldnames parameter.

Expected Output:

Example 3: Using restkey parameter

Expected Output:

Notice that the address and zip code are now stored as a sequence keyed by value extra.

Example 4: Using restval

Expected Output:

In this case, we have specified field two extra fields: phone and email. The values for extra fields is provided by the restval argument.

Writing CSV files with DictWriter()

The DictWriter object writes a dictionary to a CSV file. Its syntax is as follows:

Syntax: DictWriter(fileobj, fieldnames, restval='', extrasaction='raise', dialect='excel', **fmtparam)

ArgumentDescription
fileobjIt refers to the file object
fieldnamesIt refers to the field names and the order in which they will be written the file.
restvalIt provides the missing value for the keys which doesn’t exist in the dictionary.
extrasactionIt controls what action to take if the dictionary contains a key, that is not found in the fieldnames argument. By default, extrasaction is set to raise, which means an exception will be raised in such an event. If you want to ignore the extra values set extrasaction to ignore.

The DictWriter provides the following three methods to write data.

MethodDescription
writeheader()Writes the header (i.e fieldnames) to the CSV file and returns None.
writerow(row)Writes a single row of data and returns the number of characters written. The row must be a sequence of strings and number.
writerows(rows)Writes multiple rows of data and returns None. The rows must be a sequence.

Let’s take some examples:

Example 1:

Expected Output:

dictcustomers.csv

Example 2: Using restval

Expected Output:

dictcustomers.csv

In this case, the value of email field is missing from the dictionaries. As s result, the value of restval will be used for the email field.

Example 3: Using extrasaction

Here, the dictionary contains an extra key named zip which is not present in the header list. To prevent the exception from being raised we have set extrasaction to ignore.

Expected Output:

dictcustomers.csv

Creating Dialect

Earlier in this post, we have learned various formatting parameters that allow us to customize the reader and writer object to accommodate for differences in the CSV conventions.

If you find yourself passing the same set of formatting parameters over and over again. Consider creating your own Dialect.

A dialect object or (simply dialect) is a way to group various formatting parameters. Once you have created the dialect object, simply pass it to the reader or writer, rather than passing each formatting argument separately.

To create a new dialect, we use register_dialect() function. It accepts dialect name as a string and one or more formatting parameters as keyword arguments.

The following table lists all the formatting arguments along with their default values:

ArgumentDefaultDescription
delimiter ,It refers to the character used to separate values (or fields) in the CSV file.
skipinitialspace FalseIt controls how the space following the delimiter will be interpreted. If True, the initial whitespaces will be removed.
lineterminator \r\nIt refers to the character sequence used to terminate the line.
quotechar "It refers to the single character string that will be used to quote values if special characters (like delimiter) appears inside the field.
quoting csv.QUOTE_NONEcontrols when quotes should be generated by the writer or recognized by the reader (see above for other options).
escapechar NoneIt refers to the one-character string used to escape the delimiter when quoting is set to.
doublequote Truecontrols the handling of quotes inside fields. When True, two consecutive quotes are interpreted as one during read, and when writing, each quote character embedded in the data is written as two quotes.

Let’s create a simple dialect.

Expected Output:

oscars.csv

 


Other Tutorials (Sponsors)

This site generously supported by DataCamp. DataCamp offers online interactive Python Tutorials for Data Science. Join over a million other learners and get started learning Python for data science today!


Leave a Reply

Your email address will not be published. Required fields are marked *