Programmer's Python Data - Text Files & CSV
Written by Mike James   
Tuesday, 10 June 2025
Article Index
Programmer's Python Data - Text Files & CSV
Text Formats
The CSV Module
CSV Dialects

Text Formats

You might be working on a project where the text file you are processing really is a text file, i.e. it is a document intended for a human to read. In such a case you can read the entire file, or line by line, and process it using nothing but string manipulation. In most cases, however, the file will not be a document intended primarily for human consumption. It will contain a mixture of data of different types and it is up to your program to convert it into an internal representation.

Clearly we are going to need to find a way to read parts of a text file. You can use read(size) to specify the number of characters, not bytes, to read, but a more common approach is to divide the file into lines using the end of line character. The file.readline(size) method will read the file until it encounters a newline character or reads size characters. You can use this with a binary file, in which case you have to make sure it includes newline characters, but this is unusual. If you leave out size then the file is read until the end of line is reached and the string returned includes the newline character. If the read reaches the end of the file before encountering a newline then it returns the string with no newline character. This means that you can distinguish between a blank line in the text file, which just returns a newline character, and the end of file which returns a null string.

Of course this raises the question of what is a newline character? You can specify the character that will be treated as an end of line using the newline parameter in the open method. If you don’t specify the newline character, or pass None, then the three most common newline indicators \n, \r and \n\r are converted into \n when you read a text file. On writing, by default any \n characters are converted into the system’s default line separator. If you don’t want any translation you can set newline to the null string and then the common newline characters are recognized and included in the string. If you set newline to anything else then that sequence of characters is treated as the newline signal.

In text mode a file object can be used as an iterable with each line being an element of the iteration. So to write and read a file in text mode you can use:

path = pathlib.Path("myTextFile.txt")
with path.open(mode = "wt") as f:
    for i in range(43):
        f.write("The answer is: "+str(i)+"\n")
with path.open(mode = "rt") as f:
    for line in f:
        print(line)

You can see that to write the lines we have to construct a string that includes a newline character. To read the file we simply iterate through its lines. If you try the program you will discover that the output is double spaced. The reason is that we get one newline character from the file and one from the print, which by default starts a newline.

The print function has some additional features that makes it particularly useful for working with text files. You can use the print function to send text to a text file using the file parameter and you can set the end of line character using the end parameter. The advantage of using the print function to send data to a text file is that it automatically converts everything to its string representation and it adds a line feed at the end. This means you could write the previous example as:

path=pathlib.Path("myTextFile.txt")
with path.open(mode="wt") as f:
    for i in range(43):
        print("The answer is: ", i, file=f)
with path.open(mode="rt") as f:
    for line in f:
        print(line,end="")

Also notice that now we get single spaced output as the final print statement suppresses its linefeed character.

In many cases the print function is preferable to the write method when working with text files because it automatically forms strings from general objects and automatically adds a newline character.

Text Data Formats – CSV

These techniques are adequate when you want to work with text, but often you want to extract other data from the text. For example, suppose you want to write and read the person record as text. In this case you need to create a string with the data encoded and when you read it back you have to process the string to extract it. To make this easy you need to adopt a standard format for the data. The simplest and oldest of all text data formats is Comma Separated Values or CSV. The basic idea is that each line of the file is a record containing the field values separated by commas. For example, to write the person record you might use:

@dataclasses.dataclass
class person:
    name:str=""
    id:int=0
    score:float=0.0
me=person("mike",42,3.145)
path=pathlib.Path("myTextFile.csv")
with path.open(mode="wt") as f:
    print(me.name, ",", me.id, ",", me.score, file=f)

which writes:

mike , 42 , 3.145

to the file.

The only problem with this approach is that when you have several fields there are too many commas to type in accurately. There are many ways of approaching this problem.

When it comes to reading the data back we have to use the commas to divide the fields:

with path.open(mode="rt") as f:
    name,id,score=f.readline().split(",")
print(name,id,score)

You can see that the split method gives us a list of the different fields of the record and we can unpack these into suitable variables. The program displays:

mike 42 3.145

This looks as if it is what we want, but if you try to compute id+1, say, you will discover that the value is a string not a number. To complete the data decoding we also have to convert the parts of the string into the correct data type:

with path.open(mode="rt") as f:
    name,id,score=f.readline().split(",")
    id=int(id)
    score=float(score)
print(name,id,score)

This works, but it introduces a potential problem if the id or the score aren’t correct string representations of a bignum or a float. If your program wrote the file then presumably it will write a correctly formatted text file. If a user edits the file to correct something then things aren’t so certain. Another problem is that there is the potential to lose accuracy in floating-point values by not storing an adequate number of decimal places. Binary representations don’t have this problem as they use the internal representation with no conversion.

There are a few other problems with CSV files. For example, how do you include a comma in a string? One solution is to insist that every string is encoded with quotes around it to mark out as a string and hence any commas it contains are to be preserved. This works, but introduces the problem of how to include a quote in a string. In short, there are many possible conventions about how to represent different types of data and these are generally referred to as “dialects” of CSV.

pythondata180

 



Last Updated ( Tuesday, 10 June 2025 )