File Processing
You have been reading and writing to the standard input and output until now.
What if you want to read an external file and write to a different file?
We will see how to use actual data files in this lab. Python provides basic functions
and methods necessary to manipulate files by default. You can do most of the file manipulation using a file object.
Objects
We introduced objects in the earlier lab, but we will review it again.
Since Python is an object oriented programming language, almost everything in Python is an object with its data and methods.
https://www.w3schools.com/python/python_classes.asp
Classes
A class is like an object constructor, as a blueprint for creating objects.
https://www.w3schools.com/python/python_classes.asp
The open() function
Working with text files is easy in Python. The first step is to create a file
object corresponding to a file.
This is done using the open() function.
open() returns a file object, and is most commonly used with two arguments:
open(filename, mode)
The mode parameter is either the string "r" (read) or "w" (write), depending on whether we want to
read from the file or write to the file.
For example, to open a file called "input.txt" for reading, we can do the following:
infile = open("input.txt", "r")
Now we can use the file object infile to read the contents of input.txt.
The input file name is input.txt. The .txt file extension means it is a text file.
Python provides three related operations for reading information from a file:
file_object.read()
Reads the entire content of the file as a single (potentially large, multi-line) string.
file_object.readline()
readline() method returns the next line of the file. That is all text up to and including the next newline character.
What is the newline character?
The newline character '\n' is used to mark the end of a line and the beginning of a new line.
https://www.freecodecamp.org/news/python-new-line-and-how-to-python-print-without-a-newline/
file_object.readlines()
Returns all the lines in a file in the format of a list where each
element is a line in the file. Each list item is a single line including the newline character at the end.
A list of examples on file read(), readline(), and readlines() methods are here:
https://www.guru99.com/python-file-readline.html
https://www.w3schools.com/python/ref_file_readlines.asp
Reading data from a text file
It is critically important that the data file be a text file. It cannot be a MicroSoft Word file. There are files methods only work for
text files. For example, loadtxt() method uses a line feed (LF) to end a line. MicroSoft Word file may include a carriage return (CR)
which confuses loadtxt() method.
close()
File method close() closes the opened file. A closed file cannot be read or written any more.
It is a good practice to use the close() method to close a file.
If you are using replit, you can add a file and name it input.txt.
If you are not using desktop Python, create a text file "input.txt" using Notepad, Notepad++ (for PCs) or Sublime, BBEdit (for Macs) that have the following values.
230
500
100
25
600
The following Python program opens the input file input.txt you just created, and output the file content on the monitor.
Note: The input.txt file and the Python program file must be placed in the same directory/folder.
1. read()
"""
This program opens an input file "input.txt".
The entire contents of the input.txt is read as one large
string and stored in the variable data.
Print(data) will display the contents of input.txt on the monitor.
"""
def main():
infile = open("input.txt", 'r')
data = infile.read()
print(data)
infile.close()
main()
The output is listed below.
230
500
100
25
600
2. readline()
readline() reads only one complete line from the file.
def main():
infile = open("input.txt", 'r')
data = infile.readline()
print(data)
#print(data[:-1])
infile.close()
main()
Pay attention to line #4 and line #5. Try it out, what is the difference between the two output?
The following example prints out the first three lines of the input file input.txt.
def main():
infile = open("input.txt", 'r')
for i in range(3):
data = infile.readline()
print(data[:-1]) #The use of slicing to remove the newline character at the end of the line.
#print(data) print() automatically outputs a newline,
#so it will output an extra blank line between the lines of the file.
infile.close()
main()
The output is listed below.
230
500
100
3. readlines()
To read all the lines from a given file, readlines() reads all the contents from the given file and save the output in a list.
def main():
infile = open("input.txt", 'r')
data = infile.readlines()
print(data)
infile.close()
main()
Here is the output as a list.
['230\n', '500\n', '100\n', '25\n', '600']
You can also loop through the entire contents of a file like the following since Python treats
the file as a sequences of lines.
def main():
infile = open("input.txt", 'r')
for data in infile:
print(data[:-1])
infile.close()
main()
The following example reads each number from the input file and outputs the sum of all numbers
on the screen.
We can read all the numbers one by one using for i in data and sum up all the numbers,
def main():
infile = open("input.txt", 'r')
sum = 0
for i in infile:
number = float(i) #converts string from the input file into numeric float number.
sum = sum + number
print("The total is:", sum)
infile.close()
main()
We can also use the readlines() method which returns a list containing each line in the file as a list item.
def main():
infile = open("input.txt", 'r')
num_file = infile.readlines()
sum = 0
for i in num_file:
number = float(i)
sum = sum + number
print("The total is:", sum)
infile.close()
main()
File Output
Opening a file to write the data recieved, we can use open() to write. If no file name is given, a new file will be created.
Meaningful output file name should be considered.
outfile = open("output.txt", "w")
We can use print() statement to print to a file, we just need an extra keyword parameter
file=<outputFile>
The following example takes the numbers from the input.txt and calculates the sum of all numbers and writes the
sum in the output file output.txt
#output.txt will be created in the same directory/folder where the input file and the program file are.
def main():
infile = open("input.txt", 'r')
outfile = open("output.txt", 'w')
data = infile.readlines()
sum = 0
for i in data:
number = float(i)
sum = sum + number
print("The total is:", sum, file=outfile)
print("output.txt has the sum of all numbers.")
infile.close()
outfile.close()
main()
By using
file=outfile
we save the calculated sum in the output file.
An output file named output.txt will appear in the same project or the same directory/folder when you run the above program.
The sum of all numbers is listed in the output file.
Reading data from an CSV files
When analyzing large sets of data with multiple columns,
the simplest way is to use the NumPy loadtxt() function.
We can use an example to illustrate reading and writing CSV files.
What is a CSV file?
All Excel spreadsheet data file can be saved as a comma-separated value (CSV) file.
So if you have a Grades.xlsx file, the CSV file saved using Excel's Save As command
would be Grades.csv
Here is the Grades.csv file. Make sure the Grades.csv and the python program are in the same directory/folder.
You can download a copy of Grades.csv here.
We will use some made up CS100 students grades as an example.
import numpy as np
#delimiter argument is necessary if we load csv file
#delimiter argument is not necessary if we load txt file
#list in front of zip is necessary
id,final= np.loadtxt("Grades.csv",skiprows=2, unpack=True, usecols=(0,3),delimiter=',')
np.savetxt('output.txt',list(zip(id,final)),fmt="%8d")
There are four columns in the CSV file, student ID, midterm marks, lab marks and final marks.
To read in all four columns, we are using loadtxt() function to read the data
into four different arrays with the following statement:
id,mid,lab,final= np.loadtxt("Grades.csv",skiprows=2,unpack=True, delimiter=',')
The loadtxt() function takes four arguments:
- The first argument is a string that is the file name of the file to be read.
- The second argument tells loadtxt() to skip the first two lines at the top of file, usually the header or information of the file.
- unpack=True means to tell loadxt() to output data.
- fourth argument delimiter=',' tells loadtxt() that the columns are seperated by commas instead of whitespace or tabs.
If we only want to read in two specific columns, like the student ID and the final grades, we can use the following code:
As a consequence, only two array names are included to the left of the "=" sign, corresponding to the two columns that are read.
id,final= np.loadtxt("Grades.csv",skiprows=2, unpack=True, usecols=(0,3),delimiter=',')
Writing data to a text file
One simple way for writing data files in text format is to use the NumPy savetxt() function.
np.savetxt('output.txt',list(zip(id,final)), fmt="%8d")
The first argument of savetxt() is a string, the name of the data file to be created. The output file
name is output.txt. Be careful with filename, if you have a file with the same name, th old file will be overwritten.
The second argument is the data array that is to be written to the data file. The zip() function packages the two arrays
as one and returns a list of tuples,where the ith tuple contains the ith element from each of the arrays listed as its arguments.
Since there are two arrays, each row will be a tuple with two entries, producing a table with two columns.
The list() function is needed to construct the list of tuples.
The third argument is the format of the output. "%8d", The number will be printed with 8 characters.
d as integers, the output is padded with 7 leading blanks.
Here is the output file output.txt.
200435102 90
200445409 70
200438963 60
200420222 56
200222111 90
If it is "%5.2f", our float output has to be formatted with 5 characters.
The decimal part of the number or the precision is set to 2, i.e. the number following the "." in our placeholder.
Finally, the last character "f" of our placeholder stands for "float".
For writing data to a CSV file, you would specify a comma as the delimiter.
np.savetxt('output.csv',list(zip(id,final)), fmt="%8d", delimiter=",")
The following link provides the details:
https://physics.nyu.edu/pine/pymanual/html/chap4/chap4_io.html#file-input
Lab Assignment
1. Modify the following io_asg1.py program, so it will:
- Add a loop structure to go through all the numbers in the input file.
- Sum up all the numbers.
- Calculate the average of all numbers.
- Print the sum and the average in the output file.
- Format the sum and the average with two decimal places.
def main():
infile = open("input.txt", 'r')
outfile = open("output.txt", 'w')
data = infile.readlines()
sum = 0
# add a variable, name it count to keep track of how many numbers in the input file since it is used for average the total
# add code here to loop through the numbers
# add code here to sum up all the numbers
# add code here to calculate the average of the numbers
print("The sum of all number is:", sum, file=outfile)
#add code here to print the average of all numbers in the output file, format the average to 2 decimal points.
print("The sum and average of the numbers are in output.txt")
infile.close()
outfile.close()
main()
Here is the input.txt file if you need to create it.
230.78
500.34
100.50
25
600
The output.txt file should look similar to the following:
The sum of all numbers is: 1456.62
The average of all numbers is: 291.32
2. Add code in io_asg2.py to read the data from MyData.csv into four NumPy arrays with the variable names lab1, lab2, lab3 and lab4.
In the same script, write the data out to MyDataOut.csv data file,
without the header and only the data displayed in four columns. Use a single format specifier and set it to "%0.2f".
MyData.csv is already in replit team project, you can also download a copy of MyData.csv here.
Here is what MyDataOut.csv looks like.
7.50,8.90,5.00,6.70
6.40,9.50,7.20,9.10
Demonstrate your completed programs to the lab instructor if the lab is in person.