Thermopia: class notes on thermodynamics: April 2009

Tuesday, April 14, 2009

Sums in NumPy

Class,

It was just pointed out to me that there is a difference between sum and np.sum. Consider the following,

>>> a = np.array([[0,2,3], [5,2,3]], float)
>>> sum(a)
array([ 5.,  4.,  6.])
>>> np.sum(a)
15.0

The sum function will only sum an array along the smallest axis, while the np.sum function sums the entire array. For your dot products, you want to use the np.sum function instead of just sum.

MSS

Dear class-

There is a typo in the global minimum energy for N=9. It should read 24.1134, not 21.1134.

Also, if you are having trouble getting f2py working on Windows, here are some potential troubleshooting steps:

(1) Close out and re-open your command prompt after any change to the system (e.g., environment variables).

(2) Re-download libpython25.a from the course website and install it in c:\python25\libs (plural).

(3) When running f2py at the command line, try "python c:\python25\scripts\f2py.py" instead of just "f2py".

(4) If all else fails, reinstall Python. First, uninstall VPython, SciPy, NumPy, and Python from Add/Remove Programs. Then, delete the folder c:\python25 entirely. Next reinstall these utilities in this order: Python, NumPy, SciPy, and VPython. You may want to hold off installing VPython until you've tried to get f2py to work.

I apologize for any headaches here, but once you have everything installed, it will be smooth-sailing from now on. Moreover, rest assured the Python/SciPy/NumPy development communities are working to automate /streamline installs so that you shouldn't have to worry about these issues down the road.

-MSS

Monday, April 13, 2009

Dot products in exercise 2

The vectors described in the notes on minimization correspond to vectors in 3N-dimensional space. Therefore, the dot products for the conjugate gradient formulas indicate the multidimensional dot product in 3N-dimensional space. In other words

f^N . f^N = fx,1^2 + fy,1^2 + fz,1^2 + fx,2^2 + ... + fz,N^2

The result should be a scalar.

You will be working with (N,3) arrays in NumPy. For these kinds of arrays, NumPy's dot product will not give you the behavior above, since this function is a generalized matrix multiplication and NumPy will think your two-dimensional array is intended as a matrix, not a 3N-dimensional vector.

Instead you want to use something like:

np.sum(Forces**2)

Or,

np.sum((Forces - OldForces) * Forces)

Here, this code indicates that the Forces array will be squared on an element-by-element basis, and then these 3N elements will be summed.

Cheers,
MSS

Correction on installing f2py in Windows

Dear class-

I realized there is a mistake on the web page and in the tutorial on f2py. It says to download and place the file libpython25.a in the following location:

c:\Python25\lib\libpython25.a

This is an error. There is both a lib folder and a libs folder, and the file should be placed in:

c:\Python25\libs\libpython25.a

If there is already a file there with that name, you can overwrite it. The true libpython25.a should be about 700 kB in size.

MSS

Thursday, April 9, 2009

Ex1

Dear class,

If you are a chemical engineering grad student, I am going to put your Exercise 1 results in your boxes so that you have the comments as you begin to work on Exercise 2. If you are in a different department or are an undergrad, you can see me in my office to get yours back, or you can just wait until Tuesday.

There were only two main coding points that I want to emphasize.

The first is to keep in mind that Python indexing starts at 0, not 1. Moreover, slice notation is exclusive of the upper bound. So to take text from the third to and including the 10th columns in a pdb file, use line[2:10]. The 2 and 10 designate the third and 11th characters in line, but the upper bound is not included in the slice.

The second is to keep in mind that Python enables you to iterate over lists directly. Instead of:

for i in range(len(PdbFiles)):
  PdbFile = PdbFiles[i]
  ...

use the following:

for PdbFile in PdbFiles:
  ...

In addition, we can use list constructors to make our code much clearer. Consider the following code:

Phobics = ["ALA", "CYS", ...]
IsPhobic = []
for Res in ResNames:
  if Res in Phobics:
      IsPhobic.append(True)
  else:
      IsPhobic.append(False)

This can be made much shorter and clearer using:

Phobics = ["ALA", "CYS", ...]
IsPhobic = [Res in Phobics for Res in ResNames]

These hints aren't to curb your programming style, but rather help you take advantage of Python's programming features so that ultimately you will be able to write code very efficiently. Moreover, when you take advantage of Python's and NumPy's built-in routines and syntax, your code actually executes much faster.

Hope that helps as you are beginning Exercise 2.

Cheers,
MSS

Wednesday, April 8, 2009

Some benchmarks and coding hints

Dear class,

If you are having trouble with your code, let me provide some help here.

First, if you are wondering what some typical results are, I am getting Rg values to be in the range of about 14-24 (in Angstroms) and the ratio to vary between roughly 0.90 and 0.98.

Second, it is important that you consider the notation |r| to denote a vector norm. That is, it gives the length of the vector r. Thus,

|a-b|^2

gives the squared length of the vector pointing from b to a. This could be expanded as:

(xa - xb)^2 + (ya - yb)^2 + (za - zb)^2

where a = (xa, ya, za) and b = (zb, yb, zb).

Now, consider the radius of gyration. For a set of points, this is simply the root-mean-square-distance of the points from their centroid, where the centroid is defined as the average position of the points.

For the function you are to write called RadiusOfGyration, you need to compute the centroid and, subsequently, the squared distances of each position from it. To compute the centroid, you might write this code:

rm = np.zeros(3, float)
for (x, y, z) in Pos:
  rm[0] = rm[0] + x
  rm[1] = rm[1] + y
  rm[2] = rm[2] + z
rm = rm / len(Pos)

Better yet, we could use the fact that NumPy allows element-by-element operations on arrays:

rm = np.zeros(3, float)
for Point in Pos:
  rm = rm + Point
rm = rm / len(Pos)

Even better yet, we could use the sum function and specify that we want to sum over the first axis, the one corresponding to the particle index:

rm = Pos.sum(axis=0) / len(Pos)

Best yet, we can use the mean function:

rm = Pos.mean(axis=0)

Once we have computed the centroid, we might compute the Rg as follows:

RgSq = 0.
for (x, y, z) in Pos:
  RgSq = RgSq + (x-rm[0])**2 + (y-rm[1])**2 + (z-rm[2])**2
Rg = np.sqrt(RgSq / len(Pos))

Again, we can use element-by-element operations:

RgSq = 0.
for Point in Pos:
  RgSq = RgSq + sum((Point - rm)**2)
Rg = np.sqrt(RgSq / len(Pos))

Even better, we can use NumPy's broadcasting feature:

RgSq = sum((Pos - rm)**2)
Rg = np.sqrt(RgSq / len(Pos))

Here, Pos is an (N,3) array and rm is a (3) array. In taking the difference between the two, NumPy automatically broadcasts rm to be the same shape as Pos by duplicating it over N times.

Putting everything together, the RadiusOfGyration function can be completely written in a single line:

return np.sqrt(sum( (Pos - Pos.mean(axis=0))**2 ) / len(Pos))

Hope that helps,

MSS

Tuesday, April 7, 2009

Correction to ex1

Hi class-

In the exercise 1 assignment, it states to use:

Pos[IsPhobic]

to grab just the positions of the hydrophobic residues. This does work, but only if IsPhobic is a NumPy array of Boolean values. If you have made IsPhobic as a list (which is most likely the case), you'll have to convert it to an array:

IsPhobic = np.array(IsPhobic, bool)

Apologies for the confusion,
MSS

Monday, April 6, 2009

EX1

Hi class-

It was just pointed out to me that some of the protein structure files I posted have some complicating factors in them that can result in multiple alpha carbon coordinates for the same amino acid. I have now preprocessed and cleaned up these files. You should download the new proteins.zip or proteins.tar.gz files from the website.

Also, here are some more programming hints:

When you are parsing a pdb file, be sure to take the exact columns that the pdb format specifies. The different fields are not necessarily separated by spaces and so splitting a string apart can fail. You can access specific subsections of a string using slice notation. For example, if line is the variable containing the contents of a line read in from a file, you can get columns 18-20 (inclusive) using the notation line[17:20].

Also, you can use Python's list constructors to your advantage. For your function that returns the IsPhobic array, you can easily iterate over the residues in a form like:

[Res in Phobics for Res in ResNames]

where Phobics is a list containing hydrophobic amino acid names. The "in" statement here will result in either a True or False value so that the constructed array will contain only Booleans.

Cheers,
MSS

More hints for exercise 1

When reading a pdb file, you might want to first create two empty lists and add to them as you are parsing the lines. At the end, you can turn the position list into an array. For example,

Pos = []
ResNames = []
for line in file(PdbFile, 'r').readlines():
 if line.startswith("TER"):
     #break out of the loop over lines
     break
 elif line.startswith("ATOM"):
     #check to see if it's an alpha carbon
     if line[12:16].strip() == "CA":  
         #.... your code here to get x, y, z
         #and ResName from the string line...
         Pos.append([x, y, z])
         ResNames.append(ResName)
Pos = np.array(Pos, float)

MSS

Thursday, April 2, 2009

Python and support packages on Macs

Nathan D. has graciously shared some instructions for getting Python, NumPy, SciPy, and gfortran installed on Macs. Let me know if this doesn't work out for you:

1) Install the python package from here:
http://wiki.python.org/moin/MacPython/Leopard

2) Run this script to install SciPy, NumPy, and gfortran which you can download here:
http://macinscience.org/?page_id=6

MSS

Hint on EX1

Dear CHE210D class:

In working on exercise 1, there are many ways to make your code very short using Python's string functions and list expressions & objects. Your final code should actually be quite short. For example, to test whether or not a given amino acid name is hydrophobic, you could use something like the following

>>> Phobics = ["ALA", ... , "TRP"]
>>> s = "ALA"
>>> s in Phobics
True
>>> s = "ARG"
>>> s in Phobics
False

-MSS

Thermopia: class notes on thermodynamics