# Writing a SimPlot Clone in Python

May 2, 2020 | 5 minutes

SimPlot is a great little program written by Stuart Ray, have a look here.

As you’ve probably already gathered from the title, I’m going to go over how to recreate the basic functionality of SimPlot in Python. Why? I wanted to accomplish two things: 1. Figure out exactly what’s happening behind the scenes including how some edge cases are handled. 2. Have something which I can run easily on my Mac or Linux computer.

### How Does it Work?

Provided with an alignment of two or more sequences (nucleotide or amino acid), the original SimPlot will calculate the % identity between them within a rolling window and then plot it. See below.

Of course, in practice the window will usually be much larger, as will the sequences! First things first, let’s write some python to deal with the pointer and rolling window.

## Creating the Pointer Class

We need to keep track of the current pointer position and the location of the window bounds (which we’ll call window_lower and window_upper) given the window size, step size, and current iteration. This is fairly simple but we need to remember that in Python all counters start at 0. We also need to decide on the pointer location — it always has to be an integer and so we can’t always just divide the window size in half.

Looking at the image above you’ll notice that we need to know whether the the window size is even or odd to set the pointer position. For the window bounds the calculation is the same. The solution that I’ve come up with might be different to how you would have done it but that’s ok!

``````class Pointer:

# Set all initial values using calculate() on load
def __init__(self, window, step):
self.window = window
self.step = step
self.iteration = 0
self.calculate()

# Read the current parameters and set values
def calculate(self):
self.window_lower = self.step * self.iteration
self.window_upper = self.window - 1 + self.step * self.iteration

# Is the window size odd or even?
# Set correct pointer location
if (self.window % 2) == 0:
self.pointer = (self.window / 2) - 1 + self.step * self.iteration
else:
self.pointer = (self.window - 1) / 2 + self.step * self.iteration

# Increase the iteration count and recalculate all values
def increment(self):
self.iteration += 1
self.calculate()

# Reset the iteration count to zero and recalculate all values
def reset(self):
self.iteration = 0
self.calculate()``````

## Calculating Similarity

Using the positions provided by the Pointer class, we can now extract the sequences within the rolling window (from the full-length sequences) and compare them. To keep things simple, we’ll abstract this into two functions, extract and compare.

``````# Take the Pointer object and extract a windowed sequence
# from the full-length sequence
def extract(pointer, sequence):

# Get the window boundaries
lower = pointer.window_lower
upper = pointer.window_upper
windowed_sequence = sequence[lower:upper]
return windowed_sequence``````

The extract function is not super neccessary (it can easily be shortened to one line) but it’s useful to have this functionality separate in case we build upon this in future.

We can run extract twice — once per sequence — and then pass the outputs into the compare function to get the identity.

``````def compare(sequenceA, sequenceB):
length = len(sequenceA)

# Initiate a counter
sum = 0

# Look at each position in turn
for base in range(0, length):

# Is the base/residue the same?
if ( sequenceA[base] == sequenceB[base] ):

# Increase the counter
sum += 1

# Convert the final count to a percentage
identity = sum / length * 100
return identity``````

Now we pretty much have all the basics sorted, lets create the main loop.

## The Main Loop

We need a way to get the aligned sequences into the script, for this tutorial we’ll just paste them in as strings but for the final program I’ll write a quick class to extract them from a file.

``````sequenceA = "AGAGCATGGATGGCAAGAAAGTT----ATAA ...continued "
sequenceB = "AGATGATGGATGGGGAGAAAGATCACTATAA ...continued "``````

Now to put everything above together…

``````# Setup the basic params
window = 100
step   = 5

# Initiate an empty dict to hold the identity values
combined_identities = {}

# Initiate the pointer class
pointer = Pointer(window,step)

# Iterate through the sequences
# but stop when the window goes past the last base/residue
while pointer.window_upper <= len(sequenceA):

# Extract the part of each sequence to compare
compareA = extract(pointer, sequenceA)
compareB = extract(pointer, sequenceA)

# Compare the sequences and get the % identity
identity = compare(compareA, compareB)

# Add the identity to the dict with the pointer as a key
combined_identities[pointer.pointer] = identity

# Increment the pointer and window location
pointer.increment()

# Complete! Let's see what the values are.
print(combined_identities)``````

You’ll have noticed that we also don’t have a nice way to export the output yet. Ideally I would like to export a csv file, which can be read into something like Excel or R, but again this is past the scope of this post. I’ll include the functionality in the final script.

## Playing with parameters

For the following plots, I took the output from above and make some nice figures using ggplot2 in R.