Working algos for Biological Data: Simple to Complex problems

Posts

Calculate Cys-Richness for a protein

June 06, 2017

''' Code description: Calculate Cys-richness of a protein with criteria set as: >=4 'C's over the length of protein AND >=5% total cysteine content Function: Take input sequence => Count number of 'C's & length => Calculate percentage Output True or False if criteria is met ''' from Bio import SeqIO def Cys_rich (record_seq): C_count = record_seq . count( 'C' ) seq_len = len (record_seq) Cys_perc = float (C_count) / float (seq_len) * 100 if C_count >= 4.0 and Cys_perc >= 5.0 : return 'Cys-rich' else : return 'No' CysRichSeq = [] for record in SeqIO . parse( 'filename.fasta' , 'fasta' ): if Cys_rich(record . seq) == 'Cys-rich' : CysRichSeq . append(record) SeqIO . write(CysRichSeq, 'Cys-rich_sequences.fasta' , 'fasta' ) print 'Cys-rich sequences written to file..'

Map multiple annotations using pandas

December 29, 2016

A simple pandas solution to map multiple annotations for a protein. A protein or gene file will have annotations curated by different methods. Most frequently, biologists will encounter more than one annotation for a single protein. It is a task in itself to pick the right annotation. One of the simple ways is to consolidate them and pick the right ones after enough evidence is known. Coding i n matlab or other languages may require more number of lines to achieve the same output. Here, a simple 'groupby' of pandas can do produce the same outputin seconds ! # Map multiple annotation for protein ID and join with a delimiter ''' Sample input A_xx Annotation1 A_xx Annotation2 B_xx Annotation1 Sample output A_xx Annotation1, Annotation2 B_xx Annotation1 ''' import pandas as pd data = pd . read_csv( 'input_file.txt' , delimiter = ' \t ' ) dfc = data . groupby([ 'PROTID' ])[ 'ANNOTATION...

Pick Matching lines with list of keywords

October 25, 2016

#Simple code to find the occurrence of list of search terms in single line in a huge file. Search_terms = [ 'A' , 'B' , 'C' , 'D' ] with open ( 'BigFile.txt' , 'r' ) as infile : entries = infile . read () each_line = entries . splitlines () new_list = [] for ix , row in enumerate ( each_line ): element = row . split ( "\t" ) if set ( Search_terms ) == set ( element ): #Use <= if you want no strict option new_list . append ( element ) print ix + 1 #List the matching line number ! thefile = open ( 'Output.txt' , 'w' ) for item in new_list : print >> thefile , item

Install Parallel versions of Python from source

August 03, 2016

Few programs require certain versions of python. Especially, when you do not have root permission in your unix machine, you can still install a python version in your /home/usr directory. You can run it in parallel to in-built version. Follow these steps: 1. Unpack with tar -xvf python-x.x.tar 2. ./configure 3. make altinstall prefix=~ exec-prefix=~ (./lib and ./bin will be created in your home directory(~)) 4. create alias for python-x.x => i.e cd ~/bin/ > ln -s python-x.x python ! If you create this outside bin directory, you will get a warning (ln: symbolic link already exists !! remember it refers to inbuilt python) 5. Complete alias creation by editing: vi ~/.bashrc with alias python='~/bin/python' Now any software you try to install with python setup.py it will access the version you installed in your home ! You can still install softwares using your version by skipping steps 4 & 5. But, everytime you need to keep specifying the export PATH ...

Fasta_Header_Rename

May 20, 2016

A simple matlab code to rename the headers in fasta file. Self-explanatory variable names. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 %Author = Arun Prasanna %Rename the headers in fasta file to desired choice %For example: The input fasta file used here had header in >num_name format %strtok is used to strip and extract the required format clear; clc; tic; Path = 'Drive\Path\ToReadFile' ; % FileList = dir(Path); [rFL, cFL] = size(FileList); for i = 3:rFL %i of 1 & 2 are . & .. respectively Fas_Fname{i-2,1} = FileList(i).name; %FileList is a structure end [rFas,cFas] = size(Fas_Fname); for i = 1:rFas clear Header Seq ProtID Sp new_Header OpenFile = cell2mat(strcat(Path,Fas_Fname(i))); [Header, Seq] = fastaread(OpenFile);[rH,cH] = size(Header); for j = 1:cH ...

Fasta_Dupicate_Header

May 20, 2016

A simple, self-explanatory matlab code to identify duplicate headers in fasta files. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 %Simple matlab code to check for the duplicate header in fasta files %Store the size of header -> unique(header) ->size of new header %Copy the Table data in excel and compare the two values clear; clc; tic; Path = 'Drive\Path\FileName' ; % FileList = dir(Path); [rFL, cFL] = size(FileList); for i = 3:rFL %i of 1 & 2 are . & .. respectively Fas_Fname{i-2,1} = FileList(i).name; %FileList is a structure end [rFas,cFas] = size(Fas_Fname); for i = 1:rFas clear Header Seq Old_Header Unik_header OpenFile = cell2mat(strcat(Path,Fas_Fname(i))); [Header, Seq] = fastaread(OpenFile);[rH,cH] = size(Header); Old_Header = length(Header); Unik_header = length(...

Search This Blog

Working algos for Biological Data: Simple to Complex problems

Posts

Condense fasta header

Calculate Cys-Richness for a protein

Map multiple annotations using pandas

Pick Matching lines with list of keywords

Install Parallel versions of Python from source

Fasta_Header_Rename

Fasta_Dupicate_Header