Posts

Showing posts from 2016

Map multiple annotations using pandas

A simple pandas solution to map multiple annotations for a protein. A protein or gene file will have annotations curated by different methods. Most frequently, biologists will encounter more than one annotation for a single protein. It is a task in itself to pick the right annotation. One of the simple ways is to consolidate them and pick the right ones after enough evidence is known. Coding in matlab or other languages may require more number of lines to achieve the same output. Here, a simple 'groupby' of pandas can do produce the same outputin seconds !# Map multiple annotation for protein ID and join with a delimiter''' Sample input A_xx Annotation1 A_xx Annotation2 B_xx Annotation1 Sample output A_xx Annotation1, Annotation2 B_xx Annotation1'''importpandasaspd data = pd.read_csv('input_file.txt', delimiter='\t') dfc = data.groupby(['PROTID'])['ANNOTATION'].apply(", ".join) print dfc dfc.to_csv('Outp…

Pick Matching lines with list of keywords

#Simple code to find the occurrence of list of search terms in single line in a huge file. Search_terms=['A','B','C','D']withopen('BigFile.txt','r')asinfile:entries=infile.read()each_line=entries.splitlines()new_list=[]forix,rowinenumerate(each_line):element=row.split("\t")ifset(Search_terms)==set(element):#Use <= if you want no strict optionnew_list.append(element)printix+1#List the matching line number !thefile=open('Output.txt','w')foriteminnew_list:print>>thefile,item

Install Parallel versions of Python from source

Few programs require certain versions of python. Especially, when you do not have root permission in your unix machine, you can still install a python version in your /home/usr directory. You can run it in parallel to in-built version. Follow these steps:

1. Unpack with tar -xvf python-x.x.tar

2. ./configure

3. make altinstall prefix=~ exec-prefix=~ (./lib and ./bin will be created in your home directory(~))

4. create alias for python-x.x => i.e cd ~/bin/ > ln -s python-x.x python! If you create this outside bin directory, you will get a warning (ln: symbolic link already exists !! remember it refers to inbuilt python)

5. Complete alias creation by editing: vi ~/.bashrc with alias python='~/bin/python'

Now any software you try to install with python setup.py it will access the version you installed in your home !

You can still install softwares using your version by skipping steps 4 & 5. But, everytime you need to keep specifying the export PATH & export PYTHONP…

Fasta_Header_Rename

A simple matlab code to rename the headers in fasta file. Self-explanatory variable names.

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31%Author = Arun Prasanna%Rename the headers in fasta file to desired choice%For example: The input fasta file used here had header in >num_name format%strtok is used to strip and extract the required format clear; clc; tic; Path = 'Drive\Path\ToReadFile';% FileList = dir(Path); [rFL, cFL] = size(FileList); for i = 3:rFL %i of 1 & 2 are . & .. respectively     Fas_Fname{i-2,1} = FileList(i).name; %FileList is a structureend [rFas,cFas] = size(Fas_Fname); for i = 1:rFas     clear Header Seq ProtID Sp new_Header     OpenFile = cell2mat(strcat(Path,Fas_Fname(i)));     [Header, Seq] = fastaread(OpenFile);[rH,cH] = size(Header);     for j = 1:cH         [ProtID, Sp] = strtok(Header(1,j),'_'); %First Sp = _name         [Sp, rm] = strtok(Sp,'_'); %Second SP = name ! which we ne…

Fasta_Dupicate_Header

A simple, self-explanatory matlab code to identify duplicate headers in fasta files.

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26%Simple matlab code to check for the duplicate header in fasta files%Store the size of header -> unique(header) ->size of new header%Copy the Table data in excel and compare the two values clear; clc; tic; Path = 'Drive\Path\FileName';% FileList = dir(Path); [rFL, cFL] = size(FileList); for i = 3:rFL %i of 1 & 2 are . & .. respectively     Fas_Fname{i-2,1} = FileList(i).name; %FileList is a structureend [rFas,cFas] = size(Fas_Fname); for i = 1:rFas     clear Header Seq Old_Header Unik_header     OpenFile = cell2mat(strcat(Path,Fas_Fname(i)));     [Header, Seq] = fastaread(OpenFile);[rH,cH] = size(Header);     Old_Header = length(Header);     Unik_header = length(unique(Header));     Table{i,1} = Fas_Fname(i);     Table{i,2} = num2str(Old_Header);     Table{i,3} = num2str(Unik_header);     fprintf('…

Fasta Header Replacer V2.0

Extension of previous code 'Fasta Header Replacer.m' to process files in batch mode. Keep all/only the .fasta files inside the specified directory.

%Author: Arun Prasanna%Version 2.0 of Fasta_Header_replacer.m!.%Efficient to process files in batch mode. clear;clc;FileList=dir('D:\BRC_POSTDOC-RESEARCH\ARMILLARIA_Project\PROTEIN_FASTA');[rFL,cFL]=size(FileList);fori=3:rFL%i of 1 & 2 are . & .. respectivelyOrg_name{i-2,1}=FileList(i).name;%FileList is a structureend[rOn,cOn]=size(Org_name);forOL=1:rOnFileName=char(Org_name{OL});[Header,Seq]=fastaread(FileName);Header=Header';Seq=Seq';[rH,cH]=size(Header);check(OL,1)=rH;forIL=1:rHid=num2str(IL);[tok,rem]=strtok(FileName,'.');%Extract org name from FileName itself.new{IL,1}=strcat(id,'_',tok);endOutFileName=strcat(tok,'_mod',rem);fastawrite(OutFileName,new,Seq)TransTab=horzcat(new,Header);%===========Section-to-write-cell-array-2-Txt-file===========%TT=strcat(

Fasta Header Replacer

Handling sequence files (like .fasta) is one of the trickiest problems for novice in Bioinformatics. Bio-Perl, Bio-python are quite useful but looks really scary :-( !.
MATLAB offers a cool solution with its in-built Bioinformatics toolbox !!. Reading a fasta file with 'fastaread' is as easy as 'xlsread' ...followingly the same with 'fastawrite'/'xlswrite' :-)

fastaread simply extract the sequence headers & sequences in cell arrays !. Voila !!! Once it does...then one can do all kinds of manipulation they want.

Here is a simple-self-explanatory, one-file-at-a-time code to replace the header with an user-defined headers. Besides, creates a translation table. If you want to process multiple file then one can readily loop it over directory operations.

INPUT (sequence.fasta)

>gi|154163|gb|M83220.1|STYLEXA Salmonella typhimurium lexA (repressor of DNA damage inducible genes) gene, 5' end
ATGCGCCAGCTGCAAAATTTAAAT

>gi|154164|gb|M83220.1|STYLEXA S…

Presence Absence Matrix

Given a cluster file, one can create a Presence-Absence matrix (PA map). With this self-explanatory simple matlab file it is easy to create one.

Input Format:

Cluster file.xlsx:
(1) A  B  C  D
(2) A  A  A
(3) B C
(4) D D A

List File.xlsx:
A
B
C
D

Output: (of course, the output will have the file with only numbers printed)

     A B C D
(1) 1  1  1  1
(2) 1  0  0  0
(3) 0  1  1  0
(4) 1  0  0  1



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26%Author = Arun Prasanna%Create a presence absence matrix (PA map) from cluster informationclear;clc;[mat1,mat]=xlsread('ClusterFile.xlsx','Sheet1');clearmat1[mat2,head]=xlsread('List.xlsx','Header');clearmat2new_head=head(:,col_val);%col_val = 2 => column that has unique sp/gene list[rmat,cmat]=size(mat);[rhead,chead]=size(new_head);out=zeros(rhead,rmat);fori=1:rheadforj=1:rmatfork=1:cmatcmp=strcmp(new_head(i,1),mat(j,k));if(cmp==1)out(i,j)=1;break;endendendendOF=xlswrite('Cluster…

Gene Copy Number Matrix

Given a cluster file, one can create a gene copy number matrix (GCN). With this self-explanatory simple matlab file it is easy to create one.
Input Format:
Cluster file.xlsx:
(1) A  B  C  D
(2) A  A  A
(3) B C
(4) D D A

List File.xlsx:
A
B
C
D

Output: (of course, the output will have the file with only numbers printed)

     A B C D
(1) 1  1  1  1
(2) 3  0  0  0
(3) 0  1  1  0
(4) 1  0  0  2


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29%Author = Arun Prasanna%Create a gene copy number matrix from cluster informationclear;clc;tic[mat1,mat]=xlsread('ClusterFile.xlsx','Sheet1');clearmat1[mat2,head]=xlsread('Organism_list.xlsx','Sheet1');clearmat2%species/gene namenew_head=head(:,1)';%transpose to make it as header[rmat,cmat]=size(mat);[rhead,chead]=size(new_head);out=zeros(rmat,chead);counter=0;fori=1:cheadi%print value of i to track progressforj=1:rmatfork=1:cmatcmp=strcmp(new_head(1,i),mat(j,k));chk=strcmp(mat…

Copy | Rm first column in text file

Simple, intutive & self-explanatory code to copy or remove columns in tab delimited text files. Remember AWK or Sed one liners are very handy too. But sometimes, if there are space or inconsistencies in file they may fail.




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43__author__ ='Arun Prasanna''''Remove first column of the txt file'''withopen('Inputfile.txt','r') as infile: entries = infile.read() each_line = entries.splitlines() new_list = [] for row in each_line: element = row.split("\t") ele_size =len(element) for i inrange(1, ele_size): tmp = element[i] new_list.append(tmp) new_list.append('\t') new_list.append('\n') f =open('Output.txt','w') out= f.writelines(new_list) f.close() print"Program complete"========================================== __autho…

Hash_lookup for Cluster data

Hashes are exciting !! Hash tables are one of the most powerful lookup operations !. It gives the power of random access & hence lightning fast :-). 
Imagine that, you have non-homogenous data (again: inconsistent number of column values, for instance cluster data). You have a list and you have to map the entries or fish the desired value from the huge file ! For example, your query file has 65000 entries and cluster file has 100000 entries in which the first column is the correspondence. If you just write the basic 'for loop' it is going to iterate atleast 65000 * 100000 (in case of thorough search) or slightly lesser if you break after a match is found. In any case, it can time humongous amount of time. Solution ? Hashes !! For same number of entries, my job took 3-7 seconds ! 
Dictionaries in python can also do the same thing ! 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35#Code to map cluster & query data. …

Count elements in each row

Python code to count the number of elements (genes | proteins | genus ...) in each row in a non-homogenous cluster files.

Example Input:

g1  g2   g3  g4
g2
g4  g6  g7

Example Output:
4
1
3


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18__author__='Arun Prasanna''''Python code to count number of elements in non-homogenous text file. Small, simple & self explanatory code ! '''withopen('Input.txt','r')asinfile:entries=infile.read()each_line=entries.splitlines()new_list=[]forrowineach_line:element=row.split("\t")ele_size=len(element)new_list.append(str(ele_size))new_list.append('\n')f=open('Count_EachRowElements.txt','w')out=f.writelines(new_list)f.close()print"Program complete"

Strip Gene IDs

Code is useful to refine clustering data with IDs tagged with genus name. The output can be used to count protein copy numbers and etc., to create phyletic matrices or copy number matrices.

Example Input:

123_g1   NID_g2  
4567_g3   xx_g4    012_g6
NID_g10  ACC_g4

Example Output:

g1    g2
g3    g4    g6
g10  g4



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24__author__='Arun Prasanna'''' Program to read the input file with 'genus_number' format names and convert that to => Each lines can have different number of elements (non-homogenous data)1. 'genus_ID' to 'genus' format => [0] in split2. 'ID_genus' to 'genus' format => [1] in split'''withopen('Input_file.txt','r')asinfile:entries=infile.read()each_line=entries.splitlines()new_list=[]forrowineach_line:element=row.split("\t")ele_size=len(element)foriinrange(0,ele_size):#CHANGE: if clustname is there make …

Presence-absence Matrix to Fasta format

Convert the binary matrix to fasta format with this simple code in Python !. Recommended for larger file sizes !

Sample Input:
Sp11111
Sp20000
Sp31100

Sample Output:
>Sp1
['1', '1', '1', '1']
>Sp2
['0', '0', '0', '0']
>Sp3
['1', '1', '0', '0']

Workaround the output file in any text file to remove [,'  ] to generate final output as:
>Sp1
1111
>Sp2
0000
>Sp3
1100

__author__='Arun Prasanna''''This is a simple python code to convert binary matrix into fasta format. There are many public softwares available. But each one has size limits (<=2MB). This code processes 61 x 51000 character matrix in less than 10 seconds in python 2.7! '''withopen('infile_matrix-cp','r')asinfile:entries=infile.read().strip()each_line=entries.splitlines()Header=[]forrowineach_line:element=row.split("\t")Sym='>'#Add > sy…

BLAST_TABULAR_OUTPUT Generator

Few MPI versions do not support the very useful -m 8 tabular output format. The following code can be used to parse the raw blast output obtained from mpi_blast to tabular output file. The creditials of original creator and modifications are given in comments section of the code. #=========================================================================================# Modified BLAST Parser - Script for parsing the BLAST output in tabular format# Version 1.1.6 (November 26, 2015)# Modified by Arun Prasanna (arunprasanna83@gmail.com) from orginal script by Kirill Kryukov (http://kirill-kryukov.com/kirr/). He also has many other useful utilities..Check out !!# # # Purpose: Few versions of MPI-BLAST do not support tabular output or often the user faces # error while opting for tabular output. Hence the original code is modified to produce # tabular Output verbatim -m 8 option. The output file is compliant with other clustering # programs like Silix# # Changes made are:# 1. Added …