Posts

Showing posts from 2016

Map multiple annotations using pandas

A simple pandas solution to map multiple annotations for a protein.  A protein or gene file will have annotations curated by different methods. Most frequently, biologists will encounter more than one annotation for a single protein. It is a task in itself to pick the right annotation.  One of the simple ways is to consolidate them and pick the right ones after enough evidence is known.  Coding i n matlab or other languages may require more number of lines to achieve the same output. Here, a simple 'groupby' of pandas can do produce the same outputin seconds ! # Map multiple annotation for protein ID and join with a delimiter ''' Sample input A_xx Annotation1 A_xx Annotation2 B_xx Annotation1 Sample output A_xx Annotation1, Annotation2 B_xx Annotation1 ''' import pandas as pd data = pd . read_csv( 'input_file.txt' , delimiter = ' \t ' ) dfc = data . groupby([ 'PROTID' ])[ 'ANNOTATION&

Pick Matching lines with list of keywords

#Simple code to find the occurrence of list of search terms in single line in a huge file. Search_terms = [ 'A' , 'B' , 'C' , 'D' ] with open ( 'BigFile.txt' , 'r' ) as infile : entries = infile . read () each_line = entries . splitlines () new_list = [] for ix , row in enumerate ( each_line ): element = row . split ( "\t" ) if set ( Search_terms ) == set ( element ): #Use <= if you want no strict option new_list . append ( element ) print ix + 1 #List the matching line number ! thefile = open ( 'Output.txt' , 'w' ) for item in new_list : print >> thefile , item

Install Parallel versions of Python from source

Few programs require certain versions of python. Especially, when you do not have root permission in your unix machine, you can still install a python version in your /home/usr directory. You can run it in parallel to in-built version. Follow these steps: 1. Unpack with  tar -xvf python-x.x.tar 2. ./configure 3. make altinstall prefix=~ exec-prefix=~ (./lib and ./bin will be created in your home directory(~)) 4. create alias for python-x.x => i.e cd ~/bin/ > ln -s python-x.x python ! If you create this outside bin directory, you will get a warning (ln: symbolic link already exists !! remember it refers to inbuilt python) 5. Complete alias creation by editing: vi ~/.bashrc with alias python='~/bin/python' Now any software you try to install with python setup.py it will access the version you installed in your home ! You can still install softwares using your version by skipping steps 4 & 5. But, everytime you need to keep specifying the export PATH &

Fasta_Header_Rename

A simple matlab code to rename the headers in fasta file. Self-explanatory variable names. 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 %Author = Arun Prasanna %Rename the headers in fasta file to desired choice %For example: The input fasta file used here had header in >num_name format %strtok is used to strip and extract the required format clear; clc; tic; Path = 'Drive\Path\ToReadFile' ; % FileList = dir(Path); [rFL, cFL] = size(FileList); for i = 3:rFL %i of 1 & 2 are . & .. respectively     Fas_Fname{i-2,1} = FileList(i).name; %FileList is a structure end [rFas,cFas] = size(Fas_Fname); for i = 1:rFas     clear Header Seq ProtID Sp new_Header     OpenFile = cell2mat(strcat(Path,Fas_Fname(i)));     [Header, Seq] = fastaread(OpenFile);[rH,cH] = size(Header);     for j = 1:cH         [ProtID, Sp] = strtok(Header(1,j), '_' ); %First Sp = _name         [Sp, rm] = strtok(Sp, '_' ); %Se

Fasta_Dupicate_Header

A simple, self-explanatory matlab code to identify duplicate headers in fasta files. 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 %Simple matlab code to check for the duplicate header in fasta files %Store the size of header -> unique(header) ->size of new header %Copy the Table data in excel and compare the two values clear; clc; tic; Path = 'Drive\Path\FileName' ; % FileList = dir(Path); [rFL, cFL] = size(FileList); for i = 3:rFL %i of 1 & 2 are . & .. respectively     Fas_Fname{i-2,1} = FileList(i).name; %FileList is a structure end [rFas,cFas] = size(Fas_Fname); for i = 1:rFas     clear Header Seq Old_Header Unik_header     OpenFile = cell2mat(strcat(Path,Fas_Fname(i)));     [Header, Seq] = fastaread(OpenFile);[rH,cH] = size(Header);     Old_Header = length(Header);     Unik_header = length(unique(Header));     Table{i,1} = Fas_Fname(i);     Table{i,2} = num2str(Old_Header);     Table{i,3} = num2str(Unik_heade

Fasta Header Replacer V2.0

Extension of previous code 'Fasta Header Replacer.m' to process files in batch mode. Keep all/only the .fasta files inside the specified directory. %Author: Arun Prasanna %Version 2.0 of Fasta_Header_replacer.m!. %Efficient to process files in batch mode. clear ; clc ; FileList = dir ( 'D:\BRC_POSTDOC-RESEARCH\ARMILLARIA_Project\PROTEIN_FASTA' ); [ rFL , cFL ] = size ( FileList ); for i = 3 : rFL %i of 1 & 2 are . & .. respectively     Org_name { i - 2 , 1 } = FileList ( i ). name ; %FileList is a structure end [ rOn , cOn ] = size ( Org_name ); for OL = 1 : rOn     FileName = char ( Org_name { OL });     [ Header , Seq ] = fastaread ( FileName );     Header = Header ' ;     Seq = Seq ' ;     [ rH , cH ] = size ( Header );     check ( OL , 1 ) = rH ;     for IL = 1 : rH         id = num2str ( IL );         [ tok , rem ] = strtok ( FileName , '.' ); %Extract org name from FileName its

Fasta Header Replacer

Handling sequence files (like .fasta) is one of the trickiest problems for novice in Bioinformatics. Bio-Perl, Bio-python are quite useful but looks really scary :-( !. MATLAB offers a cool solution with its in-built Bioinformatics toolbox !!. Reading a fasta file with 'fastaread' is as easy as 'xlsread' ...followingly the same with 'fastawrite'/'xlswrite' :-) fastaread simply extract the sequence headers & sequences in cell arrays !. Voila !!! Once it does...then one can do all kinds of manipulation they want. Here is a simple-self-explanatory, one-file-at-a-time code to replace the header with an user-defined headers. Besides, creates a translation table. If you want to process multiple file then one can readily loop it over directory operations. INPUT (sequence.fasta) >gi|154163|gb|M83220.1|STYLEXA Salmonella typhimurium lexA (repressor of DNA damage inducible genes) gene, 5' end ATGCGCCAGCTGCAAAATTTAAAT >gi|154164|gb|M8322

Presence Absence Matrix

Given a cluster file, one can create a Presence-Absence matrix (PA map). With this self-explanatory simple matlab file it is easy to create one. Input Format: Cluster file.xlsx: (1) A  B  C  D (2) A  A  A (3) B C (4) D D A List File.xlsx: A B C D Output : (of course, the output will have the file with only numbers printed)      A B C D (1) 1  1  1  1 (2) 1  0  0  0 (3) 0  1  1  0 (4) 1  0  0  1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 %Author = Arun Prasanna %Create a presence absence matrix (PA map) from cluster information clear ; clc ; [ mat1 , mat ] = xlsread ( 'ClusterFile.xlsx' , 'Sheet1' ); clear mat1 [ mat2 , head ] = xlsread ( 'List.xlsx' , 'Header' ); clear mat2 new_head = head (:, col_val ); %col_val = 2 => column that has unique sp/gene list [ rmat , cmat ] = size ( mat ); [ rhead , chead ] = size ( new_head ); out = zeros ( rhead , rmat ); for

Gene Copy Number Matrix

Given a cluster file, one can create a gene copy number matrix (GCN). With this self-explanatory simple matlab file it is easy to create one. Input Format: Cluster file.xlsx: (1) A  B  C  D (2) A  A  A (3) B C (4) D D A List File.xlsx: A B C D Output: (of course, the output will have the file with only numbers printed)      A B C D (1) 1  1  1  1 (2) 3  0  0  0 (3) 0  1  1  0 (4) 1  0  0  2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 %Author = Arun Prasanna %Create a gene copy number matrix from cluster information clear ; clc ; tic [ mat1 , mat ] = xlsread ( 'ClusterFile.xlsx' , 'Sheet1' ); clear mat1 [ mat2 , head ] = xlsread ( 'Organism_list.xlsx' , 'Sheet1' ); clear mat2 %species/gene name new_head = head (:, 1 ) ' ; %transpose to make it as header [ rmat , cmat ] = size ( mat ); [ rhead , chead ] = size ( new_head ); out = zeros ( rmat , chead );

Copy | Rm first column in text file

Simple, intutive & self-explanatory code to copy or remove columns in tab delimited text files. Remember AWK or Sed one liners are very handy too. But sometimes, if there are space or inconsistencies in file they may fail. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 __author__ = 'Arun Prasanna' ''' Remove first column of the txt file ''' with open ( 'Inputfile.txt' , 'r' ) as infile: entries = infile . read() each_line = entries . splitlines() new_list = [] for row in each_line: element = row . split( " \t " ) ele_size = len (element) for i in range ( 1 , ele_size): tmp = element[i] new_list . append(tmp) new_list . append( ' \t ' ) new_list . append( ' \n ' ) f = open ( 'Output.txt' , 'w' ) out = f . writelines(new_list) f . close() print

Hash_lookup for Cluster data

Hashes are exciting !! Hash tables are one of the most powerful lookup operations !. It gives the power of random access & hence lightning fast :-).  Imagine that, you have non-homogenous data (again: inconsistent number of column values, for instance cluster data). You have a list and you have to map the entries or fish the desired value from the huge file ! For example, your query file has 65000 entries and cluster file has 100000 entries in which the first column is the correspondence. If you just write the basic 'for loop' it is going to iterate atleast 65000 * 100000 (in case of thorough search) or slightly lesser if you break after a match is found. In any case, it can time humongous amount of time. Solution ? Hashes !! For same number of entries, my job took 3-7 seconds !  Dictionaries in python can also do the same thing !  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 #Code to map cluster &

Count elements in each row

Python code to count the number of elements (genes | proteins | genus ...) in each row in a non-homogenous cluster files. Example Input: g1  g2   g3  g4 g2 g4  g6  g7 Example Output: 4 1 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 __author__ = 'Arun Prasanna' ''' Python code to count number of elements in non-homogenous text file. Small, simple & self explanatory code ! ''' with open ( 'Input.txt' , 'r' ) as infile : entries = infile . read () each_line = entries . splitlines () new_list = [] for row in each_line : element = row . split ( "\t" ) ele_size = len ( element ) new_list . append ( str ( ele_size )) new_list . append ( '\n' ) f = open ( 'Count_EachRowElements.txt' , 'w' ) out = f . writelines ( new_list ) f . close () print "Program complete"

Strip Gene IDs

Code is useful to refine clustering data with IDs tagged with genus name. The output can be used to count protein copy numbers and etc., to create phyletic matrices or copy number matrices. Example Input: 123_g1   NID_g2   4567_g3   xx_g4    012_g6 NID_g10  ACC_g4 Example Output: g1    g2 g3    g4    g6 g10  g4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 __author__ = 'Arun Prasanna' ''' Program to read the input file with 'genus_number' format names and convert that to => Each lines can have different number of elements (non-homogenous data) 1. 'genus_ID' to 'genus' format => [0] in split 2. 'ID_genus' to 'genus' format => [1] in split ''' with open ( 'Input_file.txt' , 'r' ) as infile : entries = infile . read () each_line = entries . splitlines () new_list = [] for row in each_line : element = row . split (

Presence-absence Matrix to Fasta format

Convert the binary matrix to fasta format with this simple code in Python !. Recommended for larger file sizes ! Sample Input: Sp1 1 1 1 1 Sp2 0 0 0 0 Sp3 1 1 0 0 Sample Output: >Sp1 ['1', '1', '1', '1'] >Sp2 ['0', '0', '0', '0'] >Sp3 ['1', '1', '0', '0'] Workaround the output file in any text file to remove [,'  ] to generate final output as: >Sp1 1111 >Sp2 0000 >Sp3 1100 __author__ = 'Arun Prasanna' ''' This is a simple python code to convert binary matrix into fasta format. There  are many public softwares available. But each one has size limits (<=2MB).  This code processes 61 x 51000 character matrix in less than 10 seconds in  python 2.7! ''' with open ( 'infile_matrix-cp' , 'r' ) as infile : entries = infile . read () . strip () each_line = entries .

BLAST_TABULAR_OUTPUT Generator

Few MPI versions do not support the very useful -m 8 tabular output format. The following code can be used to parse the raw blast output obtained from mpi_blast to tabular output file. The creditials of original creator and modifications are given in comments section of the code. #========================================================================================= # Modified BLAST Parser - Script for parsing the BLAST output in tabular format # Version 1.1.6 (November 26, 2015) # Modified by Arun Prasanna (arunprasanna83@gmail.com) from orginal script by Kirill Kryukov (http://kirill-kryukov.com/kirr/). He also has many other useful utilities..Check out !! # # # Purpose: Few versions of MPI-BLAST do not support tabular output or often the user faces # error while opting for tabular output. Hence the original code is modified to produce # tabular Output verbatim -m 8 option. The output file is compliant with other clustering # programs like Silix # # Changes made