Working algos for Biological Data: Simple to Complex problems

Posts

Fasta Header Replacer V2.0

April 05, 2016

Extension of previous code 'Fasta Header Replacer.m' to process files in batch mode. Keep all/only the .fasta files inside the specified directory. %Author: Arun Prasanna %Version 2.0 of Fasta_Header_replacer.m!. %Efficient to process files in batch mode. clear ; clc ; FileList = dir ( 'D:\BRC_POSTDOC-RESEARCH\ARMILLARIA_Project\PROTEIN_FASTA' ); [ rFL , cFL ] = size ( FileList ); for i = 3 : rFL %i of 1 & 2 are . & .. respectively Org_name { i - 2 , 1 } = FileList ( i ). name ; %FileList is a structure end [ rOn , cOn ] = size ( Org_name ); for OL = 1 : rOn FileName = char ( Org_name { OL }); [ Header , Seq ] = fastaread ( FileName ); Header = Header ' ; Seq = Seq ' ; [ rH , cH ] = size ( Header ); check ( OL , 1 ) = rH ; for IL = 1 : rH ...

Fasta Header Replacer

April 05, 2016

Handling sequence files (like .fasta) is one of the trickiest problems for novice in Bioinformatics. Bio-Perl, Bio-python are quite useful but looks really scary :-( !. MATLAB offers a cool solution with its in-built Bioinformatics toolbox !!. Reading a fasta file with 'fastaread' is as easy as 'xlsread' ...followingly the same with 'fastawrite'/'xlswrite' :-) fastaread simply extract the sequence headers & sequences in cell arrays !. Voila !!! Once it does...then one can do all kinds of manipulation they want. Here is a simple-self-explanatory, one-file-at-a-time code to replace the header with an user-defined headers. Besides, creates a translation table. If you want to process multiple file then one can readily loop it over directory operations. INPUT (sequence.fasta) >gi|154163|gb|M83220.1|STYLEXA Salmonella typhimurium lexA (repressor of DNA damage inducible genes) gene, 5' end ATGCGCCAGCTGCAAAATTTAAAT >gi|154164|gb|M8322...

Presence Absence Matrix

March 02, 2016

Given a cluster file, one can create a Presence-Absence matrix (PA map). With this self-explanatory simple matlab file it is easy to create one. Input Format: Cluster file.xlsx: (1) A B C D (2) A A A (3) B C (4) D D A List File.xlsx: A B C D Output : (of course, the output will have the file with only numbers printed) A B C D (1) 1 1 1 1 (2) 1 0 0 0 (3) 0 1 1 0 (4) 1 0 0 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 %Author = Arun Prasanna %Create a presence absence matrix (PA map) from cluster information clear ; clc ; [ mat1 , mat ] = xlsread ( 'ClusterFile.xlsx' , 'Sheet1' ); clear mat1 [ mat2 , head ] = xlsread ( 'List.xlsx' , 'Header' ); clear mat2 new_head = head (:, col_val ); %col_val = 2 => column that has unique sp/gene list [ rmat , cmat ] ...

Gene Copy Number Matrix

March 02, 2016

Given a cluster file, one can create a gene copy number matrix (GCN). With this self-explanatory simple matlab file it is easy to create one. Input Format: Cluster file.xlsx: (1) A B C D (2) A A A (3) B C (4) D D A List File.xlsx: A B C D Output: (of course, the output will have the file with only numbers printed) A B C D (1) 1 1 1 1 (2) 3 0 0 0 (3) 0 1 1 0 (4) 1 0 0 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 %Author = Arun Prasanna %Create a gene copy number matrix from cluster information clear ; clc ; tic [ mat1 , mat ] = xlsread ( 'ClusterFile.xlsx' , 'Sheet1' ); clear mat1 [ mat2 , head ] = xlsread ( 'Organism_list.xlsx' , 'Sheet1' ); clear mat2 %species/gene name new_head = head (:, 1 ) ' ; %transpose to make it as header [ rmat , cm...

Copy | Rm first column in text file

February 17, 2016

Simple, intutive & self-explanatory code to copy or remove columns in tab delimited text files. Remember AWK or Sed one liners are very handy too. But sometimes, if there are space or inconsistencies in file they may fail. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 __author__ = 'Arun Prasanna' ''' Remove first column of the txt file ''' with open ( 'Inputfile.txt' , 'r' ) as infile: entries = infile . read() each_line = entries . splitlines() new_list = [] for row in each_line: element = row . split( " \t " ) ele_size = len (element) for i in range ( 1 , ele_size): tmp = element[i] new_list . append(tmp) new_list . append( ' \t ' ) new_list . append( ' \n ' ) f = open ( 'Output.txt' , 'w' ) out = f . writelines(new_list) f . close() print...

Hash_lookup for Cluster data

February 16, 2016

Hashes are exciting !! Hash tables are one of the most powerful lookup operations !. It gives the power of random access & hence lightning fast :-). Imagine that, you have non-homogenous data (again: inconsistent number of column values, for instance cluster data). You have a list and you have to map the entries or fish the desired value from the huge file ! For example, your query file has 65000 entries and cluster file has 100000 entries in which the first column is the correspondence. If you just write the basic 'for loop' it is going to iterate atleast 65000 * 100000 (in case of thorough search) or slightly lesser if you break after a match is found. In any case, it can time humongous amount of time. Solution ? Hashes !! For same number of entries, my job took 3-7 seconds ! Dictionaries in python can also do the same thing ! 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 #Code to map clu...

Count elements in each row

February 13, 2016

Python code to count the number of elements (genes | proteins | genus ...) in each row in a non-homogenous cluster files. Example Input: g1 g2 g3 g4 g2 g4 g6 g7 Example Output: 4 1 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 __author__ = 'Arun Prasanna' ''' Python code to count number of elements in non-homogenous text file. Small, simple & self explanatory code ! ''' with open ( 'Input.txt' , 'r' ) as infile : entries = infile . read () each_line = entries . splitlines () new_list = [] for row in each_line : element = row . split ( "\t" ) ele_size = len ( element ) new_list . append ( str ( ele_size )) new_list . append ( '\n' ) f = open ( 'Count_EachRowElements.txt' , 'w' ) out = f . writelines ( new_list ) f . close () print "Program complete"