Working algos for Biological Data: Simple to Complex problems

Posts

Showing posts from February, 2016

Copy | Rm first column in text file

February 17, 2016

Simple, intutive & self-explanatory code to copy or remove columns in tab delimited text files. Remember AWK or Sed one liners are very handy too. But sometimes, if there are space or inconsistencies in file they may fail. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 __author__ = 'Arun Prasanna' ''' Remove first column of the txt file ''' with open ( 'Inputfile.txt' , 'r' ) as infile: entries = infile . read() each_line = entries . splitlines() new_list = [] for row in each_line: element = row . split( " \t " ) ele_size = len (element) for i in range ( 1 , ele_size): tmp = element[i] new_list . append(tmp) new_list . append( ' \t ' ) new_list . append( ' \n ' ) f = open ( 'Output.txt' , 'w' ) out = f . writelines(new_list) f . close() print

Hash_lookup for Cluster data

February 16, 2016

Hashes are exciting !! Hash tables are one of the most powerful lookup operations !. It gives the power of random access & hence lightning fast :-). Imagine that, you have non-homogenous data (again: inconsistent number of column values, for instance cluster data). You have a list and you have to map the entries or fish the desired value from the huge file ! For example, your query file has 65000 entries and cluster file has 100000 entries in which the first column is the correspondence. If you just write the basic 'for loop' it is going to iterate atleast 65000 * 100000 (in case of thorough search) or slightly lesser if you break after a match is found. In any case, it can time humongous amount of time. Solution ? Hashes !! For same number of entries, my job took 3-7 seconds ! Dictionaries in python can also do the same thing ! 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 #Code to map cluster &

Count elements in each row

February 13, 2016

Python code to count the number of elements (genes | proteins | genus ...) in each row in a non-homogenous cluster files. Example Input: g1 g2 g3 g4 g2 g4 g6 g7 Example Output: 4 1 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 __author__ = 'Arun Prasanna' ''' Python code to count number of elements in non-homogenous text file. Small, simple & self explanatory code ! ''' with open ( 'Input.txt' , 'r' ) as infile : entries = infile . read () each_line = entries . splitlines () new_list = [] for row in each_line : element = row . split ( "\t" ) ele_size = len ( element ) new_list . append ( str ( ele_size )) new_list . append ( '\n' ) f = open ( 'Count_EachRowElements.txt' , 'w' ) out = f . writelines ( new_list ) f . close () print "Program complete"

Strip Gene IDs

February 13, 2016

Code is useful to refine clustering data with IDs tagged with genus name. The output can be used to count protein copy numbers and etc., to create phyletic matrices or copy number matrices. Example Input: 123_g1 NID_g2 4567_g3 xx_g4 012_g6 NID_g10 ACC_g4 Example Output: g1 g2 g3 g4 g6 g10 g4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 __author__ = 'Arun Prasanna' ''' Program to read the input file with 'genus_number' format names and convert that to => Each lines can have different number of elements (non-homogenous data) 1. 'genus_ID' to 'genus' format => [0] in split 2. 'ID_genus' to 'genus' format => [1] in split ''' with open ( 'Input_file.txt' , 'r' ) as infile : entries = infile . read () each_line = entries . splitlines () new_list = [] for row in each_line : element = row . split (

Presence-absence Matrix to Fasta format

February 05, 2016

Convert the binary matrix to fasta format with this simple code in Python !. Recommended for larger file sizes ! Sample Input: Sp1 1 1 1 1 Sp2 0 0 0 0 Sp3 1 1 0 0 Sample Output: >Sp1 ['1', '1', '1', '1'] >Sp2 ['0', '0', '0', '0'] >Sp3 ['1', '1', '0', '0'] Workaround the output file in any text file to remove [,' ] to generate final output as: >Sp1 1111 >Sp2 0000 >Sp3 1100 __author__ = 'Arun Prasanna' ''' This is a simple python code to convert binary matrix into fasta format. There are many public softwares available. But each one has size limits (<=2MB). This code processes 61 x 51000 character matrix in less than 10 seconds in python 2.7! ''' with open ( 'infile_matrix-cp' , 'r' ) as infile : entries = infile . read () . strip () each_line = entries .

BLAST_TABULAR_OUTPUT Generator

February 04, 2016

Few MPI versions do not support the very useful -m 8 tabular output format. The following code can be used to parse the raw blast output obtained from mpi_blast to tabular output file. The creditials of original creator and modifications are given in comments section of the code. #========================================================================================= # Modified BLAST Parser - Script for parsing the BLAST output in tabular format # Version 1.1.6 (November 26, 2015) # Modified by Arun Prasanna (arunprasanna83@gmail.com) from orginal script by Kirill Kryukov (http://kirill-kryukov.com/kirr/). He also has many other useful utilities..Check out !! # # # Purpose: Few versions of MPI-BLAST do not support tabular output or often the user faces # error while opting for tabular output. Hence the original code is modified to produce # tabular Output verbatim -m 8 option. The output file is compliant with other clustering # programs like Silix # # Changes made

Perl_Line_Picker

February 03, 2016

Line_Picker.pl => Average execution time: Few seconds for a file with >20000 lines Input file: Text file with non-homogeneous entries separated by tab character Output desired: Text file only with desired number of entries in each line 1: open(out, ">Output_file.txt"); 2: open(infile, "Input_File.txt"); 3: while(<infile>) 4: { 5: @line = split /\s+/; 6: $num = scalar @line; 7: if($num > 4 && $num < 101){ print out "@line\n";} 8: } 9: print "Program complete\n"; Example Input: 1. A B C D 2. E F G 3. H Example Output: (Only lines with > 1 entry) A B C D E F G