Hash_lookup for Cluster data

February 16, 2016

Hashes are exciting !! Hash tables are one of the most powerful lookup operations !. It gives the power of random access & hence lightning fast :-).

Imagine that, you have non-homogenous data (again: inconsistent number of column values, for instance cluster data). You have a list and you have to map the entries or fish the desired value from the huge file ! For example, your query file has 65000 entries and cluster file has 100000 entries in which the first column is the correspondence. If you just write the basic 'for loop' it is going to iterate atleast 65000 * 100000 (in case of thorough search) or slightly lesser if you break after a match is found. In any case, it can time humongous amount of time. Solution ? Hashes !! For same number of entries, my job took 3-7 seconds !

Dictionaries in python can also do the same thing !

#Code to map cluster & query data. Create a hash and lookup.
#Partly written by Priya & re-worked by Arun Prasanna

use strict;
use warnings;

%hash=(); 

open(clust, "Clusterfile.txt") or die "File not found!";
while(<clust>)
 {
 chomp;
 @line = split /\s+/;
 $end = (scalar @line)-1;
 @array = @line[1..$end]; 
 $tag = join("\t", @array);
 $hash{$line[0]}= $tag; #This makes first entry as key & all the other elements as single string
 }
 close clust;

open(out, ">Outputfilename.txt");

open(query, "Querylist.txt") or die "File not found!";
while(<query>)
 {
 chomp;
 @ln = split /\s+/;

  if (exists($hash{$ln[1]}))
  {
   print out "$ln[1]\t$hash{$ln[1]}\n";
  }
 }
        close query;
print "Program complete\n";

Search This Blog

Working algos for Biological Data: Simple to Complex problems

Hash_lookup for Cluster data

Comments

Post a Comment

Popular posts from this blog

Pick Matching lines with list of keywords

Condense fasta header

Map multiple annotations using pandas