Strip Gene IDs

February 13, 2016

Code is useful to refine clustering data with IDs tagged with genus name. The output can be used to count protein copy numbers and etc., to create phyletic matrices or copy number matrices.

Example Input:

123_g1 NID_g2
4567_g3 xx_g4 012_g6
NID_g10 ACC_g4

Example Output:

g1 g2
g3 g4 g6
g10 g4

__author__ = 'Arun Prasanna'
''' Program to read the input file with 'genus_number' format names and convert that to 
=> Each lines can have different number of elements (non-homogenous data)
1. 'genus_ID' to 'genus' format => [0] in split
2. 'ID_genus' to 'genus' format => [1] in split
'''

with open('Input_file.txt','r') as infile:
    entries = infile.read()
each_line = entries.splitlines()
new_list = []
for row in each_line:
    element = row.split("\t")
    ele_size = len(element)
    for i in range(0, ele_size): #CHANGE: if clustname is there make it (1, ele_size)
        tmp = element[i].split('_')[1] #CHANGE: split the element with first _ and store head which is in [0] or [1]
        new_list.append(tmp)
        new_list.append('\t')
    new_list.append('\n')
# Write the output into file    
f = open('Output.txt','w')
out= f.writelines(new_list)
f.close()
print "Program complete"

Search This Blog

Working algos for Biological Data: Simple to Complex problems

Strip Gene IDs

Comments

Post a Comment

Popular posts from this blog

Pick Matching lines with list of keywords

Condense fasta header

Hash_lookup for Cluster data