Strip Gene IDs

Code is useful to refine clustering data with IDs tagged with genus name. The output can be used to count protein copy numbers and etc., to create phyletic matrices or copy number matrices.

Example Input:

123_g1   NID_g2  
4567_g3   xx_g4    012_g6
NID_g10  ACC_g4

Example Output:

g1    g2
g3    g4    g6
g10  g4



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
__author__ = 'Arun Prasanna'
''' Program to read the input file with 'genus_number' format names and convert that to 
=> Each lines can have different number of elements (non-homogenous data)
1. 'genus_ID' to 'genus' format => [0] in split
2. 'ID_genus' to 'genus' format => [1] in split
'''

with open('Input_file.txt','r') as infile:
    entries = infile.read()
each_line = entries.splitlines()
new_list = []
for row in each_line:
    element = row.split("\t")
    ele_size = len(element)
    for i in range(0, ele_size): #CHANGE: if clustname is there make it (1, ele_size)
        tmp = element[i].split('_')[1] #CHANGE: split the element with first _ and store head which is in [0] or [1]
        new_list.append(tmp)
        new_list.append('\t')
    new_list.append('\n')
# Write the output into file    
f = open('Output.txt','w')
out= f.writelines(new_list)
f.close()
print "Program complete"

Comments

Popular posts from this blog

Fasta_Dupicate_Header

Gene Copy Number Matrix

Fasta_Header_Rename