Strip Gene IDs
Code is useful to refine clustering data with IDs tagged with genus name. The output can be used to count protein copy numbers and etc., to create phyletic matrices or copy number matrices.
Example Input:
123_g1 NID_g2
4567_g3 xx_g4 012_g6
NID_g10 ACC_g4
Example Output:
g1 g2
g3 g4 g6
g10 g4
Example Input:
123_g1 NID_g2
4567_g3 xx_g4 012_g6
NID_g10 ACC_g4
Example Output:
g1 g2
g3 g4 g6
g10 g4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | __author__ = 'Arun Prasanna' ''' Program to read the input file with 'genus_number' format names and convert that to => Each lines can have different number of elements (non-homogenous data) 1. 'genus_ID' to 'genus' format => [0] in split 2. 'ID_genus' to 'genus' format => [1] in split ''' with open('Input_file.txt','r') as infile: entries = infile.read() each_line = entries.splitlines() new_list = [] for row in each_line: element = row.split("\t") ele_size = len(element) for i in range(0, ele_size): #CHANGE: if clustname is there make it (1, ele_size) tmp = element[i].split('_')[1] #CHANGE: split the element with first _ and store head which is in [0] or [1] new_list.append(tmp) new_list.append('\t') new_list.append('\n') # Write the output into file f = open('Output.txt','w') out= f.writelines(new_list) f.close() print "Program complete" |
Comments
Post a Comment