Posts

Showing posts from December, 2016

Map multiple annotations using pandas

A simple pandas solution to map multiple annotations for a protein.  A protein or gene file will have annotations curated by different methods. Most frequently, biologists will encounter more than one annotation for a single protein. It is a task in itself to pick the right annotation.  One of the simple ways is to consolidate them and pick the right ones after enough evidence is known.  Coding i n matlab or other languages may require more number of lines to achieve the same output. Here, a simple 'groupby' of pandas can do produce the same outputin seconds ! # Map multiple annotation for protein ID and join with a delimiter ''' Sample input A_xx Annotation1 A_xx Annotation2 B_xx Annotation1 Sample output A_xx Annotation1, Annotation2 B_xx Annotation1 ''' import pandas as pd data = pd . read_csv( 'input_file.txt' , delimiter = ' \t ' ) dfc = data . groupby([ 'PROTID' ])[ 'ANNOTATION&