Map multiple annotations using pandas

A simple pandas solution to map multiple annotations for a protein. 

A protein or gene file will have annotations curated by different methods. Most frequently, biologists will encounter more than one annotation for a single protein. It is a task in itself to pick the right annotation. 

One of the simple ways is to consolidate them and pick the right ones after enough evidence is known. 

Coding in matlab or other languages may require more number of lines to achieve the same output. Here, a simple 'groupby' of pandas can do produce the same outputin seconds !

# Map multiple annotation for protein ID and join with a delimiter
 Sample input
 A_xx  Annotation1
 A_xx  Annotation2
 B_xx  Annotation1 

 Sample output
 A_xx Annotation1, Annotation2
 B_xx Annotation1

import pandas as pd
data = pd.read_csv('input_file.txt', delimiter='\t')
dfc = data.groupby(['PROTID'])['ANNOTATION'].apply(", ".join)
print dfc
dfc.to_csv('Output.txt', sep='\t', encoding='utf-8')


Popular posts from this blog

Condense fasta header

Fasta Header Replacer