Map multiple annotations using pandas

December 29, 2016

A simple pandas solution to map multiple annotations for a protein.

A protein or gene file will have annotations curated by different methods. Most frequently, biologists will encounter more than one annotation for a single protein. It is a task in itself to pick the right annotation.

One of the simple ways is to consolidate them and pick the right ones after enough evidence is known.

Coding in matlab or other languages may require more number of lines to achieve the same output. Here, a simple 'groupby' of pandas can do produce the same outputin seconds !

# Map multiple annotation for protein ID and join with a delimiter
'''
 Sample input
 A_xx  Annotation1
 A_xx  Annotation2
 B_xx  Annotation1 

 Sample output
 A_xx Annotation1, Annotation2
 B_xx Annotation1

'''
import pandas as pd
data = pd.read_csv('input_file.txt', delimiter='\t')
dfc = data.groupby(['PROTID'])['ANNOTATION'].apply(", ".join)
print dfc
dfc.to_csv('Output.txt', sep='\t', encoding='utf-8')

Search This Blog

Working algos for Biological Data: Simple to Complex problems

Map multiple annotations using pandas

Comments

Post a Comment

Popular posts from this blog

Pick Matching lines with list of keywords

Presence-absence Matrix to Fasta format