Showing posts from December, 2016

Map multiple annotations using pandas

A simple pandas solution to map multiple annotations for a protein. A protein or gene file will have annotations curated by different methods. Most frequently, biologists will encounter more than one annotation for a single protein. It is a task in itself to pick the right annotation. One of the simple ways is to consolidate them and pick the right ones after enough evidence is known. Coding in matlab or other languages may require more number of lines to achieve the same output. Here, a simple 'groupby' of pandas can do produce the same outputin seconds !# Map multiple annotation for protein ID and join with a delimiter''' Sample input A_xx Annotation1 A_xx Annotation2 B_xx Annotation1 Sample output A_xx Annotation1, Annotation2 B_xx Annotation1'''importpandasaspd data = pd.read_csv('input_file.txt', delimiter='\t') dfc = data.groupby(['PROTID'])['ANNOTATION'].apply(", ".join) print dfc dfc.to_csv('Outp…