Posts

Condense fasta header

''' Biopython hack to condense fasta header. When there is a lengthy header in fasta file like the following: >geneid1213 len = 234 covStat = val otherparam = sval, Shorten it to make it >geneid1213. ''' from Bio import SeqIO new_header = [] with open ( "test.fasta" , "rU" ) as infile: for record in SeqIO . parse(infile, "fasta" ): record . description = record . name record . id = record . name new_header . append(record) SeqIO . write(new_header, "short_header.fasta" , "fasta" ) print ( "program complete" )

Calculate Cys-Richness for a protein

''' Code description: Calculate Cys-richness of a protein with criteria set as: >=4 'C's over the length of protein AND >=5% total cysteine content Function: Take input sequence => Count number of 'C's & length => Calculate percentage Output True or False if criteria is met ''' from Bio import SeqIO def Cys_rich (record_seq): C_count = record_seq . count( 'C' ) seq_len = len (record_seq) Cys_perc = float (C_count) / float (seq_len) * 100 if C_count >= 4.0 and Cys_perc >= 5.0 : return 'Cys-rich' else : return 'No' CysRichSeq = [] for record in SeqIO . parse( 'filename.fasta' , 'fasta' ): if Cys_rich(record . seq) == 'Cys-rich' : CysRichSeq . append(record) SeqIO . write(CysRichSeq, 'Cys-rich_sequences.fasta' , 'fasta' ) print 'Cys-rich sequences written to file..'

Map multiple annotations using pandas

A simple pandas solution to map multiple annotations for a protein.  A protein or gene file will have annotations curated by different methods. Most frequently, biologists will encounter more than one annotation for a single protein. It is a task in itself to pick the right annotation.  One of the simple ways is to consolidate them and pick the right ones after enough evidence is known.  Coding i n matlab or other languages may require more number of lines to achieve the same output. Here, a simple 'groupby' of pandas can do produce the same outputin seconds ! # Map multiple annotation for protein ID and join with a delimiter ''' Sample input A_xx Annotation1 A_xx Annotation2 B_xx Annotation1 Sample output A_xx Annotation1, Annotation2 B_xx Annotation1 ''' import pandas as pd data = pd . read_csv( 'input_file.txt' , delimiter = ' \t ' ) dfc = data . groupby([ 'PROTID' ])[ 'ANNOTATION&

Pick Matching lines with list of keywords

#Simple code to find the occurrence of list of search terms in single line in a huge file. Search_terms = [ 'A' , 'B' , 'C' , 'D' ] with open ( 'BigFile.txt' , 'r' ) as infile : entries = infile . read () each_line = entries . splitlines () new_list = [] for ix , row in enumerate ( each_line ): element = row . split ( "\t" ) if set ( Search_terms ) == set ( element ): #Use <= if you want no strict option new_list . append ( element ) print ix + 1 #List the matching line number ! thefile = open ( 'Output.txt' , 'w' ) for item in new_list : print >> thefile , item

Install Parallel versions of Python from source

Few programs require certain versions of python. Especially, when you do not have root permission in your unix machine, you can still install a python version in your /home/usr directory. You can run it in parallel to in-built version. Follow these steps: 1. Unpack with  tar -xvf python-x.x.tar 2. ./configure 3. make altinstall prefix=~ exec-prefix=~ (./lib and ./bin will be created in your home directory(~)) 4. create alias for python-x.x => i.e cd ~/bin/ > ln -s python-x.x python ! If you create this outside bin directory, you will get a warning (ln: symbolic link already exists !! remember it refers to inbuilt python) 5. Complete alias creation by editing: vi ~/.bashrc with alias python='~/bin/python' Now any software you try to install with python setup.py it will access the version you installed in your home ! You can still install softwares using your version by skipping steps 4 & 5. But, everytime you need to keep specifying the export PATH &

Fasta_Header_Rename

A simple matlab code to rename the headers in fasta file. Self-explanatory variable names. 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 %Author = Arun Prasanna %Rename the headers in fasta file to desired choice %For example: The input fasta file used here had header in >num_name format %strtok is used to strip and extract the required format clear; clc; tic; Path = 'Drive\Path\ToReadFile' ; % FileList = dir(Path); [rFL, cFL] = size(FileList); for i = 3:rFL %i of 1 & 2 are . & .. respectively     Fas_Fname{i-2,1} = FileList(i).name; %FileList is a structure end [rFas,cFas] = size(Fas_Fname); for i = 1:rFas     clear Header Seq ProtID Sp new_Header     OpenFile = cell2mat(strcat(Path,Fas_Fname(i)));     [Header, Seq] = fastaread(OpenFile);[rH,cH] = size(Header);     for j = 1:cH         [ProtID, Sp] = strtok(Header(1,j), '_' ); %First Sp = _name         [Sp, rm] = strtok(Sp, '_' ); %Se

Fasta_Dupicate_Header

A simple, self-explanatory matlab code to identify duplicate headers in fasta files. 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 %Simple matlab code to check for the duplicate header in fasta files %Store the size of header -> unique(header) ->size of new header %Copy the Table data in excel and compare the two values clear; clc; tic; Path = 'Drive\Path\FileName' ; % FileList = dir(Path); [rFL, cFL] = size(FileList); for i = 3:rFL %i of 1 & 2 are . & .. respectively     Fas_Fname{i-2,1} = FileList(i).name; %FileList is a structure end [rFas,cFas] = size(Fas_Fname); for i = 1:rFas     clear Header Seq Old_Header Unik_header     OpenFile = cell2mat(strcat(Path,Fas_Fname(i)));     [Header, Seq] = fastaread(OpenFile);[rH,cH] = size(Header);     Old_Header = length(Header);     Unik_header = length(unique(Header));     Table{i,1} = Fas_Fname(i);     Table{i,2} = num2str(Old_Header);     Table{i,3} = num2str(Unik_heade