Posts

Condense fasta header

'''Biopython hack to condense fasta header. When there is a lengthy header in fasta file like the following:>geneid1213 len = 234 covStat = val otherparam = sval,Shorten it to make it>geneid1213.'''fromBioimport SeqIO new_header = [] withopen("test.fasta", "rU") as infile: for record in SeqIO.parse(infile, "fasta"): record.description = record.name record.id = record.name new_header.append(record) SeqIO.write(new_header, "short_header.fasta", "fasta") print("program complete")

Calculate Cys-Richness for a protein

'''Code description: Calculate Cys-richness of a protein with criteria set as: >=4 'C's over the length of protein AND >=5% total cysteine contentFunction: Take input sequence => Count number of 'C's & length => Calculate percentageOutput True or False if criteria is met'''fromBioimport SeqIO defCys_rich(record_seq): C_count = record_seq.count('C') seq_len =len(record_seq) Cys_perc =float(C_count) /float(seq_len) *100if C_count >=4.0and Cys_perc >=5.0: return'Cys-rich'else: return'No' CysRichSeq = [] for record in SeqIO.parse('filename.fasta', 'fasta'): if Cys_rich(record.seq) =='Cys-rich': CysRichSeq.append(record) SeqIO.write(CysRichSeq, 'Cys-rich_sequences.fasta', 'fasta') print'Cys-rich sequences written to file..'

Map multiple annotations using pandas

A simple pandas solution to map multiple annotations for a protein. A protein or gene file will have annotations curated by different methods. Most frequently, biologists will encounter more than one annotation for a single protein. It is a task in itself to pick the right annotation. One of the simple ways is to consolidate them and pick the right ones after enough evidence is known. Coding in matlab or other languages may require more number of lines to achieve the same output. Here, a simple 'groupby' of pandas can do produce the same outputin seconds !# Map multiple annotation for protein ID and join with a delimiter''' Sample input A_xx Annotation1 A_xx Annotation2 B_xx Annotation1 Sample output A_xx Annotation1, Annotation2 B_xx Annotation1'''importpandasaspd data = pd.read_csv('input_file.txt', delimiter='\t') dfc = data.groupby(['PROTID'])['ANNOTATION'].apply(", ".join) print dfc dfc.to_csv('Outp…

Pick Matching lines with list of keywords

#Simple code to find the occurrence of list of search terms in single line in a huge file. Search_terms=['A','B','C','D']withopen('BigFile.txt','r')asinfile:entries=infile.read()each_line=entries.splitlines()new_list=[]forix,rowinenumerate(each_line):element=row.split("\t")ifset(Search_terms)==set(element):#Use <= if you want no strict optionnew_list.append(element)printix+1#List the matching line number !thefile=open('Output.txt','w')foriteminnew_list:print>>thefile,item

Install Parallel versions of Python from source

Few programs require certain versions of python. Especially, when you do not have root permission in your unix machine, you can still install a python version in your /home/usr directory. You can run it in parallel to in-built version. Follow these steps:

1. Unpack with tar -xvf python-x.x.tar

2. ./configure

3. make altinstall prefix=~ exec-prefix=~ (./lib and ./bin will be created in your home directory(~))

4. create alias for python-x.x => i.e cd ~/bin/ > ln -s python-x.x python! If you create this outside bin directory, you will get a warning (ln: symbolic link already exists !! remember it refers to inbuilt python)

5. Complete alias creation by editing: vi ~/.bashrc with alias python='~/bin/python'

Now any software you try to install with python setup.py it will access the version you installed in your home !

You can still install softwares using your version by skipping steps 4 & 5. But, everytime you need to keep specifying the export PATH & export PYTHONP…

Fasta_Header_Rename

A simple matlab code to rename the headers in fasta file. Self-explanatory variable names.

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31%Author = Arun Prasanna%Rename the headers in fasta file to desired choice%For example: The input fasta file used here had header in >num_name format%strtok is used to strip and extract the required format clear; clc; tic; Path = 'Drive\Path\ToReadFile';% FileList = dir(Path); [rFL, cFL] = size(FileList); for i = 3:rFL %i of 1 & 2 are . & .. respectively     Fas_Fname{i-2,1} = FileList(i).name; %FileList is a structureend [rFas,cFas] = size(Fas_Fname); for i = 1:rFas     clear Header Seq ProtID Sp new_Header     OpenFile = cell2mat(strcat(Path,Fas_Fname(i)));     [Header, Seq] = fastaread(OpenFile);[rH,cH] = size(Header);     for j = 1:cH         [ProtID, Sp] = strtok(Header(1,j),'_'); %First Sp = _name         [Sp, rm] = strtok(Sp,'_'); %Second SP = name ! which we ne…

Fasta_Dupicate_Header

A simple, self-explanatory matlab code to identify duplicate headers in fasta files.

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26%Simple matlab code to check for the duplicate header in fasta files%Store the size of header -> unique(header) ->size of new header%Copy the Table data in excel and compare the two values clear; clc; tic; Path = 'Drive\Path\FileName';% FileList = dir(Path); [rFL, cFL] = size(FileList); for i = 3:rFL %i of 1 & 2 are . & .. respectively     Fas_Fname{i-2,1} = FileList(i).name; %FileList is a structureend [rFas,cFas] = size(Fas_Fname); for i = 1:rFas     clear Header Seq Old_Header Unik_header     OpenFile = cell2mat(strcat(Path,Fas_Fname(i)));     [Header, Seq] = fastaread(OpenFile);[rH,cH] = size(Header);     Old_Header = length(Header);     Unik_header = length(unique(Header));     Table{i,1} = Fas_Fname(i);     Table{i,2} = num2str(Old_Header);     Table{i,3} = num2str(Unik_header);     fprintf('…