Fasta Header Replacer
Handling sequence files (like .fasta) is one of the trickiest problems for novice in Bioinformatics. Bio-Perl, Bio-python are quite useful but looks really scary :-( !.
MATLAB offers a cool solution with its in-built Bioinformatics toolbox !!. Reading a fasta file with 'fastaread' is as easy as 'xlsread' ...followingly the same with 'fastawrite'/'xlswrite' :-)
fastaread simply extract the sequence headers & sequences in cell arrays !. Voila !!! Once it does...then one can do all kinds of manipulation they want.
Here is a simple-self-explanatory, one-file-at-a-time code to replace the header with an user-defined headers. Besides, creates a translation table. If you want to process multiple file then one can readily loop it over directory operations.
INPUT (sequence.fasta)
>gi|154163|gb|M83220.1|STYLEXA Salmonella typhimurium lexA (repressor of DNA damage inducible genes) gene, 5' end
ATGCGCCAGCTGCAAAATTTAAAT
>gi|154164|gb|M83220.1|STYLEXA Salmonella typhimurium lexA (repressor of DNA damage inducible genes) gene, 5'end
ATGCGCCAGCTGCAAAATTTAAAT
OUTPUT1(Testout.fasta):
>1_Org
ATGCGCCAGCTGCAAAATTTAAAT
>2_Org
ATGCGCCAGCTGCAAAATTTAAAT
OUTPUT2: (TransTab.txt)
gi|154163|gb|M83220.1|STYLEXA Salmonella typhimurium lexA .....5' end 1_Org
gi|154164|gb|M83220.1|STYLEXA Salmonella typhimurium lexA .....5' end 2_Org
MATLAB offers a cool solution with its in-built Bioinformatics toolbox !!. Reading a fasta file with 'fastaread' is as easy as 'xlsread' ...followingly the same with 'fastawrite'/'xlswrite' :-)
fastaread simply extract the sequence headers & sequences in cell arrays !. Voila !!! Once it does...then one can do all kinds of manipulation they want.
Here is a simple-self-explanatory, one-file-at-a-time code to replace the header with an user-defined headers. Besides, creates a translation table. If you want to process multiple file then one can readily loop it over directory operations.
INPUT (sequence.fasta)
>gi|154163|gb|M83220.1|STYLEXA Salmonella typhimurium lexA (repressor of DNA damage inducible genes) gene, 5' end
ATGCGCCAGCTGCAAAATTTAAAT
>gi|154164|gb|M83220.1|STYLEXA Salmonella typhimurium lexA (repressor of DNA damage inducible genes) gene, 5'end
ATGCGCCAGCTGCAAAATTTAAAT
OUTPUT1(Testout.fasta):
>1_Org
ATGCGCCAGCTGCAAAATTTAAAT
>2_Org
ATGCGCCAGCTGCAAAATTTAAAT
OUTPUT2: (TransTab.txt)
gi|154163|gb|M83220.1|STYLEXA Salmonella typhimurium lexA .....5' end 1_Org
gi|154164|gb|M83220.1|STYLEXA Salmonella typhimurium lexA .....5' end 2_Org
%Author: Arun Prasanna %M-CODE to replace old header with new ones in fasta file !. %Output files are 1. .fasta file with new header & sequence %=================2. Translation table with old & new header % Algo: Read a fastafile -> store head, seq as cell -> generate % new header names -> write output with new header, sequence pairs; % write translation table clear; clc; [Header, Seq] = fastaread('sequence.fasta'); Header = Header'; Seq= Seq'; [rH,cH] = size(Header); for i = 1:rH id = num2str(i) org_name = '_Org'; new{i,1} = strcat(id,org_name) end fastawrite('Testout.fasta',new,Seq); TransTab = horzcat(Header,new); %===========Section-to-write-cell-array-2-Txt-file===========% fileID = fopen('TransTab.txt','w'); [nrows,ncols] = size(TransTab); for row = 1:nrows fprintf(fileID,'%s\t%s\n',TransTab{row,:}); end disp('Program Complete')
Comments
Post a Comment