-- 作者:admin
-- 发布时间:9/23/2004 2:05:00 AM
-- bioinformatics(4)
bioinformatics(4) 发信人: happymood (土豆块儿), 信区: Bioinformatics 标 题: bioinformatics(4) 发信站: 北大未名站 (2001年04月10日16:15:35 星期二), 站内信件 发信人: spaceman (estranged), 信区: LifeScience 标 题: Bioinformatics, Genomics, and Proteomics(转载) 发信站: BBS 水木清华站 (Wed Nov 29 10:34:37 2000) The Scientist 14[23]:26, Nov. 27, 2000 PROFILE Bioinformatics, Genomics, and Proteomics Scientific discovery advances as technology paves the path By Christopher M. Smith Data Mining Software for Genomics, Proteomics and Expression Data (Part 1) Data Mining Software for Genomics, Proteomics and Expression Data (Part 2) High-throughput (HT) sequencing, microarray screening and protein expression profiling technologies drive discovery efforts in today's genomics and prot eomics laboratories. These tools allow researchers to generate massive amoun ts of data, at a rate orders of magnitude greater than scientists ever antic ipated. Initiatives to sequence entire genomes have resulted in single data sets ranging in size from 1.8 million nucleotides (Haemophilus influenza gen ome) to more than 3 billion (human genome)--a single microarray assay can ea sily produce information on thousands of genes, and a temporal protein expre ssion profile may capture a data picture of 6,000 proteins.1 Integration of Genomica's LinkMapper with ABI's Gene Mapper ---------------------------------------------------------------------------- ---- It's what you do with the data that counts, however, and that's where bioinf ormatics takes over. Researchers in bioinformatics are dedicated to the deve lopment of applications that can store, compare, and analyze the voluminous quantities of data generated by the use of new technologies. One of the original functions of bioinformatics was to provide a mechanism t o compare a query DNA or protein sequence against all sequences in a databas e. Several comparison algorithms have provided some successful and powerful computational applications,2 such as Smith-Waterman, FASTA, and BLAST. Early on, query sequences or sets of query sequences were relatively small, rangi ng from a few to 10,000 nucleotides, and 10- to 1,000-sequence query sets. B ecause of the proliferation and improvement of HT sequencing technologies, i t is now common to find query sequences with 10,000 nucleotides and data set s containing up to 1 million sequences. The kinds of data developed and the methods for processing and analysis also have changed. Previously, small-scale DNA sequencing projects would perhaps generate 100 sequences (usually 50-400 nucleotides) that could be assembled relatively easily into a contiguous DNA sequence (a contig). Today, contig assembly may involve 1 million sequences with up to 5,000 nucleotides. The b urgeoning fields of proteomics and microarray technologies provide another d egree of complexity, adding multidimensional information to the biological d ata cornucopia. New Scientific Challenges The exponential rate of discovery in the era of modern molecular biology has been nothing short of phenomenal, culminating with the announcement in June 2000 that preliminary sequencing of the human genome had been completed.3 H owever, this achievement is just a taste of the scientific successes that ar e to come in the 21st century. As impressive as it is, the determination of the sequence of the approximately 3.2 billion nucleotides of the human genom e, encoding an estimated 100,000 proteins, represents only the first step do wn a long road. Gene identification does not automatically translate into an understanding of gene function. Although mapping and cloning studies have l inked a number of genes to heritable genetic diseases, the true (i.e., "norm al") function of a majority of these genes remains unknown. This dichotomy between gene identity and function will be one source of new research challenges in the 21st century, encompassing problems in biological science, computational biology, and computer science. Biologists will need to decipher the genetic makeup of genomes, map genotypes with phenotypic tra its, determine gene and protein structure and function, design and develop t herapeutic agents (recombinant and genetically engineered proteins, and smal l molecule ligands), and unravel biochemical pathways and cellular physiolog y. Tackling these biological issues will require innovations in computationa l biology that will be met by the development of new algorithms and methods for comparison of DNA and protein sequence, design of novel metrics for simi larity and homology analyses, tools to outline biochemical pathways and inte ractions, and construction of physiological models. Success in the computati onal biology arena will require improvements in computational and informatic s infrastructure, including development of novel databases as well as annota tion, curation, and dissemination tools for the databases; design of paralle l computation methods; and development of supercomputers. These latter chall enges are particularly important, as high performance computing (HPC) and bi oinformatics applications need to be retooled to accommodate the fast interr ogation of a plethora of databases, comparisons between relatively long stri ngs of data, and data with varying degrees of complexity and annotation. The lion's share of interest and effort over the past few years has been dir ected toward protein identification (proteomics), structure-function charact erization (structural bioinformatics), and bioinformatics database mining. T he pharmaceutical industry has for the most part driven these efforts in the search for new therapeutic agents. Identifying proteins from the cellular p ool and/or determining structure-function in the absence of concrete biologi cal data is a daunting task, but novel technological approaches are helping scientists to make headway on these fronts. Proteomics: Protein Expression Profiling Proteomics refers to the science and the process of analyzing and cataloging all the proteins encoded by a genome (a proteome). Since the majority of al l known and predicted proteins have no known cellular function, the hope is that proteomics will bridge the chasm separating what raw DNA and protein pr imary sequence reveals about a protein and its cellular function. Determinin g protein function on a genomewide scale can provide critical pieces to the metabolic puzzle of cells. Because proteins are involved in one measure or a nother in disease states (whether induced by bacterial or viral infection, s tress, or genetic anomaly), complete descriptions of proteins, including seq uence structure and function, will substantially aid the current pharmaceuti cal approach to therapeutics development. This process, known as rational dr ug design, involves the use of specific structural and functional aspects of a protein to design better proteins or small molecule ligands that can serv e as activators or inhibitors of protein function. A recent technology profi le in LabConsumer4 and a meeting review5 detail companies providing proteomi cs tools. The multidimensional nature of proteomics data (for example, 2D-PAGE gel ima ges) presents novel collection, normalization, and analysis challenges. Data collection issues are being overcome by sophisticated proteomic systems tha t semiautomate and integrate the experimental process with data collection. Improvements in the experimental technology have increased the number of pro teins that can be identified, with consistency, within a single gel; however , making comparisons and looking for patterns and relationships between prot eins and/or particular environmental, disease, or developmental states requi res data mining and knowledge discovery tools. Finding the Needle in the Haystack Data mining refers to a new genre of bioinformatics tools used to sift throu gh the mass of raw data, finding and extracting relevant information and dev eloping relationships between them.6 As advances in instrumentation and expe rimental techniques have led to the accumulation of massive amounts of data, data mining applications are providing the tools to harvest the fruit of th ese labors. Maximally useful data mining applications should: * process data from disparate experimental techniques and technologies and d ata that has both temporal (time studies) and spatial (organism, organ, cell type, sub-cellular location) dimensions; * be capable of identifying and interpreting outlying data; * use data analysis in an iterative process, applying gained knowledge to co nstantly examine and reexamine data; and * use novel comparison techniques that extend beyond the standard Bayesian ( similarity search) methods. Data mining applications are built on complex algorithms that derive explana tory and predictive models from large sets of complex data by identifying pa tterns in data and developing probable relationships. Data mining workbenche s also incorporate mechanisms to filter, standardize/normalize, cluster data , and visualize results. As a tool to identify open reading frames (ORFs) or hypothetical genes in ge nomic data, data mining is a new twist on existing gene discovery applicatio ns, such as programs that identify intron/exon boundaries in genomic DNA. On e of data mining's greatest practical applications will be in the area of HT , microarray-based gene- and protein-expression profiling, where massive dat a sets need to be examined to identify sometimes subtle intrinsic patterns a nd relationships. Differential gene analysis has the potential to explicitly describe the interrelationships of genes during development, under physiolo gical stress, and during pathogenesis. The data mining approach taken to ana lyze microarray data is a function of experimental design and purpose. Inves tigations analyzing defined perturbations of a given genetic stasis use hypo thesis-testing computational methods, whereas genetic surveys and research i nto fundamental cellular biology use statistical methods. Similarly, the sam e methods are utilized in analyzing large-scale proteomics data sets. An extension of data mining is the concept of knowledge discovery (KD), in w hich the results of data mining experiments open up new avenues of research, 7 with obvious and subtle findings forming the basis of new questions from d ifferent perspectives. Some of the more prominent data mining applications a nd KD workbenches are described in the accompanying table. Predicting Protein Structure and Function Structural bioinformatics involves the process of determining a protein's th ree-dimensional structure using comparative primary sequence alignment, seco ndary and tertiary structure prediction methods, homology modeling, and crys tallographic diffraction pattern analyses. Currently, there is no reliable d e novo predictive method for protein 3D-structure determination. Over the pa st half-century, protein structure has been determined by purifying a protei n, crystallizing it, then bombarding it with X-rays. The X-ray diffraction p attern from the bombardment is recorded electronically and analyzed using so ftware that creates a rough draft of the 3D structure. Biological scientists and crystallographers then tweak and manipulate the rough draft considerabl y. The resulting spatial coordinate file can be examined using modeling-stru cture software to study the gross and subtle features of the protein's struc ture. One major bottleneck associated with this classic crystallography technology is the inordinate amount of time it takes to successfully grow protein crys tals. This problem is being addressed by HT technology under development tha t streamlines the crystallization process. This HT crystallography technolog y performs many crystallization conditions in parallel with real-time photo- video crystal monitoring. This enables the researcher to test thousands of c rystallization conditions simultaneously, aborting those conditions that do not work at an early stage and selecting "perfect" crystals suitable for X-r ay analysis. Efforts to bypass the excessive time needed to tweak the rough draft of X-ra y crystallographic structures have led to the advancement of computational m odeling (homology and ab initio modeling) approaches. These techniques have been under development, in one form or another, since the first protein stru cture (of myoglobin) was determined in the late 1950s.8 Computational modeli ng utilizes predictive and comparative methods to fashion a new protein stru cture. Ab initio methods use the physiochemical properties of the amino acid sequence of a protein to literally calculate a 3D structure (lowest energy model) based on protein folding. As opposed to determining the structure of an entire protein, ab initio methods are typically used to predict and model protein folds (domains). This method is gaining considerably, in part due t o the development of novel mathematical approaches, a boost in available com putational resources (for example, tera- and pentaFLOPS supercomputers), and considerable interest from researchers investigating protein-ligand (or dru g) interactions. Having the structure, even if only hypothetical, for a part of the protein that interacts with a ligand, can potentially hasten drug ex ploration research. In homology modeling, the structural and functional characteristics of known proteins are used as a template to create a hypothesized structure for an " unknown" protein with similar functional and structural features. Protein st ructure researchers estimate that 10,000 protein structures will provide eno ugh data to define most, if not all, of the approximately 1,000 to 5,000 dif ferent folds that a protein can assume;9 hence, predictive structure modelin g will become more accurate and important as more and more structures are de rived. The homology modeling approach has become very important to the pharm aceutical industry, where expense and time are major drawbacks to the classi cal methods of determining protein structure, even if automation shortens th e discovery cycle. Hypothesized models provide an electronic footprint with which researchers may computationally design various "shoes," such as inhibi tors, activators, and ligands.10 This provides for better engineering of pot ential drugs and reduces the number of compounds that need to be tested in v itro and in vivo. A variety of companies and research initiatives have undertaken these modern approaches to 3D protein structure determination. Most produce structure pr ediction/modeling applications useful in drug development and basic science research, provide access to proprietary structure databases, and/or will dev elop customized analysis services for researchers. LabConsumer will present a profile on molecular modeling applications, including those that are key p layers in homology modeling, early next year. Tools for the 21st Century Modern experimental technologies are providing seemingly endless opportuniti es to generate massive amounts of sequence, expression, and functional data. The drive to capitalize on this enormous pool of information in order to un derstand fundamental biological phenomena and develop novel therapeutics is pushing the development of new computational tools to capture, organize, cat egorize, analyze, mine, retrieve, and share data and results. Most current c omputational applications will suffice for analyses of specific questions us ing relatively small data sets. But to expand scientific horizons, to accomm odate the larger and larger data sets, and to find patterns and see relation ships that span temporal and spatial scales, new tools that broaden the scop e and complexity of the analyses are needed. Many of these data mining tools are available from the companies highlighted in the accompanying table. The se new products and those listed in a previous LabConsumer profile11 have th e capacity to expand research opportunities immeasurably. Christopher M. Smith (csmith@sdsc.edu) is a freelance science writer in San Diego. References 1. W.P. Blackstock, M.P. Weir, "Proteomics: quantitative and physical mappin g of cellular proteins," Trends in Biotechnology, 17:121-7, 1999. 2. R.F. Doolittle, "Computer methods for macromolecular sequence analysis," Methods in Enzymology, Vol. 206. San Diego, Academic Press, 1996. 3. A. Emmett, "The Human Genome," The Scientist, 14[15]:1, July 24, 2000. 4. L. De Francesco, "One step beyond: Going beyond genomics with proteomics and two-dimensional technology," The Scientist, 13[1]:16, January 4, 1999. 5. S. Borman, "Proteomics: Taking over where genomics leaves off," Chemical & Engineering News, 78[31]:31-7, July 31, 2000. 6. J.L. Houle et al., "Database mining in the human genome initiative," www. biodatabases.com/whitepaper.html, Amita Corp., 2000. 7. G. Zweiger, "Knowledge discovery in gene-expression-microarray data: mini ng information output of the genome," Trends in Biotechnology, 17:429-36, 19 99. 8. J.C. Kendrew et al., "Structure of myoglobin," Nature, 185:422-7, 1960. 9. L. Holm, C. Sander, "Mapping the protein universe," Science, 273:595-602, 1996. 10. J. Skolnick, J.S. Fetrow. "From genes to protein structure and function: Novel applications of computational approaches in the genomics era," Trends in Biotechnology, 18:34-9, 2000. 11. C. Smith, "Computational gold: Data mining and bioinformatics software f or the next millennium," The Scientist, 13[9]:21-3, April 26, 1999. 12. R.H. Gross, "CMS molecular biology resource," Biotech Software & Interne t Journal, 1:5-9, 2000. Bioinformatics on the Web Portals to data analysis The heart of bioinformatics analyses is the software and the databases upon which many of the analyses are based. Traditionally, bioinformatics software has required high-end workstations (desktop to mid-range servers) with a mu ltitude of visualization plug-ins and/or peripheral equipment, and a user (o r administrator) willing to routinely download database updates. The mid-ran ge UNIX server is still the standard bioinformatics platform, though there a re also a fair number of Microsoft Windows? and Apple PowerMac? computers. T here are also a number of specialized platforms that integrate hardware and custom software into a powerful data analysis tool, such as DeCypher?, produ ced by Incline Village, Nev.'s TimeLogic (http://www.timelogic.com/); Biocce lerator?, from Compugen Ltd. of Tel Aviv, Israel (http://www.cgen.com/); and GeneMatcher?, manufactured by Paracel Inc. (http://www.paracel.com/) of Pas adena, Calif. Yet the amount of time, money, and effort needed to purchase a nd maintain the hardware, software, and databases required for bioinformatic s research can be a considerable burden to a research laboratory. 2D-gel analysis with Compugen's Z3OnWeb.com ---------------------------------------------------------------------------- ---- To circumvent many of these problems, a few commercial entities are now prov iding fee-based bioinformatics analysis services through the World Wide Web. These services offer several advantages over local stand-alone or server-ba sed analyses. Because they are provided through a Web interface, these servi ces are platform-independent and may be accessed by practically any Web brow ser. Also, they are world accessible. No longer must researchers struggle wi th different applications (doing the same function), different computer syst ems, file formats, and other hurdles to access their data and results. Bioin formatics Web portals truly provide universal access. Some of the more recen t application service providers of Web-based bioinformatics tools are presen ted below. Bionavigator (http://www.bionavigator.com/), is a product of eBioinformatics Inc., of Sunnyvale, Calif., a spin-off venture of the Australian National G enomic Information Service. This service primarily targets academic research ers and provides access to more than 20 databases and 200 analytical tools, including those for database searching, DNA/protein sequence analysis, phylo genetic analyses, and molecular modeling. Another attractive and useful feat ure of the Bionavigator is that it can generate publication-quality result o utput (for example, color-coded multiple sequence alignments and graphic phy logenetic trees). Doubletwist.com, formerly Pangea Systems of Oakland, Calif., is a major purv eyor of annotated sequence data through its Prophecy database. DoubleTwist h as recently added fee-based bioinformatics services through an integrated li fe science portal. Using any one of a number of "research agents," researche rs can analyze protein and DNA sequence data. DNA analysis tools provide for the identification of new gene family members, potential full-length cDNAs, and sequence homologs, whereas the protein tools include routines to identi fy protein family associations, protein-protein interactions, and conserved protein domains. GeneSolutions.com, a product of HySeq Inc., of Sunnyvale, Calif., provides a ccess to information describing proprietary gene sequences and related data from more than 1.4 million expressed sequence tags (EST) analyzed by HySeq u sing its proprietary SBH process. The GeneSolutions? Portfolio contains gene sequences, homology data, and gene expression data generated by HySeq. More than 35,000 genes are reported to have been identified and characterized in HySeq's proprietary databases. IncyteGenomics OnLine Research (www.incyte.com/online) provides a Web portal to the numerous databases developed and maintained by Incyte Genomics Inc., of Palo Alto, Calif., and a personal workbench where researchers can store their sequences, perform analyses, and search the company's databases. LabOnWeb.com (http://www.labonweb.com/), developed by Compugen Ltd., is an I nternet life science research engine providing access to a variety of gene d iscovery tools. First introduced in December 1999, the latest version (2.0), released in September 2000, includes a variety of tools for the prediction of open reading frames and polypeptides (including an InstantRACE module tha t uses public and proprietary databases to return a complete cDNA sequence g iven an input EST), alternative splicing sites, gene function (by similarity to protein domain profiles), and tissue distribution, among others. Z3OnWeb.com (http://www.2dgels.com/) is another service provided by Compugen for the analysis of 2D-gel image data using Z3 software. Researchers have t he option of purchasing and operating the software from their own workstatio ns or they may upload their image data to the Web-accessible Z3 platform for analysis. For researchers working on a nonexistent bioinformatics budget, there are st ill a host of powerful bioinformatics applications, accessible without charg e, on the Web. If the researcher needs only to perform one or two types of a nalyses, and if data security, having to work through several disparate appl ications, and output format are not critical issues then these gratis Web to ols are a bargain. A comprehensive listing of more than 2,300 Web-based bioinformatics tools (a nd information sources), organized according to the type of analyses they pe rform, is available through the CMS Molecular Biology Resource12 (www.sdsc.e du/restools) at the San Diego Supercomputer Center, University of California . A good place to start is at the National Institute of Health's National Ce nter for Biotechnology Information Web site (http://www.ncbi.nlm.nih.gov/). This server contains sequencing and mapping data for nearly 800 different or ganisms through the GenBank database, all searchable using the BLAST tool. N CBI also contains an ORF finder, the Online Mendelian Inheritance in Man (OM IM) database of human genes, and a variety of other useful tools, most of th em cross-indexed to the NCBI PubMed MEDLINE database. --Christopher M. Smith ---------------------------------------------------------------------------- ---- The Scientist 14[23]:26, Nov. 27, 2000 -- ※ 来源:·北大未名站 bbs.pku.edu.cn·[FROM: 166.111.185.231]
|