The twenty-first century is presenting humankind with unprecedented environmental and medical challenges. The ability to design novel proteins tailored for specific purposes would potentially transform our ability to respond to these issues in a timely manner. Recent advances in the field of artificial intelligence are now setting the stage to make this goal achievable. Protein sequences are inherently similar to natural languages: amino acids arrange in a multitude of combinations to form structures that carry function, the same way as letters form words and sentences carry meaning. Accordingly, it is not surprising that, throughout the history of natural language processing (NLP), many of its techniques have been applied to protein research problems. In the past few years we have witnessed revolutionary breakthroughs in the field of NLP. The implementation of transformer pre-trained models has enabled text generation with human-like capabilities, including texts with specific properties such as style or subject. Motivated by its considerable success in NLP tasks, we expect dedicated transformers to dominate custom protein sequence generation in the near future. Fine-tuning pre-trained models on protein families will enable the extension of their repertoires with novel sequences that could be highly divergent but still potentially functional. The combination of control tags such as cellular compartment or function will further enable the controllable design of novel protein functions. Moreover, recent model interpretability methods will allow us to open the ‘black box’ and thus enhance our understanding of folding principles. Early initiatives show the enormous potential of generative language models to design functional sequences. We believe that using generative text models to create novel proteins is a promising and largely unexplored field, and we discuss its foreseeable impact on protein design.

ーーControllable protein design with language models

To accelerate biomedical research process, deep-learning systems are developed to automatically acquire knowledge about molecule entities by reading large-scale biomedical data. Inspired by humans that learn deep molecule knowledge from versatile reading on both molecule structure and biomedical text information, we propose a knowledgeable machine reading system that bridges both types of information in a unified deep-learning framework for comprehensive biomedical research assistance. We solve the problem that existing machine reading models can only process different types of data separately, and thus achieve a comprehensive and thorough understanding of molecule entities. By grasping meta-knowledge in an unsupervised fashion within and across different information sources, our system can facilitate various real-world biomedical applications, including molecular property prediction, biomedical relation extraction and so on. Experimental results show that our system even surpasses human professionals in the capability of molecular property comprehension, and also reveal its promising potential in facilitating automatic drug discovery and documentation in the future.

ーーA deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals

ーーKV-PLM-Download

蛋白质序列在本质上类似于自然语言:氨基酸以多种组合方式排列,形成承载功能的结构,就像字母构成单词和句子承载意义一样。因此,在整个自然语言处理(NLP)的历史中,它的许多技术被应用于蛋白质研究问题。Transformer 预训练模型的实施使文本生成具有类似人类的能力,包括具有特定属性的文本,如风格或主题。受其在 NLP 任务中取得的巨大成功的激励,预计专用 Transformer 将在不久的将来主导自定义蛋白质序列的生成。对蛋白质家族的预训练模型进行微调,将使它们能够用新的序列来扩展它们,这些序列可能是高度不同的,但仍有潜在的功能。控制标签的结合方式,如细胞区系或功能,进一步使新型蛋白质功能的可控设计成为可能。此外,最近的模型可解释性方法将使我们能够解决「black box」问题,增强我们对蛋白质 folding 原理的理解。早期的举措显示了生成性语言模型在设计功能序列方面的巨大潜力。作者认为,使用生成性文本模型来创造新的蛋白质是一个很有前途的、在很大程度上未被开发的领域,并讨论了它对蛋白质设计可预见的影响。

ーー基于语言模型的可控蛋白质设计

一个将分子结构和生物医学文本桥接起来的深度学习系统,其理解力可与人类专业人员媲美。

为了加速生物医学研究过程,人们开发了深度学习系统,其通过阅读大规模的生物医学数据,来自动获取分子实体的知识。受到人类通过多种方式阅读分子结构和生物医学文本信息来学习深度分子知识的启发,论文作者提出了一个知识丰富的机器阅读系统,该系统将这两种类型的信息连接在一个统一的深度学习框架中,为生物医学研究提供全面的帮助。他们解决了现有的机器阅读模型只能分别处理不同类型数据的问题,从而实现了对分子实体的全面深入的理解。通过在不同信息来源中以无监督的方式抓取元知识,他们的系统可以促进各种现实世界生物医学应用,包括分子性质预测,生物医学关系提取等。实验结果表明,该系统在分子性质理解能力方面甚至超过了人类专业人员,并显示了其在未来药物自动发现和文档化方面的潜力。

ーー清华大学计算机系孙茂松团队发表于 nature communications,名为《A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals》的论文。

Single cell RNA-seq

Open Problems - Multimodal Single-Cell Integration | Kaggle

In the past decade, the advent of single-cell genomics has enabled the measurement of DNA, RNA, and proteins in single cells. These technologies allow the study of biology at an unprecedented scale and resolution and have led to new insights into fundamental drivers of health and disease.

In this competition, you will predict how DNA, RNA, and protein measurements co-vary in single cells as bone marrow stem cells develop into more mature blood cells. You will have access to a 300,000-cell timecourse dataset and develop a model that predicts from DNA to RNA and from RNA to protein at a later unseen timepoint.

Comments

2023-04-07