BioGPT Agent Tools
The twenty-first century is presenting humankind with unprecedented environmental and medical challenges. The ability to design novel proteins tailored for specific purposes would potentially transform our ability to respond to these issues in a timely manner. Recent advances in the field of artificial intelligence are now setting the stage to make this goal achievable. Protein sequences are inherently similar to natural languages: amino acids arrange in a multitude of combinations to form structures that carry function, the same way as letters form words and sentences carry meaning. Accordingly, it is not surprising that, throughout the history of natural language processing (NLP), many of its techniques have been applied to protein research problems. In the past few years we have witnessed revolutionary breakthroughs in the field of NLP. The implementation of transformer pre-trained models has enabled text generation with human-like capabilities, including texts with specific properties such as style or subject. Motivated by its considerable success in NLP tasks, we expect dedicated transformers to dominate custom protein sequence generation in the near future. Fine-tuning pre-trained models on protein families will enable the extension of their repertoires with novel sequences that could be highly divergent but still potentially functional. The combination of control tags such as cellular compartment or function will further enable the controllable design of novel protein functions. Moreover, recent model interpretability methods will allow us to open the ‘black box’ and thus enhance our understanding of folding principles. Early initiatives show the enormous potential of generative language models to design functional sequences. We believe that using generative text models to create novel proteins is a promising and largely unexplored field, and we discuss its foreseeable impact on protein design.
To accelerate biomedical research process, deep-learning systems are developed to automatically acquire knowledge about molecule entities by reading large-scale biomedical data. Inspired by humans that learn deep molecule knowledge from versatile reading on both molecule structure and biomedical text information, we propose a knowledgeable machine reading system that bridges both types of information in a unified deep-learning framework for comprehensive biomedical research assistance. We solve the problem that existing machine reading models can only process different types of data separately, and thus achieve a comprehensive and thorough understanding of molecule entities. By grasping meta-knowledge in an unsupervised fashion within and across different information sources, our system can facilitate various real-world biomedical applications, including molecular property prediction, biomedical relation extraction and so on. Experimental results show that our system even surpasses human professionals in the capability of molecular property comprehension, and also reveal its promising potential in facilitating automatic drug discovery and documentation in the future.
蛋白质序列在本质上类似于自然语言:氨基酸以多种组合方式排列,形成承载功能的结构,就像字母构成单词和句子承载意义一样。因此,在整个自然语言处理(NLP)的历史中,它的许多技术被应用于蛋白质研究问题。Transformer 预训练模型的实施使文本生成具有类似人类的能力,包括具有特定属性的文本,如风格或主题。受其在 NLP 任务中取得的巨大成功的激励,预计专用 Transformer 将在不久的将来主导自定义蛋白质序列的生成。对蛋白质家族的预训练模型进行微调,将使它们能够用新的序列来扩展它们,这些序列可能是高度不同的,但仍有潜在的功能。控制标签的结合方式,如细胞区系或功能,进一步使新型蛋白质功能的可控设计成为可能。此外,最近的模型可解释性方法将使我们能够解决「black box」问题,增强我们对蛋白质 folding 原理的理解。早期的举措显示了生成性语言模型在设计功能序列方面的巨大潜力。作者认为,使用生成性文本模型来创造新的蛋白质是一个很有前途的、在很大程度上未被开发的领域,并讨论了它对蛋白质设计可预见的影响。
ーー基于语言模型的可控蛋白质设计
一个将分子结构和生物医学文本桥接起来的深度学习系统,其理解力可与人类专业人员媲美。
为了加速生物医学研究过程,人们开发了深度学习系统,其通过阅读大规模的生物医学数据,来自动获取分子实体的知识。受到人类通过多种方式阅读分子结构和生物医学文本信息来学习深度分子知识的启发,论文作者提出了一个知识丰富的机器阅读系统,该系统将这两种类型的信息连接在一个统一的深度学习框架中,为生物医学研究提供全面的帮助。他们解决了现有的机器阅读模型只能分别处理不同类型数据的问题,从而实现了对分子实体的全面深入的理解。通过在不同信息来源中以无监督的方式抓取元知识,他们的系统可以促进各种现实世界生物医学应用,包括分子性质预测,生物医学关系提取等。实验结果表明,该系统在分子性质理解能力方面甚至超过了人类专业人员,并显示了其在未来药物自动发现和文档化方面的潜力。
ーー清华大学计算机系孙茂松团队发表于 nature communications,名为《A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals》的论文。
Single cell RNA-seq¶
Open Problems - Multimodal Single-Cell Integration | Kaggle
In the past decade, the advent of single-cell genomics has enabled the measurement of DNA, RNA, and proteins in single cells. These technologies allow the study of biology at an unprecedented scale and resolution and have led to new insights into fundamental drivers of health and disease.
In this competition, you will predict how DNA, RNA, and protein measurements co-vary in single cells as bone marrow stem cells develop into more mature blood cells. You will have access to a 300,000-cell timecourse dataset and develop a model that predicts from DNA to RNA and from RNA to protein at a later unseen timepoint.
蛋白质工程(Protein Engineering) 在现代生物医药中的应用,特别是在抗体疗法、抗体-药物偶联物(ADCs)、双特异性抗体(BITEs)、DARPins 和 CAR-T 细胞治疗等领域的核心原理。
关键概念
生物制药平台技术:
・单克隆抗体(Therapeutic antibodies):用于靶向治疗癌症、免疫疾病等。
・抗体-药物偶联物(ADCs, Antibody-Drug Conjugates):抗体携带小分子药物,特异性杀死癌细胞。
・特异性 T 细胞衔接器(BITEs, Bispecific T-cell Engagers):可同时结合癌细胞和 T 细胞,促进免疫攻击。
・DARPins(Designed Ankyrin Repeat Proteins):比传统抗体更稳定、更易于生产的蛋白质结合物。
・CAR-T 细胞疗法(CAR-based therapeutics):利用基因工程改造 T 细胞,使其能识别并杀死癌细胞。
所有这些平台的共同点:
・核心目标:设计一种人工蛋白质,使其能高效结合某种与疾病相关的靶标蛋白。
・蛋白质结合(Protein Binding):药物分子(如抗体)必须精准、高效地附着在目标蛋白上,以发挥治疗作用。
蛋白质工程的数学模型:一个『最小化问题』:
・输入:目标蛋白的 3D 结构(通常来自 PDB 文件,即 蛋白质数据库 .pdb)。
・目标:寻找一条氨基酸序列,使其编码的蛋白(称为 binder,即结合物)能紧密结合目标蛋白。
『最小化问题』意味着什么?
蛋白质工程的本质是优化(Optimization):
- 目标函数(Objective Function):
- 评估不同氨基酸序列与目标蛋白结合的紧密程度(亲和力)。
- 亲和力越高,结合越牢固,药效越好。
- 搜索空间(Search Space):
- 由于蛋白质由 20 种氨基酸 组成,假设我们需要设计一个 100 个氨基酸 的结合物,则可能的组合数量是:$20^{100} \approx 10^{130}$
- 这个搜索空间远远超过了传统计算能力,因此需要计算机模拟、机器学习和量子计算来加速搜索。
- 目标函数(Objective Function):
现实应用
- 药物发现:用于筛选高效的蛋白质药物,如 新冠病毒的抗体开发 就涉及这种优化过程。
- 计算方法:
- 经典计算:使用蒙特卡罗方法(Monte Carlo)、分子动力学(Molecular Dynamics, MD)进行模拟。
- 人工智能:AlphaFold 2、RosettaFold 通过深度学习预测蛋白质结构和结合能力。
- 量子计算:利用 量子搜索(Grover)、量子变分优化(VQE)加速找到最优氨基酸序列。
总结
这段话说明了蛋白质工程在现代生物制药中的核心作用,并将其归结为一个优化问题:
- 输入:疾病相关的 目标蛋白 3D 结构(.pdb 文件)。
- 优化目标:寻找一条氨基酸序列,使其编码的蛋白能最紧密地结合目标蛋白(最小化结合能量)。
- 方法:结合计算机模拟、AI、量子计算进行搜索,以加速药物发现。
简单来说,这是一种『让蛋白质变成理想药物』的数学优化问题!