李相君同学的论文被BMC Bioinformatics录用

实验室2019届硕士研究生李相君同学的论文“A Sequence Embedding Method for Enzyme Optimal Condition Analysis”（作者：李相君#、窦智欣#、孙宇清*、王禄山、龚斌、万林。#共同第一作者，*通讯作者）被BMC Bioinformatics录用。

BMC Bioinformatics是生物信息学和计算生物学领域的SCI二区期刊，IF=3.242，关注建模和统计方法在生物信息领域创新应用。本篇工作提出了基于氨基酸和序列结构信息的表示学习方法，通过对氨基酸序列的语义分析，探索从生物序列预测生理生化功能的创新跨越，从而极大程度上减小了生物学实验所需时间、材料等各项成本，利用学科交叉优势，促进了生物学领域酶的理性设计发展。

Abstract

Background: An enzyme activity is influenced by the external environment. It is important to have an enzyme remain high activity in a specific condition. A usual way is to first determine the optimal condition of an enzyme by either the gradient test or by tertiary structure, and then to use protein engineering to mutate a wild type enzyme for a higher activity in an expected condition.

Results: In this paper, we investigate the optimal condition of an enzyme by directly analyzing the sequence. We propose an embedding method to represent the amino acids and the structural information as vectors in the latent space. These vectors contain information about the correlations between amino acids and sites in the aligned amino acid sequences, as well as the correlation with the optimal condition. We crawled and processed the amino acid sequences in the glycoside hydrolase GH11 family, and got 125 amino acid sequences with optimal pH condition. We used probabilistic approximation method to implement the embedding learning method on these samples. Based on these embedding vectors, we design a computational score to determine which one has a better optimal condition for two given amino acid sequences and achieves the accuracy 80% on the test proteins in the same family. We also give the mutation suggestion such that it has a higher activity in an expected environment, which is consistent with the previously professional wet experiments and analysis.

Conclusion: A new computational method is proposed for the sequence based on the enzyme optimal condition analysis. Compared with the traditional process that involves a lot of wet experiments and requires multiple mutations, this method can give recommendations on the direction and location of amino acid substitution with reference significance for an expected condition in an efficient and effective way.

Keywords: Protein sequence analysis; Embedding; Bioinformatics