A collaboration between scientists in the Environmental Genomics and Systems Biology (EGSB) Division at Berkeley Lab and at Stanford University has revealed new insights into how regulatory sequences called enhancers drive gene expression during embryonic development. Enhancers are sections of DNA that orchestrate the expression of a gene despite being located far away from the actual coding sequence.
Their work, published in Nature, shows how multiple short, modular sequences within an enhancer are needed to properly guide expression, and that even a single nucleotide mutation in one of these regions can change how and where a gene is activated. In one striking example, alterations to an enhancer associated with building structures in the face and limbs caused it to activate in the heart and nervous system tissue instead.
Investigating enhancers has always been challenging, because each of these sequences contain multiple binding sites for the molecules that switch DNA transcription on or off. The effects of mutations depend on the specific combination and location of sites that are altered, and can only be revealed through systematic experiments. This complexity, and the lack of sufficient data to train machine learning algorithms, makes it difficult to build accurate predictive models.
Using a mouse model, the Berkeley Lab team created a huge variety of different mutations to seven enhancers known to govern development of the brain, heart, limbs, and face. They then looked for changes to developing tissues across the whole body. Using this large experimental dataset, the Stanford collaborators developed a new machine learning model and tested whether it could identify the same important sequences revealed in the experiments.
They found that although the model could identify many functionally important regions of enhancers by searching for sequence patterns known to indicate binding sites, it missed other sequences that are clearly critical based on the team’s experimental evidence.