| Abstract |
Intrinsically disordered regions (IDRs) are highly flexible regions in proteins that lack a stable 3D structure. They differ greatly to structured/ordered regions, where it has long been understood that their functions are dependent on the structure, which is ultimately encoded in their sequence. Due to the lack of structure, IDRs were initially considered purely as functionless linkers between structured domains. However, increasing evidence has indicated the immense functionality of IDRs, which harbor key regulatory sites and motifs that are critical for protein function, or promote the formation of biomolecular condensates. The importance of IDRs is further evidenced by the ubiquity of IDRs in the protein universe, where IDRs are found across all kingdoms of life, particularly in eukaryotes. Despite their prevalence and functionality, understanding of IDR sequence-function remains poor. Whilst IDRs are subject to evolution and selection whereby functional sequence features are conserved, they also undergo more rapid evolution since they are not under the constraints of preserving structure, resulting in lower sequence identity between functionally homologous IDRs. This renders alignment ineffective in identifying homologs and limits the efficacy of alignment-based tools in detecting conserved sequence features within. To tackle these issues, I developed Similarity/Homology Assessment by Relating K-mers (SHARK), a novel alignment-free sequence comparison algorithm which improved on existing alignment-free word-based algorithms by incorporating amino acid physicochemical similarities, and even outperformed local alignment on a proof-of-concept test of functionally opposing IDRs. I further combined top-performing scoring algorithms into SHARK-dive, a machine-learned, gradient boosting classifier. SHARK-dive was trained on a newly curated set of unalignable orthologous sequences. Importantly, SHARK-dive outperformed widely used, local alignment-based homology search tools BLAST, HMMER in detecting homologs in a systematic benchmark, and was able to distinguish between functionally homologous and unrelated IDRs reported in the literature. Moreover, I developed SHARK-capture, an alignment-free motif detection tool to identify conserved regions. SHARK-capture extends on the core concepts of the SHARK algorithm to detect motifs using physicochemical similarities on a continuous scale, without requiring rigidly defined equivalency groups. This approach should allow the detection of multiple conserved motifs across IDRs despite low identity and poor alignment and is particularly effective when motifs are juxtaposed during evolution. We confirm this by comparing the performance of SHARK-capture with a variety of existing tools in a systematic motif detection benchmark and observed that it offers consistently strong performance across both site and residues levels. In addition, not only did SHARK-capture detect highly conserved residues of the GLEBS motif within the IDR of phase-separating BuGZ orthologs, it also detected multiple Pro-X0-4-Gly-like motifs known to drive phase separation in vitro within the sequences. I further identified a highly conserved RDYR motif in the C-terminus of ATPase helicase Ded1p orthologs which I subsequently experimentally validated to be necessary for ATPase activity regulation, demonstrating the utility of SHARK-capture in providing confident experimental hypotheses of functional IDR regions. The development of these tools represents an initial step in facilitating systematic IDR sequence-function relationship investigation, and future work aims to integrate the SHARK-tools for iterative and highly sensitive search for more remote homologs, thereby enabling the development of an IDR-specific homology resource and a global investigation of the IDR sequence landscape. |