Stream Matrices, Save RAM: Boost Cryptographic Performance!
Hey there, crypto enthusiasts and performance hounds! We're diving deep into a super critical optimization for pq-code-package and mldsa-native that's all about making our systems leaner, meaner, and more efficient. The core challenge we're tackling today is reducing RAM usage by moving away from storing massive matrices in memory. Instead, we're going to get clever and ad-hoc generate them on the fly, as needed. Trust me, guys, this is a game-changer, especially for those working with resource-constrained environments like embedded systems, IoT devices, or even optimizing cloud infrastructure costs where every byte of RAM counts. Think about it: when you're dealing with post-quantum cryptography, the matrices can get pretty chunky, and simply keeping them all in memory can quickly become a bottleneck, both in terms of available resources and even cache performance. Our ultimate goal here isn't just a minor tweak; it's a fundamental shift in how we handle these critical data structures, ensuring our cryptographic operations are not only secure but also incredibly efficient. This approach promises to significantly alleviate memory pressure, making our MLDSA implementations much more scalable and versatile across a wider range of hardware, allowing us to deploy robust security solutions without the hefty RAM overhead that often comes with high-security cryptographic algorithms. We're talking about a move towards a future where cryptographic libraries can be integrated into even the most humble of devices, pushing the boundaries of what's possible in secure computing.
Diving Deep: Our Strategy to Reduce RAM Usage
Alright, folks, let's get down to the nitty-gritty of how we're going to achieve this RAM-saving magic. This isn't just a one-shot fix; it's a meticulously planned, multi-step architectural evolution designed to maintain stability and correctness throughout the process. We're talking about a systematic approach that introduces layers of abstraction and control, ensuring that every change is thoroughly tested and verified. The beauty of this strategy lies in its iterative nature: each proposed step builds upon the last, making the entire transformation manageable and less prone to introducing pesky bugs. We're going to start by introducing a struct wrapper for our matrices, then we'll create helper functions to control access, verify that our existing operations play nice with the new structure, implement a compile-time option for flexibility, and finally, unleash the power of ad-hoc matrix generation. This careful staged rollout is absolutely crucial because we're messing with fundamental data structures in a cryptographic library, where correctness and security are paramount. Every step is designed to isolate changes, making debugging easier and giving us confidence that mld_tests all will continue to pass with flying colors, ensuring that our security guarantees remain rock-solid even as we drastically improve memory efficiency. This isn't just about saving RAM; it's about building a more robust, maintainable, and adaptable codebase for the future of post-quantum cryptography.
Step 1: Wrapping Our Matrix with mld_polymat
Our very first step in this grand optimization journey is all about introducing a proper struct wrapper for our matrix, moving from the rather direct mld_polyvecl mat[MLDSA_K] to a more elegant and encapsulated mld_polymat. Now, you might be asking, "Why bother with a wrapper, guys? What's the big deal?" Well, trust me, this seemingly small change is absolutely foundational for everything else we want to achieve. A struct wrapper like mld_polymat brings a ton of benefits, primarily abstraction and encapsulation. It means we're creating a clean, defined interface for our matrix operations, separating how the matrix is used from how it's stored or generated. Initially, mld_polymat will simply wrap the existing mld_polyvecl mat[MLDSA_K] array, so functionally, nothing changes right away. It's like putting a nice, new cover on an old book; the content is the same, but now it has a proper, well-defined container. This abstraction is super important for future-proofing our code and for making the subsequent steps much smoother. By creating this wrapper, we're laying the groundwork for drastically altering the underlying storage mechanism without having to rewrite every single piece of code that interacts with the matrix. It adheres to good software engineering principles, making our pq-code-package and mldsa-native libraries more modular, easier to understand, and much more maintainable in the long run. We're also taking a leaf out of the book from mlkem-native/pull/1263, which means we're standardizing our approach across related projects, ensuring consistency and making the overall ecosystem more cohesive. After this change, our immediate priority is to verify that ./scripts/tests all still passes, confirming that our new wrapper hasn't introduced any regressions and that all our cryptographic operations remain fully functional and correct. This initial step is less about immediate performance gains and more about architectural integrity and preparing for the real memory-saving magic down the line.
Step 2: Introducing mld_polymat_get_row for Controlled Access
Next up, guys, we're introducing a helper function called mld_polymat_get_row. This function is designed to retrieve a mld_polyvecl pointer from a mld_polymat pointer, specifically pointing to the required row of the matrix. Think of mld_polymat_get_row as our new, polite bouncer at the club door for matrix access. Instead of letting every piece of code just waltz in and grab whatever part of the matrix it wants (which is what direct array access essentially allows), this function becomes the only sanctioned way to get a row. Initially, this function will simply return the memory address of the i-th entry within the mld_polyvecl mat[MLDSA_K] array that mld_polymat now wraps. So, again, no functional change in terms of what data is returned, but a massive change in how that data is accessed. The real power here is the abstraction it creates: we're decoupling the access logic from the storage mechanism. We'll then rewrite mld_polyvec_matrix_pointwise_montgomery()—a critical function that processes matrix rows—to use mld_polymat_get_row() instead of directly accessing the matrix data. This refactoring is absolutely key because it funnels all row access through a single, controlled point. Why is this so important, you ask? Because in a later step, when we want to switch to ad-hoc generation of rows instead of storing them, we'll only need to modify one place: the mld_polymat_get_row function itself. All the other code that depends on getting a matrix row, like mld_polyvec_matrix_pointwise_montgomery(), won't even know the difference; it will just continue to call mld_polymat_get_row and get its mld_polyvecl pointer, whether it was pulled from memory or freshly generated. This separation of concerns is a fundamental principle of good software design, making our code much more flexible and adaptable to future changes. After implementing this, we'll once again run ./scripts/tests all to make sure everything is humming along perfectly, ensuring that our controlled access hasn't broken any existing functionality and that our crypto operations remain sound.
Step 3: Verifying Matrix Operations – A Crucial Checkpoint
Alright, guys, this particular step is less about writing new code and more about a critical sanity check—and trust me, it's super important. Before we plunge into the more drastic changes involving ad-hoc generation, we absolutely must verify that, at this point, the only operations interacting with our matrix are specifically (a) mld_polymat_get_row, (b) mld_polyvec_matrix_expand, and (c) mld_polyvec_matrix_pointwise_montgomery(). This verification step is absolutely crucial for identifying our