Approaching Data

Benefits of modernization and strategy to get started

The relational database approach where all data is normalized and business logic runs in stored procedures creates maintainability problems that create unlimited risk and cost over time. Every year of continued database operation puts the product and organization further behind.

These days, we have open source frameworks and libraries, object-relational mapping libraries, and microservices. These approaches enable much better data management with newer patterns.

What needs do databases fulfill?

Zooming out for a second, what are these huge relational databases doing that’s so important?

Need Old Way New Way
Data used in the interface Web applications use SQL queries to insert data that users need Web application uses a library that maps objects to persistent storage
Temporary data storage for sessions Create a table and insert new records and update old records or remove them Use a memory cache like Redis
Events and streams of data Create a table, insert and delete rows Message queue with high performance and delivery options
Business logic, permissions Stored procedures that perform various actions Keep that out of the DB, just store and retrieve data
Database servers few, multi-tenant, centrally managed as many as are needed, keep data local to the apps
Analytics and Data Science Reports and temp tables and stored procedures ETL to a data lake along with all the other DBs

Approaches to move from old to new

  1. Start to do an ETL out of the big DB for reporting
  2. Applications with business logic in the database should be replaced with a thin rest API
  3. Applications that use DB table for event queues should have the functionality moved to a message queue
  4. Applications that make use of ephemeral data tables or contents should move that content to memory cache
  5. Applications that persist data from a database into the database should have the data for that app moved to a local database, either postgres or mongo depending on the type
  6. After these steps have been taken, the amount of reliance on the central database will be reduced to the point that the last few hold-outs can be addressed individually

Why change?

Keep in mind that by using an old RDBMS, all related software is crippled and unable to adapt to the changing needs out in the world. The coordination cost that comes from having a centralized database with a schema under strict change control is massive.

Teams will notice their velocity and quality increase as they reduce their reliance on the database. These teams can suddenly take advantage of CI pipelines that are entirely ephemeral. Security testing can be much more aggressive because the blast radius of an SQL injection is limited to the environment linked to this single review application. Nobody else will be impacted even if the whole database is brought down.

The user experience will improve for the ultimate end-users as well, since the types of persistent data are stored in the most fitting format and operated on as such. A message queue can scale massively when queue depth increases suddenly and shrink back down. A memory cache can load balance across more nodes as usage increases. The database won’t have expensive queries running synchronously unless it’s actually important.

First step

In this case, the way the database has evolved and what applications depend on which capabilities will determine what steps to take. The best step is to decommission as much of the old software as possible, but I’ve personally spent years trying to get 90s software turned off and somehow never had the budget for resources to fill those last few gaps.

All else being equal, starting with message queues and memory cache is likely the biggest architectural improvement that can be made.

Alternatively, starting with the ETL process and setting up a proper data lake would shift a lot of the reporting load without impacting the inputs and applications directly. It creates a common destination for what happens when the message queue or event stream or other data starts flowing and being processed. If it has a place in the data lake, what is the resolution and scope necessary to support existing reports? Clear input, output, and usage targets helps with the shift from DB to Queue.