Look at the big picture – the presentation layer, and what the users would like to get.
Sometimes the “perfect” database design will cause the presentation layer to be manipulated. Because of this, the performance will be bad and the question will be posed “do we need a stronger machine?” or a suggestion, “let’s redesign the database.” When you design an analytics solution, it means the data engineer, or the analytics tools must get the data fast without structure changes (View on top of a view ….) while maintaining performance and being efficient.
There are many database architectures, choose the best fit according to your business needs:
- Schema on read or on write: Would it be structural database or unstructured?
It depends on your source data
- Reporting tool: Choose a reporting tool that has a native data driver to your database.
- Late transaction: How to deal with a late income transaction. How to reprocess or ignore?
- How much data to keep. Although information is considered a gold mine, do you need to save a lot of information in your analytical database? Or can you work with a ”thin” and fast system and save the old information in a quarriable cheap storage?
Choose an ETL tool: One picture is worth more than 1,000 words. An ETL tool can decrease development and maintenance time. The time “saving” increases when the process is more complicated from a business perspective. Does ETL process code that is generated from an ETL tool take more time and less effective than human code writing? Not really. In both cases it depends on the developer.
Micro batch processing
Micro-batch processing involves small pieces of code that transform data. The micro batch runs under a full business process which is composed from a many micro batch processes. The workflow can run by a trigger; when a file arrives or a message is delivered from a topic, queue, TCP call, CDC or simple schedule. There are many advantages when developing micro batches and some of them are:
- Commit points and recovery
- Maintenance and release
- Amount of data increases
- Agile development
- Debug and troubleshooting
Commit points and recovery
“The process failed, run it again.” When the process takes more time, consumes more resources and the backlog increases; the exception is more painful. Developing your workflow recovery is possible when your process is divided into small micro batches.
Micro-batch processing is a dedicated functionality process, therefore the development will be more reliable, and it will match faster. It is like a manufacturing plant. You know what you get, and you know what the output is. Adding features or removing behavior will not always cause regression.
Amount of data increases
As the amount of data increases, this does not necessarily involve spending more money to increase resources or redevelop it into a different technology. Understanding which tasks cause more resource consumption will be easier when developing micro batches. It is more often that memory leak or an untuned process causes resource problems. If the process takes more time to run because of data increases, then the bulk of data can be separated into small pieces and run parallelly.
A micro batch has an entry point and result. The developer can mock the process with known files and compare the result with those files. Each developer can take a ”piece” of the whole process to develop.
Debug and troubleshooting
Because it looks like an assembly line, and the outcome does not suit the desired result. The developer can check the result of each step, unlike a single process.
The process can enrich the data from other data sources or from other micro services. It can be written in any language. The only thing that matters is that the micro process will get a file with known structure and will create a file with known structure. Known structure does not mean csv, it can be even json or xml or another format. Depending on the requirements, a micro batch flow can process even one entity row (structure) and can run every second.
Building a workflow based on many micro batches
The most usage consumption is aggregation, sorting and keeping lookup data in memory.
Most of the ETL tools use a known sorting algorithm.
With some ETL tools there is a big difference when aggregating sorted data or not. It is recommended to create known size data groups (Partitions) and aggregate them. Some tools offer threading processes and can manage those threads with a thread pool.
When the lookup table is big, it is recommended to create a micro service for it. No need to create a micro service for each lookup table.
Dividing a Flow
It is recommended if it is possible to separate the heavy tasks into different micro batches.
The entire task is called data preparation.