How to build solid Data Engineering Platforms/products?
Let’s outline the checklist to develop solid Data Engineering Platform. All the brown field applications /products needs to move towards green field to be success turn around for the business. I’m trying to explain Data Engineering into plain layman’s terms — no gimmicks , buzz words, solid product should be important factor — as I said moving brown field to real green field, not “green coating”. But the building suppose to be strong.
Also, Most of the companies are started to step up to modernize all the applications due to computation bottleneck ,querying slowness, manual maintenance problems, deployment process and cost on fault tolerance handling , etc.
Data Engineering is the back bone for AI related products such as Machine Learning, Deep Learning and Reinforcement Learning implementation and achieve analytics capabilities. Because of the nature, those applications should be implemented distributed data processing to make it faster computation.
In other corporate terms, data is playing major role to build any kind of software products and applications for all the domains.
High Level Architecture manifesto for all the modernization(data engineering perspective):
Majority Use cases for Data Engineering:
We can discuss every use cases design and implementation factors step by step including the portfolio to build solid product.
- Data Pipeline
- Fast Data Applications / Development of Data Intensive Applications
- Micro Services Platform / Event Driven Applications
- Reactive Systems
- NoSQL Systems
- Clod Adoption / Cloud Native Applications.
Data Pipeline:
The main phases for Pipeline implementation would be collect, transformation, store and reporting. In other word, Machine learning world add one more phase as “modeling”. In detail, more phases or chain of the components are appended based on the requirements- chain of responsibilities. The pipeline process needs to handle large volume data, handling high throughput data, low latency and self service management. This elasticity purpose the cloud native process is more cost effective.
These patterns are concluded for well known pipeline architecture.
- ETL / ELT — well know pattern
- Batch processing data pipeline
- Near-real time processing data pipeline — Streaming data Engine
- Data lake
- Data warehouse
- Data mart.
Fast Data Applications / Developing data Intensive Applications:
The following characteristics needs to be applied in the system to achieve faster applications and high performance are,
- in-memory computation
- Schema-free on write activity
- CQRS — Common Query Responsible Segregation
- Event Sourcing
- Concurrency
- Distributed data processing
- micro services and handling back pressure
- component isolation.
Microservice Platform:
The main objective of Microservices would be splitting the business capability into small components and let’s orchestrator takes care the arrangement.so obviously the business unit can have their characteristics such as storage, business logic, etc. period. The implementation on business unit is technology agnostic and time to market effort is quick due to loosely coupling.
Event Driven Applications:
This design pattern is the flip side of Microservices architecture. This pattern of design aspects improves the efficiency on data communications between producer and consumer. In other words, the data handling mainly treated in asynchronous manner to improve the data processing flexibility.
— — further continue in the next article…
just one glance on how the following open source stack might help to make the future applications into Green..
at high level:
Cloud and Management Tools > Data Source > Upstream Service > Data Processing > compute > storage > Downstream Services / dashboard