How to build solid Data Engineering Platforms/products?

3 min readMar 13, 2021

Let’s outline the checklist to develop solid Data Engineering Platform. All the brown field applications /products needs to move towards green field to be success turn around for the business. I’m trying to explain Data Engineering into plain layman’s terms — no gimmicks , buzz words, solid product should be important factor — as I said moving brown field to real green field, not “green coating”. But the building suppose to be strong.

Also, Most of the companies are started to step up to modernize all the applications due to computation bottleneck ,querying slowness, manual maintenance problems, deployment process and cost on fault tolerance handling , etc.

Data Engineering is the back bone for AI related products such as Machine Learning, Deep Learning and Reinforcement Learning implementation and achieve analytics capabilities. Because of the nature, those applications should be implemented distributed data processing to make it faster computation.

In other corporate terms, data is playing major role to build any kind of software products and applications for all the domains.

High Level Architecture manifesto for all the modernization(data engineering perspective):

Majority Use cases for Data Engineering:

We can discuss every use cases design and implementation factors step by step including the portfolio to build solid product.

Data Pipeline
Fast Data Applications / Development of Data Intensive Applications
Micro Services Platform / Event Driven Applications
Reactive Systems
NoSQL Systems
Clod Adoption / Cloud Native Applications.

Data Pipeline:

The main phases for Pipeline implementation would be collect, transformation, store and reporting. In other word, Machine learning world add one more phase as “modeling”. In detail, more phases or chain of the components are appended based on the requirements- chain of responsibilities. The pipeline process needs to handle large volume data, handling high throughput data, low latency and self service management. This elasticity purpose the cloud native process is more cost effective.

These patterns are concluded for well known pipeline architecture.

ETL / ELT — well know pattern
Batch processing data pipeline
Near-real time processing data pipeline — Streaming data Engine
Data lake
Data warehouse
Data mart.

Fast Data Applications / Developing data Intensive Applications:

The following characteristics needs to be applied in the system to achieve faster applications and high performance are,

in-memory computation
Schema-free on write activity
CQRS — Common Query Responsible Segregation
Event Sourcing
Concurrency
Distributed data processing
micro services and handling back pressure
component isolation.

Microservice Platform:

The main objective of Microservices would be splitting the business capability into small components and let’s orchestrator takes care the arrangement.so obviously the business unit can have their characteristics such as storage, business logic, etc. period. The implementation on business unit is technology agnostic and time to market effort is quick due to loosely coupling.

Event Driven Applications:

This design pattern is the flip side of Microservices architecture. This pattern of design aspects improves the efficiency on data communications between producer and consumer. In other words, the data handling mainly treated in asynchronous manner to improve the data processing flexibility.

— — further continue in the next article…

just one glance on how the following open source stack might help to make the future applications into Green..

at high level:

Cloud and Management Tools > Data Source > Upstream Service > Data Processing > compute > storage > Downstream Services / dashboard

How to build solid Data Engineering Platforms/products?

Written by Jeevan NRMoorthy

No responses yet