A-Z Data Engineering

Data Engineering (A-Z)

Airflow — Programmable DAG based job scheduler, a very popular Apache project

BigQuery — Google’s serverless datawarehose that competes with Redshift and Azure DW

Cassandra — distributed NoSQL database popular for columnar storage capabilities

Databricks — a web-based platform for working with Spark and more

ETL — extract from source, transform, load into destination

Flink — distributed processing engine for data streams

Glue — large scale serverless ETL, data pipelining solution from AWS

Hadoop — big data processing framework comprising of MapReduce, YARN and HDFS

InfluxDB —very popular timeseries database

JSON —the de facto data transportation format over the internet

Kafka — LinkedIn’s distributed streaming framework

Looker — Google’s latest browser based BI tool

MongoDB — very popular open source NoSQL database

NoSQL — group of database technologies which are more than relational databases

Oozie — DAG based workflow scheduler for Hadoop jobs

PostgreSQL — programmers’ favorite open-source database

Query Engine — the piece of software that executes queries against a dataset

Redshift — most popular managed, petabyte scale data warehousing solution

SQL — the language that data speaks

Terraform — the de facto Infrastructure-as-code product by HashiCorp

Unstructured Data — data without a schema or a pre-defined structure

View —unpersisted database object represented by a SQL query

Wrangling — cleaning the data, making it ready for analysis

Xplenty — integrations platform to extract data out of various cloud apps and move data

YARN — resource manager for Hadoop ecosystem (used for MapReduce and Spark)

Zookeeper — centralized configuration management service

Do you know them all? What would you change?

Kategoriler