INTRO
what's data engineering?
- Data Collection and Storage
- Data preparation
- Exploration and Visualization
- Experimentation and Prediction
Data engineers deliver
- the correct data
- in the right form
- to the right people
- as efficiently as possible
A data engineer's responsibilities
- ingest data from different sources
- optimize databases for analysis
- remove corrupted data
- develop, construct, test and maintain data architectures
The five Vs
V
olume
V
ariety
V
elocity
V
eracity
V
alue
These are what data engineers have to consider
The data pipeline
- ingest
- process
- store
- need pipelines
- DS can use up-to-date, accurate data
ETL / Data pipelines
ETL
E
xtract data
T
ransform extracted data
L
oad transformed data to another database
Data pipelines
Move data from one system to another
May follow ETL
Data may not be transformed
Data may be directly loaded in applications
Data structures
Structed data
- Easy to search and organize
- Consistent model, rows and columns
- Defined types
- can ge grouped to form relations
- stored in relational databases
- about 20 per. of the data is structed
- created and queried using SQL
Semi-structed data
- relatively easy to search and organize
- consistent model, less-rigid implementaion
- different types, sizes
- can be grouped, but needs more work
- NoSQL databases: JSON, XML, YAML
Unstructed data
- hard to search and organize
- doesn't follow a model, can't be contained in rows and cols
- usually stored in data lakes, can appear in data warehouses or databases
- extremely valuable
SQL
- Structured Query Language
- Industry standard ofr Relational Database Management System (RDBMS)
- allows yyou access many records at once, group, filter or aggregate them
CREATE TABLE (
stu_id INT,
stu_name VARCHAR(255)
)
SELECT *
FROM table
WHERE condition
Data warehouses and data lakes
Data lake
:
- stores all the raw data
- unprocessed, massy
- can be petabypes (1 million GBs)
- stores all data structures
- difficult to analyze
- requires an up-to-date data catalog
- used by DS
- big data, real-time analysis
Data warehouse
:
- specific data for specific use
- relatively small
- stores mainly structured data
- more costly to update
- optimized for data analysis
- used by DS and BA
- Ad-hoc, read-only queries
- Ad-hoc : for special purpose
Data catalog for data lakes


refs:
-
what is the source of this data?
-
Where is this data used?
-
Who is the owner?
-
How often is this updated?
-
Good practice in terms of data governance
-
Ensures reproducibillity
-
No catalog -> data swamp
-
Good practice for any data storage solution
- reliability
- autonomy
- scalability
- speed
DB vs. DW
-
database:
- general term
- loosely defined
-
DW
Data processing value
-
remove unwanted data
-
to save money
-
convert data from one type to another
-
organize data
-
to fit into a schema/structure
-
increasing productivity
How DE process data
-
data manipulation, cleaning, and tidying tasks
- that can be automated
- that will always need to be done
-
store data in a sanely structured DB
-
create views on top of the DB tables
-
deciding what happens with missing metadata
-
optimizing the performance of the DB
Scheduling
-
can apply to any task listed in data processing
-
glue of the data engineering system
-
runs tasks in a specific order and resolves all dependencies
Manual, Time, Condition
-
Manual : because~
-
Time : every, seasons
-
Condition : when, if ~
Batches and streams
Batches
- group records at intervals
- often cheaper
Streams
- send individual records right away
EX ) Apache airflow, Luigi
Parallel computing
: 병렬 연산
- Basis of modern data processing tools
- Necessary:
- mainly because of memory
- Also for processing power
- How it works:
- split tasks up into several smaller subtasks
pros and cons
Pros | cons |
---|
extra procssing power | moving data incurs a cost |
reduced memory footprint | communication time |
refs: https://ko.wikipedia.org/wiki/%EB%B3%91%EB%A0%AC_%EC%BB%B4%ED%93%A8%ED%8C%85
Cloud Computing
differences from on-premise
- No need space
- Electrical and maintenance cost can be reduced ( rented )
- DB reliability : data replication
- But, there's risks with sensitive data
refs: Data Engineering for everyone in DATACAMP