Fly me to the 🌙

Fly me to the 🌙

Cleaning Data with PySpark (PySpark 중점)

All We Need is Data, itself !·2022년 6월 15일

data engineering

0

DataCamp

목록 보기

13/13

Immutability and Lazy processing

python variables
- mutable
- flexibility
- potential for issues with concurrency
- likely adds complexity
immutability
- a component of functional programming
- defined once
- unable to be directly modified
- recreated if reassigned
- able to be shared efficiently
laziy processing ?

Understanding Parquet

Spark and CSV files
- slow to parse
- files cannot be filtered
The Parquet Format
- columnar data format
- supported in Spark and other data processing frameworks
- supports predicate pushdown
- automatically stores schema info.

df = spark.read.format('parquet').load(filename.parquet')
df = spark.read.parquet('filename.parquet')

df = spark.read.parquest('filename.parquet')
df.createOrReplaceTempView('name')
df2 = spark.sql('SQL QUERIES')

Partitioning and lazy processing

Partitioning
- DF are broken up into partitions
- Partition size can vary
- Each partition is handled independently

Lazy processing
- Transformations are lazy
  - .withcolumn(...)
  - .select(...)
- Nothing is actually done until an action is performed
  - .count(...)
  - .write(...)
- Transformations can be re-ordered for best performance
- Sometimes causes unexpected behavior

spark is Lazy!

All We Need is Data, itself !

분명히 처음엔 데린이었는데,, 이제 개린이인가..

이전 포스트

Big Data Fundamentals with PySpark

0개의 댓글