Cleaning Data with PySpark (PySpark 중점)

All We Need is Data, itself !·2022년 6월 15일
0

DataCamp

목록 보기
13/13

Immutability and Lazy processing

  • python variables

    • mutable
    • flexibility
    • potential for issues with concurrency
    • likely adds complexity
  • immutability

    • a component of functional programming
    • defined once
    • unable to be directly modified
    • recreated if reassigned
    • able to be shared efficiently
  • laziy processing ?

Understanding Parquet

  • Spark and CSV files
    • slow to parse
    • files cannot be filtered
  • The Parquet Format
    • columnar data format
    • supported in Spark and other data processing frameworks
    • supports predicate pushdown
    • automatically stores schema info.
df = spark.read.format('parquet').load(filename.parquet')
df = spark.read.parquet('filename.parquet')

df = spark.read.parquest('filename.parquet')
df.createOrReplaceTempView('name')
df2 = spark.sql('SQL QUERIES')

Partitioning and lazy processing

  • Partitioning
    • DF are broken up into partitions
    • Partition size can vary
    • Each partition is handled independently
  • Lazy processing
    • Transformations are lazy
      • .withcolumn(...)
      • .select(...)
    • Nothing is actually done until an action is performed
      • .count(...)
      • .write(...)
    • Transformations can be re-ordered for best performance
    • Sometimes causes unexpected behavior
  • spark is Lazy!

profile
분명히 처음엔 데린이었는데,, 이제 개린이인가..

0개의 댓글