Google ColaboratoryでPySpark環境構築

個人的にライブラリやらなんやらを試してみる際にGoogle Colaboratoryを使うことが多いのですが、今回はSpark習熟のためにGoogle Colaboratoryのnotebook上でPysparkを実行できるように環境構築したので、備忘録としてその過程を記録しておきます。正直なところ処理の意図を理解できてないシーケンスが多々あるので、お勉強が進み次第適宜更新すると思います。

【2022/3/2 追記】
以下の記事に情報を更新しています。 toeming.hatenablog.com

【追記ここまで】

早速ですが、以下notebook上での処理手順

なお、こちらの記事を参考にしています。 qiita.com

1. Sparkモジュールをマジックコマンドでインストールします

!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
!tar xf spark-2.3.1-bin-hadoop2.7.tgz
!pip install -q findspark

Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:2 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:7 http://security.ubuntu.com/ubuntu bionic-security InRelease
Hit:8 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:9 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:12 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:13 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Hit:14 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:15 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Reading package lists... Done

2. 環境変数でJavaとSparkのありかを設定します

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-hadoop2.7"

3. 必要なモジュールをインポートします

"SparkSession - in-memory"以下の出力が得られれば、とりあえずOKらしい。

findsparkとSparkSessionについては、まだいまいちよくわかってません。

import findspark
findspark.init()

import pyspark

# SparkContext
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf=conf)

# SparkSession
from pyspark.sql import SparkSession

#spark = SparkSession.builder.getOrCreate() 
spark = SparkSession.builder.appName("My App").getOrCreate()

spark

SparkSession - in-memory

SparkContext

Spark UI

Version
v2.3.1
Master
local
AppName
My App

Dataframeを扱うならこちらも。

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

from pyspark.sql import functions as fn

4. Spark Contextでそれっぽい処理ができることを確認します

テスト用の（しょーもない）テキストファイルを作成しておきます。

cur_dir = "/content/drive/MyDrive/"
filename_txt = "testText.txt"
file_txt = cur_dir+filename_txt

# テスト用のテキストファイルを作成
text = """ start ...
this text is for test.
test for studying Pyspark.
this is a 4th line of the text.
And this is 5th.
this is txt to test how Pyspark work.
this line is the end of text.
"""

f = open(file_txt, "w") 
f.write(text)
f.close()

Spark Contextでそれっぽい処理ができることを確認します

# 文字列からなるRDDの生成
lines = sc.textFile(file_txt)

# filter()変換 -> first()アクション
lines_include_pyspark = lines.filter(lambda line: "Pyspark" in line)
print("lines include pyspark : ", lines_include_pyspark.first())

lines include pyspark :  test for studying Pyspark.

使い終わったらSpark Contextをシャットダウン

sc.stop()

おわりに

ということで、分散処理の恩恵を受けられる環境ではないですが、Google Colaboratory上で自習がてら最低厳の基礎的な記法をお試ししてみるには（多分）問題ない環境になったかと思います。

よくわからんが、まぁ動いているからヨシ！（仮）

雑記 in hibernation

頭の整理と備忘録