Nan's Blog: PySpark - Read Shape file from S3 and Mount S3 as File System

Advantages of Mounting Amazon S3 as a File System

Mounting an Amazon S3 bucket as a file system means that you can use all your existing tools
and applications to interact with the Amazon S3 bucket to perform read/write operations on
files and folders. Can EC2 mount Amazon S3? Using this method enables multiple Amazon
EC2 instances to concurrently mount and access data in Amazon S3, just like a shared file
system.

Why use an Amazon S3 file system? Any application interacting with the mounted drive
doesn’t have to worry about transfer protocols, security mechanisms, or Amazon
S3-specific API calls. In some cases, mounting Amazon S3 as drive on an application
server can make creating a distributed file store extremely easy.

For example, when creating a photo upload application, you can have it store data on a fixed
path in a file system and when deploying you can mount an Amazon S3 bucket on that fixed
path. This way, the application will write all files in the bucket without you having to worry about
Amazon S3 integration at the application level. Another major advantage is to enable legacy
applications to scale in the cloud since there are no source code changes required to use an
Amazon S3 bucket as storage backend: the application can be configured to use a local path
where the Amazon S3 bucket is mounted. This technique is also very helpful when you want
to collect logs from various servers in a central location for archiving.

After mounting S3 as local file system, you can use Pandas or others to access file using
path like:

Location = geopandas.read_file("/dbfs/mnt/bucket-name/geofactor/data/shapefilename.shp")

But in my case, since we have limitations on mounting S3 as well as permission issues
(only Spark can ready from S3 bucket), but the file type is shape file which includes .dbf, .prj,
.shp and .shx type files and we have to read them as a whole. so I zipped the file. So basically,
we cannot use Spark to read this zip file.

We have to use boto3 to read the zip file. And the work around it instead of mounting S3 is to
read the zip file using BytesIO().
buffer = BytesIO(zip_obj.get()["Body"].read())
zipfile = ZipFile(io.BytesIO(buffer.read()))

Nan's Blog

Wednesday, October 7, 2020

PySpark - Read Shape file from S3 and Mount S3 as File System

Advantages of Mounting Amazon S3 as a File System

No comments:

Post a Comment

Labels

Blog Archive