Wednesday, October 7, 2020

PySpark - Read Shape file from S3 and Mount S3 as File System


Advantages of Mounting Amazon S3 as a File System

Mounting an Amazon S3 bucket as a file system means that you can use all your existing tools 
and applications to interact with the Amazon S3 bucket to perform read/write operations on 
files and folders. Can EC2 mount Amazon S3? Using this method enables multiple Amazon 
EC2 instances to concurrently mount and access data in Amazon S3, just like a shared file 
system.
Why use an Amazon S3 file system? Any application interacting with the mounted drive 
doesn’t have to worry about transfer protocols, security mechanisms, or Amazon 
S3-specific API calls. In some cases, mounting Amazon S3 as drive on an application 
server can make creating a distributed file store extremely easy.
For example, when creating a photo upload application, you can have it store data on a fixed 
path in a file system and when deploying you can mount an Amazon S3 bucket on that fixed 
path. This way, the application will write all files in the bucket without you having to worry about
 Amazon S3 integration at the application level. Another major advantage is to enable legacy 
applications to scale in the cloud since there are no source code changes required to use an 
Amazon S3 bucket as storage backend: the application can be configured to use a local path 
where the Amazon S3 bucket is mounted. This technique is also very helpful when you want 
to collect logs from various servers in a central location for archiving.
After mounting S3 as local file system, you can use Pandas or others to access file using 
path like: 
Location = geopandas.read_file("/dbfs/mnt/bucket-name/geofactor/data/shapefilename.shp")
But in my case, since we have limitations on mounting S3 as well as permission issues 
(only Spark can ready from S3 bucket), but the file type is shape file which includes .dbf, .prj, 
.shp and .shx type files and we have to read them as a whole. so I zipped the file. So basically, 
we cannot use Spark to read this zip file. 

We have to use boto3 to read the zip file. And the work around it instead of mounting S3 is to 
read the zip file using BytesIO().
buffer = BytesIO(zip_obj.get()["Body"].read())
zipfile = ZipFile(io.BytesIO(buffer.read()))





No comments:

Post a Comment