Assignment 0 Due Wednesday Jan 31, 11:00am ET

Using AWS Academy, EC2, and Linux Shell

FAQs

If you have trouble signing up AWS Academy due to the term item not showing up, it’s because of the browser compatibility issue. Try updating your Firefox and then try logging in using Firefox. (I will contact the AWS Academy support to see how this can be addressed as this is obviously a bug!)
If you work with WSL, make sure to store your SSH key file under your Ubuntu home directory. Windows FS does not work well with chmod 400. How to locate your /home/user directory is by typing cd and that will automatically bring you to your home dir.
Accessing the SSH key file stored under a cloud drive (OneDrive, Dropbox, or Google Drive) might have permission issue. If that happens, move vockey.pem to a local directory not under any cloud-managed folder and that should fix the permission issue.
For Windows user, we recommend installing WSL (the Windows Subsystem for Linux). You can find the installation documentation here. With WSL, you can use the familiar Linux commands for SSH, file operations, and all other kinds of shell-related tasks.
To copy your submission file from remote EC2 to local, you can use scp (SSH copy). To do so:
```
scp -i vockey.pem ubuntu@YOUR_ADDRESS:PATH_TO_YOUR_FILE .
```
Note that calling scp from EC2 to copy file out will not work as your computer is behind NAT so it is almost not possible to be directly addressable. You need to run scp from your computer to copy file in from a remote EC2, which is addressable.

Overview

You will do all your assignments for DS5110/CS5501 using AWS Academy. Through AWS Academy, you can get access to a wide range of computing, storage, and network resources on Amazon Web Services (AWS). If you haven’t worked with AWS Academy before, you need to go through the AWS Academcy onboarding process.

This assignment will walk you through the AWS Academy registration process and show you how to use AWS and Linux EC2 from AWS Academy to setup an environment for upcoming programming assignments.

Learning outcomes

After completing this assignment, you should be able to:

Create and SSH into an EC2 virtual machine (EC2) instance.
Create a remote Jupyter Notebook server running on AWS EC2.
Get familiar with basic Linux shell commands (browsing file system, installing packages, etc.).

Part 0: Register an AWS Academy account

You will receive a course invitation email sent to your virginia.edu email. You will need to register with AWS Academy Canvas (note this is different from UVA Canvas) before you can participate in the AWS Academy Learner Lab. To register, click on Get Started in the invitation email. It will bring you to the Welcome Aboard web page. Enter a password under your email, select Eastern Time (US & Canada) as the Time Zone, and click on Register.

After successfully registered, you will have access to your AWS Academy portal. You will see the AWS Academy Learner Lab under Dashboard. Click on Accept to accept the invitation to join AWS Academy Learner Lab if you see a message at the very top of the web page.

Click on AWS Academy Learner Lab at the Dashboard to enter. Then click on Modules to view the available modules. Before start using AWS resources, I highly encourage you to go through the AWS Academy Learner Lab Student Guide to get you familiar with how to use Learner Lab (including how to start a lab, how to track your credit, etc.).

To start using the AWS resources, click on Launch AWS Academy Learner Lab under Module, which will bring you to the Terms and Conditions. Click on I agree and enter the lab session.

AWS Academy Setting

The lab session is a web terminal interface with limited functionality. To start the lab, click on Start Lab at the top of the web terminal interface. The starting process should take a few minutes (if this is the very first time you start the lab, as in the background AWS Academy creates a temporary AWS account for you under the Learner Lab). At the same time, you can see your available AWS cloud credit with a total amount of $100 and how much you have used. When the signal light at the top left corner turns green, it indicates the lab has started.

AWS Academy Setting

Click on the green signal light to access your AWS Console Home web page.

A started lab session has a session duration of up to four hours (see the timer 04:00 in the figure above). When the lab session timer runs to 0:00, the session will automatically end, but any data and resources that you created in the AWS account will be retained. If you later launch a new session (for example, the next day), you will find that your work is still in the lab environment. Running EC2 instances will be stopped and then automatically restarted the next time you start a session.

IMPORTANT: Monitor your lab budget in the lab interface above. Whenever you have an active lab session, the latest known remaining budget information will display at the top of this screen. This data comes from AWS Budgets which typically updates every 8 to 12 hours. Therefore the remaining budget that you see may not reflect your most recent account activity. If you exceed your lab budget your lab account will be disabled and all progress and resources will be lost. As a best practice, we highly recommend frequently pushing your progress to online repositories such as GitHub to avoid losing data. Therefore, it is important for you to manage your spending. When you use GitHub or any other online repository service, make sure to create a private repository; DO NOT share any of your code to other students and the internet.

At the AWS Console Home page, click on Services at the top left corner and click on Compute to view all the available computing services that you may use. Click on EC2 to start creating new EC2 VM (virtual machine) instances.

Part 1: Create and access EC2 instances

Step 1: Create a name and choose an OS image for your VM

Under EC2, click on Launch instances to start creating a new VM. Type a name to label your VM under Name and tags. For example, name your first instance vm0.

We recommend using Ubuntu Server 22.04 LTS as the OS image and 64-bit (x86) as the architecture. You can also specify the number of instances to launch at this time. To start off, choose 1. Later for Assignment 1 choose 2 and for Assignment 2 choose 5.

EC2 instances

Step 2: Choose an EC2 instance type

Next choose an EC2 instance type. To test, you can always create one or multiple t2.micro or t1.micro, both of which are free tier eligible, meaning the resource is free of charge. There are, however, a limited range of EC2 instances that you can choose from. We recommend using the t3.large instance type that comes with 2 vCPU cores and 8 GB of memory.

EC2 instances

Step 3: Choose an SSH key pair

What is important when creating new VMs is that you choose a key pair for SSH login. Under Key pair (login), click on the drop down menu and select vockey (type rsa).

EC2 instances

Step 4: Check HTTP/HTTPS options

Under Network settings, you may check Allow HTTPs traffic from the internet and Allow HTTP traffic from the internet so that you can access the web servers hosted on your EC2 VM. (Apache Spark comes with a web-based dash, and would require HTTP/HTTPS traffic.)

EC2 instances

Step 5: Configure storage

You may also increase the storage capacity of the Root volume under Configure storage. By default you will be allocated a small 8-GB root disk. We recommend increasing it to 100 GB so that you have sufficient disk storage capacity for dependency installation and storing datasets. Optionally, you can also add a new EBS volume of 100GB just in case the 100GB root volume runs out of capacity (it runs out of space very fast considering the installation of quite a few large software packages and datasets).

EC2 instances

Step 6: Configure the subnet availability zone

We recommend launching all your EC2 instances in the same availability zone. Any US east zones should be fine. This can be configured in Network settings. Once choosing one, always stick with it when creating new EC2 instances. This is to guarantee the best network performance among your EC2 instances. For example, you may choose to use a subnet within an availability zone of us-east-1a and stick with it.

EC2 instances

Step 7: Launch the EC2 instance

After finalizing the EC2 VM configuration, click on the Launch instance button on the right side to launch the configured VM. It may take a few minutes to start the VM.

Step 8: Download the SSH private key

Now go back to the Learner Lab web page. This is where you can download the SSH key. Click on the AWS Details tab, you will see the information you need to login to your computing resources. Under SSH key, click on Download PEM to download the SSH key to your local computer. It comes with a default name of labsuser.pem.

AWS Academy Setting

Open a terminal (if you use a MacBook, open the Terminal software; if you use a Windows, we recommend installing WSL), locate your private key file that you have just downloaded locally, change its name to vockey.pem and update its permission, if necessary, to ensure your key is not publicly viewable:

% mv labsuser.pem vockey.pem
% chmod 400 vockey.pem

Then, connect to your instance using the following command:

% ssh -i "vockey.pem" ubuntu@[public_IPv4_DNS_address_of_your_EC2_instance]

To view the connection instruction, click on the instance ID from your AWS Console, then click on Connect at the top right corner of the page to view the public DNS address of your EC2 instance. An example name looks something like this: ubuntu@ec2-11-222-3-190.compute-1.amazonaws.com, where ubuntu is your default username on any EC2 instances that you created.

AWS Academy Setting

Great! So far you should be able to SSH login to a remote Linux VM computer that you have just created on AWS running on somewhere in some North Virginia datacenter!

Part 2: Setting Up an AWS-EC2-hosted Jupyter Notebook Service

There are two ways of writing/editing PySpark programs on a remote cloud server:

Use a shell text editor of your choice (e.g., VIM, nano, Emacs, etc.).
Launch a Jupyter Notebook and directly write PySpark code there.

This tutorial is provided in case you need to use Jupyter Notebook for Assignment 1, though you can complete the assignment without it.

Step 0: Install Pip

First, install pip3 (the package installer for Python).

$ sudo apt update
$ sudo apt install -y python3-pip
$ which pip3

The third command above, which pip3, should output something like /usr/bin/pip3 if it’s successfully installed.

Step 1: Install Jupyter Notebook

To install Jupyter Notebook:

$ pip3 install notebook

pip3 by default will install everything in /home/ubuntu/.local/bin on your EC2 instance. This path, however, is not included by the environment variable $PATH, so your shell will not be able to locate the installed programs through pip3. To fix this, you will need to include the path in $PATH by running the following command:

$ source ~/.profile
$ which jupyter

Step 2: Deploy a Jupyter Notebook server

SSH into your EC2 instance where you’ve just installed the Jupyter Notebook (say the scheduler instance vm1) using the following command:

$ ssh -i "vockey.pem" ubuntu@<public_IPv4_DNS_address_of_your_EC2_instance> -L 8000:localhost:8888

The -L option is to forward any connections to the given TCP port 8000 on your local client host to the given remote host (the specified EC2 instance of <public_IPv4_DNS_address_of_your_EC2_instance>) and port (8888), or Unix socket, on the remote side.

Once SSH’ed in, start the Jupyter Notebook server process using command jupyter notebook. You will see some output information generated by the launched Notebook server process:

To access the notebook, open this file in a browser:
    file:///home/ubuntu/.local/share/jupyter/runtime/nbserver-3233-open.html
Or copy and paste one of these URLs:
    http://localhost:8888/?token=557f06135dd4c77bc5267c88abca55db1672f741ff7631a4
 or http://127.0.0.1:8888/?token=557f06135dd4c77bc5267c88abca55db1672f741ff7631a4

On your local client machine, open a web browser and go to localhost:8000 and copy paste the token when it prompts. Now you can start writing Python code using your Notebook GUI from your browser.

Step 3: Work on Python programming using your cloud-native Notebook

Create a new Notebook and select the default Python 3 (ipykernel) Notebook kernel. Than you are good to go.

Notebook Demo

Part 3: Shell Script

If you have successfully cleared all previous parts and steps, congratulations! Now you have a cloud VM with basic programming environment setup.

In this part, you will practice some basic shell command / shell script skill by completing the following tasks.

Task 1: Check Operating System version, hardware configuration, and your Jupyter Notebook install

Create some system and environment information using basic shell commands. You should use man the Linux manual command to print what a command does. For example, if you type man lscpu it will tell you what lscpu is. Save the outputs to submit:

$ cat /etc/os-release > os.txt
$ lscpu > cpu.txt
$ lsmem > mem.txt

Then, create some more files so we can check your install of pip3 and Jupyter Notebook:

$ pip3 --version > pip3.txt
$ jupyter --version > jupyter.txt

Task 2:

This zip file contains two CSV files with StackOverflow post data. You need to download the dataset to your EC2 VM instance:

pip3 install gdown
gdown https://drive.google.com/uc?id=1vlczaocHyghEW5Q-6WprqB2nnna_wrLE

In the above commands, the first command is to install the downloading tool gdown so that you can use it to download large files from Google Drive.

Try running some shell commands to extract the contents and print how many lines contain the text "python".

Now, combine these commands in a count_python.sh script file; the script should have a shebang line so that the following runs with bash:

$ ./count_python.sh

HINTS: You can use unzip to extract the CSV files from the zip archive. If unzip is not installed by default, you should run sudo apt install -y unzip to install it on your EC2 instance.

Deliverables

You should submit a tar.gz file to Canvas, which follows the naming convention of LastName_FirstName_ComputingID_A0.tar.gz. The submitted file should contain the following: os.txt, cpu.txt, mem.txt, pip3.txt, jupyter.txt, count_python.sh.

HINTS: You can use the following command to create a tar.gz file: tar -czvf [submission_file_name] os.txt cpu.txt mem.txt pip3.txt jupyter.txt count_python.sh.

Autograder

You can use supplied autograding test suite to test your work and environment setup. There are two .py files you need to download: autograde.py (https://tddg.github.io/ds5110-cs5501-spring24/assets/datasets/autograde.py) and tester.py (https://tddg.github.io/ds5110-cs5501-spring24/assets/datasets/tester.py). Use wget to download these two files from the links above. Run python3 autograder.py to test your work. The test result will be written to a test.json file in your working directory. This will probably be your grade, but autograders are imperfect, so we reserve the right to deduct further points. Some cases are when students acheive the correct output by hardcoding, or not using an approach we specifically requested.

Assignment 0 Due Wednesday Jan 31, 11:00am ET

Using AWS Academy, EC2, and Linux Shell

FAQs

Overview

Learning outcomes

Part 0: Register an AWS Academy account

Part 1: Create and access EC2 instances

Step 1: Create a name and choose an OS image for your VM

Step 2: Choose an EC2 instance type

Step 3: Choose an SSH key pair

Step 4: Check HTTP/HTTPS options

Step 5: Configure storage

Step 6: Configure the subnet availability zone

Step 7: Launch the EC2 instance

Step 8: Download the SSH private key

Step 9: Login to the EC2 instance through SSH

Part 2: Setting Up an AWS-EC2-hosted Jupyter Notebook Service

Step 0: Install Pip

Step 1: Install Jupyter Notebook

Step 2: Deploy a Jupyter Notebook server

Step 3: Work on Python programming using your cloud-native Notebook

Part 3: Shell Script

Task 1: Check Operating System version, hardware configuration, and your Jupyter Notebook install

Task 2:

Deliverables

Autograder