Module: Concurrent Data Structures in Java or C++
Description
This module consists of an active lab, which we affectionately refer to as the 'spider lab', where students are given an initial 'sequential' version of code in which a 'web crawler' starts at an given input web page and scrapes its contents for links to other pages. Each new link found is added to a list of new pages to be scraped. When the first page is completed, that page is added to the completed list and the crawler goes to the new pageslist, removes the next page on it, and begins the scraping proces again. This continues until a certain number of pages have been 'scraped'. The students are given some information about how multiple 'threads' of crawlers could be used to extract pages from the new pageslist and work on them simultaneously, in the hopes that multiple crawlers might be able to keep up with all the work needed to be done. With threads of crawlers accessing the shared new pagesand completedlists, however, we need a mechanism to enable safe concurrent access to those data structures. Students experiment with using thread-safe concurrent data structures for the shared lists, along with learning how to create the threaded version of the code.
We have written the code for parsing the web pages and finding links-- much of the code needed to complete the spider's task is given to the students. When examining the code, they are directed to complete the task of writing a processPage() method, which processes a page using methods provided. Then they complete the task of crawling through pages using their processPage() method.
Materials
Optional pre-reading (Java, C++)
We sometimes introduce parallel computing concepts by having students read our "concept" module Intermediate Introduction to Parallel Computing.
Alternatively, we may introduce the concepts we need (e.g., the notion of a thread) as part of the course material.
Java
The Java code for this series of labs is a shared Dropbox file called ConcurrentDataStructures.jar.
Notes for Java eclipse users:
Create a new Java Project in eclipse-- you could call it ConcurrentDataStructures. Then import the code in the .jar file above into it (right-click on the project name, Import; General-Archive. Browse to find the jar file. Choose to import it into directory ConcurrentDataStructures/src.
When using eclipse and the Java version of this material, you need JUnit libraries installed on your build path for the Java Project. (right-click on project, build-path; add libraries, choose JUnit, JUnit3)
C++
The source code as used by students is contained in a "tarball" (gzipped tar archive) at www.stolaf.edu/people/rab/sigcse12/cds.tar.gz, and is also available on the MTL for purposes of this lab. The lab assignment sheet below indicates how to obtain and use that tarball.
Also, the document Introduction to STL containers (Microsoft Word 43kB Mar2 12) provides basic information about STL containers, in case they are new to your students.
Your Tasks
Java version
The first step for students is to work through the lab explaining the sequential, one crawler version of the code and complete the 'spider' (pdf version (Acrobat (PDF) 144kB Mar3 12)). The philosophy of this lab is that students can learn by reviewing existing code and complete portion where they are going to be using specific data structures that they have been studying.
Once the students have completed the initial one-spider case and have used the classic data structures available in Java, then they look at developing a threaded crawler with multiple 'spiders' by working on part 2 of the spider lab (pdf version (Acrobat (PDF) 85kB Mar3 12)).
There is a spot in this second lab where we designed a place for the instructor to explain how threads can make this problem more tractable and how to create shared data structures in Java. We use a powerpoint presentation for crawler design with threads, which we have shared for you from Dropbox. It is helpful to also let the students refer to it as they work on the rest of the lab.
Pedagogical Note: The second lab has a part where the students first try having their threads access the original non-thread-safe Java data structures from each spider thread. The concept here was to have them see that problems could occur with this code. In practice, this wasn't particularly successful because it is difficult to see from the printed debugging statements when something might be going wrong. However, when students compared each other's results they could see that they were not getting the same set of pages. In retrospect, we would likely skip that step in the future and have them use the concurrent structures immediately.
C++ version
This is a "guided discovery" laboratory exercise for students (and workshop participants!) to complete.
Note to Windows users:The C++ lab has been developed and taught on a Linux platform, and has not been used with Visual C++ and other Windows tools. Consequently, this laboratory exercise is strongly oriented towards Linux or Macintosh computational tools. We hope to produce a Windows-oriented version of this module during Summer 2012. (Let me know if you'd like to collaborate on this! Dick Brown, rab@stolaf.edu)
Curricular Use
This active module is designed for 2-3 hours of use in a lab setting for a course where data structures are being used.