Over the last couple of weeks, a few really exciting things happened:
1. I was named Andover Youth Services' Youth of the Week because of the virtual teaching initiatives I started there! I really appreciate that Energize Andover gave me the freedom to start running a virtual Python class with a group of 5 girls, an experience I eventually used to start virtual technology classes through the Youth Services.
2. While I recently began a summer internship, I have still been working on my Energize projects in my free time. Kate, the PhD student working with Energize through the BU URBAN program, helped me figure out the best way to visually represent the data using matplotlib. I have written a script that generates a PDF of the following visualizations:
I have also separated out the rooms likely to have sensor issues into their own spreadsheet based on certain conditions in the data. I am now making adjustments to the script to refine the visualizations a bit more.
EDIT: I forgot to mention that I also started teaching a bit of matplotlib to the new recruits! I showed them my project and explained how it works.
Also, here are some sample images of what my visualizations will look like:
A few important things happened over the last couple of weeks, so I will update this blog on them here.
1. School is ending this week, so in order to give the new recruits a bit of a break, the Energize recruit class will meet only once every 3 weeks. However, the plan is for the students to work on their projects of interest in their small groups.
2. I finished testing the historical data version of the weekly report -- it is completely free of bugs and usable.
3. I will be working with a PhD student, Kate, through the BU URBAN program! We met yesterday to discuss how I can apply my project to a health context.
Kate is currently studying the ways to use carbon dioxide as an indicator of ventilation quality, which is particularly relevant to places looking to safely reopen during the COVID-19 pandemic, as bad ventilation can increase the risk of disease spread. Therefore, we are currently looking into connecting that aspect of things to the reporting engine.
On Monday, I continued training the new recruits. Before class, I spent about an hour developing challenges for them to take on (and corresponding solutions) using the student database. (You can see them at my GitHub repo here.)
When class started, they were divided into two groups (one for water consumption and one for political data). Each group worked on challenges that were tailored to the type of data they wanted to work with.
As some students have Chromebooks and are using repl.it (an absolutely fantastic tool!), we tried to integrate the database into repl. When this did not work, I asked them to collaborate with their groups through repl, such that the person with the capacity to run code on the database would test each time they were ready. This worked much better, but figuring out the setup took a bit more time than expected.
Back in September, I had figured out how to read from a sqlite file based on the example file in the database's Google Drive folder. In order to give the new recruits an exercise in "real-world" problem-solving (as opposed to a classroom-like environment), I gave them the same challenge to start, having them glean knowledge from the example rather than teaching it to them directly. Interestingly, both groups were getting errors when their code was perfect; we eventually realized that the databases had somehow become empty when they were copied into the repository.
I am really proud of both groups for adapting really well to both the challenge and the technical issues that came up along the way. The rest of the challenges involve using the pandas library -- I can't wait to see where they go with them next Monday! (Also, keep on the lookout for blog links next week!)
Today, I worked for about an hour on debugging the Weekly Report. My main objective was to fix the incorrect values in the Days with Problems column, and I'm happy to say that I succeeded in debugging this error!
Luckily, I realized that I still had the code for the all_data DataFrame (the one I had previously used to get the correct Days With Problems values), albeit commented out, so I first un-commented out that code. Once I had that set up, I had to figure out exactly what was being stored in all_data and how I was going to merge the all_data table with my new weekly_log table. Once I had made sense of the data, I cleaned up the merge from which the DataFrame originated, and finally, I had to use a series of groupby commands to isolate just the day for each problematic interval and find the number of unique days belonging to each room. This took a few tries to get right. Finally, I merged the new DataFrame into the weekly_log. After checking back against my manually calculated test data, I came to the conclusion that the error had been fixed.
Today, I worked for about an hour on debugging the Weekly Report.
I discovered that the issues with timestamps were actually human errors, not programmatic errors. However, I looked deeper into the discrepancies in the number of days with problems. This number is too high in the program's results because it counts any day on which the room has data instead of filtering out which days have problems.
This should not happen because Task II filters out any intervals that don't have problems. At first, I thought that perhaps when I reference the old database in Task IV, I unknowingly bring back the days without problems. I tested this suspicion out by using the debugger and some strategically plotted breakpoints. First, I tried breaking at the end of the "Task III" portion of the code, which led me to discover that the DataFrame at the end of Task III also contained the unproblematic intervals. This meant that the problem was not in Task IV at all -- it had to be earlier, since the data being aggregated in Task IV already had the unproblematic rooms.
This couldn't be possible, because Task II should filter it out before it goes into the daily database in Task III... however, when I broke at Task II, I finally realized the issue.
Sometime in January, I had changed the central DataFrame of Task III to include all intervals, not just problematic ones, so that I could find the true highest and lowest values. However, I had not realized that the "Days With Problems" column would be aggregated incorrectly.
Now that I know the origin of the problem is not with the switch to the historical report, my task is to develop a solution that correctly counts the number of days with problems.
The last couple months have been pretty busy for me, so I haven't gotten around to posting in a while. (Don't worry -- I'll definitely be posting more regularly, especially as I start working on the report more often!)
As for the class I started for the 5 new girls I recruited into the program (more on that here), they all finished the course! After that, I taught them Version Control (for those who didn't know it) with Git and GitHub, how to use the PyCharm IDE and debugger, and actually guided them through the same data36 pandas tutorial (all 3 parts) that I used when I was first learning to use the library. In the last couple weeks, Dan, Ayush, Justin, and I all demoed our projects for them as examples of the kind of things they will be making.
Now, they are beginning projects in the areas of water consumption and political data. (They were placed into smaller groupings based on their interests. Rishika and Holly are working on water consumption data, while Madeline, Avanthika, and Sarah are working on political data.) They are also starting blogs similar to this one -- stay tuned for links to their blogs on this page!
I'm so proud of how far they have all come -- it is a super awesome achievement to learn a whole language and some of the basics of development, as well as start working on real-life projects, in less than 2 months!
In other news, the Weekly Report is currently undergoing testing. I used similar test rooms to the ones I used to test version 1, and manually calculated values to compare against the data I was getting from the report. Currently, it seems that there are some errors, which I am in the process of debugging. Specifically, the number of days with problems as well as some of the times are having errors.
Finally, the future of the Weekly Report is looking very bright. I have been lucky enough to get the opportunity for a really exciting partnership on the project, but I'll go into more detail on that once we get started in a couple of weeks.
Overall, I am really excited for the future of not only my own projects, but those of the new recruits!
On Saturday, I spent about an hour and a half setting up the test for the new Weekly Report and testing the warm and cold spreadsheet.
I adapted the test I had used before, which included test rooms with made-up temperature and CO2 values to reflect a variety of test cases, to fit into the historical report. Since the report produced was comprehensive, I decided to focus on temperature for that day -- everything checks out with the values I determined manually with a calculator (a process which took a decent amount of time even with the few data points I had -- that's why automation is so helpful).
Next, I will test the report on carbon dioxide values and then start deploying to the server. This new version will only make use of cron for the fifteen-minute logging and the two programs (task_zero and generate_historical_report) run at the end of each week. Additionally, since school is closed, the values collected will not be meaningful; they are simply a test of the capabilities of this new report.
I also have a bit of functionality to add to the final piece of the new report, based on what I was told by Facility members in January. In the automated email, I should include the top 5 or so rooms that need attention, so that the Facility members can look at them. This is an easily reachable goal as it simply requires the method DataFrame.head() to return the top 5 rows of each DataFrame in sorted order.
Recently, I had been thinking about ways we can get more people into the Energize program (since I was previously the only member in 10th grade or below, as well as the only girl). Since everyone has a lot of unexpected extra time due to COVID-19, I recruited some girls I know and set up an online "class". This week, I began teaching them about Python and Data Science so that they would be prepared to start working with Energize.
About a week ago, I talked to Mr. Navkal and began recruitment. I ended up with a group of 5 girls, all in eighth and ninth grade, eager to learn or review Python and join Energize! Since then, we have had three class sessions over Zoom. As a syllabus, we have been using Codecademy's Learn Python 3 course. I'm excited to continue running the class and see where this goes!
On Wednesday and Thursday, I spent a total of about 45 minutes on the historical report.
I mainly spent this time integrating Task Three (the creation of the "daily" reports) into the main program. I also separated out what I had from Task One of the old report to create the logging program, which is now a standalone program that will run every 15 minutes.
In future sessions, I need to more comprehensively test Task Three to make sure the data it is producing is accurate and bug-free. I also need to integrate Task Four as the final step in creating a weekly report based on historical data.
Yesterday, I spent about an hour working on the historical report. After debugging an issue with task 0, I finished integrating task 2 into the new system.
When creating a report from historical data, you need a lot more filtering than you do when running the numbers in real time -- you have to filter first by the week itself (selecting the week you want to report on), and then by day of the week. I had used a Dictionary to successfully filter out which days were school days (this is a basic implementation which assumes that every weekday is a school day -- I still have to get access to the school calendar, somehow scrape those dates and add the updated values into the dictionary) -- but after implementing most of task II and testing the results, I realized I had never actually filtered out which week I needed to select.
For now, I used a simple input function to determine the start date of the selected week. (Hopefully, this will evolve into an interactive front-end where users can select the day and the parameters.) Once I had the start date, I added 7 days to make the end date, looped through all the days in between, and only set those weekdays to true in the dictionary.
After this, I finished integrating Task 2 into the generate_historical_report program. Right now, it logs a TemperatureProblemDatabase and a CarbonDioxideProblemDatabase the same way the old task_two did. (Right now, it performs Task 2 once for each day -- it should run task 3 at the end of the loop as a way of "daily" aggregation, so as to save data in between days.)