Analyzing Patterns in Student SQL Solutions
Structured Query Language (SQL), the standard language for relational database management systems, is an essential skill for software developers, data scientists, and professionals who need to interact with databases. SQL is highly structured and presents diverse ways for learners to acquire this skill. However, despite the significance of SQL to other related fields, little research has been done to understand how students learn SQL as they work on homework assignments. The aim of this project is to analyze students' SQL submissions to homework problems in a Database course.
Via Levenshtein Edit Distance
The first stage of this project focused on computing the Levenshtein Edit Distances between every submission and their final submission to understand how students reached their final solution and how they overcame any obstacles in their learning process. We developed a system that visualizes the edit distances between students' submissions to a SQL problem, enabling instructors to identify interesting learning patterns and approaches. These findings will help instructors target their instruction in difficult SQL areas for the future and help students learn SQL more effectively.
Via Sequence Alignment Algorithms
In the second part of the project, we are using local and global sequence alignment algorithms to identify patterns of similar approaches students used to solve a given SQL assignment. We started with producing a heatmap that shows the differences/similarities between all the submissions students made for a given problem. We are currently analyzing heatmaps of different students submissions toward the same SQL problem to identify and categorize similar approaches.

System Overview

X and Y axis represent the submission number of the student. The darker the color the more similar the submissions are.
Learning Next-Generation Databases


TriQL’s Query Builder and Query Result Interfaces. The QBI allows users to construct the queryusing a user-friendly GUI and the Query Result Inter-face shows the query result in its native database

Breakdown of errors by SQL concept evaluated

Cypher Error submissions per Concept

Breakdown of most common Javascript and MongoDB errors by question

TriQL System Architecture: The Query Builder for building queries using a GUI; the IntermediateQuery Generator converts user queries to DataLog; The Schema and Query Translator generates the schema of the three database and coverts the DataLog query into SQL, Cypher and MongoDB
With more organizations relying on data to make crucial business decisions, database systems have become essential in managing financial, medical, and scientific data. Consequently, managing databases has become a necessary skill for programmers, data analysts, and data scientists to accelerate scientific inquiry and business decision-making. However, with the abundance of databases supporting various data models, such as relational, graph, document-oriented, beginner learners often find it challenging to decide what database model they should learn. Experienced developers also struggle to learn new database models as different models have different data structures and query languages. This project aims at developing student and instructor tools that can facilitate the learning and teaching of next-generation database systems.
TriQL: A tool for learning relational, graph and document-oriented database programming
This project introduces TriQL, a system for helping novices learn the structures (schema) and query languages of three major database systems, including MySQL (a relational, SQL-Structured Query Language, database), Neo4J (a graph database), and MongoDB (a document/collection-oriented database). TriQL offers learners a graphical user interface to design and execute a query against a generic database schema without requiring them to have any
database programming experience. TriQL follows an interactive approach to learning new database models, supporting a dynamic and agile learning environment that can be easily integrated into database labs and homework assignments.
A Quantitative Analysis of Student Solutions to SQL, Graph and Document Database Assignments
In this project, we analyze students’ errors in homework submissions of queries written in SQL, Cypher (the query language
for Neo4j—the most prominent graph database), and MongoDB (a document-oriented database). Based on tens of thousands of
student submissions from homework assignments in the database course I teach here at the University of Illinois, we then provides a quantitative analysis of students’ learning when solving database problems and we suggest a further improvement on the classification of syntactic errors.
Modeling The Content Structure of MOOCs
The number of Massive Open Online Courses (MOOCs) are increasing rapidly, providing students with tremendous opportunities to improve their knowledge and career. However, most MOOC platforms consider the course as the smallest unit of content delivery, wasting a valuable opportunity for learners to develop customized content tailored to their interests. This project aims to develop the infrastructure needed for offering customized MOOC courses. Our approach is to mine the model the content of existing MOOC courses so that we can build knowledge structures that can facilitate customized learning.
Unsupervised Approach for Modeling Content Structures of MOOCs
In the first part of the project, we introduced an unsupervised approach to build the precedence graph of similar MOOCs, where nodes are clusters of lectures with similar content, and edges depict alternative precedence relationships. Our approach to cluster similar lectures based on PCKMeans clustering algorithm that incorporates pairwise constraints: Must-Link and Cannot-Link with the standard K-Means algorithm. To build the precedence graph, we link the clusters according to the precedence relations mined from current MOOCs. Experiments over real-world MOOC data show that PCK-Means with our proposed pairwise constraints outperform the K-Means algorithm in both Adjusted Mutual Information (AMI) and Fowlkes-Mallows scores (FMI).
Topics Transitions in MOOCs
Modeling the relationships among educational
topics is a fundamental first step for automating curriculum planning
and course design. In the second part of the project, we introduce Topic Transition Map
(TTM), a general structure that models the content of MOOCs at
the topic level. TTMs capture the various ways instructors organize
topics in their courses by modeling the transitions between topics. We
investigate and analyze four different methods that can be exploited
to learn the Topic Transition Map: 1) Pairwise Constrained K-Means,
2) Mixture of Unigram Language Model, 3) Hidden Markov Mixture
Model, and 4) Structural Topic Model. To evaluated the effectiveness
of these methods, we qualitatively compare the topic transition maps
generated by each model and investigate how the Topic Transition
Map can be used in three sequencing tasks: 1) determining the
correct sequence, 2) predicting the next lecture, and 3) predicting the
sequence of lectures. Our evaluation revealed that PCK-Means has
the highest performance in the first task, HMMULM outperforms
other methods in task 2, while there is no winning in task 3

Integrating CustomLearn service into MOOC platforms to accommodate goal-oriented learning and other learners needs
Clustering Student-Written SQL Queries

Echelon: An AI Tool for Clustering SQL Queries
As part of teaching SQL, instructors often rely on auto-grading systems for marking students' assignments. However, such systems lack essential insights into the approaches students use to solve these assignments, allowing subtle flaws in student intuition to go unseen. Further, manual analysis of students' code submissions ranges from costly to impossible, depending on the assessments' frequency. The goal of this project is to use AI to help database instructors quickly identify trends in students' solutions. To that end, we developed a system called Echelon capable of extracting features that instructors deem significant from students' SQL queries and using them to generate clusters that capture the key approaches taken. The system creates a two-dimensional projection, which is then linked to a dashboard that instructors can use to rapidly assess class performance, using clustering algorithms to group student approaches into clean, intuitive categories. Instructors can then address a variety of student approaches, and thus create a more responsive classroom.
FIE 2023 Presentations:
- MOOCs
- DB Learning