Author Archive

Data Pipeline & Data Standard for Primary Education: Challenges and Opportunities

tisdag, oktober 24th, 2023

This article appears also in Swedish here

Data is seen as a vital resource for modern societies, holding several qualities that can spark a technological revolution [1]. The use of digital technology in education has allowed for the collection of a wide range of data on students. Educational data, such as text, grades, quizzes, timestamps, and behavioral data on the use of Digital Learning Material (DLM) or educational platform behavioral-use data, can be large, complex, and heterogeneous [2]. On the other hand, educational data are spread across several different digital services, making it challenging to use them strategically. Although data standards facilitate the interconnection and dissemination of information among various digital educational tools within schools, there are no established standards for collecting, processing, analyzing, and presenting data. As a result, school leaders, teachers, and students do not capitalize on the possibility of making decision based on data.

The significance of data pipelines has been acknowledged in different recent studies. While implementing data pipelines does present certain potential and problems such as organizational barriers, maintenance concerns with the infrastructure, and data quality issues [3, 4], yet, having a data pipeline is essential. Data pipelines are made up of complex chains of connected activities that begin with a data source, and end with a data sink. To put it another way, data pipelines are an interconnected series of operations where the output of one or more operations serves as the input for an additional operation [5]. We proposed a technical solution [6] containing a data pipeline (Figure 1) by employing a secure Swedish depository—the Swedish University Computer Network (SUNET) [7]. Data pipelines allow data to flow from an application to a data depository, from a data lake to an analytics database. At the start of the pipeline, data is loaded into the storage. After that, there are several steps, each of which produces an output that serves as the input for the following step, iterating until the pipeline is finished.

For all stakeholders involved, data security, privacy, and ethics are the most important considerations. To address these issues and keep the data of the different stakeholders, such as Educational Technology (EdTech) companies and municipalities, distinct and allow them to upload and access their own data, we used a SUNET drive including a SUNET Simple Storage Service (S3) [8] for each of the stakeholder’s data in the suggested data pipeline. S3 is a highly accessible object depository that offers typical customers practically infinite storage capacity at a reasonable cost, and it is supported by several well-known cloud databases. To obtain a thorough assessment of the students’ progress, it was crucial to merge their data in various S3s.

Yet, the task of merging this data became quite challenging in the absence of a specific data standard. To tackle this problem, we have put forth a data standard that incorporates all the educational data gathered in SUNET S3s. Despite the user-friendly nature of our proposed data standard, many stakeholders encountered difficulties when it came to formatting their data to adhere to the proposed data standard and uploading it to the S3s on our SUNET drive.  This challenge arose due to a lack of expertise within schools and municipalities, as well as limited financial resources and time constraints faced by the EdTech companies. To address this issue, we developed individualized scripts for each stakeholder based on their data. By having stakeholders run these scripts once a week, we were able to ensure consistent data formatting across all the S3s.

Once stakeholders have uploaded their formatted data to their respective S3s on our SUNET drive, the project’s development team (DT) merges and hashes the data. The pseudonymized data will be stored as a backup in the SUNET drive. It is crucial to note that only pseudonymized data is used by the DT. According to the General Data Protection Regulation (GDPR), pseudonymization involves processing personal data in a way that prevents association with a specific individual without additional information. This additional information is kept separate and is subject to measures preventing the linking of personal data to identifiable individuals [9]. To preprocess and perform cross-analysis of the data, provide server space, develop a multi-dashboard VLA tool, implement a login system, and create backups of the code and processed data, we established a virtual machine on our university cloud. With multiple stakeholders involved, each stakeholder gains access to a dashboard within the developed multi-dashboard VLA tool through an API, ensuring secure access control. Considering the proposed technical solution to facilitate for various stakeholders in how data is managed, shared, analyzed, and visualized, we hope that the project contributes to ultimate standards for how the school system and DLM companies in Sweden handle, share and use educational data.

Zeynab (Artemis)  Mohseni & Italo Masiello // EdTechLnu


  • Perez, C. (2002). Technological revolutions and financial capital: the dynamics of bubbles and golden ages. Cheltenham (UK): Edward Elgar.
  • Mohseni, Z., Martins, R. M., & Masiello, I. (2022). SBGTool v2. 0: An Empirical Study on a Similarity-Based Grouping Tool for Students’ Learning Outcomes. Data, 7(7), 98.
  • Munappy, A. R., Bosch, J., Olsson, H. H. (2020). Data pipeline management in practice: Challenges and opportunities. In International Conference on Product-Focused Software Process Improvement, Springer, Cham, pp. 168-184.
  • Pervaiz, F., Vashistha, A., Anderson, R. (2019). Examining the challenges in development data pipeline. In Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies, pp. 13-21.