ONTOLOGY-DRIVEN
DATA SEMANTICS
DISCOVERY
FOR CYBER-SECURITY
Author: Sarah Kushner
Advisor: Marcello Balduccini, PhD
Institution: Drexel University
We present a software architecture for data semantics discovery, capable of extracting semantically-rich content from human-readable files without prior specification of the format. Human-readable files come in a massive variety of formats. The architecture, based on work at the intersection of knowledge representation and machine learning, includes machine learning modules for automatic file format identification, tokenization, and entity identification. The process is driven by an ontology, a formal hierarchy of interrelationships between domain-specific concepts and their properties. The ontology also provides a layer of abstraction for querying the extracted data. This architecture can be applied in a variety of domains. However, we focus on cyber-forensics applications, aiming to allow the parsing of log files, for which there are no readily-available parsing and analysis tools. We also aim to aggregate and query data from multiple, diverse systems across large networks. The key contributions of our work are: the development of an architecture that constitutes a substantial step toward solving a highly-practical open problem, the creation of one of the first comprehensive ontologies of cyber assets, and the demonstration of a non-trivial combination of declarative knowledge specification and machine learning.
Author: Sarah Kushner
Advisor: Marcello Balduccini, PhD
Institution: Drexel University
We present a software architecture for data semantics discovery, capable of extracting semantically-rich content from human-readable files without prior specification of the format. Human-readable files come in a massive variety of formats. The architecture, based on work at the intersection of knowledge representation and machine learning, includes machine learning modules for automatic file format identification, tokenization, and entity identification. The process is driven by an ontology, a formal hierarchy of interrelationships between domain-specific concepts and their properties. The ontology also provides a layer of abstraction for querying the extracted data. This architecture can be applied in a variety of domains. However, we focus on cyber-forensics applications, aiming to allow the parsing of log files, for which there are no readily-available parsing and analysis tools. We also aim to aggregate and query data from multiple, diverse systems across large networks. The key contributions of our work are: the development of an architecture that constitutes a substantial step toward solving a highly-practical open problem, the creation of one of the first comprehensive ontologies of cyber assets, and the demonstration of a non-trivial combination of declarative knowledge specification and machine learning.


