## Big Data and Text Analysis

### Overview

The amount of data that can be analyzed now is enormous. “Data science” is the research field where techniques are studied and applied to extract information, analyze contents and discover new knowledge from large data sources (big data) to obtain a competitive advantage. Data science is therefore of paramount importance both for the business and for the research.

The course intends to have a practical orientation and includes a series of activities through which the fundamental techniques of big data analysis will be introduce. This includes algorithms for data management (eg, MapReduce), application of data mining algorithms and statistical modeling, lexical text analysis (NLP) applications.

The course proposes the development of simple programs with the Python language (in particular and pandas and scikit-learn libraries). Through this course students: - know the major technological, scientific and application trends connected with big data and data science. - model data analysis problems and propose approaches for their resolution. - use some important techniques of data analysis, text analysis and data mining. - learn the pandas, sklearn libraries to apply data mining techniques. - apply data mining and machine learning techniques to systems in production (MLops).

### Course contents

Theoretic content:

1. Elements of Map Reduce for data analysis
2. Data analysis through Data Mining techniques. Evaluation of Models. Machine Learning in Production (MLops)
3. Search for similar items (Locality-Sensitive Hashing for Documents)
4. Search for Frequent Itemsets (Market Basket Analysis)
5. Link analysis and Page Rank
6. Text analysis, mining and retrieval (text retrieval, language models, NLP for text classification with word embeddings, Sentiment Analysis)
7. Basics of Recommendation Systems (Content-Based Recommendations / Collaborative Filtering)
8. Elements of Time Series Analysis and anomaly detection


Laboratory

1. Pandas: representation and management of tabular data. Preprocessing and feature engineering.
2. Scikit-learn: application of classification and regression techniques, model evaluation, feature engineering, pipeline construction and evaluation.


## Ingegneria del Software (Software Engineering)

#### Overview

Providing the fundamental knowledge of the models, languages and technologies for requirements specification, design, implementation, testing, deployment and maintenance of software systems.

#### Course contents

The course will provide the main theoretical technological and applicative knowledge of the following phases in which the life cycle of a software system can be divided: requirement specifications, design, implementation, testing, deployment and maintenance. In particular:

1. Requirement specifications.
**Theoretical part**: The specific goals of the requirements, the expected inputs and how to create 	specifications.
**Application part**: Using the UML use case diagram for the representation of the requirements.

2. Design
**Theoretical part**: The objectives of the planning phase, the high-level design definition and detail. Examples of architectural styles. Functional decomposition. Object Relational Mapping: definition and issues. Defining the design quality characteristics. Introduction of metrics for software evaluation. Definition of cohesion and coupling of software modules.
**Application part**: Use of UML diagrams for the description of the software. ORM: JPA and Hibernate. Use of design patterns for software design.

3. Implementation, testing
**Theoretical part**: Defining what it is meant by good planning and identification of best practices. Software refactoring. Software testing and code inspection. Survey of several testing techniques (Path Analysis, Boundary values analysis, ...).
**Application part**: Junit: the framework for testing Java programs Overview of basic techniques of defensive programming Maven and Git programming support, and software management.

4. Distribution and Maintenance
**Theoretical part**: Description of the main problems associated with configuration management and post-production software. Types of software licenses.