ELEC0033 分析任务

ELEC0033 - 2020/2021
Page 9
5 Data Analytics Task - Climate Data Analysis using Python
5.1 General Overview
The assignment comprises individual code writing, data analysis and inferring. You are
allowed to discuss ideas with peers, but your code, and experiments and report must be
done solely based on your on work.
The assignment leverages elements covered in class (data analytics lecture). You will be
working with a couple of meteorological datasets, you will be required to crunch data, to
clean the datasets and infer hidden patterns. Specifically, there will be three tasks you will
be asked to solve.
The goals of the assignment are the following:
• To further develop your programming skills
• To further develop your skills and understanding principle of data analytics and
machine learning
• To acquire experience in dealing with real-world data
5.2 Assignment description

  1. Dataset description
    You will find two pickle files named weather-denmark-resampled.pkl and df_perth.pkl,
    respectively.
    For TASKS 1 and 2, which cover the main aspects of preliminary data analysis, missing
    data and outlier detection, you must use the first dataset.
    For TASK 3, which cover correlation and pattern inferring, you will be using the second
    smaller dataset in order to find correlations and infer patterns.
  2. Tasks to be solved
    Read carefully the three tasks description and address them using the pre-compiled
    Jupyter notebook named Coursework_weather_data.ipynb.
    TASK 1 - PRELIMINARY ANALYSIS
    In this first task, you will explore the dataset. Follow the instructions in the following:
    a. Import the weather-denmark-resampled.pkl dataset provided in the folder and
    explore the dataset by answering the following questions.
    i. How many cities are there in the dataset?
    ii. How many observations and features are there in this dataset?
    iii. What are the names of the different features?
    ELEC0033 - 2020/2021
    Page 10
    b. Now that you got confident with the dataset, evaluate if the dataset contains any
    missing values? If so, then remove them using the pandas built-in function.
    c. Extract the general statistical properties summarising the minimum, maximum,
    median, mean and standard deviation values for all the features in the dataset. Spot
    any anomalies in these properties and clearly explain why you classify them as
    anomalies.
    TASK 2 – OUTLIERS
    The second task is focused on spotting and overcoming outliers. Follow the instructions
    in the following:
    d. Store the temperature measurements in May 2006 for the city of Odense. Then
    produce a simple plot of the temperature versus time.
    HINT: In this dataset, the cities are vertically stacked. Therefore, we have a multi
    column dataset, which basically works as a nested dictionary.
    e. Find the outliers in this set of measurements (if any) and replace them using linear
    interpolation.
    TASK 3 – CORRELATION AND INFERENCE
    In this last task, you will be seeking correlation between features of the data and inferring
    hidden patterns. For this task, you will be working with a smaller dataset. Follow the
    instructions in the following:
    3.1 – CORRELATION
    f. We now take a new dataset (df_perth.pkl), which collects climate data of a city
    in Australia. Here we have just one year of measurements, but more features.
    g. Find any significant correlations between features.
    HINT: you might find useful looking for trends and recurrent patterns within the
    data.
    h. We now focus on the correlation between precipitation and cloud cover. We
    want to infer the probability of having moderate to heavy rain (> 1 mm/h) as a
    function of the cloud cover index.
    HINT: you might find useful to create a new column where you have 0 if
    precipitation < 1 mm/h and 1 otherwise.
    3.2 – INFERENCE
    i. Let’s now assume that we want to predict the photovoltaic production (PV
    production) using multiple linear regression. Explain which features are
    statistically significant in modelling the target variable.
    j. Create a multivariate model using the predictors chosen in the previous
    question.
    ELEC0033 - 2020/2021
    Page 11
    5.3 Deliverable
    Report
    The report should be written in the form of an academic paper using the ICML format1.
    The report should be at most 10 pages long excluding references and appendices. The
    report must include the following sections:
    ● Abstract. This section should be a short paragraph (4-5 sentences) that provides a
    brief overview of the methodology and results presented in the report.
    ● Preliminary Analysis. This section describes your study carried out during task 1
    and should be organized in the following subsections:
    ○ Data Understanding. This subsection should detail the data that was used
    for this study, clearly describing the content, size and format of the data,
    how many cities are described in the dataset, how many observations and
    how many (and which) features are considered. Further information can
    be provided.
    ○ Data Cleaning. This subsection should describe the missing data
    processing. It is important to describe the methodology that you used in
    searching for the missing data and how did you address them in the best
    way (for example how do you ensure that the dataset preserver the same
    statistics/properties). Motivate clearly your answers.
    ○ Data Statistics. This subsection should describe the general statistical
    properties of the dataset with numerical or graphical visualization. Provide
    reflections toward anomalies (with clear motivation/supporting evidence
    for anomalies)
    ● Outliers. This section should describe all the steps that were applied to the data
    to find and tackle outlier pre-processing. A justification for each step should also
    be provided. In case no or very little pre-processing was done, this section should
    clearly justify why.
    ● Data inference. This section should describe the explorative and inference
    process. The following subsections should be provided
    ○ Data Correlation: This subsection should describe the different features
    correlations that you have investigated in the current dataset. Even if you
    discover little patterns, it is important that you clearly explain and justify
    the methodologies that you adopted. Clearly show results that can support
    your statements.
    ○ Data Inference. This subsection should describe the final step of data
    inference. Again clearly motivate your solutions, approaches and
  3. https://icml.cc/Conferences/2...
    ELEC0033 - 2020/2021
    Page 12
    conclusions/results.
    ● Conclusion. This last section summarises the findings, highlights any challenges or
    limitations that were encountered during the study and provides directions for
    potential improvements.
    Please make sure you complement your discussion in each section with relevant
    equations, diagrams, or figures as you see fit. Most importantly, be sure that all your
    answers and solutions are well motivated.
    Marking Criteria
    See the following page for the marking criteria
    Criteria Mark
    Weight
    Abstract/
    Conclusions
    The purpose of the executive summary is to outline data analytics project,
    input, envisioned outputs as well as key findings 5%
    Task 1 -
    Preliminary
    Analysis
    Dataset Understanding. Provide a clear description of the dataset answering the
    following questions: i) How many cities are there in the dataset? ii) How many
    observations and features are there in this dataset? iii) What are the names of the
    different features?
    10%
    Data Cleaning – Missing data. Provide a clear description of the results
    from your missing data analysis and key outcomes. 15%
    Data Statistics. Describe the general statistical properties of the dataset
    with numerical or graphical visualization. Provide reflections toward
    anomalies (with clear motivation/supporting evidence for anomalies)
    10%
    Task 2 –
    Outliers
    Show the visualization of the temperature measurements, together with some
    comments on the behaviour depicted in the plots. Provide summaries on the
    outliers – in terms of number of outliers detected as well as techniques adopted to
    replace outliers (motivate your answers).
    20%
    Task 3 –
    Inference
    Data Correlation. Comment on the significant correlation you found between
    features and assess rain probability as a function of cloud cover index. Support
    the text with visualization of results and key insights on the considered
    approach.
    15%
    Data Inference. Good understanding of data inference. Comment on the
    multivariate model using the predictors chosen in the previous question. 20%
    Report Style Report needs to be with a clean and clear structure as well as layout. Quality
    of images, table, citations and references will be also taken into account. 5

你可能感兴趣的