title: DAMA
#+STARTUP: overview
DAMA
Motivation for Data Management
- Visibility
- Reliability
- Security
- Scalability
challenges
- reduce cost
- increase storage
- deliver stream data for more IoT
Motivation for Research Data Management
- the amonut of data growth fast
- new technologies
- more accessible data due to standardisation
- Required by funding agency
- Scientific
Key difference between Data management Vs Data Science
- Acquisition
- Storage
- Quality
- Governance
- Integrity
Data Science Life Cycle
Acquirement, clean, explore, preprocessing, model, validation, present
Research Data Life Cycle
Plan, Collect, Assure, Describe, Submit, Preserve, Discover, Integrate, Analysis, Publish
Research Data Life Cycle 2
Analysis <- discovery Archiving processing -> Distribution
- version controll
- provenance
- normalization
- integrity preservation
- curation
Storage & Preseveration
- storage, mid-term, on my own disk
- preseveration, long term, accessible for other
- 321
- Accessibility, Authenticity, Longevity
Data Model
model
- why? to save the data in the database or information system, conceptual representation of data objects, associcated with other data object
- benefits, helpful for visual representation, busisses rules, regulatory compliances, consistancy conventions
data moduling technologies
- ERM, Entity Relation Model
- UML, Unified Modeling Language
Data modul Types
- Conceptual
- Logical
- Physical
Data Type
Structured
Semi-Structured
self-describing structur json, XML, Email, Key-value-Store
Unstuctured
Data Acquire
-
Harvesting from external into system storage
-
Ingesting from different storages to different step for processing
Database
Hierarchical database
- hierarchical structured
- 1 to 1, 1 to n,
- XML
Network database
- Predecessor of relational database
- more than one parent
- allow n to n
- uses a direct graph with direct link
- can becomes a complex structure
Relational Database
- not sufficient for fast growth of big data and its complexity
- hard to scale horizontally
- Normalization, reduce the disk space
- DeNormalization, fast and optimised for query
- Snowflake schema,
- less disk space
- normalized
- minimal redundency
- powerful for data analysis
- allow many to many relationship
- Star Schema(denormalization)
- simple to understand and build
- fast querying, no join
Object-Oriented Database
No-SQL database
- for big data
ACID compliance
- potential failures, server, power, OS
- Atomicity
- Consistency
- Isolation
- durablity IMB Statement
CAP Theorem
- Consistency
- Availability
- Partition tolerance
- sql(CAx), no-sql(xAP)
ETL Process
- E, Extract: Read data from source
- L, Load, store data in the final data store
- T, Transform: modify data based on the requirement
- mostly is ELT
- but if the data source allow modification, TEL is also possible
Data Warehouse
- integrated data from different source
- resructure data (denormalization)
- optimized for analysis
- Building
- Data source Layer: integrate internal and external data
- Staging layer: conduct transformation
- Storage Layer: host a database
Dagtabase
- real time data
- optimized for modification and querying(very efficient)
- normalized data
Data access in system
- OLTP: Normalized
- OLAP: Denormalized
Data Marts
- subject-oriented database
- to meet the specific group of users
- data access with higher performance
- Data maintenance, different department can have their own controll
- Easy setup, simple design, required less technical
- Analysis, KPIs,
- Easy input
Data Lake
data Warehouse Data Lake processed raw structured all possible and raw ready for analysis open business user data scientists costly easy more secure less sql no sql fast result slow resulte
Data Mesh
- domain-oriented decentralized data owership and architecture
- data as a product
- self-serve data infrastructure as platform
- federated computational governance
- architecture for data governance
Data Fabric
- conceptual
- focus on highly automating and integrating
- need more matedata
Meta Data
Meta data is must, a connect(descripation) data of data, necessary for control and efficiency while processing data, also important and required for access and discovery. problem, stored in different place also different form.
FAIR Principes
- Findable
- Accessable
- interoperable
- Reuseable
- The principles emphasise machine-actionability, because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.
Type
direction
- who created it, what it is, when, where, how, licence.
Meaning
- Controlled Vocabular
- list of words, to tag the information
- predefined and authorized
- reduce ambiguities of natural language
- improve retrievablilty
- support interoperable
- Texonomies
- with hierarchical structur
- Thesaui
- control of terms
- Hierarchical, Equivalence, Association relationship within terms
- support consistent indexing
- serve interoperability
- Ontologies, large Thesaui
Linked Data
- URIs (Uniform Resource Identifiers) to name individual things
- HTTP URIs to find things in web
- linked data 2, all conceptual things should start with http
Signposting
An approach to make the scholarly web more friendly to machines.
PIDs
standard, invariant and long-term reference of a digital resource, regardless of status location or current owner
Actionable IDs
A Persistent Identifier (PID) policy for the European Open Science Cloud
Machine Actionable
means that a formal statement is syntactically and semantically specified enabling computing systems to carry out automatic processing.
URLS
- subset of URIs
- related to IP address
Sustainability by technology
- reliable
- robust
- long term perspective
PID data type
Types are Metadata Elements
Data Governance
define
- Data governance determines what the data represents and what it can be used for
- Authentication and Authorization required
dimensions
- Organization, roles, responsibilites(RACI model)
- Process,
- Technology
- People, Trust, Ethics
- Data
Role
- Formed
- Accountable
- Consulted
- Responsiable
Data Quality
- Data quality describes how good it fits for its intended use: Operations, Decision making, Planning, Researching
Dimension
- accessibility or availability
- accuracy or correctness
- comparability
- completeness or comprehensiveness
- consistency, coherence, or clarity
- refers to whether the same data kept at different places do or do not match.
- ACID ensure from one valid to another
- peer-2-peer may be violated
- credibility, reliability, or reputation
- flexibility
- plausibility
- relevance, pertinence, or usefulness
- timeliness or latency
- uniqueness
- validity or reasonableness
Data Management(DAMA)
Dimension is defined as a measurable features of an object
- Accuracy
- Availability
- Clarity
- Completeness
Data Security
Definition, planning, development, and execution of security policies and procedures to provide proper authentication, authorization, access, and auditing of data and information assets.
Access
- role based access control
- attribute-based access control
GraphQL
It's smart API, query intersting data in a smart way
Microservice Architecture
Api for difference Applications, also standardized
Function as a service
S3 Amazon
Data Operation
to improve the terms of time for data analysis
Data Flow Diagram
representing the flow of data through a process or system
Rules
- a process must have at least one input data and one output data
- data stroe must have at least one data flow in and data flow out
- data stroage must through a process
- all prcesses will go to another process or end up with data store
- data can not flow between entities
event dirven Data handing
- Synchronous
- Asynchronous
Message Queue
- producer/consumer
- Eventual consistency / strong consistency(consistency)
- At-least-once/Exactly-once/At-most-once(Processing models)
Read-only Data store
Principles
- they are immutable
- only original data source can deliver to RDS
- one RDS for each application
- each are specific applications
- hosting near the original data
- data provider is responsible for data integrity