title: Data Science and Cloud Service Computing

#+STARTUP: overview

Hadoop

one node installation

.bashrc

export HADOOP_HOME=/home/cloud/hadoop-3.3.1
export HADOOP_INSTALL=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">OME</span><span class="mord mathnormal">e</span><span class="mord mathnormal">x</span><span class="mord mathnormal">p</span><span class="mord mathnormal" style="margin-right:0.02778em;">or</span><span class="mord mathnormal">t</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.10903em;">M</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">PRE</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">OME</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span></span></span></span>HADOOP_HOME
export HADOOP_COMMON_HOME=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">OME</span><span class="mord mathnormal">e</span><span class="mord mathnormal">x</span><span class="mord mathnormal">p</span><span class="mord mathnormal" style="margin-right:0.02778em;">or</span><span class="mord mathnormal">t</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.13889em;">F</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">OME</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span></span></span></span>HADOOP_HOME
export YARN_HOME=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">OME</span><span class="mord mathnormal">e</span><span class="mord mathnormal">x</span><span class="mord mathnormal">p</span><span class="mord mathnormal" style="margin-right:0.02778em;">or</span><span class="mord mathnormal">t</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.07153em;">C</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.02778em;">OMMO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.109em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">L</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05017em;">B</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0502em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.10903em;">N</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.13889em;">T</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.22222em;">V</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.02778em;">D</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.00773em;">R</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span></span></span></span>HADOOP_HOME/lib/native
export PATH=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.13889em;">T</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span></span></span></span>HADOOP_HOME/sbin:<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">OME</span><span class="mord">/</span><span class="mord mathnormal">bin</span><span class="mord mathnormal">e</span><span class="mord mathnormal">x</span><span class="mord mathnormal">p</span><span class="mord mathnormal" style="margin-right:0.02778em;">or</span><span class="mord mathnormal">t</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.02778em;">O</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">PTS</span><span class="mord">&quot;</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal">ja</span><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="mord mathnormal">a</span><span class="mord">.</span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">ib</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.03588em;">ry</span><span class="mord">.</span><span class="mord mathnormal">p</span><span class="mord mathnormal">a</span><span class="mord mathnormal">t</span><span class="mord mathnormal">h</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span></span></span></span>HADOOP_HOME/lib/nativ"

$HADOOP~HOME~/etc/hadoop/hadoop-env.sh

append to end

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

$HADOOP~HOME~/etc/hadoop/core-site.xml

in configuration

<property>
     <name>hadoop.tmp.dir</name>
     <value>/home/cloud/tmpdata</value>
     <description>A base for other temporary directories.</description>
 </property>
 <property>
     <name>fs.default.name</name>
     <value>hdfs://localhost:9000</value>
     <description>The name of the default file system></description>
 </property>

$HADOOP~HOME~/etc/hadoop/hdfs-site.xml

in configuration

<property>
  <name>dfs.data.dir</name>
  <value>/home/cloud/dfsdata/namenode</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/home/cloud/dfsdata/datanode</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>

$HADOOP~HOME~/etc/hadoop/mapred-site.xml

in configuration

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>

$HADOOP~HOME~/etc/hadoop/yarn-site.xml

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>127.0.0.1</value>
</property>
<property>
  <name>yarn.acl.enable</name>
  <value>0</value>
</property>
<property>
  <name>yarn.nodemanager.env-whitelist</name>
  <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>

init nodename

cd hadoop-3---
hdfs namenode -format
cd sbin
./start-dfs.sh
./start-yarn.sh
jps

hadoop hadoop1 hadoop2 hadoop3 hostname project q3lb q3l HDFS Namenode NameNode SecondaryNameNode HDFS DataNode DataNode DataNode DataNode YARN ResourceM ResourceManager
YARN NodeM NodeManager NodeManager NodeManager

.bashrc

export HADOOP_HOME=/home/cloud/hadoop-3.3.1
export HADOOP_INSTALL=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">OME</span><span class="mord mathnormal">e</span><span class="mord mathnormal">x</span><span class="mord mathnormal">p</span><span class="mord mathnormal" style="margin-right:0.02778em;">or</span><span class="mord mathnormal">t</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.10903em;">M</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05764em;">PRE</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0278em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">OME</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span></span></span></span>HADOOP_HOME
export HADOOP_COMMON_HOME=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">OME</span><span class="mord mathnormal">e</span><span class="mord mathnormal">x</span><span class="mord mathnormal">p</span><span class="mord mathnormal" style="margin-right:0.02778em;">or</span><span class="mord mathnormal">t</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.13889em;">F</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">OME</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span></span></span></span>HADOOP_HOME
export YARN_HOME=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">OME</span><span class="mord mathnormal">e</span><span class="mord mathnormal">x</span><span class="mord mathnormal">p</span><span class="mord mathnormal" style="margin-right:0.02778em;">or</span><span class="mord mathnormal">t</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.07153em;">C</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.02778em;">OMMO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.109em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">L</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05017em;">B</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0502em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.10903em;">N</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.13889em;">T</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.22222em;">V</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.02778em;">D</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.00773em;">R</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span></span></span></span>HADOOP_HOME/lib/native
export PATH=<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.13889em;">T</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span></span></span></span>HADOOP_HOME/sbin:<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.08125em;">H</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">OME</span><span class="mord">/</span><span class="mord mathnormal">bin</span><span class="mord mathnormal">e</span><span class="mord mathnormal">x</span><span class="mord mathnormal">p</span><span class="mord mathnormal" style="margin-right:0.02778em;">or</span><span class="mord mathnormal">t</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal" style="margin-right:0.02778em;">OO</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.1389em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.02778em;">O</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.05764em;">PTS</span><span class="mord">&quot;</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord mathnormal">ja</span><span class="mord mathnormal" style="margin-right:0.03588em;">v</span><span class="mord mathnormal">a</span><span class="mord">.</span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">ib</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.03588em;">ry</span><span class="mord">.</span><span class="mord mathnormal">p</span><span class="mord mathnormal">a</span><span class="mord mathnormal">t</span><span class="mord mathnormal">h</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span></span></span></span>HADOOP_HOME/lib/nativ"

$HADOOP~HOME~/etc/hadoop/hadoop-env.sh

append to end

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

$HADOOP~HOME~/etc/hadoop/core-site.xml

in configuration

<property>
     <name>hadoop.tmp.dir</name>
     <value>/home/cloud/hadoop-3.3.1/data</value>
     <description>A base for other temporary directories.</description>
 </property>
 <property>
     <name>fs.default.name</name>
     <value>hdfs://hostname:9000</value> watch out for inter floatip for localhost 
     <description>The name of the default file system></description>
 </property>

$HADOOP~HOME~/etc/hadoop/hdfs-site.xml

in configuration

<property>
  <name>dfs.data.dir</name>
  <value>/home/cloud/hahoop-3.3.1/dfsdata/namenode</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/home/cloud/hahoop-3.3.1/dfsdata/datanode</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>
<property>
  <name>dfs.namenode.http-address</name>
  <value>*inter floatip:9870*</value>
</property>
<property>
  <name>dfs.namenode.secondary.http-address</name>
  <value>inter floatip:9868</value>
</property>

$HADOOP~HOME~/etc/hadoop/yarn-site.xml

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>*inter floatip*</value>
</property>
<property>
  <name>yarn.acl.enable</name>
  <value>0</value>
</property>
<property>
  <name>yarn.nodemanager.env-whitelist</name>
  <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>

$HADOOP~HOME~/etc/hadoop/mapred-site.xml

in configuration

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>

$HADOOP~HOME~/etc/hadoop/wores

gwdg01 gwdg10 gwdg19

init nodename

cd hadoop-3---
xsycn etc/hadoop
hdfs namenode -format
cd sbin
./start-dfs.sh
./start-yarn.sh
jps

command

general comands

hdfs dfs -ls /
hdfs dfs -chmod 777 /testFolder
hdfs dfs -cat /tesFolder/text.txt
hdfs dfs -get hdfspath localpath
hdfs dfs -put localpath hdfspath
hdfs dfsadmin -report
hdfs fsck /

word example

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /output/
hadoop fs -cat /output/part-r-00000
cd output
hadoop fs -getmerge /hpda04-2.3-output/ out
cat out

map()

map(fun <key1, val1>) -> list(<key2, val2>) to a list of key-value pairs all elemenet in list must have the same type

Schuffle

schuffle(list(<key2, val2>)) -> list(<key2, list(val2)>)

reduce

reduce (fun, list(<key2, list(val2)>)) -> list(val3)

Limitation

1, multiple map() and reduce() must be manually specified 2, intermediary results has to be written to the HDFS, not on memory iterative algorithms are not very efficient with Hadoop.

HDFS

descripation

Hadoop distributed file system
Namenode vs Datanodes

1, high throughout with low latency
2, support large file
3, locally computation in Node, less transfer zwischen Nodes
4, resilient design for hardware failurs

install

tut link https://drive.google.com/drive/folders/1XdPbyAc9iWml0fPPNX91Yq3BRwkZAG2M

java
ssh localhost(ssh-keygen -t rsa)
edit 6 file
./hdfs  namenode -format

YARN

Yet Another Resource Negotiator Resource Manager vs NodeManager Resource Manager avoid overutilization and underutilization The NodeManager execute tasks on the local resources 1, Client send a requirement to Resource Manager 2, Resource manager allocate container in Node Manager 3, Container in Node Manager start the application Master 4, Application Master require Resource from Resoure Manager 5, as the required Resoure is allocated, application master start the Application

Spark

3 node installation

GWDG deployment

floatip .1 .10 .19

hostname gwdg01 gwdg10 gwdg19 ip .8 .5 .10 HDFS Namenode NameNode SecondaryNameNode HDFS DataNode DataNode DataNode DataNode YARN ResourceM ResourceManager
YARN NodeM NodeManager NodeManager NodeManager

descripation

results do not need to save in HDFS, it support in memory executation. Resilient Distributed Datasets RDDS DataFrame from SparkSQL

scala

can from binary file can from source file can from IDEA blugin can from spark installation

install

from source

this is a full eco system, can build a cluster by my own, with embended scala

from pip

my Prof can also build a eco system in pip download file, with config in master: spark-submit --deploy-mode --master yarn test.py But I can't, I can even not find conf file in pip file for pyspark, if you still want to consturcte a cluster, use spark installation from source file, like following

single master node configuration with

cat ~/Documents/spark/myown/test.py

print("hello world")

cd .../spark
./sbin/start-all
curl localhost:8080(spark-url for master)
./bin/spark-submit --master spark-url ./myown/test.py

test.py will be executed

./bin/pyspark --master spark-url

will open a terminal with master configuration

pyspark

cd spark
bin/spark-submit examples/src/main/python/wordcount.py testtext.txt &> output.txt

Big data lecture

Association Rule Minning

Transaction: T, one behavior, which accomplished a lot of things(items) -transaction instances t,
Item : I, which stands for the smallest unit, that can be done.
Our task is to find out, the relationship between items

Support the probabilty of a itemset occurs $S u pp or t (I S) = \frac{∣ t \in T : I S \subseteq t ∣}{∣ T ∣}$ All IS bigger than a setted value is called frequent itemset, but how to set the value is self-define

Confidence $C (a - > b) = P (b ∣ a) = \frac{P ( ab )}{P ( a )}$

the Probabilities  If a is done, b will also be done.

Support $S (a - > b) = P (a \cap b) = P (ab)$

a and b be done together from all Transaction, identify the special Transaction case

$X - > Y (C = c, S = s)$

practicability (Lift) $L (a - > b) = \frac{C o n f i d e n ce ( a - > b )}{P ( b )} = \frac{P ( ab )}{P ( a ) P ( b )}$

The impact of a on b to be done

Aprioir algorithm

1.(with support level(S) )

find the frequently itemset(L)
the subsets of frequently itemset is also frequently itemset
collect the total Transaction set(T), and set the the support level
find all $L_{1}$ , which satisfied S,
find all $L_{2}$ , which come from all 2-items combinations, which satisfied S
.....to only one left, $L_{k}$ .

2.(with Confidence (C) )find all subsets of $L_{k}$ , which satisfied C.

Note all the operations in this 2 step is done in the whole Transaction sets

Data Exploration

Singal feather: histogram density, rug, Box-Whisker Box-Whisker: low quartile to high quartile is interquartile range (IQR)

low boundary: low quartile - IQR high boundary: high quartile + IQR

pair-wise scatterplot

hexbin plot

correlation heatmap

Time Series Analysis

Descripation

Discrete values { $x_{1}$ ,....,$x~T~$} = $(x_{t})_{t = 1}^{T}$ A core assumpation is the time difference between $x_{t}$ and $x_{t + 1}$ is equal for $t \in (1... T)$ . $x_{t}$ can be decomposed into 3 components:

1. trend component T change over all time 2. seasonality S the results of seasons effect 3. autocorrelation R how the values depends on prior values

so $X = T + S + R$

Box-Jenkins for stationary

stationary

Time series is stationary Mean and Variance of the trend and seasonality are constant and can be removed so the autocorrelation is stochastic process,
Trend and Series Effects
1. model the trend $T_{t}$ on the time series $x_{t}$
2. detrended time series $\overset{x}{^} = x_{t} - T_{t}$
3. model the seasonality $S_{t}$ on the time series $x_{t}$
4. get the seasonality adjusted time series $\hat{\overset{x_{t}}{^}} = \overset{x}{^} - S_{t}$
Regession and Seasonal Means

In this context we can only use linear regession to fit the all time series, get $\overset{x}{^}$ .

and then substract the seasonal Means: $\hat{\overset{x}{^}} = \overset{x}{^} - (M e an (\overset{x}{^}_{t}^{´}) - M e an (\overset{x}{^}_{t}))$ $M e an (\overset{x}{^}_{t}^{´})$ is special for mod(t, s), in the recursive seasonal effect, only the same time slot element will be calculated, this happens if the last season is not complete.

Cons only works for linear trends and with seasonal effects that have no trend.

Differencing for not stationary

for $y = f (x)$ for two points( $x_{1}$ , $y_{1}$ ) and ( $x_{2}, y_{2}$ ), the first-order difference to detrended time series: $\overset{x}{^}_{t} = Δ x_{t}^{1} = x_{t} - x_{t - 1}$ . or if you want, you can get the second-order-difference $Δ^{2} x t = Δ^{1} x_{t} - Δ^{1} x_{t - 1} = x_{t} - 2 x_{t - 1} + x_{t - 2}$

using difference to adjust the seasonal effect: using the difference between two consecutive points in time during the season. $\hat{\overset{x}{^}} = Δ_{s} \overset{x}{^}_{t} = \overset{x}{^}_{t} - \overset{x}{^}_{t - s}$

pro it can deal with both changes in the mean, as well as changes in the movement of the mean

Correlation

Autocorrelation is the direct relationship of the values of the time series at different points in time, for two adjacent points

Partial autocorrelation is the autocorrelation without the carryover, i.e., only the direct correlation, not for two adjacent points

for Authentication and Partial authentication we can see the residual seasonal effect for regession and seasonal means

ARIMA

three ways to model correlation

AR: atuoregressive

model the direct influence of the past p points on time series $c + ϵ_{t} + \sum_{i = 1}^{p} a_{i} x_{t - i}$ c :constant over all time $ϵ_{t}$ : white noise, mean of 0, normal distribution
MA: Moving average

model the random effect on time series $c + ϵ_{t} + \sum_{j = 1}^{q} b_{j} ϵ_{t - j}$ the difference is the random influence of the past noise to next value
ARMA: autoregressive and Moving average

$c + ϵ_{t} + \sum_{i = 1}^{p} a_{i} x_{t - i} + \sum_{j = 1}^{q} b_{j} ϵ_{t - j}$
select p and q

partial authentication estimate the p for AR, but if p can cover all the season, but if p is too big, it can lead to overfitting.

autocorrelation can estimate the q for MA, using q as the steps for autocorrelation to be 0, so we look at when the autocorrelation goes towards zero and use this for q. at the same time the effect of AR should also be counted for determinate q.

Text minning

Preprocessing

Creation of a Corpus

contains all text to analysis
remove the irrelevant content,

links, timestamps
Punctuation and Cases

remove all Punctuation, and all use small cases a problem is about acronyms
Stop words

commons words should be removed, auch as I, to ,a
Stemming and Lemmatization

first Lemmatization, and then Stemming

Visualiztation

bag-of-words with wordclouds
Term frequency(TF)

is the count of a words within document
Inverse Document Frequency(IDF)

is to weight words by their uniqueness within the corpus $I D F_{t} = lo g \frac{N}{D _{t}}$ t: word(term) N: the number of document in corpus $D_{t}$ : the number of document in corpus, which contains word t
TFIDF

$TF I D F = TF \cdot I D F$
beyond the bag-of-words

ignore the structure of document ignore simiarity of words

challages

dimensionality
Ambiguities

Sensor Fusion lecture

Sensor Dataverarbeitung

Tensor Fehler, Präzision: stochastisch Richtigkeit: systematisch

concepts

competitive many sensor for the same place für higher accuracy complementary many sensor for many places für higher completeness dead reckoning errors accumulation over previous knowlegde

$y = H x + e$

measurement equation projects the state onto the measurement space y measurement x state H measurement matrix e measurement error

Jacobian Matrix one order

Hessian Matrix two order

Partial Matrix

$\frac{\partial}{\partial x} c^{T} x = \frac{\partial}{\partial x} x^{T} c = c$
$\frac{\partial}{\partial x} x^{T} A x = 2 A x$
$\frac{\partial}{\partial x} A x = \frac{\partial}{\partial x} x^{T} A = A$

data analysis code demo

statistical mothode

from scipy import stats
from scipy.stats import norm
import numpy as np
import scipy as sp
print(sp.stats.t.ppf(0.95,6))
print(norm.cdf([-1,0,1]))
print(norm.cdf(np.array([-1,0,1])))
print(norm.mean(), norm.std(), norm.var() )
print(norm.pdf(0))
print(norm.cdf(1.96))
print(norm.ppf(0.975))
print(norm.cdf(1))
print(norm.ppf(0.841344746090))
print(norm.sf(1-norm.cdf(1)))
print(norm.ppf(0.9))
print(stats.t.ppf(0.975,3))
print(stats.t.ppf(0.975,3))

confidence level interval determinate

import numpy as np
import scipy as sp
import scipy.stats

b = [8*x**0 for x in range(200)] + np.random.normal(0, 0.05, (200))


def t_stastik(data, confidence):
    m, se = np.mean(data), sp.stats.sem(data)
    h = se*sp.stats.t.isf((1-confidence)/2. , df = (len(data)-1) )
    return m, m-h, m+h
print(" For given data sete we have their mean  with 95% confidence level of region :",t_stastik(b,0.95))

def mean_confidence_interval(data, confidence):
    m, se = np.mean(data), sp.stats.sem(data)
    h = se*sp.stats.t.ppf((1+confidence)/2.,len(data)-1)
    return m, m-h, m+h
print('For data the mean  can also  be calcaleted as at 95% confidence level is :', mean_confidence_interval(b, 0.95))

a complete ploted distribution of confidence level on t mode

  import numpy as np
#  import seaborn as sns
  from scipy import stats
  import matplotlib.pyplot as plt

  np.random.seed(3)
  MU = 64
  sigma = 5
  size = 10
  heights = np.random.normal(MU, sigma,size)
  print("accoding to the mean and deviation we have a example of 10 rondom number : ", heights)

  mean_heights = np.mean(heights)
  deviation_heights = np.std(heights)
  SE = np.std(heights)/np.sqrt(size)

  print('99% confidence interval is :', stats.t.interval(0.99, df = size-1 , loc = mean_heights, scale=SE))
  print('90% confidence interval is :', stats.t.interval(0.90, df = size-1 , loc = mean_heights, scale=SE))
  print('80% confidence interval is :', stats.t.interval(0.80, df = size-1 , loc = mean_heights, scale=SE))

a complete ploted distribution

import numpy as np

sample_size = 1000
heights = np.random.normal(MU, sigma, sample_size)
SE = np.std(heights)/np.sqrt(sample_size)
(l,u) = stats.norm.interval(0.95, loc = np.mean(heights), scale = SE)
print(l,u)
plt.hist(heights, bins = 20)
y_height = 5
plt.plot([l,u], [y_height, y_height], '_', color='r')
plt.plot(np.mean(heights), y_height, 'o', color= 'b')
plt.show()

a complete ploted distribution on between region

x = np.linspace(-5,5,100)
y = stats.norm.pdf(x,0,1)
plt.plot(x,y)
plt.vlines(-1.96,0,1,colors='r',linestyles='dashed')
plt.vlines(1.96,0,1,colors='r',linestyles='dashed')
fill_x = np.linspace(-1.96,1.96,500)
fill_y = stats.norm.pdf(fill_x, 0,1)
plt.fill_between(fill_x,fill_y)
plt.show()

a example from internet

import pandas as pd
from scipy import stats as ss
data_url = "https://raw.githubusercontent.com/alstat/Analysis-with-Programming/master/2014/Python/Numerical-Descriptions-of-the-Data/data.csv"
df = pd.read_csv(data_url)
print(df.describe())
import matplotlib.pyplot as plt
pd.options.display.mpl_style = 'default' 
plt.show(df.plot(kind = 'box'))

1 2 3 order and gauss fitting

import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
from scipy.optimize import curve_fit


def f_1_degree(x,A,B):
    return A*x + B

def f_2_degree(x,A,B,C):
    return A*x**2 + B*x + C

def f_3_degree(x,A,B,C,D):
    return A*x**3 + B*x**2 + C*x + D


def f_gauss(x,A,B,sigma):
    return A*np.exp(-(x-B)**2/(2*sigma**2))

def plot_figure():
    plt.figure()

    x0 = [1,2,3,4,5]
    y0 = [1,3,8,18,36]

    #plot original data
    plt.scatter(x0,y0,25,"red")

    # plot f1
    params_1, pcovariance_1 = optimize.curve_fit(f_1_degree,x0,y0)

    params_f_1, pcovariance_f_1 = curve_fit(f_1_degree,x0,y0)
    x1 = np.arange(0,6,0.01)
    y1 = params_1[0]*x1+params_1[1]
    plt.plot(x1,y1,"blue")
    print("The liear fitting for date is : y = ",params_1[1],"*x + ",params_1[0])
    print("The params uncertainies are:")
    print("a =", params_1[0], "+/-", round(pcovariance_1[0,0]**0.5,3))
    print("b =", params_1[1], "+/-", round(pcovariance_1[1,1]**0.5,3))


    #plot f2
    params_2, pcovariance_2 = curve_fit(f_2_degree,x0,y0)
    x2 = np.arange(0,6,0.01)
    y2 = params_2[0]*x1**2+params_2[1]*x1 + params_2[2]
    plt.plot(x2,y2,"green")
    print("The second order curve fitting for date is : y = " ,params_2[2],"*x² + " ,params_2[1],"*x + ",params_2[0])
    print("The params uncertainies are:")
    print("a =", params_2[0], "+/-", round(pcovariance_2[0,0]**0.5,3))
    print("a =", params_2[0], "+/-", round(pcovariance_2[0,0]**0.5,3))
    print("b =", params_2[1], "+/-", round(pcovariance_2[1,1]**0.5,3))
    print("c =", params_2[2], "+/-", round(pcovariance_2[2,2]**0.5,3))

    #plot f3
    params_3, pcovariance_3 = curve_fit(f_3_degree,x0,y0)
    x3 = np.arange(0,6,0.01)
    y3 = params_3[0]*x1**3+params_3[1]*x1**2 + params_3[2]*x1 + params_3[3]
    plt.plot(x3,y3,"purple")
    print("The second order curve fitting for date is:y =",params_3[3],"*x³+",params_2[2],"*x² + " ,params_2[1],"*x + ",params_2[0])
    print("The params uncertainies are:")
    print("a =", params_3[0], "+/-", round(pcovariance_3[0,0]**0.5,3))
    print("b =", params_3[1], "+/-", round(pcovariance_3[1,1]**0.5,3))
    print("c =", params_3[2], "+/-", round(pcovariance_3[2,2]**0.5,3))
    print("d =", params_3[3], "+/-", round(pcovariance_3[3,3]**0.5,3))

    #plot gauss
    params_gauss, pcovariance_gauss = curve_fit(f_gauss,x0,y0)
    xgauss = np.arange(0,6,0.01)
    ygauss = params_gauss[0]*np.exp(-(xgauss-params_gauss[1])**2/(2*params_gauss[2]**2))
    plt.plot(xgauss,ygauss,"black")
    print("The gauss function curve fitting for date is : y = ",params_gauss[2],"*exp{-(x-",params_gauss[1],")²/(2*sigma²) +",params_gauss[0])
    print("The params uncertainies are:")
    print("a =", params_gauss[0], "+/-", round(pcovariance_gauss[0,0]**0.5,3))
    print("mean =", params_gauss[1], "+/-", round(pcovariance_gauss[1,1]**0.5,3))
    print("std =", params_gauss[2], "+/-", round(pcovariance_gauss[2,2]**0.5,3))



    plt.title("plot for different fittign")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.show()
    return

plot_figure()

linear fitting

# matplotlib inline
import matplotlib.pyplot as plt;
import numpy as np;
from scipy import integrate
from scipy.optimize import curve_fit
import math

#1. x axis coordinnat for 10 points data
xmin=0.01; xmax=2; pts = 10;
xx = np.linspace(xmin, xmax, pts);

#2. y axis coordinnat for 10 points data
rho = np.sqrt(1/xx) + 0.5*np.exp(xx)*xx**2;

#plot the original data
plt.plot(xx, rho, 'bo', label='Original data')

#3. x axis coordinnat for 200 points fitting
x_fine = np.linspace(xmin, xmax, 200);

#fiting
params, cov = np.polyfit(xx, rho, 1, cov=True)

#to reconstruct the linear function
bestfit_rho = params[0]*x_fine + params[1]
plt.plot(x_fine, bestfit_rho, 'r-', lw=2, label='One order of linear fit');

print(params)

linear fitting with ployfit

# matplotlib inline
import matplotlib.pyplot as plt;
import numpy as np;
from scipy import integrate
from scipy.optimize import curve_fit
import math

#1. x axis coordinnat for 10 points data
xmin=0.01; xmax=2; pts = 10;
xx = np.linspace(xmin, xmax, pts);

#2. y axis coordinnat for 10 points data
rho = np.sqrt(1/xx) + 0.5*np.exp(xx)*xx**2;

#plot the original data
plt.plot(xx, rho, 'bo', label='Original data')

#3. x axis coordinnat for 200 points fitting
x_fine = np.linspace(xmin, xmax, 200);

#fiting  it can be any order 
params, cov = np.polyfit(xx, rho, 4, cov=True);
p = np.poly1d(params)
plt.plot(x_fine, p(x_fine), 'g-', lw=2, label='The Best poly1d fit');

print(params)
plt.xlabel('$x$');
plt.ylabel(r'$\rho$');
plt.legend(fontsize=13);
plt.show()

High performance Data Analysis lecture

concepts

High performance Data Analysis: with parallel processing to quickly find the insights from extremely large data sets

Chap01 overview

Distributed System

Definiation:

Components separate located
communicatation through passing massage between components

Characteristics:

own memory
concurrency
locks

Applcation:

cloud compuation
internet of Things

Algorithm:

Consensus, Repication

Challages:

Programm
resource sharing

Levels of parallelism

Bit-level, Instruction level, Data level, Task level

Name typical applications for high-performance data analytics

weather forecast
Simulating kernel fusion, tokamak reactor

Distinguish HPDA from D/P/S computing and how these topics blend

Stricter than distributed system( strongly scalling: weak scalling)

Describe use-cases and challenges in the domain of D/P/S computing

Recommendation engine

Describe how the scientific method relies on D/P/S computing

Simulation models real systems to gain new insight Big Data Analytics extracts insight from data

Name big data challenges and the typical workflow

how to deal with big data(5Vs) Raw-> Descriptive -> Diagnostics -> Predictive -> Prescriptive

Recite system characteristics for distributed/parallel/computational science

Sketch generic D/P system architectures

Chap02 DataModels & Data Processing Strategies

Define important terminology for data handling and data processing

Raw data, semantic normalization, Data management plan, Data life cycle, data governance, data provenance...

Sketch the ETL process used in data warehouses

extract from a source database, transform with controlling, error and missing treatment, change the layout to fit loading, integrate them into data warehouses for user

Sketch a typical HPDA data analysis workflow

classical: discovery, integration, exploitation in high level, with SQL, java, scala, Python, with parallelism for data Exploration

Sketch the lambda architecture

Lambda architecture is a concept for enabling real-time processing and batch methods together. batch layer(large scala) + serving layer speed layer(read time)

Construct suitable data models for a given use-case and discuss their pro/cons

Define relevant semantics for data

data models

Concurrency, Durability, Consensus,

relational model
Clumnar Model (combinded relational model)(HBase)
key-value model (BigTable)
Documents model (MongoDB)
Graph

Chap03 Databases and DataWarehouses

relatation model

Cardinality
- one to one
- one to many
- many to many
Normalization Form

reduces dependencies, prevents inconsistency, save space
- 1NF: no collections in row tuples
- 2NF: no redundancy (entities of many-to-many relations are stored in separate tables)
- 3NF: no dependence between columns
- 4NF: no multiplie relationships in one table(not good for big data)
group by

it's done with Aggregatation(in sql or in python, both)
join

cross join: Cartesian product of two tables natural jon: all combinations that are equal on their common attributes inner join: only all condition satisfied left join: condition strict on left right join: condition strict on right full join
Transactions

ACID

Define Database, DBMS, and Data Warehouse

an organized collection of data
software application for user to use the collected data
a system used for reporting and data analysis, with multidimensional data cube

Create a relational model for a given problem

Draw an ER(Entity Relational) diagram for a given relational model (and vice versa)

Normalize a small relational model into a redundant-free model

List the result of an inner join of two tables to resolve relationships

Formulate SQL queries for a relational model

Create a Star-Schema from a relational model (and formulate queries)

Sketch the operations for an OLAP cube

Slice
Dice
Roll up
Pivot

Appraise the pro/cons of OLAP vs. traditional relational model

Star-Schema: pro: simplification of query and performancd gain, emulates OLAP cube start-Schema: cons: data integrity is not guaranteed, no natural support of many to many relations,

Describe DBMS optimizations: index, bulk loading, garbage cleaning

Chap04 Distributed Storage and Processing with Hadoop

hadoop

map: filter and convert all input into key-value tuples reduce: receives all tuples with the same keys, accumulated

Describe the architecture and features of Apache Hadoop

HDFS and MapReduce executation engine
High availability,
automatic recovery
Replication of data
Parallel file access
Hierarchical namespace
Rack-awareness

Formulate simple algorithms using the MapReduce programming model

Justify architectural decisions made in Apache Hadoop

Sketch the execution phases of MapReduce and describe their behavior

distributed code
determine fiels
map
combine
shuffle
partition
reduce
output

Describe limitations of Hadoop1 and the benefits of Hadoop2 with TEZ

Allow modelling and execution of data processing logic
Reconfigure dataflow graph based on data sizes and target load
Controlled by vertex management modules
Task and resource aware scheduling
Pre-launch and re-use containers and caching intermediate results
Everyone has to wait for the prozess between mapping and reducing

Sketch the parallel file access performed by MapReduce jobs

Chap05 Big Data SQL using Hive

Compare the execution model of SQL in an RDBMS with Hive

Table: Like in relational databases with a schema
Partitions: table key determining the mapping to directories
Buckets/Clusters: Data of partitions are mapped into files

Justify the features of the ORC format(Optimized Row Columnar)

Light-weight index stored within the file
Compression based on data type
Concurrent reads of the same file
Split files without scanning for markers
Support for adding/removal of fields
Partial support of table updates
Partial ACID support (if requested by users)

Apply a bloom filter on example data

Identify if an element is a member of a set with n elements Allow false positives but not false negatives

Describe how tables are generally mapped to the file system hierarchy and optimizations

Describe how data sampling can be optimizing via the mapping of tables on HDFS

Sketch the mapping of a (simple) SQL query to a MapReduce job

Chap06

Create a Columnar Data Model (for HBase) for a given use case

Justify the reasons and implications behind the HBase storage format

medium-size object,
stored by row key,
cell data is kept in store files on HDFS,
Encoding can optimize storage space
- row keys and date
- column family
- Reading data

Describe how HBase interacts with Hive and Hadoop

Describe the features and namespace handling in Zookeeper

Create a Document Data Model (for MongDB) for a given use case

Provide example data (JSON) for the MongoDB data model and the queries

Sketch the mapping of keys to servers in MongoDB and HBase

Select and justify a suitable shard key for a simple use case

Chap07

Define in-memory processing

Processing of data stored in memory

Data will fit in memory
Additional persistency is required
Fault-tolerance is mandatory

Describe the basic data model of Apache Spark and the SQL extension

it based on RDDs, which are immutable tuples, (Resilient Distributed Datasets) Computation is programmed by transformation, lazy evaluation, all computaion is deferred until needed by actions

Program a simple data flow algorithm using Spark RDDs

nums = sc.parallelize(arange(1,100000)) r1 = nums.filter(lambda x: (x%2) == 1) r1 = r1.map(lambda x:(x, x**2)) r1. = r1.reduce(lambda a,b :a * b)

Sketch the architecture of Spark and the roles of its components

Transformation: map, filter, union, pipe, groupbykey, join
Actions: reduct, count, token, frist
Schuffle: repartation

Describe the execution of a simple program on the Spark architecture

Chap08

Define stream processing and its basic concepts

Application for real-time continuous stream-computation for high-velocity data Stream groupings defines how tuples are transferred

Describe the parallel execution of a Storm topology

the graph of the calculation represented as network, the parallelism (tasks) is statically defined for a topology

Illustrate how the at-least-once processing semantics is achieved via tuple tracking

one tuple may be executed multiple time, and if error occurs, tuple restarted from Spout

each tuple has a tuple ID
Acker tracks tuple ID with hashing map
Ack execute each step with XOR of all derived tuple ID, if it retures value 0, retart from Spout agin

Describe alternatives for obtaining exactly-once semantics and their challenges

each tuple is executed exactly once,
provide idempotent operations
Execute tuples strongly ordered to avoid replicated execution
Use Storm's transactional topology(processing phase, commit phase[stong ordering])

Sketch how a data flow could be parallelized and distributed across CPU nodes on an example

Chap09

Chap10

List example problems for distributed systems

Reliable broadcast, Atomic commit, Consensus, Leader election, Replication

Sketch the algorithms for two-phase commit

Prepare phase, Commit phase

consistent hashing

manage the key/value data in distributed system load balancing, and faul tolerant

Discuss semantics when designing distributed systems

Consistency(atomicity, visibility, isolation) Availability(Robustness, Scalability, Partition) Durability

Discuss limitations when designing distributed systems

CAP(Consistency, Available, Partition tolerance) can't meet together in a DS

Explain the meaning of the CAP-theorem

Sketch the 3-tier architecture

Presentation, Application precessing, Data management

Design systems using the RESTful architecture

Simplicity of the interface, Portability, Cachable, Tracable

Describing relevant performance factors for HPDA

Time, cost, energie, Productivity

Listing peak performance of relevant components

Computation, Communicatation, Input/Output devices

Assessing /Judging observed application performance

Estimate the workload
Compute the workload throughout per node, W
Compute the Hardware capabilities P

E = W / P

Chap11

Sketching the visual analytics workflow

Listing optical illusions

Color, Size&Shape, Moving,Interpretation of objects,

Listing 5 goals of graphical displays

show the data
induce the viewer to think about the substance
present many numbers in a small space
make large data sets coherent
serve a reasonably clear purpose
be closely integrated with the statistical

Discuss the 4 guidelines for designing graphics on examples

Use the right visualization for data types
Use building blocks for graphics (known plot styles)
Reduce information to the essential part to be communicated
Consistent use of building blocks and themes (retinal properties)

Describe the challenges when analyzing data

large data volumes and velocities
complex system and storage topologies
understand the system behavior is difficult
data movement of memory and CPU is costly

Discuss the benefit of in-situ and in-transit data analysis

in-situ: analyze results while the applications is still running
in-transit: analyze data while it is on the IO path
interact with application while it runs

Chap12

Sketch a typical I/O stack

Develop a NetCDF data model for a given use case

Compare the performance of different storage media

Sketch application types and access patterns

Justify the use for I/O benchmarks

Can use simple/understandable sequence of operations May use a pattern like a realistic workloads Sometimes only possibility to understand hardware capabilities

Describe an I/O performance optimization technique

Read-ahead, write-behind, async-IO

Describe a strategy for trustworthy benchmark result

single-shot: acceptance test periodically: regression test

03-01

drop table if exists WikipediaArticles ;
create table WikipediaArticles (
id int,
title varchar(50), 
text varchar(50),
category varchar(50),
link int
) ;
\d wikipediaarticles;

drop table if exists linkarticles ;
    create table linkarticles (
    id int,
    linked int
 ) ;

delete from wikipediaarticles where id = 1;
insert into WikipediaArticles (id, title, text, category, link) values (1, 'math', 'mathematics and nature and nature', 'nature', 1) ;
delete from wikipediaarticles where id = 2;
insert into WikipediaArticles (id, title, text, category, link) values (2, 'phy', 'physics', 'nature', 2) ;
delete from wikipediaarticles where id = 3;
insert into WikipediaArticles (id, title, text, category, link) values (3, 'chemie', 'chemistry', 'science', 3) ;
delete from wikipediaarticles where id = 4;
insert into WikipediaArticles (id, title, text, category, link) values (4, 'bio', 'biology', 'science', 4) ;
select * from wikipediaarticles ;

delete from linkarticles where id = 1;
insert into Linkarticles (id, linked) values (1, 2) ;
insert into Linkarticles (id, linked) values (1, 3) ;
delete from linkarticles where id = 2;
insert into Linkarticles (id, linked) values (2, 3) ;
delete from linkarticles where id = 3;
insert into Linkarticles (id, linked) values (3, 4) ;
delete from linkarticles where id = 4;
insert into Linkarticles (id, linked) values (4, 1) ;
select * from linkarticles ;

select * from wikipediaarticles where title = 'phy';

select * from wikipediaarticles where id in
 (select linked from linkarticles where id in
  (select id from wikipediaarticles where title = 'math')
);

select count(*) , linked from linkarticles group by linked;

select unnest(string_to_array('this is is is a test', ' '))

select id,  unnest(string_to_array(text , ' ')) as word, count(*) from WikipediaArticles group by id, word

select * from wikipediaarticles where category = 'science';

03-02

digraph diagramm {
  WikipediaArticles  -> id
  WikipediaArticles  -> Title
  WikipediaArticles  -> Text
  WikipediaArticles  -> Category
  WikipediaArticles  -> Links
  Links  -> linkarticles
  linkarticles -> lid
  linkarticles -> linked
}

04-01

mapper and reducer in own

def mapper(key, value):
  words = key.split()
  for word in words:
    Wmr.emit(word, 1)

def mapper(key, value):
  words = key.split()
  for word in words:
    Wmr.emit("s", stem(word), 1)
  for word in words:
    Wmr.emit("l", lemmatize(word), 1)


def reducer(key, values):
  count = 0
  for value in values:
    count += int(value)
    Wmr.emit(key, count)

sql

cat ~/Documents/hpda0404.csv

drop table if exists hpda0401 ;

create table hpda0401 (
num int,
germany varchar(10),
english varchar(10),
chinese varchar(10),
listed int
) ;

insert into hpda0401 (num, germany, english, chinese, listed) values (1, 'eins', 'one','一', 1);
insert into hpda0401 (num, germany, english, chinese, listed) values (2, 'zwei', 'two','二', 1);
insert into hpda0401 (num, germany, english, chinese, listed) values (3, 'drei', 'three','三', 2);
insert into hpda0401 (num, germany, english, chinese, listed) values (6, 'sechs', 'six','六', 2);

select germany from hpda0401 where  germany = 'zwei';

select listed, sum(num) as mysum from hpda0401 group by listed;

select

import csv
from functools import reduce
path = "/home/si/Documents/hpda0404.csv"
data = []
with open(path) as f:
    records = csv.DictReader(f)
    for row in records:
        data.append(row)
    print(data)


mapiter = map(lambda x: x["germany"], data)
maplist = [ele for ele in mapiter]
print(maplist)    

filteriter = filter(lambda x: x=="zwei", maplist)
filterlist = [ele for ele in filteriter]
print("select germany WHERE germany == zwei :", filterlist)

summation

import csv
from functools import reduce
path = "/home/si/Documents/hpda0404.csv"
data = []
with open(path) as f:
    records = csv.DictReader(f)
    for row in records:
        data.append(row)
    print(data)


iters = map(lambda x: x["listed"], data)
iterslist = [ele for ele in iters]
iterset = set(iterslist)
print("grouped by ", iterset)

dic = {}
for i in iterset:
    temp = []
    for d in data:
        for (j, n) in [b for b in map(lambda x: (x["listed"],x["num"]), [d])]:
            if i == j:
                temp.append(int(n))
    reduer = reduce(lambda x, y:x+y, temp)
    dic[i]= reduer

print("sum (num) GROUP) BY listed : ", dic)

join

cat ~/Documents/hpda0404a.csv 
cat ~/Documents/hpda0404b.csv

import csv
from functools import reduce
path1 = "/home/si/Documents/hpda0404a.csv"
path2 = "/home/si/Documents/hpda0404b.csv"
data1 = []
with open(path1) as f:
    records = csv.DictReader(f)
    for row in records:
        data1.append(row)
    print(data1)

data2 = []
with open(path2) as f:
    records = csv.DictReader(f)
    for row in records:
        data2.append(row)
    print(data2)    


for a in data1:
    aid = [y for y in map(lambda x: x["id"], [a])]
    for b in data2:
        bid = [y for y in map(lambda x: x["id"], [b])]
        if aid == bid:
            (af1, bf2) = ([y for y in map(lambda x: x["germany"], [a])], [y for y in map(lambda x: x["fan"], [b])])
            print(af1, bf2)

04-02

2.1

from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer

stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

file = "/home/si/Documents/hpda0402wordscount.txt"
sdict = {}
ldict = {}
with open(file, "r") as data:
    datas = data.read()
    words = datas.split(' ')
    for word in words:
        sword = stemmer.stem(word)
        lword = lemmatizer.lemmatize(word)
        if sword in sdict:
            sdict[sword] += 1
        else:
            sdict[sword] = 1

        if lword in ldict:
            ldict[lword] += 1
        else:
            ldict[lword] = 1

    print("---------sdict----------------------")
    for (item, key) in sdict.items():
          print(item, key)

    print("---------ldict----------------------")
    for (item, key) in sdict.items():
        print(item, key)

2.2

2.3

see in Document folder

2.4

mapper

import sys
for line in sys.stdin:
  words = line.strip().split(" ")
    for word in words:
    print(word + "\t" + "1")

reducer

import sys

oldword = ""
count = 0
for line in sys.stdin:
    (word, c) = line.strip().split("\t")
    if word != oldword:
        if count != 0:
            print(oldword +"\t"+ str(count))
        count = 0
        oldword = word
    count = count + int(c)
if oldword != "":
    print(oldword +"\t%d" %(count))

cd /home/hadoop/hadoop-3-3.1/sbin
./start-dfs.sh
./start-yarn.sh
jps

word count example

hdfs daf -put /home/si/Documents/hpda/hpda04-2.3.txt /
hadoop fs -rm -r /hpda04-2.3-output/
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /hpda04-2.3.txt /hpda04-2.3-output/
hadoop fs -cat /hpda04-2.3-output/part-r-00000
cd output
hadoop fs -getmerge /hpda04-2.3-output/ out

With errors

yarn jar share/hadoop/tools/lib/hadoop-streaming-3.3.1.jar -Dmapred.reduce.tasks=1 -Dmapred.map.tasks=11 --mapper /home/si/Documents/hpda/04/mapper.py -reducer /home/si/Documents/hpda/04/reducer.py -input /hpda04-2.3.txt --output /hpda04-2.3-output/

05

import csv

class dataflow:
    def __init__(self):
        self.data = []

    def read(filename):
        d = dataflow()
        with open(filename, newline='') as csvfile:
            spamreader = csv.reader(csvfile)
            for row in spamreader:
                d.data.append(row)
        return d

    def map(self, func):
        d = dataflow()
        for x in self.data:
            d.data.append(func(x))
        return d

    def filter(self, func):
        d = dataflow()
        for x in self.data:
            if func(x):
                d.data.append(x)
        return d

    def write(self, filename):
        d = dataflow()
        with open(filename, 'w', newline='') as csvfile:
            spamwriter = csv.writer(csvfile, quoting=csv.QUOTE_MINIMAL)
            for d in self.data:
                spamwriter.writerow(d)
        return d
    def __str__(self):
        return str(self.data)


d = dataflow.read("/home/si/Documents/hpda/05/file.csv")
print(d)
flat = d.map(lambda t: (t[0], eval(t[3])))
bd = flat.filter(lambda t: "HPDA" in t[1])
bd.write("/home/si/Documents/hpda/05/out.csv")

06

MongoDB

show dbs

use testdatabase
db.getCollectionNames()

use testdatabase;
db.wiki.drop();
db.createCollection("wiki");
show collections;

use testdatabase;
db.wiki.insert({_id:1, "person":"Gauss","Beruf":"Mathematiker" })
db.wiki.find()

use testdatabase;
db.wiki.update({"person":"Gauss"},{"Beruf": "Mathematiker Physiker" })
db.wiki.find()

use testdatabase;
db.wiki.update({"person":"Gauss"}, {"Beruf": "Mathematiker Physiker", "Wohnsite": "Göttingen Hannover"})
db.wiki.find()

use testdatabase;
db.wiki.drop()

Paralle compuation lecture

performance

Andel's law:

$S_{t o t a l} = \frac{1}{1 - p + \frac{p}{s}}$ $S = \frac{s}{1 - P _{B} - P _{D} + \frac{P _{B}}{N _{B}} + \frac{P _{D}}{N _{D}}}$

if the task is changed, Gostafan's law,

$s_{g} = \frac{T _{s} + p T _{p}}{T _{s} + T _{p}}$

$S = \frac{s _{g}}{( s _{g} - P _{p} ) + \frac{P _{p}}{N _{p}}}$

all $P_{p}$ is changed task, such as 70% task doubled, will be 1.4

Effectivy:

$E = \frac{S}{P}$

$S = \frac{T _{s}}{T _{p}} = \frac{n}{\frac{n}{p} + lo g _{2} p}$

Chap1: introduction

Von Nroven

cpu, interconnection, memory

memory mode

shared memory distributed memory

shared memory

easy to build ,hard to large scare

distribution memory

Chap 2: Proformance

CPI: cycles per instruction MIPS: Million Instructions per second FLOPS: Floating Point Operation per second

Benchmark idle

Does this also mean in a hundred percent parallel code , the speed up is proportional to the number of threads?

fashion inductive

$T_{ser ia l}$ : The Time for task which can't be parallelized. $T_{p a r a ll e l}$ : The Time for task which can be parallelized. $p$ : number of processes $P$ : Precent of Task, which can be parallelized

single Process:

$T_{p a r a ll e l} = \frac{T _{ser ia l}}{p}$ speedup: $S = \frac{T _{ser ia l}}{T _{p a r a ll e l}}$ . if the parallelized part are perfect parallelable, $S == p$ .

Multi processes

$T_{p a r a ll e l} = (1 - P) T_{ser ia l} + \frac{P \cdot T _{ser ia l}}{p}$

speedup: $S = \frac{T _{ser ia l}}{T _{p a r a ll e l}} = \frac{1}{( 1 - P ) + \frac{P}{p}}$ . $S = p - > \infty lim \frac{1}{( 1 - P ) + \frac{P}{p}} = \frac{1}{1 - P}$

Efficient

$E = \frac{S}{p}$

Adaes-low

Gustafon low

$S_{p} = \frac{T _{ser ia l} - p T _{p a r a ll e}}{T _{ser ia} + T _{p a r a ll e l}}$

Cloud computation lecture

Platform Virtualization

Defination of Virtualization

the processes of creating software-based version of resources.

The reasons for applying virtualization

Utilization: Server consolidation
Isolation: Implication of errors is restricted in virtual resource only
Flexiblity: many Application access the same physical Hardware
On-demand: virtual resource is created/destoryed on request
Migration: Fault tolerance, live update, optimization of performance
New reaserch:new OS new technology
Encapsulation: current stats can be saved copied and loaded
Minimal downtime
Fast provisioning

Full virtualization (Hypervisor system, Bare matal)

Translation of instructions
implantation: Virtual Box
Hypervisor receive the IO from application,and translate to HW
Hypervisor translate the request from Guest OS to HW
no need special HW support
no need modified OS

Hardware-assisted virtualization (Hypervisor system, Bare matal)

implantation: VMware Workstation
can install many virtual machine
need special HW support
no need modified OS

Para virtualization (Hypervisor system, Bare matal)

VM(modified OS) runs on Host
Host on hypervisor
implantation: linux kernel
need modified OS
need Host OS level on hypervisor

Host OS virtualization (Hypervisor system, Hosted)

Guest OS on Hypervisor
Hypervisor on Host OS
Host on HW

<!-- -->

no need modified OS
need Hypervisor on Host OS
inter VM communication is difficult

OS-level virtualization (Container system)

no hypervisor
multiple useer instances(light-weight) run on a host OS
implantation: Docker

Memory virtualization

shadow page table on Guest OS
Extended Page table in Host

Network virtualization

hypervisor provide virtual switch, offering every VM a ip address

Feathers

Encapsulation solation Hardware abstraction Migration Partation

Kubernetes

Container-Orchestration System

Cluster
Control Plane
Workload:application on Kubernetes
Pod: many containers share the same volume
Deployment
Service

Virtual Machine

Partition
Isolation
Encapsulation
Hardware abstraction
Live Migration

Storage Virtualization

SSD advantage and disadvanage over HDD

Reliablity
Fast
small Size
More expensive
less Space

Storage virtualization advantage

Faster access: because you can have multiple data sources for the same data
Independence of logic storage resources
improvement of management: Moving data easy, in multiple localaction
High reliablity: because of Redundancy
High effience: Replication and Duplication
compression, compaction
increasing volume if needed

Provisioning:

allocate disk space to user on demand
give a mount of Storage, but not really allocated so much

Deduplication

Single instance Storage: if the hash value of a datablock is the same with one we already stored, dann save its link

checksum with hash value

Compression:

compacting the data so that it comsumes less space

Cloning

Consuming no storage except what is required for metadata until changes are written to the copy

Snapshotting Copies

a read-only, point-in-time image of a volume

increasing the proformance

with more physical disks at the same time

Modern Datacenters

automation

scaling
Inreases Repeatablity
Make processes Faster
imporve Reliablity
disadvanage Additional Complexity illusion of Stability

Idempotent

the same code generate the same result, without any change

Infrastructure as code

Benefits:
- Repeatablity
- Agility
- Disaster Recovery
- fast deploy
- live upgrade
Imperative:describe the stes to get to desired state
Declarative: describe the desired state

Foreman:

give the initial configuration to run an OS

Puppet

Declarative description of resource states
Client / server Architecture
Security throgh cettificate
OS abstraction

Monitor

challange

collecte data from large mount of servers Watch out the overhitting

Real time monitoring

Availability Monitoring: altering of failure
Capacity Monitoring: detect outages of resource

Historical Monitoring

Long-term information
Trend analysis
Capacity planning

Architecture

Measurement: Blackbox, Whitebot,Gauges, Conntes
Collection: push, pull
Analysis: real time, short term, long term, Anomaly detection with AI
Alerting:
Virtualization

Cloud Computing Concepts

Cloud Defination

Cloud Computing is a model for enabling on-demand network
access to a shared pool of configurable computing resource
(network, server, storage, application, service) that can
be rapidly provisioned and released with minimal management
effort or service provider interaction

SOA

Servive Oriented Architecture SOA has become a core concept of service computing and provides the fundamental technologies for realizing service computing

Advantage

No captial costs
High scalability
Highh Flexiblity

Network design

Different: SDN: software define Network New architectures have a detached control plane instead of heavy logic switching/routing in hardware

hardware independent
better shaping and Qos(Quit of service)
Data Center Briding for local and remote network

GWDG feathers

self service front-end
SSH authenticate
snapshotting
using Openstack

Infrastructure as Service

Different deployment methode

Private Cloud community Cloud public Cloud Hybird Cloud
Storage

CDMI: Cloud Data Management Interface File, Block Devices, Object Stores, Database Store example: AWS S3
Network
advantage
- quick implement of new project
- Flexiblity and scalability
- no hardware costs
- pay only what you need
disadvantage
- complicated to change provider
- dependency on provider
- internet access is essential

Platform as a Service

Rapid Time-to-Market
Minimal Development
Reduced Pressure on internal resources

Software as a Service

based on IaaS, fouce on Applications

Web services

Benefits

Programmable access
Distribution over internet
Encapsulation of discrete functionality
can offer stardartized Interface
- TCP/IP prokotoll
- HTTP based

SOAP

Simple Object Access Protocol xml based RPC based

WSDL

Web Services Description Language xml based

REST

Everything is resource
Every resource is identified by a unique Identifier
Using simple and uniform interface
Communication is done by representation
be stateless
more flexiblity
less redundancy, raw message based
URI and URL

API

Application Programming Interface

Big Data Service

feathers

Volume: Scale of data
Velocity : spend of transfer data
Variety: Different form of data
Veracity: Uncertainty of data

processes

Acquisition, Recording
Extraction, Cleaning, Annotation
Integration, Representation
Analysis, Modeling
Interpretation, Virtualization

Challenges

Heterogeneity, Incompleteness
Scale
Timeliness
Privacy

Mapreduce

map map the data into key-value-pairs according to our problem reduce key-value-pairs get accumlated shuffling

Large Scale Data Analysis

batch process

disadvanage: views generated in batch may out of date

steaming process

disadvanage: expensive and complex

Stream Computation Platform

Apache Storm
Spark Streaming
Apache Flink
Heron

Hadoop

HDFS

Namenode vs DataNodes

YARN

Resource Manager vs NodeManager

Apache Kafka

Fast, efficient IO
Fault tolerant storage
Publish and Subscribe to steams of records

Data management cycle

Data
Meta-data
PID
Search
Disposition

Data Grid Data Management

Data Lake

A data lake is a data storage, where raw data can be stored, whos structure is determined at the extraction from the lake

Challenges
- Reliablity
- Slow Performance
- Lack of security
Zones
- Transient
- raw
- trusted
- refined

service value system SVS
1. Guiding principles
  - focus on value
  - start where you are
  - progress iteratively with feedback
  - collaborate and promote visibility
  - think and work holistically
  - keep in simple and practical
  - optimize and automate
2. Service Value Chain
  - plan
  - improve
  - engage
  - design
  - transition
  - obtain
  - deliver
3. ITIL Practices
the four dimensions model

Organization&People Information & Technology Value streams&Processes Partners&suppliers

SLA Service Level Agreement Life cycle

Development
Negotiation
Implementation
Execution
Assessment
Termination

SLA components include

Parties, terms, conditions
service defination include costs
Performance parameters
what is measured, how and when(monitoring)
what is done to in case a SLA is voilated

Meassage: M
Content: N
Ciphertext: C
Public key: E
Encryption: E(x)
private key: D
Decryption: D(x)

RSA Algorithm
1. Select two prime number, p[13] and q[17]
2. Generate Algorithm content N[221]: N = q*p
3. calcalete the Eular function [192]: <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">φ</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">(</span><span class="mord mathnormal">p</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">1</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">∗</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">1</span><span class="mclose">)</span></span></span></span>
4. Rondomly generate public key e[5]: and e is relatively prime with <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">φ</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="mclose">)</span></span></span></span>
5. calcalete the private key d[77]: so that <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4653em;"></span><span class="mord mathnormal">e</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">∗</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord mathnormal">d</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">1</span></span></span></span>  mod <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">φ</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="mclose">)</span></span></span></span>
6. pack Public key E = (n, e) and publish to someone
7. save Private key D =(n, d) 

Someone want to some me Mesaage M: [12]
Encryption: <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.10903em;">M</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6644em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">e</span></span></span></span></span></span></span></span></span></span></span></span> mod n  [207]
send C [207] to me 

I do the Decryption
Message M: <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.10903em;">M</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.8491em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8491em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">d</span></span></span></span></span></span></span></span></span></span></span></span> mod n  [207**77%221]
get the Mesaage [12]

security benefits

Integrity
authentify the sender
non deniable for message

symmetric encryption

challange of key exchange
en/decryption with the same key

asymmetic encryption

en/decryption need more resource
safe key exchange

my collection