Sandbox Hortonworks
http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
Chargement de données dans HDFS
données compressées
https://s3.amazonaws.com/hw-sandbox/tutorial1/NYSE-2000-2001.tsv.gz
- Select the ‘Upload’ button
- Select ‘Files’ and a pop-up window will appear.
- Click the button which says, ‘Upload a file’.
- Locate the example data file you downloaded and select it.
- A progress meter will appear. The upload may take a few moments.
Visualisation dans HCatalog
Définition du métamodèle (type et caractéristique de chaque colonne)
Requêtage dans Hive
Les bases HCatalog sont directement accessibles.
On voit la décomposition en jobs Map Reduce dans les logs
select count(*) from nyse2001
select * from nyse2001 where stock_symbol= "IBM"
Pig
- Step 1: Create and name the script
- Step 2: Loading the data
a = LOAD ‘default.nyse_stocks’ USING org.apache.hive.hcatalog.pig.HCatLoader();
- Step 3: Select all records starting with IBM
b = filter a by stock_symbol == 'IBM';
c = group b all;
- Step 4: iterate and average
d = FOREACH c GENERATE AVG(b.stock_volume);
DUMP d;
- Step 5: save the script and execute it
paramètre de pig :
-useHCatalog (entrée)
Tutorial 2 PIG
http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/
batting = load 'Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs BY (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;
Tutorial 3 Hive
http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/
create table temp_batting (col_value STRING);
LOAD DATA INPATH '/user/hue/Batting.csv' OVERWRITE INTO TABLE temp_batting;
create table batting (player_id STRING, year INT, runs INT);
insert overwrite table batting
SELECT
regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id,
regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year,
regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run
from temp_batting;
SELECT year, max(runs) FROM batting GROUP BY year;
SELECT a.year, a.player_id, a.runs from batting a
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b
ON (a.year = b.year AND a.runs = b.runs) ;